
Show HN: Visual Search using features extracted from Tensor Flow inception model - visualsearchsv
https://github.com/AKSHAYUBHAT/VisualSearchServer
======
visualsearchsv
Inspired by the Pinterest paper at KDD on implementing Visual Search, I have
created this barebones but functional implementation of a Visual Search server
using ~450,000 Female Fashion images crawled from an image aggregation
website. The code uses Pool 3 layer features extracted from the Google's
latest Inception model using Tensor Flow. The extracted vectors are then
indexed using an Approximate Nearest Neighbor implementation from Nearpy. The
AMI provided contains both images and pre-computed vectors.

In future I plan to add more images ~2 Million in the same domain. Test
various combination of nearest neighbor indexes and multiple vectors per
images using some form of a multibox style detector. I will also add a script
to launch spot GPU instances via Cloudformation to economically index images
using S3 and SQS. I am building a companion iOS swift app however, since
Tensor Flow hasn't been ported to iOS yet, its still in development.

[https://engineering.pinterest.com/blog/building-scalable-
mac...](https://engineering.pinterest.com/blog/building-scalable-machine-
vision-pipeline)

------
Jamhser
The papers is really good and it is very interesting story. The
[https://reviews.clazwork.com](https://reviews.clazwork.com) is making the
documents for the students

------
nl
That's pretty nice. I thought that the Pinterest implementation used VGG?

~~~
visualsearchsv
According to their KDD paper, they showed modified / trained alexnet
performing as good as VGG with significantly less computation time. Not sure
what they actually use in practice.

[http://www.kevinjing.com/visual_search_at_pinterest.pdf](http://www.kevinjing.com/visual_search_at_pinterest.pdf)

~~~
nl
That sounds like it makes sense, right?

I beleive that most of the strength of VGG (and Inception) vs Alexnet is that
it was able to learn the feature relationships better, not because it learnt
better features.

VGG is pretty computational intensive, which is why Google concentrated so
much on the computational complexity for GoogLeNet/Inception/ReCeption

So if you are just using the features directly then it would make sense to use
whatever was the quickest to compute.

~~~
visualsearchsv
Yes, being able to compute quickly is especially important in reducing query
latency, much more so than during indexing. What stood out for me in the paper
was that out of box performance of VGG (trained on imagenet alone) was as good
as fine tuned alexnet.

I am interested in assessing if there are any tricks that could be used when
querying from a mobile device. In such cases feature extraction can be
performed on the device itself, with only feature vectors sent over the
network. In case of pinterest, another special case is that a lot queries are
performed on images already present in the system. The user simply readjusts
the bounding box to highlight the object of interest. In this case they can
simply pre-compute 4~20 crops per image. Online feature computation is much
more expensive / complicated than offline.

~~~
nl
TensorFlow runs on Android, right? And AlexNet runs on an RaspberryPi, so it
should be fine on a phone.

But it would be interesting to know if that is better. I'd imagine most phones
have some kind of hardware support for resizing images, so it might be better
to take advantage of that and then do feature extraction on a server?

~~~
krasin
Contemporary phones (e.g. iPhone 6S) are capable of running GoogLeNet at 1
FPS, see, for example, this demo (mine):
[https://github.com/krasin/MetalDetector](https://github.com/krasin/MetalDetector)

AlexNet will run at ~10 FPS, I guess.

