

Show HN: Cocktail search written in Python using scrapy and sphinx - wallunit
http://cocktails.p24dev.de/

======
mapgrep
Nice, but there should be

1\. Some stemming/normalizing such that "créme" and "creme" (as in, say, creme
de menthe) are not two distinct result sets. (To your credit, whisky/whiskey
appear normalized)

2\. A list of sources available apart from mouse-scrubbing the results. Any
reason Esquire (considered borderline definitive by certain cocktail snobs)
isn't in there?

Wishlist:

-A way to make a search for "brandy" return anything containing "cognac" or "calvados," but not vice versa. Ditto for whisky vs scotch/rye/bourbon.

-Ability to exclude certain sources from all searches

~~~
wallunit
Actually the english stemmer from libstemmer/snowball is already used.
Additionally there is a wordforms list, that at the moment maps only whisky to
whiskey. ;) Feel free to do a pull request, if you know other words that must
be normalized. I have also just added an advanced charset_table, mapping
amongst others "é" to "e". So when you search now for "créme", you get the
same recipes as for "creme".

At the moment I only crawl Wikipedia, liquor.com and seriouseats.com for
cocktail recipes. But thanks for mentioning Esquire. I didn't knew that
website yet. Maybe I will crawl them as well in the future.

One direction mapping of words is unfortunately not possible with sphinx.
However I could expand "brandy" to "brandy OR cognac OR calvados", just before
I run the query. I will consider that approach.

------
chris_p
I really like this! I recently started making simple cocktails, and I am
certainly going to use this.

Things I'd like to see: * I prefer milliliters over oz (I didn't even know
what it was before I googled it, since I live in Europe), please make this an
option * Some form of autocomplete, preferably with a dropdown list (but make
sure it doesn't obstruct the next textbox). There's a jquery plugin that does
dropdown autocomplete well. * Pressing enter should make the next textbox
active (seems more intuitive than tab in this case) * It would be nice if the
ingredients in the ingredient list below a cocktail were clickable. Clicking
them should add them in the list of selected ingredients and update the
cocktails visible.

~~~
imjared
On things to fix:

While I appreciate the use of history, I promise you I don't need to hold my
position on the page for every single letter I type in the search box:
<http://screencast.com/t/1qFwxlz5>

I ended up with a history entry for "search results for 'c'", "search results
for 'co'", "search results for 'cog'", etc.

Maybe trim this down to only firing after 10s on inactivity in a text input or
use onblur? I'm not too sure what the answer is but it (page history)
definitely loses any usefulness when architected in this manner.

~~~
wallunit
The primary reason why I am using the history API, is that you can bookmark
and send links to a specific search. However I agree that when you are
actually using the history, it becomes annoying to have any search that was
executed while you were typing in your history.

However when updating the history only every 10s, you might have to wait until
you can copy the link to your search from the address bar. And if you aren't
typing fast enough there still can be incomplete searches in your history.

Using onblur would be better. But what would you think about, updating the
history as soon, as the cursor is moved? I think I like that idea pretty much.

~~~
umaar
I think <http://caniuse.com> handles this situation quite well (try searching
for something). Notice the short delay between a keypress and the URL
updating.

~~~
wallunit
They reset the timout everytime you type another letter. That approach seems
to work quite well, unless you are a very slow typer.

However I have just implemented another approach to deal with that issue. Now,
the history is updated either when the current field lost focus or when you
move the mouse, after you entered something. Please try it out and let me know
what you think.

------
pscheufele
This is great. I would really love to see this tool with a curated list of
recipes. The recipes from scraping are hit or miss. Please message me if you
want to chat about it, maybe I can help.

I would also have a couple suggestions for improvement. (1) include the recipe
for how to make the cocktail, as opposed to simply the ingredients. This can
make a big difference in the result. (2) include an AND function for the
ingredients search (eg. a search for pisco AND lime, I didn't see that this
exists currently).

~~~
wallunit
I'm not sure if either approach would improve the results.

Searching the introductions in additions to the list of ingredients, will rank
recipes higher, when the ingredients you were searching for, occur more often
in the instructions. But a recipe isn't necessarily more relevant, because of
its instructions are more verbose.

I'm not sure if an AND function would add much benefit. Yes, at the moment
only one of the given ingredients must be part of the cocktail, in order that
it appears in the search result. But the search results are also sorted by
relevance, so that cocktails that contain more of the searched ingredients are
ranked higher, than cocktails that contain fewer of the searched ingredients.

I'm pretty sure that there is no way to PM other users, here at Hacker News
and you don't have an email address in your profile. However if you want have
a further discussion on that, feel free to write me an email. You'll find my
email address in my Hacker News and Github profile.

------
tharshan09
Very nice. I particularly like the crawler implementations. I have used scrapy
before in a project. I could not guess from the code, but is the all the data
going into the database? Could you give some overview details about how things
in the backend work. Thanks

~~~
wallunit
Actually the entire backend is on github. But let me explain it for you.

There is as you have already discovered, a crawler implemented with scrapy.
However I don't use the scrapy server and pipelines. Instead I have a script
that lets scrapy generate JSON files with the crawled recipes and builds the
sphinx index from the crawled data. There is no RDBMS. Basically sphinx is my
database. :)

Well and than there is the website. Its server-side is implemented with
werkzeug and its UI with jQuery.

~~~
tharshan09
Oh i actually make scrapy do that to. Do you use the scrapy crawl with -o and
-t json options? I did not know spinx can index json files etc. Any reason why
sphinx was chosen rather than an alternative? Thanks for the info.

~~~
wallunit
Yes, I use "scrapy crawl <spider> -o <spider>.json" (see bootstrap.sh). Sphinx
can not index JSON files directly, but it can index an XML stream written to
stdout by a given command. So I have written a script (sphinx/xmlpipe.py),
that reads the JSON files generated by scrapy and writes the crawled recipes
in sphinx's XML format to stdout.

There are alternatives to Sphinx? ;)

* Sphinx is ridiculous fast, as you can see when searching. But even building the index takes only 340ms (from which 220ms are spend by the python script that generates the XML) for 1699 recipes on my 3 years old notebook.

* Sphinx don't require a RDBMS to index documents from. It can index documents from any source. You just need to write a simple script that brings the documents in the XML format expected by sphinx.

* Sphinx is not only a full text search engine. It is also a multi-value store. You can add extra information like the title and url to indexed documents. And so you don't need an additional database.

* I need the ability to limit a fulltext search to sentence boundaries. I don't know if there are other fulltext search engines that can do that.

~~~
tharshan09
I was quite surprised at how fast it was on the searches. It sounds really
great and im sure I can put up with xml, I just dont like having to return
HTML - but is that you preference and the way you have done it? It looks like
its just return dict of results, so no reason why I cant form a json response
I guess. How does a typical query look like? and how are you using the
sentence boundaries (tbh im sure sure what it even is). I guess for fast text
search this is perfect but im guessing any computation it cant do?

~~~
wallunit
For example, if you enter the ingredients "rye whiskey" and "vermouth",
following query is generated:

@ingredients (rye SENTENCE whiskey) | vermouth

The SENTENCE operator, basically works like the & operator, just that both
operands must occur in the same sentence. That is very helpful, since the
index field "ingredients" contains a list of all ingredients of the recipe
separated by an "!".

However Sphinx can not only do full text search. You can also filter and sort
by attributes and complex expressions, that involve any attribute, the
relevance from the fulll-text search, arithmetics and some built-in functions.
And Sphinx does that much faster than any RDBMS does. I never managed to
generate a sphinx query that took a measurable amount of time, on my 3 years
old notebook. ;)

The sphinx index doesn't contain any HTML. However I return HTML, from the
WSGI app, that serves the AJAX calls. So I guess that is what you are talking
about. It just seemed to me, it would be simpler to generate the HTML for the
search results on the server-side with Python than in Javascript.

By the way if you want to use Sphinx for your own project, and you have the
data to index already in a MySQL or PostgreSQL database, there is no need to
write a script to generate XML. Sphinx can index data from MySQL and
PostgreSQL databases directly. However I was just saying that, thanks to
xmlpipe support, you don't need an RDBMS just to use Sphinx. And that thanks
to index attributes, Sphinx can completely replace an RDBMS in a lot of cases.

~~~
tharshan09
Great I think you have me sold :) I want to give this a try since a standard
db is what I would normally use to keep the data. This would be an interesting
experience. I will be reading up on sphinx and mysql etc, and do any future
data scraping straight to sphinx through xmlpipe. Last question :) - Do you
have any useful links you can provide on the matter of sphinx etc? Thanks.

~~~
wallunit
<http://sphinxsearch.com/docs/2.0.6/>

~~~
tharshan09
So i have been trying to get the sphinx search working on OSX today - I have
indexed the search. I can search stuff in the mysql port using SQL queries but
when I search using the test api scripts in php or python or the search
command line tool - it always returns empty results. Not sure what is going
on.

~~~
tharshan09
nvm works perfectly on my vps. I guess my local machine is just a bit messed
up. Dam MAMP.

------
jtchang
Good stuff.

Back button behavior is pretty annoying :)

