Not the best code I've ever written by far, but it was a neat little experiment.
Basically, indexing each record with Soundex(phonetics) makes the search semi-tolerant to misspellings and also brings up similar-sounding results, while Damerau-Levenshtein is used to sort the results by closeness. I also had it index timestamps with
Here it is in action:
As you can see, you can misspell search terms and it will usually get it right. Granted, a lot of unrelated records usually come up in a search, but the important thing is the user should get what they're looking for in the first few results, usually the first.
Regex is a powerful thing.
One big reasons for this is the need for really good training data to do supervised learning on search problems. If you’re Google or Wikipedia, you can get this data. But many struggle to understand what a good result was for a query based on clicks and other user data due to low volumes of traffic.
The key is to use machine learning to generalize user behavior across multiple items in roughly the same category, rather than individual items. In fact, with tiny amounts of data you should pretty much never deal in individual items -- always in groups of related items.
Yup that’s one method I teach people in Learning to Rank training. In addition to reducing dimensionality on queries, you can also do so by geo and user depending on the domain. But do you want to do this much munging of training data?
Even still the basis for how you group similar information needs isn’t always obvious, and most teams prefer a simpler solution they can understand and manage that’s not ML based. A lot depends on the team that’s ultimately going to maintain the solution.
- maroon walking shoes
- walking shoes
- casual shoes
And obviously the same thing can be applied to generalize over the query. These hierarchies are constructed statically and dynamically with unsupervised learning, and then associations from query to groups happens dynamically.
For example, a common training data storage format looks like
grade(0-4) queryId list of feature values...
4 qid:1 1:0.5 2:0.24 # a doc with these features is 'good' for this query
0 qid:1 1:-0.5 2:0.24 # a bad doc for query id 1 is described by these features
Features in an LTR context are some relationship between the query and doc (such as scores computed from Solr/Elasticsearch queries)
Of course turning clicks/session info into grades isn't easy. One way to go about it is with click models (https://pdfs.semanticscholar.org/0b19/b37da5e438e6355418c726...). Another way is to equate conversions in a session with one of the grades (a purchase is a 4).
Needless to say, such usage data is sparse for long tail queries (queries that rarely happen - 'fuscha shoes'). Luckily it's not sparse for head queries (common queries - 'shoes'). So those head queries can provide some benefit, helping to generalize to the rare long tail queries.
You'll see a lot of patterns that are modifiers on head queries. Like 'shoes' is a head query. But 'blue shoes' and 'purple shoes' and other 'color shoes' are tail queries.
You can do a lot with that insight
- Supervised grouping of queries into one 'query id' by watching how users refine queries. For example users type "shoes" then in the same session add one color adjective "red shoes". Using that to group '<color> shoes' queries
- Unsupervised clustering of similar sessions or searches where users seem to expect ranking to perform similarly based on user behavior
The downside/tradeoff is you lose a lot of nuance that you're getting with per-query training data. But it's a technique people use.
It also substitutes one ML problem for another, which may or may not be harder than the original one
This is what I based Debian Code Search on: https://codesearch.debian.net/
See also https://codesearch.debian.net/research/bsc-thesis.pdf if you’re interested in details.
There is no information on the subject besides the 3 blog posts and the proof of concept.
The demo page works. The signup just send me a confirmation email and the setup is still manual, so expect hours/day delay. Aiming to have this all automated by tomorrow. Appreciate any/all feedback.
And this is really the bare minimum. An even better search would e.g. allow searching for identifiers (comments and strings disregarded).
In other words, I agree with you but I also know it is an extremely hard problem to solve at Github's scale.
Git is easy to host, especially now with several good web frontends.
For example here's a hacky regex that finds lambdas:
This is not the obvious search strategy for code, lacking semantic structure, though apparently it suits regex searchable indexes.
I did write a book on the topic, developed the Elasticsearch learning to rank plugin, and have helped several Alexa top 50 sites deliver more accurate search
I’m only pointing out that search is oversold industry, but when you read about what Google actually does it’s more normal stuff that we would all probably do to solve a similar problem.
(for those curious it was a link to content at a community conference i help run - pointing out how hard ML can be for search. I admit a bit tangential perhaps)
The last time someone posted a link to a hopeful successor at github: https://news.ycombinator.com/item?id=18022357
Even a traditional rdbms will start become slow with datasets in the 10-100GiB range.
Note that products like bigquery, snowflake, redshift...etc might be able to support but that is not Cheap or relatively fast
That said, I approve of naive brute-force and one should always benchmark their systems against the brute force solution.