I had a... suboptimal experience with SQLite's FTS, so here it is for others: match queries don't work with `OR` conditions. You have to do this work around:
WHERE
fa.rowid IN (SELECT rowid FROM address_fts WHERE address_fts MATCH 'google')
OR
ta.rowid IN (SELECT rowid FROM address_fts WHERE address_fts MATCH 'google')
I'm building a search feature for my app, that parses a user's search query (a la Github search) and returns relevant results. Generating the above workaround SQL on the fly is just... ulgh.
It isn't made clear in the repo but what this is doing [1] is a small amount of code to glue SQLite's tokenizer API to the Rust unicode-segmentation crate [2] which in turn is using the TR-29 algorithm with generated tables [3].
Stemmers tend to cover English and a few Europeanish languages only. For example you can see what snowball covers here [1] while NLTK has some more [2].
What you may find useful is unidecode that transliterates Unicode to ASCII similar sounds. There are packages available for most programming languages, with the original being on Perl. I highly recommend reading the original article describing how it works. My practical experience is that is fairly good for text searching, producing reasonable results.
Just pass both the content and query through unidecode and some simple filter that removes stop words and cruft in the content, works wonders for autocomplete. Applying that idea to true full-text search is somewhat more involved (you have to somehow produce a stemmer that works on the output of unidecode), but doable and it works.
I hadn't seen that tokenizer and it's really great that they open sourced it because not supporting CJK languages out of the box is one of the major issues with sqlite fts.
I think this also shows why a standard third party distribution of sqlite is needed so that extensions like this can be included with the sqlite binaries provided with bindings for other languages
The problem with doing CJK and other languages properly is that you need a lot of data lookup tables. As an example Python has a unicodedata module with several tables available, but not enough to implement word splitting. It is compiled code and 1.2MB in size.
The ICU C library does include all the tables and algorithms, which are 30MB compiled data plus 12MB of compiled code. Run through gzip the compiled data becomes 12MB giving a reasonable minimum bound. This gives an idea of the payload that would have to be included if you want up to date Unicode definitions, or to be included in base images.
SQLite does have optional integration with ICU, but it doesn't cover segmentation.
> The problem with doing CJK and other languages properly is that you need a lot of data lookup tables
Hmm.. My impression is that with other search software, just splitting CJK text into bigrams often works well enough for a lot of things, but I guess because sqlite FTS doesn't allow fuzzy searches where it returns the best ranked results even if they don't include all the input tokens, that might not work.
Perhaps a good compromise would be to figure out a way to include an extension by default that requires the additional data lookup tables for CJK support to be loaded at runtime if required? That way the binary sqlite distribution could still be standardized, since loading custom binary extensions isn't really that convenient if you're using a language other than c and trying to build binaries for multiple platforms.
Unfortunately, it doesn't seem like the creators of sqlite are that interested in solving this problem since theoretically you can just use extensions to do whatever you need.
Problem with bigrams is that it will create a lot of nonwords. In the example given 你好世界 we’d get one bigram 好世 which no one is going to be looking for. Lots of wasted space in the index.
> Unfortunately, it doesn't seem like the creators of sqlite are that interested in solving this problem since theoretically you can just use extensions to do whatever you need.
Small. Fast. Reliable. Choose any three. :)
> sqlite FTS doesn't allow fuzzy searches where it returns the best ranked results even if they don't include all the input tokens
I'm busy updating my Python SQLite extension with full text search and this is one of things I'll be addressing. You can do a fuzzy search by having the query OR all tokens for the match, and then use a ranking function to figure out an appropriate score.
I'm also doing other things like query expansion (find statistically similar search terms), and other issues like spelling correction (playystation -> playstation), split tokens (noone -> no one), combining (play station -> playstation), larger tokens (one -> gone, phone, opponent).
> I'm busy updating my Python SQLite extension with full text search and this is one of things I'll be addressing. You can do a fuzzy search by having the query OR all tokens for the match, and then use a ranking function to figure out an appropriate score.
> I'm also doing other things like query expansion (find statistically similar search terms), and other issues like spelling correction (playystation -> playstation), split tokens (noone -> no one), combining (play station -> playstation), larger tokens (one -> gone, phone, opponent).
That's nice but it just seems unfortunate that people have to keep reimplementing this stuff in different languages on top of sqlite
I'm building a search feature for my app, that parses a user's search query (a la Github search) and returns relevant results. Generating the above workaround SQL on the fly is just... ulgh.