In particular I love that one of the examples in your comment history is in Latin as that language is not currently supported by Postgres FTS. Are Latin and Ancient Greek supported by Manticore? (dare I hope for Anglo Saxon...)
There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...
For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...
It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.
The corpus-search functions those researchers use include wildcards, exact-phrase specification with quotation marks, proximity searches, and Boolean search strings. When you have a chance, you might want to add a list of the syntax formats that currently work. (I tried using * as a wildcard in a phrase surrounded by quotation marks, and it didn’t seem to work.)
One small improvement you could make would be to widen the “Search terms” field so that longer search strings are visible.
It wouldn't have the same nice interface and searches may take several seconds, but there's only 60,000 books...
The current search box is a wrapper around Postgres phraseto_tsquery  whilst the Discovery tab uses plainto_tsquery, so you could play with either as an ersatz for some of these features for now, although special characters might get stripped or parsed incorrectly.
Do you know where the people you are talking about hang out online (for example, subreddits)? I'd love to get in touch with them once the features are built and for more general feedback.
I will probably write a blog post once I've tried a few of the approaches suggested in this thread.
Most of the researchers I remember were literature scholars. While they were comfortable using computers, not many of them seemed savvy enough to write their own programs, run their own servers, or even use grep.
I'm not familiar with Dash, but IN WORST CASE I think you can add a route to the nginx (an additional app) to pass GET parameters to the app.
The next line I can find though. How are you parsing?
To be more specific, "what is a" are stop words and "cynic" is very common, so a lot of rows are returned (see my other comment). ts_rank takes too long to rank them, and the server times out, leaving you with the previous query's table because I didn't take the time to program a correct response to this issue. "Cecil Graham. What is a cynic?" returns Lady Windermere's Fan almost instantly.
The workarounds I've thought of would be to cache these queries (assuming I've seen them before, and after I've set up logging), buy a larger server, or pay Second Quadrant to speed up ts_rank... I'd love any suggestions from more experienced Postgres engineers!
Edit to your edit, re parsing. The subset of rows returned follows:
where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32)
As others have mentioned, there isn't one "right answer" but there are a few nifty tricks you can use to boost performance.
You could try a trigram index as the primary search index. You'd loose stemming, but gain language support and indexed regular expression search which is a nice power user feature.
Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.
if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery.  from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).
You should experiment with GIN vs GIST indices. GIN are larger, so on a large dataset the index might not fit in memory. You could pay more to have more memory, but worth trying a GIST index to see if that makes things faster just because it fits in memory.
A final frontend comment, I'm a fan of infinite scroll for this kind of stuff. You already have pagination effectively set up, you could spoil me as a user with infinite scroll. react-virtualized is a great library for that.
Hope that helps.
> Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.
Here is the index:
set textsearchable_index_col = to_tsvector(b.cfgname::regconfig, coalesce(paragraph, ' '))
inner join pg_ts_config b on lower(a.language) = b.cfgname
where gutenberg.paragraphs.num = a.num;
create index textsearch_paragraph_idx on gutenberg.paragraphs using gin (textsearchable_index_col);
Ranking was already done as a column in a CTE and ranked further down:
with paragraphs as (
, ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32) as rank
, ts_headline(%s, paragraph, phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000, StartSel=**, StopSel=**') as highlighted_result
where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
Pagination is a tricky one. I was in a rush and hit an issue where building large Dash tables was very expensive and prone to crashing. I initially set up pagination as per the Dash examples, but that didn't play nice with markdown which I needed for the URLs and highlighting (giving these up for plain text tables made in HN-fast, but it confused my trial users). So the quick and dirty solution "to ship now" was to set up custom pagination in the SQL and show the results.
I think that if there is enough interest in this project, the front end will have to move off Dash.
I don't believe I'm going to summon cranks with the following:
There's more specialized tools for your usecase than postgres and you should be looking into "n-gram indexing".
Lucene based systems such as elasticsearch are quite popular there's also sphinx and xapian, also fairly widespread.
You need to read the documentation and configure them, they are flexible and need to be tuned to this usecase.
In the end, there is no "correct" way to do things. For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word, a mistake could lead to death, so no, stemming is very bad here.
Sometimes stop words is the way to go, but if people are looking for quotes, such as "to be or not to be" well now you have the empty string, splendid.
So yeah, configure configure configure.
This may bring out the cranks:
If you want to roll-your-own nosql systems like redis and mongo or couch seem to work really well (I've rolled my own in all 3 on separate occasions).
I guarantee there's advanced features of maria and postgres that aren't widely used and some people reading this will confidently claim they are superior but I assure you, that is a minority opinion. Most people go with the other options.
If you ever doubt it, ask the actual developers of postgres or maria on chat. They are extremely nice and way more honest about the limitations then their hardcore fans are. The databases are under constant development with advanced features and you'll learn a lot (really, they are both super chill dev communities, impressively so).
Perhaps your solution (as mine has been) is a hybrid. You can store the existence for instance, in one system and the offsets and chunks in another so you get a parallelizable asynchronous worker pipeline, it's impressively fast when you horizontally scale it. <0.5 sec for multiple terabytes of text (and yes, I'm talking nvme/hundreds of gb of RAM per node/>=10gb network). I've legitimately just done random queries to marvel at the speed
I really hope I save myself from the rock throwing.
FWIW lemmatization may be a good alternative to stemming. Stemming is algorithmic and can generate errors, as you point out; "caring" for example might naively be stemmed to "car". Lemmatization uses a dictionary of senses and their root words to avoid this. For common English, there's Princeton's Wordnet (https://wordnet.princeton.edu/). Lemmatizing technical niches, like medicine, would require an additional dictionary.
Sphinx is now Manticore, and as luck has it, a Manticore dev is in this thread, offering support: https://news.ycombinator.com/item?id=25890828
Nowadays I just go to LibGen when I want to have a look into a book.
That LibGen doesn’t limit itself to works in the public domain is rather a feature ;)
I was quite sad to read about the case in 2018, and it is unfortunate that it is still not resolved.
„On February 9 2018, the Court issued a judgment granting essentially most of the Plaintiff's demands. The Court did not order that the 18 items no longer be made available by Project Gutenberg, and instead wrote that it is sufficient to instead make them no longer accessible to German Internet (IP) addresses.
PGLAF complied with the Court's order on February 28, 2018 by blocking all access to www.gutenberg.org and sub-pages to all of Germany.“, from https://cand.pglaf.org/germany/index.html
When I lived in London I stopped buying TimeOut to find out whatson, and instead simply paid closer attention to Tube posters.
This is much closer to what I’m looking for.
EDIT: and it’s going to be open sourced? I love it
I've begun adding to this repository, it'll come in piece by piece as I clean up the code: https://github.com/cordb/gutensearch
The main performance issue with the Postgres FTS approach (possibly also the others?) is ranking. Matching results uses the index, but ts_rank cannot.
Most of the time, few results are returned and the front end gets its answer in ~300ms including formatting the text for the front end (~20ms without).
However, a reasonably common sentence will return tens or hundreds of thousands of rows, which takes a minute or more to get ranked. In production, this could be worked around by tracking and caching such queries if they are common enough.
I'd love to hear from anyone experienced with the other options (Lucene, Solr, ElasticSearch, etc.) whether and how they get around this.
 https://news.ycombinator.com/item?id=6562126 (the link does not load for me)
Subset of sudo lshw --short:
processor Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
memory 16GiB System Memory
disk 256GB Micron_1100_MTFD
It seems like you could do it more easily, include all recent additions, and have faster search responses:
1. Mirror the current gutenberg archive (e.g. rsync -av --del aleph.gutenberg.org::gutenberg gutenberg)
2. Install recoll-webui from https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins... or using docker-recoll-webui: https://github.com/sunde41/recoll
3. Run the recoll indexer
4. Each week, repeat steps #1 and #3
Can you recall the steps that led to the error?
It is an amazing site for sure. I am very happy one can just search for any phrase and it shows up with results in a matter of seconds! Allowing fuzzier results is a very useful feature I must say.
The site may use a few improvements on CSS part as you may have already noticed (search boxes get smaller while typing, including dark theme, cleaning up the results table, etc). Also, "Discovery" section was not displaying the correct results although the query was executed for some time (I searched "days of our lives" in place of "call me Ismael"). With the same parameters, the search engine shows results anyway, so not a big deal.
Please release the pg_dump some day. It would be extremely useful.
"Days of our lives" returns fine on the Search tab and times out (as I assumed it would, with such common words) on the Discovery tab because the broader plainto_tsquery is returning too many rows to rank in time.
...which made me think, Discovery ranks by random(), so I do not need to calculate ts_rank... time to push a quick fix!
Regarding pg_dump, I originally wanted to do this, but at 60GB a pop the bandwidth would be quite expensive if the thing got popular at all; I recommend you head to the repository  and build it locally instead.
It has to do (I think) with Dash's columnar layout which unfolds the menu over the next few columns at least in Chrome.
The quick workaround I found was to type out the language until it appears below and click on it or finish typing, then press enter. This should select it.
I'd love to hear from other Dash developers who've had, and solved this issue.
There's a workaround to the dropdown issue, see this other comment thread: https://news.ycombinator.com/item?id=25890458