Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Full text search Project Gutenberg (60m paragraphs) (gutensearch.com)
202 points by gutensearch 9 months ago | hide | past | favorite | 57 comments



Hi. Nice project! I don't know why there's no BM25 (or at least SOME TF-IDF implementation) in Postgres FTS, but if you decide you need it (and more languages support and highlighting and lower response time) ping us at contact@manticoresearch.com and we'll help you with integrating your postgres dataset with Manticore Search. 60M docs shouldn't be a problem at all (should take about an hour to make an index) and you'll get proper ranking and nice highlighting with just few lines of new code. Here's an interactive course about indexing from mysql https://play.manticoresearch.com/mysql/ , but with postgres it's the same.


Thank you for your generous offer of help! I look forward to taking it up (may take a while as I'm about to move countries and quarantine).

In particular I love that one of the examples in your comment history is in Latin as that language is not currently supported by Postgres FTS. Are Latin and Ancient Greek supported by Manticore? (dare I hope for Anglo Saxon...)


In terms of advanced NLP (stemming, lemmatization, stopwords, wordforms) - no. In terms of just general tokenization - I've never dealt with Latin and Ancient Greek characters (if there're specific characters for those languages), but if even they are not supported by default it's not a problem to add them in config (https://mnt.cr/charset_table)


For the character mappings, it might be useful to have a look at the config for https://tatoeba.org (or rather, the PHP script that generates the config): https://github.com/Tatoeba/tatoeba2/blob/dev/src/Shell/Sphin...

There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.


This is a very useful tool. I’ve known people doing research on the evolution of grammar, vocabulary, literary style, etc. who use only small subsets of the Project Gutenberg data. I'm sure they would appreciate being able to search the entire corpus.

The corpus-search functions those researchers use include wildcards, exact-phrase specification with quotation marks, proximity searches, and Boolean search strings. When you have a chance, you might want to add a list of the syntax formats that currently work. (I tried using * as a wildcard in a phrase surrounded by quotation marks, and it didn’t seem to work.)

One small improvement you could make would be to widen the “Search terms” field so that longer search strings are visible.


Surely someone doing serious research would download the entire corpus and use grep (or equivalent)?

It wouldn't have the same nice interface and searches may take several seconds, but there's only 60,000 books...


Thank you for taking the time to lay out feature requests in such details! I really appreciate it.

The current search box is a wrapper around Postgres phraseto_tsquery [1] whilst the Discovery tab uses plainto_tsquery, so you could play with either as an ersatz for some of these features for now, although special characters might get stripped or parsed incorrectly.

Do you know where the people you are talking about hang out online (for example, subreddits)? I'd love to get in touch with them once the features are built and for more general feedback.

[1] https://www.postgresql.org/docs/12/textsearch-controls.html


I'm curious if you ever tried MeiliSearch for this, I tried it recently for something unrelated and had a very good experience with it. Since you have the corpus already, it might be worth trying and seeing if it speeds search up? I'd be interested in the results either way.


Thank you for the suggestion! The demo looks great and I'm curious to see how they automatically handle language. It would be nice to add support for Chinese which is the great absent from this attempt, even at the cost of several other languages which in any case have few books transcribed to text in Gutenberg.

I will probably write a blog post once I've tried a few of the approaches suggested in this thread.


That would be great, thanks!


I know /r/CompLing on Reddit is quite popular.


Do you know what tools they're using for that?


I’m afraid it’s been a few years since I heard their research presentations. The only specific tool I remember is AntConc [0], as I happen to know the developer. (I teach at a university in Japan, as does he.) But I think they mentioned other concordance and corpus-related tools as well.

Most of the researchers I remember were literature scholars. While they were comfortable using computers, not many of them seemed savvy enough to write their own programs, run their own servers, or even use grep.

[0] https://www.laurenceanthony.net/software/antconc/


What useful thing! It would be much better if it was possible to pass search query using GET requests.

e.g. https://gutensearch.com/?q=avicenna


An API is already on my roadmap! I couldn't quite figure the state of passing query parameters in the URL with Dash, though.


Having an API is good but not suitable for what I mean.

I'm not familiar with Dash, but IN WORST CASE I think you can add a route to the nginx (an additional app) to pass GET parameters to the app.


"What is a cynic" the famous line from Lady Windmere's Fan comes up empty. You can find it here: https://www.gutenberg.org/files/790/790-h/790-h.htm

The next line I can find though. How are you parsing?


Thanks for noticing!

To be more specific, "what is a" are stop words and "cynic" is very common, so a lot of rows are returned (see my other comment). ts_rank takes too long to rank them, and the server times out, leaving you with the previous query's table because I didn't take the time to program a correct response to this issue. "Cecil Graham. What is a cynic?" returns Lady Windermere's Fan almost instantly.

The workarounds I've thought of would be to cache these queries (assuming I've seen them before, and after I've set up logging), buy a larger server, or pay Second Quadrant to speed up ts_rank... I'd love any suggestions from more experienced Postgres engineers!


Thanks for noticing! To be more specific, "what is a" are stop words and "cynic" is very common, so a lot of rows are returned (see my other comment). ts_rank takes too long to rank them, and the server times out, leaving you with the previous query's table because I didn't take the time to program a correct response to this issue. "Cecil Graham. What is a cynic?" returns Lady Windermere's Fan almost instantly.

The workarounds I've thought of would be to cache these queries (assuming I've seen them before, and after I've set up logging), buy a larger server, or pay Second Quadrant to speed up ts_rank... I'd love any suggestions from more experienced Postgres engineers!

Edit to your edit, re parsing. The subset of rows returned follows:

  where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
and relevance is determined by:

  ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32)
with the %s being language and paragraph respectively.


I have a few suggestions regarding search performance.

As others have mentioned, there isn't one "right answer" but there are a few nifty tricks you can use to boost performance.

You could try a trigram index[0] as the primary search index. You'd loose stemming, but gain language support and indexed regular expression search which is a nice power user feature.

Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

You should experiment with GIN vs GIST indices. GIN are larger, so on a large dataset the index might not fit in memory. You could pay more to have more memory, but worth trying a GIST index to see if that makes things faster just because it fits in memory.

A final frontend comment, I'm a fan of infinite scroll for this kind of stuff. You already have pagination effectively set up, you could spoil me as a user with infinite scroll. react-virtualized[2] is a great library for that.

Hope that helps.

[0] https://www.postgresql.org/docs/9.6/pgtrgm.html [1] https://dba.stackexchange.com/questions/4054/ts-rank-massive... [2] https://github.com/bvaughn/react-virtualized


Thanks for the suggestions! I will try them next.

> Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

Here is the index:

  update gutenberg.paragraphs
  set textsearchable_index_col = to_tsvector(b.cfgname::regconfig, coalesce(paragraph, ' '))
  from
  gutenberg.all_data a 
  inner join pg_ts_config b on lower(a.language) = b.cfgname 
  where gutenberg.paragraphs.num = a.num;
  create index textsearch_paragraph_idx on gutenberg.paragraphs using gin (textsearchable_index_col);
> if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

Ranking was already done as a column in a CTE and ranked further down:

  with paragraphs as (
                select 
                num
                , paragraph
                , ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32) as rank 
                , ts_headline(%s, paragraph, phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000, StartSel=**, StopSel=**') as highlighted_result 
                from gutenberg.paragraphs 
                where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
            ) 
I did rewrite the query as per your [1] but EXPLAIN ANALYZE found no difference.

Pagination is a tricky one. I was in a rush and hit an issue where building large Dash tables was very expensive and prone to crashing. I initially set up pagination as per the Dash examples, but that didn't play nice with markdown which I needed for the URLs and highlighting (giving these up for plain text tables made in HN-fast, but it confused my trial users). So the quick and dirty solution "to ship now" was to set up custom pagination in the SQL and show the results.

I think that if there is enough interest in this project, the front end will have to move off Dash.


full-text searching is contentious, it's a rabbit-hole of holy-wars.

I don't believe I'm going to summon cranks with the following:

There's more specialized tools for your usecase than postgres and you should be looking into "n-gram indexing".

Lucene based systems such as elasticsearch are quite popular there's also sphinx and xapian, also fairly widespread.

You need to read the documentation and configure them, they are flexible and need to be tuned to this usecase.

In the end, there is no "correct" way to do things. For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word, a mistake could lead to death, so no, stemming is very bad here.

Sometimes stop words is the way to go, but if people are looking for quotes, such as "to be or not to be" well now you have the empty string, splendid.

So yeah, configure configure configure.

This may bring out the cranks:

If you want to roll-your-own nosql systems like redis and mongo or couch seem to work really well (I've rolled my own in all 3 on separate occasions).

I guarantee there's advanced features of maria and postgres that aren't widely used and some people reading this will confidently claim they are superior but I assure you, that is a minority opinion. Most people go with the other options.

If you ever doubt it, ask the actual developers of postgres or maria on chat. They are extremely nice and way more honest about the limitations then their hardcore fans are. The databases are under constant development with advanced features and you'll learn a lot (really, they are both super chill dev communities, impressively so).

Perhaps your solution (as mine has been) is a hybrid. You can store the existence for instance, in one system and the offsets and chunks in another so you get a parallelizable asynchronous worker pipeline, it's impressively fast when you horizontally scale it. <0.5 sec for multiple terabytes of text (and yes, I'm talking nvme/hundreds of gb of RAM per node/>=10gb network). I've legitimately just done random queries to marvel at the speed

I really hope I save myself from the rock throwing.


> For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word...

FWIW lemmatization may be a good alternative to stemming. Stemming is algorithmic and can generate errors, as you point out; "caring" for example might naively be stemmed to "car". Lemmatization uses a dictionary of senses and their root words to avoid this. For common English, there's Princeton's Wordnet (https://wordnet.princeton.edu/). Lemmatizing technical niches, like medicine, would require an additional dictionary.


> there's also sphinx and xapian, also fairly widespread.

Sphinx is now Manticore, and as luck has it, a Manticore dev is in this thread, offering support: https://news.ycombinator.com/item?id=25890828


Thanks! I really appreciate the pointers. I already had planned to explore some of these and you've expanded and directed the search nicely.


Oh, so Project Gutenberg is still a thing? I used to use it until they blocked German users (context: they were asked to prevent access to certain pieces for German users but they decided to go nuclear instead).

Nowadays I just go to LibGen when I want to have a look into a book. That LibGen doesn’t limit itself to works in the public domain is rather a feature ;)


And this is why the server is in Amsterdam, even though I have had good experiences with Hetzner in the past.

I was quite sad to read about the case in 2018, and it is unfortunate that it is still not resolved.


Hetzner has locations in finland too if that would work.


What's the German story?


Work that’s PD in the US but not in Germany. German rights-holders took PG to court (or rather, a poor German sysadmin for PG), court demanded PG pull the titles for Germany, PG refused saying that it was down to the users to determine PD status for their home country, and because of HTTPS the German court’s only option was to block PG at the domain level.


Project Gutenberg themselves decided to block all access from Germany instead of just the items in question.

„On February 9 2018, the Court issued a judgment granting essentially most of the Plaintiff's demands. The Court did not order that the 18 items no longer be made available by Project Gutenberg, and instead wrote that it is sufficient to instead make them no longer accessible to German Internet (IP) addresses.

PGLAF complied with the Court's order on February 28, 2018 by blocking all access to www.gutenberg.org and sub-pages to all of Germany.“, from https://cand.pglaf.org/germany/index.html


Helpful page, with much useful information.


Never ceases to amaze me how one can be 'into' a subject and miss huge 'in plain sight' news.

When I lived in London I stopped buying TimeOut to find out whatson, and instead simply paid closer attention to Tube posters.


I love this. So often when I search for passages I get bombarded with links to quote websites.

This is much closer to what I’m looking for.

EDIT: and it’s going to be open sourced? I love it


Thanks! I had the exact same problem and eventually it got me to do something about it. It is particularly bad with writers from antiquity or with a lot of popular appeal.

I've begun adding to this repository, it'll come in piece by piece as I clean up the code: https://github.com/cordb/gutensearch


I'd love to see this dataset used as a performance and relevance benchmark for different search engines!


That was definitely part of the original plan! I spotted two other attempts [1] [2] here using BERT and ElasticSearch respectively.

The main performance issue with the Postgres FTS approach (possibly also the others?) is ranking. Matching results uses the index, but ts_rank cannot.

Most of the time, few results are returned and the front end gets its answer in ~300ms including formatting the text for the front end (~20ms without).

However, a reasonably common sentence will return tens or hundreds of thousands of rows, which takes a minute or more to get ranked. In production, this could be worked around by tracking and caching such queries if they are common enough.

I'd love to hear from anyone experienced with the other options (Lucene, Solr, ElasticSearch, etc.) whether and how they get around this.

[1] https://news.ycombinator.com/item?id=19095963

[2] https://news.ycombinator.com/item?id=6562126 (the link does not load for me)


I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.


What kind of hardware are you using to host the Postgres instance?


Same place as the app: a Start-2-M-SSD from online.net in their AMS1 DC (Amsterdam).

Subset of sudo lshw --short:

  Class          Description
  ======================================================
  processor      Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
  memory         16GiB System Memory
  disk           256GB Micron_1100_MTFD


This is really cool. Something like this should exist.

It seems like you could do it more easily, include all recent additions, and have faster search responses:

1. Mirror the current gutenberg archive (e.g. rsync -av --del aleph.gutenberg.org::gutenberg gutenberg)

2. Install recoll-webui from https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins... or using docker-recoll-webui: https://github.com/sunde41/recoll

3. Run the recoll indexer

4. Each week, repeat steps #1 and #3


I am sure the Postgres full text search is pretty good by now. But I always expect projects like these to eventually reluctantly switch over to Elasticsearch in order to handle 10-20% of the queries better.


Seems like a nice project. However, I am currently receiving PR_CONNECT_RESET_ERROR on Firefox. Bringing it to your attention.


Interesting! I don't recall seeing this before, and the app otherwise held up to the HN hug of death which was unexpected given nothing is optimised.

Can you recall the steps that led to the error?


I was not able to access it until I last checked on yesterday night. I was very interested in implementing this project myself, but did not have enough SQL skills to go about it, and hence I was naturally curious. It was working for a brief period where I accessed it, however seems to be down for me again.

It is an amazing site for sure. I am very happy one can just search for any phrase and it shows up with results in a matter of seconds! Allowing fuzzier results is a very useful feature I must say.

The site may use a few improvements on CSS part as you may have already noticed (search boxes get smaller while typing, including dark theme, cleaning up the results table, etc). Also, "Discovery" section was not displaying the correct results although the query was executed for some time (I searched "days of our lives" in place of "call me Ismael"). With the same parameters, the search engine shows results anyway, so not a big deal.

Please release the pg_dump some day. It would be extremely useful.


Thanks for the details! I tried your search string but could not replicate the exact bug (either app not working at all, or your error message). What is your setup and connection speed? Please feel free to email me instead at contact@ if you would prefer to preserve your privacy.

"Days of our lives" returns fine on the Search tab and times out (as I assumed it would, with such common words) on the Discovery tab because the broader plainto_tsquery is returning too many rows to rank in time.

...which made me think, Discovery ranks by random(), so I do not need to calculate ts_rank... time to push a quick fix!

Regarding pg_dump, I originally wanted to do this, but at 60GB a pop the bandwidth would be quite expensive if the thing got popular at all; I recommend you head to the repository [1] and build it locally instead.

[1] https://github.com/cordb/gutensearch


I enjoy being able to grep the 200,000 books I have downloaded as part of the Pile that was on HN a while back. It allowed me to do things like "show me all the paragraphs where 'Edmund Burke' and 'patriotism' appear close to each other." This seems to be a similar thing, just with less books.


Great ! It is not far from one of my dream project : numerising a maximum of old text and let historian do research on them using state of the art tools (that work across synonims and languages) with parameters to restrict by time of publication and localisation obviously.


Have you considered working with the Internet Archive on this across their corpus? They are open to such work being done. And if some of the material you need isn’t in the archive, let’s get it in there.


I have not but I am going to file the idea, it would indeed be a good starting point.


I cannot choose another language, some weird css/html issue happens that hides all other fields. Looks like an overflow:hidden hides the possible selections.


Thanks for the bug report!

It has to do (I think) with Dash's columnar layout which unfolds the menu over the next few columns at least in Chrome.

The quick workaround I found was to type out the language until it appears below and click on it or finish typing, then press enter. This should select it.

I'd love to hear from other Dash developers who've had, and solved this issue.


Hm. I cannot do language selection on iOS Safari. Even not with the mentioned workaround of first typing the language name prefix.


This is indeed very useful. But I think there are some bugs like when I click on dropdown to choose language it breaks (malfunction)


Thanks!

There's a workaround to the dropdown issue, see this other comment thread: https://news.ycombinator.com/item?id=25890458


Nice project! You might want to ask a designer to give it a look. UI is a little cluttered.


Thanks! I would love for an experienced designer to join the project (and DevOps, and front end, and...). My email is in the profile.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: