
Making the Internet Archive’s full text search faster - danso
http://gio.blog.archive.org/2017/05/31/making-the-internet-archive-full-text-search-faster/
======
superkuh
This is pretty cool. I just wish these recent "improvements" to the
archive.org site weren't also associated with a massive decrease in
accessibility and usability.

The switch to a javascript-rendered site frontend has made using the wayback
machine pretty much inaccessible to all but the most modern and bloated of
browsers. Emails to support about keeping the old site going at an alternate
URL or fixing fall-back rendering/operation for no-JS have been met with
silence.

I guess I'm just not in the demographic archive.org is targeting now despite
being a heavy user for more than a decade. It's not the backend guy's fault
but it does mean lots of us won't benefit from any backend improvements.

------
gus7
15-25TB to provide search for the entire internet archive. This is a pretty
small number if you think about it. Your avg undergrad class room would be
able match it if they pooled all their devices together. So why do we need
Google and Bing ladies and gents? If we can get a 100 devs in every city in
every corner of the planet to donate a hardrive/cpuand some time, distributed
open search could well do to Google what Linux did to Microsoft. I think its
going to happen. There is lot of duplication of the above work going on in
little silos like Wikipedia, Khan academy, Stackoverflow, Zealdocs etc it just
has get hooked up

~~~
btian
You know Google is free right? No need to donate resources. And latency will
be so horrible if you get 100 devs in every city to donate resources.

~~~
kontai
The main point about search is valid. Centralized search might have been
required as the internet exploded. Not anymore. I work on Wall St and there is
lot of curated distributed search index sharing going on.

~~~
btian
Any links to what's being shared?

How are the content crawled?

------
gioark
hi, I am the author of the article.

@devhead: To import the data into ES we used a custom application to extract
the text from the OCR'd documents. This is required to support our bookreader
software. A complete ingestion takes a few days; we rate-limit indexing in
order not to overload the cluster, and maintain reasonable search performance.

~~~
mdellabitta
Hey, I'm wondering why you didn't consider using stopwords to prevent bloated
inverted index entries fir words like 'the'?

~~~
gioark
We don't use stopwords because we want to find all the best and complete
matches. We don't want to ignore any of the words part of the search query.

~~~
aisofteng
You do use stopwords. Your most common unigrams are not in the index, by
design. You just use your own stopwords.

------
gioark
If you are having problems loading the page, try this url:
[https://medium.com/@giovannidamiola/making-the-internet-
arch...](https://medium.com/@giovannidamiola/making-the-internet-archives-
full-text-search-faster-30fb11574ea9)

~~~
onurcel
Making the Internet Archive’s blog faster

------
devhead
awesome job IA, glad to see you were able to push through it and get your
performance to an acceptable rate!

It'd be awesome if you could share how you imported the data into ES and how
long it took to fully go through that large dataset?

------
kunthar
i think you didn't hear before riak with solr backend. you can have all of
your goals with a very few effort. pm me if you like to get some help for
installation and setup. i'd be glad to help web archive guys.

------
wmu
A nice article, and very good final result. (I love the cow pictures :))

