Hacker News new | comments | show | ask | jobs | submit login
Making the Internet Archive’s full text search faster (gio.blog.archive.org)
127 points by danso on June 9, 2017 | hide | past | web | favorite | 21 comments



This is pretty cool. I just wish these recent "improvements" to the archive.org site weren't also associated with a massive decrease in accessibility and usability.

The switch to a javascript-rendered site frontend has made using the wayback machine pretty much inaccessible to all but the most modern and bloated of browsers. Emails to support about keeping the old site going at an alternate URL or fixing fall-back rendering/operation for no-JS have been met with silence.

I guess I'm just not in the demographic archive.org is targeting now despite being a heavy user for more than a decade. It's not the backend guy's fault but it does mean lots of us won't benefit from any backend improvements.


15-25TB to provide search for the entire internet archive. This is a pretty small number if you think about it. Your avg undergrad class room would be able match it if they pooled all their devices together. So why do we need Google and Bing ladies and gents? If we can get a 100 devs in every city in every corner of the planet to donate a hardrive/cpuand some time, distributed open search could well do to Google what Linux did to Microsoft. I think its going to happen. There is lot of duplication of the above work going on in little silos like Wikipedia, Khan academy, Stackoverflow, Zealdocs etc it just has get hooked up


I'd love to see fulltext someday on their Wayback Machine WWW corpus, not just their side projects. I've used it heavily in researching various things but the limitation to querying by domain or URL is highly restrictive: I'm certain that many of the pages I've failed to find are present in the Wayback Machine at other URLs, either after moving or being quoted or mirrored.

(For example, Google's Blogspot has this amazingly horrible thing of providing the same blog across dozens of national TLDs like 'blog.com', 'blog.co.uk', 'blog.fr' etc - so there may be a full copy of that blog in the WM except it's split across dozens of domains and the Wayback Machine forces you to check every possible permutation of domain name since it can't do a fulltext search.)


You know Google is free right? No need to donate resources. And latency will be so horrible if you get 100 devs in every city to donate resources.


The main point about search is valid. Centralized search might have been required as the internet exploded. Not anymore. I work on Wall St and there is lot of curated distributed search index sharing going on.


Any links to what's being shared?

How are the content crawled?


Note that this search is in everything the Archive backs up except the Web.


hi, I am the author of the article.

@devhead: To import the data into ES we used a custom application to extract the text from the OCR'd documents. This is required to support our bookreader software. A complete ingestion takes a few days; we rate-limit indexing in order not to overload the cluster, and maintain reasonable search performance.


Hey, I'm wondering why you didn't consider using stopwords to prevent bloated inverted index entries fir words like 'the'?


We don't use stopwords because we want to find all the best and complete matches. We don't want to ignore any of the words part of the search query.


You do use stopwords. Your most common unigrams are not in the index, by design. You just use your own stopwords.


Great article man. Am always​ super exited when i find articles that talk about information retrieval systems. I have a lot of questions for you. Been working on a search engine project www.cognifly.com for a year and its inverted index is still very small. Like 4gig now. So is ok if i send you an email?


sure, feel free to contact me by email or DM on twitter. gio archive org


It was a great article. On a side note, holy shit is it rare to even hear of other guys named Giovanni, much less as similar a last name (I'm Giovanni d'Amelio).


Wow those ASCII tables look terrible on the iPhone. If I rotate they clean up but when vertical they are unreadable.


cool, thanks for sharing; maybe one day you can release your ingestion app to the world.


If you are having problems loading the page, try this url: https://medium.com/@giovannidamiola/making-the-internet-arch...


Making the Internet Archive’s blog faster


awesome job IA, glad to see you were able to push through it and get your performance to an acceptable rate!

It'd be awesome if you could share how you imported the data into ES and how long it took to fully go through that large dataset?


i think you didn't hear before riak with solr backend. you can have all of your goals with a very few effort. pm me if you like to get some help for installation and setup. i'd be glad to help web archive guys.


A nice article, and very good final result. (I love the cow pictures :))




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: