Hacker News new | past | comments | ask | show | jobs | submit login
The Evolution of Marginalia's Crawling (marginalia.nu)
108 points by marginalia_nu on Aug 23, 2022 | hide | past | favorite | 22 comments



I've started to build a search engine as a hobby side project (I'm three weekends in. It's very much inspired by marginalia, but with different content, curation, and UX goals. I have beefy hardware at home that can see it through.) My early crawl architecture plans actually mirror this new scheme, so it's delightful to have a bit of confirmation I'm on a mildly correct track.

One of my long-running side projects has been to do a lot of web-scraping on a website with continually evolving content, that's related to the economics of a side business, and yeah—we hit a lot of those growing pains of a monolithic scraper (poor debugging, unexplained re-crawling, waiting too long for a crawl plan, rare race conditions w/ multiple scrapers), averted by this model.

I kinda want to know how the big G does their crawling—I watch all the different googlebots hit my many websites, but I'm curious to see their architecture of prioritization and analysis—but a lot of the analysis is definitely the resulting sauce that should be kept somewhat proprietary and secret.


I've noted my search engine index starts to get noticeably stale after about two months, and right now it takes 2 or so weeks to crawl. Extrapolating, that might mean that it's sustainable up to about 4x the size of my current index.

Right now a lot of this work is manual. Kinda not long-term viable, but for now it's not too bad. Kick off a script every now and again and come back a week later and see how it went.

After that, there needs to be some sort of prioritization. Maybe a job that randomly probes URLs and flags dead/changed links for priority re-crawling. I think using RSS feeds to detect new links is also a tool that might be useful.


Ah yeah, that makes sense. and there's a balance here. IMO, Google puts far too much weight on recent content (and makes it impossible to find that one post you read in 2007 and you remember the vague keywords for), but also staleness of content is also an issue.

I've built native support for RSS and JSON feeds for discovery on high-quality sites, and I'm also tracking estimated content refresh rate to know when to re-fetch.

(I'm at a point where I've given myself too many signals to flag against, and finding the right tuning while starting to build an index, where I'm also actively evolving the schema and figuring out how to prioritize content I return in SERPs, is a problem. My current thinking is just a series of differentiated set of rules that populate the crawl queue for different reasons—seeking appropriate breadth—but then I'll definitely hit the re-crawl issue, particularly around sites I'm scoring as high quality.)


> Ah yeah, that makes sense. and there's a balance here. IMO, Google puts far too much weight on recent content (and makes it impossible to find that one post you read in 2007 and you remember the vague keywords for), but also staleness of content is also an issue.

There's actually a paradox with new content that should make it less interesting to crawl, which is that the odds content will vanish or change is inversely proportional to its age. If something has been around for 10 years, it's a fairly safe bet it will be tomorrow. If something has been around for 10 hours, it's a coin-toss.

Aggressively seeking out fresh content is probably a waste. If anything, fresh content should be regarded with suspicion. Maybe probe it a few days later to make sure it's still there before adding it to the index.


I'm beginning to think that there's no right answer, because the ideal crawling rules depend a lot on what type on content you're seeking, which obviously can't be known up front.

I'm working on a federated search tool. I've built a few different sectors (I've called them "realms") that I care about (programming, automotive, racing, fediverse) and found that I need pretty different heuristics depending. Example: car forums circa ~2005 are a treasure trove of valuable information, but 17 year old posts about programming are (in general) less interesting. That informs a lot about how URLs need or do not need to be re-crawled.


Yeah, probably. Could also be the answer isn't more complicated than to offer up the ability to control the search a bit more. Especially with forums it's easy to figure out the post time since there's only some half-dozen forum softwares. Maybe just add the option to filter by year or something would go a long way.

I do think offering different filters like that is probably a good option to having the search engine try to mind-read what you want based on spying on your historical queries and machine learning haruspicy.


Yea, I don't think the idea has any real value. It kind of falls in between a bookmarking tool and a search engine. But forming search engines from trees of other search engines is kind of cool, and I have stumbled onto knowledge that I wouldn't have otherwise found. I'm not sure what the balance is between mind reading and offering too many knobs to a user is, though.


Kinda wish there was more software architecture discussions.

It's hard to get right, and has a huge effect on what you're able to do with your software, but it feels like many of us (myself very much included) are just sort of winging it as we go along based on fads, hunches and whatever random assortment of experiences we've had professionally, which can't be very many bigger projects no matter who you are or how long you've worked where.


I feel like there are already more software architecture discussions out there than can be read by a single person. they're just not indexed well ;)


Out of curiosity, whats the crawl speed of both marginalia crawlers?

I had an inspiration to try and spin my own crawler after reading some other posts on the marginalia search. (it runs very dumbly, just pulling links from an ever increasing in-memory set) And on a single thread with asynchronous web requests and a massive pool of async workers (10k, ram is cheap on a personal machine). I've been able to reach around 300-400 requests per second, pulling the page, parsing for <a> tags, and throwing the href on the stack to search. I find the use of that many bespoke threads to be really surprising. Both because of the increased complexity of threads over async code, and my (possibly naive) expectation that web traffic will always out-bottleneck cpu bound tasks like HTML parsing/lexing/tagging etc.

I'll admit that I've been dragging my feet on implementing any proper parsing of my own, so I don't have any comparison to draw from. (Tried SQLite, clogged up my async code too much with blocking ops and im not excited to try a second time yet)


In practice, maybe 40-50rps (peaking at 100) for the first design, and 300rps for the second.

Although I'm serving search engine traffic from the same machine, so I'm trying to leave ample bandwidth for that. If I go too fast the NAT starts dropping packets and refusing connections, and that's not great for crawling or serving.


Is there any value in starting with Common Crawl?

https://hn.algolia.com/?query=author%3Amarginalia_nu%20commo... > https://news.ycombinator.com/item?id=32205535#32211292

> It's simply too unwieldy. It's far easier (and cheaper) to do my own crawling at a manageable scale, than it is to work with CC's datasets.

Is there any way to contribute to Common Crawl beyond donating?

https://commoncrawl.org/big-picture/what-you-can-do/


Yeah I stand by that.

I just don't see what Common Crawl would actually help me with, other than making my own data more stale given it would take about as long to download the CC dataset as my own crawl takes to perform (i.e. ~200h).

As it stands, crawling isn't the hard part of building a search engine. Don't get me wrong, if you're doing data science and want to access a crawl data set, Common Crawl is amazing.


@marginalia, Can you talk more about how you prioritize the URL frontier?


I use a combination of incoming links and the average ranking of these linking sites to add new sites to the crawl queue. It's not super sophisticated, and I think it matters less the bigger the crawl is.


How do you keep it "indie"? Even fairly small sites probably link to Forbes or the Atlantic once in a while. Do you have a specific "block" list to keep large, commercial sites out of the results?


I'll crawl those sites too. Most of them will be weeded out in the processing stage, where I exclude websites that have too much heavy duty javascript and tracking and so on. I only index about 20% of the documents I fetch.

Some still slip by, but the ranking algorithm takes care of the rest. I'm using personalized pagerank[1] biased toward a set of real human websites, which turns out to rather aggressively promote human websites.

[1] http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (see ch. 6)


something about this search is a winner for me, it's killed it for a few terms I've thrown in like "applicative languages" and "low level"


I'd be curious if the new architecture can be adapted to use different content types, other than HTML. I think the Fediverse (even if it can be crawled as the normal web pages) it would benefit for a custom crawler that can jump from inbox to inbox and understands (at least some) ActivityPub jsonld.


Yeah, that should be fairly doable.

The beauty of the design, since it's mediated by a portable language of JSON-objects, is that you can in principle replace or extend any of the steps, including support for other protocols (like gemini:// ?) or content-types.


It's very simple to write a basic ActivityPub crawler (I have done this), but you'd go from outbox to outbox, not inbox to inbox. Regardless, it's all just JSON, so it is more straightforward to crawl than the web.

I tried writing a crawler to map the Fediverse (to discover homeservers) but I discovered quickly that most of the content in the Fediverse is extremist content, both left-wing and right-wing -- it's basically all tankies and Nazis, with a few techies in a bubble using mastodon.social, who think Fediblock is a solution to this problem.

Oh right, and there are a ton of pedo instances too, mostly based in Japan where "loli" is legal. Due to the way ActivityPub publishes federated content, I didn't feel comfortable running a homeserver, because a pedo instance could federate illegal content to my homeserver and get me arrested and charged for possession of content I did not request and do not want to store on my computers, and if I'm not monitoring what content is getting federated to my timeline, I might not even know it's there. Too risky.

Demoralized, I abandoned the project, and I don't think much about ActivityPub anymore.


> because a pedo instance could federate illegal content to my homeserver and get me arrested and charged for possession of content I did not request and do not want to store on my computers, and if I'm not monitoring what content is getting federated to my timeline, I might not even know it's there. Too risky.

I also work in the activitypub space and my solution to avoiding unlawful content is to make the "federating" relationship between two servers a mutual one. One asks to follow, the other replies favourably (or not). This puts a damper on the "wild west" of everyone federating with everyone, but ensures content is curated at a basic level by a SysOP so there are no surprises.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: