Hacker News new | past | comments | ask | show | jobs | submit login
Notes on Writing Web Scrapers (cushychicken.github.io)
134 points by cushychicken on Dec 2, 2021 | hide | past | favorite | 67 comments



> Be nice to your data sources.

Very much this. If I know I'm about to possibly consume a non-negligible amount of resources from a server by indexing a website with lots of subdomains, I typically send the webmaster an email asking them if this is fine, telling them what I use the data for and how often my crawler re-visits a site, asking if I should set any specific crawl delays or do it in particular times of day.

In every case I've done this, they have been fine with it. It goes down a lot better than just barging in and grabbing data. This also gives them a way of reaching me if something should go wrong. I'm not just some IP address grabbing thousands of documents, I'm a human person they can talk to.

If I'd get a no, then I would respect that too, not try to circumvent it like some entitled asshole. No means no.


I wouldn't do that. It's better to stay under the radar IMO. They usually won't notice the traffic if it's just for personal reasons. And who says the internet can't be used for automatic retrieval?

I'm not going to pound websites with requests every minute of course but I think this falls under legitimate use. Whether I click the button myself or just schedule a script to do it shouldn't matter so much.

But if you ask you draw attention to it and get their legal dept involved (after all most websites are not run by a single webmaster in their bedroom) who will most likely say no because legal people are hesitant to commit to anything.

But maybe my usecase is different. I just scrape stuff to check if something I want is back in stock, to download my daily pdf newspaper I pay for, archive forum posts I've written, stuff like that. I don't index whole sites.

But yeah I do make sure I don't bombard them with requests, though this is more from a "staying under the radar" point of view. And indeed to avoid triggering stuff like cloudflare.

But if you're scraping to run your own search engine and offer the results to the public the situation is much more complex of course, both technically and legally.


I've never actually reached out to a webmaster to ask permission, but I think that's a great idea. (They may even have some suggestions for a better way to achieve what I'm doing.)

How do you typically find contact info for someone like that?

I'm running very generous politeness intervals at the moment to try and ensure I'm not a nuisance - one query every two seconds.


If you can't find the address on the website, sometimes it's in robots.txt, you could also try support@domain, admin@domain or webmaster@domain.

Contact forms can as well, usually they seem to get forwarded to the IT department if they look like they have IT stuff in them, and I've had reasonable success getting hold of the right people that way.


I'll give that a shot!

Do you know if this is generally an in-house position for companies that use third party platforms?

I ask because Workday has been the absolute bane of my indexing existence, and I suspect they make it hard so they can own distribution of jobs to approved search engines. (Makes it easier to upcharge that way, I suppose.)

If the administrator for the job site is the Workday customer (i.e. Qualcomm or NXP or whoever is using Workday to host their job ads), I'd suspect I'd have a chance at getting a better way to index their jobs. (My god, I'd love API access if that's a thing I can get. I'd be a fly on the wall in most cases - one index a day is plenty for my purposes!)


If their security is any good, their firewall may tarpit you anyway if they can see you are spidering links quicker than a human can read, your browser agent offers no clues and/or your ip address (range) offers no clues.


This is a nice approach. I generally leave a project specific email in the request headers with a similar short summary of my goals.


Yeah, my User-Agent is "search.marginalia.nu", and my contact information is not hard to find on that site. Misbehaving bots are very annoying and it's incredibly frustrating when you can't get hold of the owner to tell them about it.


I've written scrapers over the years, mostly for fun, and I've followed a different approach. Re. "don't interrupt the scrape", whenever URLs are stable, I keep a local cache of downloaded pages, and have a bit of logic that checks the cache first when retrieving an URL. This way you can restart the scrape at any time and most accesses will not hit the network until the point where the previous run was interrupted.

This also helps with the "grab more than you think you need" part - just grab the whole page! If you later realize you needed to extract more than you thought, you have everything in the local cache, ready to be processed again.


You're not the first person in this thread to suggest grabbing the whole page text. I've never tried, just because I assumed it was so much space as to be impractical, but I don't see the harm in trying!


My current favorite cache for whole pages is a single sqlite file with the page source stored with brotli compression. Additional columns for any metadata you might need (URL, scraping sessionid, age). The resulting file is big (but brotli for this is even better than zstd), and having a single file is very convenient.


It is not advised to use SQLite with several threads, are you running only a single thread?


You can't use SQLite from multiple threads if you are in single thread mode [0]. If you are in multi-thread mode you can use multiple connections to the DB from different threads, or if you want to share a connection then you have to serialize over it. Or you can use SERIALIZED mode.

This is if you do not care writing any logic. If you do, there are other possibilities, for example using a queue and a a dedicated thread for handling database access, but I personally do not think there are many advantages to this more complicated approach.

[0] https://www.sqlite.org/threadsafe.html


Thanks, I wasn't aware of this. I'm not sure if it was old info or I "imagined" it but in any case it's good to know.

I hope it wasn't true in 2019, since I painstakingly wrote a multi-threaded app that was accessing SQLite on only 1 thread!


I would use SQLite between each of the stages so each process can be restarted, run concurrently and observed with queries.


But how do you deal with dynamic pages, i.e. content that changes each time - would you need to pattern-match?


Hey, posted on your other comment asking for advice, so thought I'd return the favor. I haven't built a scraper for workday yet, but I looked a little bit at a workday board to figure out the process before writing this comment. 1.) Navigate to a workday page (we'll say https://broadinstitute.wd1.myworkdayjobs.com/broad_institute... for example)

2.) Open up your developer console in chrome (Ctrl+Shift+J command on windows), and navigate to the Network tab.

3.) Change the filter to Fetch/XHR

4.) Refresh the page

5.) You should see a few requests pop up, the one you care about is the clientRequestId request

6.) Take a look at the response payload of that request (throw it in http://jsonprettyprint.net/ for readability)

7.) You get a json payload that gives you the job positions you're looking for

8.) In addition to that, go back to the original web page and scroll down. You'll see a new request pop-up, giving you a format for how you'll traverse through the next positions.

Hope this helps!


It is helpful! Annoyingly, there's some site-to-site variation in how companies structure results in their Workday instance. I get similar (but not identical) results when I look at NXP's Workday site, for example:

https://nxp.wd3.myworkdayjobs.com/en-US/careers

I'm going to try this technique with individual posting results - it's been challenging to get them to render as well, but I think that's more a Javascript thing than a requests thing.


Ah okay, I see what you mean with that one. I think the way I would approach Workday is categorizing the different companies that use Workday into certain buckets. So the example I gave would be one bucket, and the one you gave would be a different bucket. I would create a script for each of these buckets, instead of trying to use a one-size fits all approach. The approach I'd use for the website you linked would be something like:

Create a request mimicking this curl command:

curl 'https://nxp.wd3.myworkdayjobs.com/wday/cxs/nxp/careers/jobs' \ -H 'Connection: keep-alive' \ -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -H 'Accept-Language: en-US' \ -H 'sec-ch-ua-mobile: ?1' \ -H 'User-Agent: Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36' \ -H 'sec-ch-ua-platform: "Android"' \ -H 'Origin: https://nxp.wd3.myworkdayjobs.com' \ -H 'Sec-Fetch-Site: same-origin' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Referer: https://nxp.wd3.myworkdayjobs.com/en-US/careers?p=4' \

  --data-raw '{"limit":20,"offset":80,"searchText":"","appliedFacets":{}}' \
  --compressed

Change the offset by +20 (second to last row) each time until you reach the desired number of jobs. May need some changes but that's the general approach!


Thanks for going down the Workday scraping rabbit hole with me. :)

Did you pull this from the browser console's "Copy as cURL" function?

I tried this with some success (there's even a utility for translating cURL to Python - imagine that! https://curlconverter.com/) but I had some issues after a while, probably because the cookie/session token expired.


> When you’re reading >1000 results a day, and you’re inserting generous politeness intervals, an end-to-end scrape is expensive and time consuming. As such, you don’t want to interrupt it. Your exception handling should be air. Tight.

Seems like it would be more robust to just persist the state of your scrape so you can resume it. In general, I try to minimize the amount of code that a developer _MUST_ get right.


Seems like it would be more robust to just persist the state of your scrape so you can resume it.

Say more about this. I'm not a software engineer by training, so I don't really know what "persist your state" would look like in this case.


Generally this means that you should be writing all the important information about the program to disk so in the event of a crash, you can read it back and continue where you left off. Alternatively, your state could BE on disk (or whatever persistent store, like a DB) and your program should process it in atomic chunks with zero difference between a "from scratch" and "resumed" run.

I work on web scrapers at [day job] and can say the latter approach is far better, but the former is far more common. An implementation of the former could be as simple as "dump the current URL queue to a CSV file every minute".

As for doing this "properly" in a way that works at scale, my preferred way is with a producer-consumer pipeline and out-of-process shared queues in between each pair. So, for example, you have 4 queues: URL queue, verify queue, reponse queue, item queue and 5 stages: fetch (reading from URLQ and writing to verifyQ), response check (reading from verifyQ, writing good responses to reponseQ and bad response URLs back to fetchQ), parse (reading from responseQ and writing to itemQ) and publish (reading from itemQ and writing to database or whatever).

This is is be both horrendously overcomplicated and beautifully simple. I've implemented this both with straight up bash scripts and with dedicated queues and auto-scaling armies of docker containers.


Write the pages you are scraping to a cache. The simplest way would be to write them each to folder. Check if the page you are going to scrape is cached before trying to request it from the server.


If you are scraping 1000 pages and your program crashes after the 800th page, you are going to be much happier if you have the data for the 800 pages saved somewhere vs having to start all over.


Got it. Yeah, I can give this approach a shot.

One of the things I've been doing with this scraping project is "health checking" job posting links. There's nothing more annoying than clicking an interesting looking link on a job site, only to find it's been filled. (This is one of the lousiest parts of Indeed and similar, IMO.) I wrote some pretty simple routines

Caching solves the problem of potentially missing data while the scraper is running, but it doesn't really alleviate the network strain of requesting pages to see that they are actually still posted, valid job links.


It just means to write to disk where you left off, like saving your progress in a video game.


Author heeyah. Would love your feedback, either here or at @cushychicken on the tweet street.


There's a project called woob (https://woob.tech/) that implements python scripts that are a little bit like scrapers but only 'scrape' on demand from requests from console & desktop programs.

How much of this article do you think would apply to something like that? e.g. something like 'wait a second (or even two!) between successive hits' might not be necessary (one could perhaps shorten it to 1/4 second) if one is only doing a few requests followed by long periods of no requests.


Interesting question. My first instinct is to say that woob seems closer in use case to a browser than a scraper, as it seems largely geared towards making rich websites more easily accessible. (If I'm reading the page right, anyway.) A scraper is basically just hitting a web page over and over again, as fast as you can manage.

The trick, IMO, is to be closer to browser level loading on a server than scraper. Make sense?


I can truly relate to this article, especially where you mentioned trying to extract only the specific contents of elements that you need; without bloating your software. To me, that seemed intuitive with the minimal experience I have in web scraping. However, I ended up fighting the frameworks. Me being stubborn, I did not try your approach and kept trying to be a perfectionist about it LMAO. Thank you for this read, glad I am not the only one who has been through this. Haha...


Yeah it's an easy thing to get into a perfectionist streak over.

Thinking about separation of concerns helped me a lot in getting over the hump of perfectionism. Once I realized I was trying to make my software do too much, it was easier to see how it would be much less work to write as two separate programs bundled together. (Talking specifically about the extract/transform stages here.)

Upon reflection, this project has been just as much self-study of good software engineering practices as it has been learning how to scrape job sites XD


Thank you for your reply, and your feedback, man! I will be sure to take this knowledge with me on my next web scraping journey! I appreciate your time XD


Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Let's say you're scraping product info from a large list of products. I'm assuming you mean if it's strange one-off type errors to handle those, and you'd stop altogether if too many fail? Otherwise you'd just be DOS'ing the site.


Can you elaborate on what you mean by not interrupting the scrape and instead flagging those pages?

Sure! I can get a little more concrete about this project more easily than I can comment on your hypothetical about a large list of products, though, so forgive me in advance for pivoting on the scenario here.

I'm scraping job pages. Typically, one job posting == one link. I can go through that link for the job posting and extract data from given HTML elements using CSS selectors or XPath statements. However, sometimes the data I'm looking for isn't structured in a way I expect. The major area I notice variations in job ad data is location data. There are a zillion little variations in how you can structure the location of a job ad. City+country, city+state+country, comma separated, space separated, localized states, no states or provinces, all the permutations thereof.

I've written the extractor to expect a certain format of location data for a given job site - let's say "<city>, <country>", for example. If the scraper comes across an entry that happens to be "<city>, <state>, <country>", it's generally not smart enough to generalize its transform logic to deal with that. So, to handle it, I mark that particular job page link as needing human review, so it pops up as an ERROR in my logs, and as an entry in the database that has post_status == 5. After that, it gets inserted into the database, but not posted live onto the site.

That way, I can go in and manually fix the posting, approve it to go on the site (if it's relevant), and, ideally, tweak the scraper logic so that it handles transforms of that style of data formatting as well as the "<city>, <country>" format I originally expected.

Does that make sense?

I suspect I'm just writing logic to deal with malformed/irregular entries that humans make into job sites XD


I've had a lot of success just saving the data into gzipped tarballs, like a few thousand documents per tarball. That way I can replay the data and tweak the algorithms without causing traffic.


Is that still practical even if you're storing the page text?

The reason I don't do that is because I have a few functions that analyze the job descriptions for relevance, but don't store the post text. I mostly did that to save space - I'm just aggregating links to relevant roles, not hosting job posts.

I figured saving ~1000 job descriptions would take up a needlessly large chunk of space, but truth be told I never did the math to check.

Edit: I understand scrapy does something similar to what you're describing; have considered using that as my scraper frontend but haven't gotten around to doing the work for it yet.


Yeah, sure. The text itself is usually at most a few hundred Kb, and HTML compresses extremely well. Like it's pretty slow to unpack and replay the documents, but it's still a lot faster than downloading them again.


And it's friendlier to the server you're getting the data from.

As a journalist, I have to scrape government sites now and then for datasets they won't hand over via FOIA requests ("It's on our site, that's the bare minimum to comply with the law so we're not going to give you the actual database we store this information in.") They're notoriously slow and often will block any type of systematic scraping. Better to get whatever you can and save it, then run your parsing and analysis on that instead of hoping you can get it from the website again.


First of all, thanks for marginalia.nu.

Have you considered stored compressed blobs in a sqlite file? Works fine for me, you can do indexed searches on your "stored" data, and can extract single pages if you want.


The main reason I'm doing it this way is because I'm saving this stuff to a mechanical drive, and I want consistent write performance and low memory overhead. Since it's essentially just an archive copy, I don't mind if it takes half an hour to chew through looking for some particular set of files. Since this is a format deigned for tape drives, it causes very little random access. It's important that it's relatively consistent to write since my crawler does while it's crawling, and it can reach speeds of 50-100 documents per second, which would be extremely rough on any sort of database based on a single mechanical hard drive.

These archives are just an intermediate stage that's used if I need to reconstruct the index to tweak say keyword extraction or something, so random access performance isn't something that is particularly useful.


Have you thought about pushing the links onto a queue and running multiple scrapers off that queue? You'd need to build in some politeness mechanism to make sure you're not hitting the same domain/ip address too often but it seems like a better option than a serial process.


Why 5, exactly? This struck me as odd in the article. Perhaps I missed something. Are there other statuses? Why are statuses numeric?


It's arbitrary.

I have a field, post_status, in my backend database, that I use to categorize posts. Each category is a numeric code so SQL can filter it relatively quickly. I have statuses for active posts, dead posts, ignored links, links needing review, etc.

It's a way for me to sort through my scraper's results quickly.


I think you have a case of premature optimisation there, as I wrote in a recent comment[0].

[0]: https://news.ycombinator.com/item?id=29430281


Not sure what's premature here. The optimization is to allow me, a human, to find a certain class of database records quickly. I also chose a method that I understand to be snappy on the SQL side as well.

What would you suggest as a non-optimized alternative? That might make your point about premature optimization clearer.


There is indeed a trade-off, and the direction I would have chosen is to use meaningful status names as opposed to magic numbers. My reasoning being that maintenance cost in terms of how self-explanatory the system is makes more sense to me economically than obscuring the meaning behind some of the code/data for a practically non-existent performance benefit.

After all, hardware is cheap, but developer time isn't.

For a more concrete example, I might have chosen the value `'pending'` (or similar) instead of `5`. Active listings might have status `'active'`. Expired ones might have status `'expired'`, etc.


Integer columns are significantly faster and smaller than strings in a SQL database. It adds up quickly if you have a sufficiently large database.

I use the following scheme:

   1 - exhausted
   0 - alive
  -1 - blocked (by my rules)
  -2 - redirected
  -3 - error


The author is scraping fewer than 1,000 records per day, or roughly 365,000 records per year.

On my own little SaaS project, the difference between querying an integer and a varchar like “active” is imperceptible, and that’s in a table with 7,000,000 rows.

It would take the author 19 years to run into the scale that I’m running at, where this optimisation is meaningless. And that’s assuming they don’t periodically clean their database of stale data, which they should.

So this looks like a premature optimisation to me, which is why it stood out as odd to me in the article.


I'd put it closer to the category of best practices than premature optimization. It's pretty much always a good idea. It's not that not doing this will break things, the alternative is slower and uses more resources in a way that affects all queries since larger datatypes elongate the records, and locality is tremendously important all aspects of software performance.


I disagree. I think a better "best practice" is to make the meaning behind the code as clear as possible. In this case, the code/data is less clear, and there is zero performance benefit.


There is absolutely a performance benefit to reducing your row sizes. It both reduces the amount of disk I/O and the amount of CPU cache misses and in many cases also increases the amount of data that can be kept in RAM.

You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.


You can map meaning onto the column in your code, as most languages have enums that are free in terms of performance. It does not make sense to burden the storage layer with this, as it lacks this feature.

Was just looking at how to do this with an enum today! Read my mind. :)


The performance benefit is negligible at the scale the author of the article is operating at. You already alluded to this point being context-dependent earlier when you said:

> if you have a sufficiently large database

Roughly 360,000 rows per year is not sufficiently large. It's tiny.


It's arbitrary.


I just want to add that is very important to look at get/post and ajax calls within the site. When properly understood, they can be worked on your favor, taking away a lot of complexity of your scarpers.


Agreed - frequently the necessary data is available as nicely formatted json. Sometimes you an also modify the query: rather than something 10 results per request, you can get 100.


The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month.

So I think there’s a balance to be struck - I’d argue you should absolutely be thoughtful about your target - if they notice you or you break them, that could be problematic for both of you. But if you’re TOO conservative, the job will never get done in a reasonable timeframe.


The problem with waiting 1-2 seconds between requests is that if you’re trying to scrape on the scale of millions of pages, the difference between 30 parallel requests / sec and a single request every 1-2 seconds is the difference between a process that takes 9 hours and a month

Fortunately for me, I'm almost assuredly never going to have to do this on the scale of millions of pages. If time proves me wrong, I suspect I'll be hiring someone with more expertise to take over that part of the project.

I'm definitely biasing towards a very conservative interval. Optimizing the runtime is more to help with tightening the iteration cycles for me, the sole developer, instead of limiting the job size to a reasonable timeframe.


Ah the sanitization links were great, thanks!

Do you plan on handling sanitization of roles so people can search by that? I ended up using a LONG case when statement to group roles into buckets, probably not ideal

Doing something similar to you, but focused on startups and jobs funded by Tier 1 investors: https://topstartups.io/


Do you plan on handling sanitization of roles so people can search by that? I ended up using a LONG case when statement to group roles into buckets, probably not ideal

Probably not. I get the impression that job titles are one of the ways that recruiters and hiring managers "show their shine", so to speak, in terms of marketing open roles. Job titles can convey explicit and important information about the role's focus - things like "FPGA", or "New College Grad", or "DSP", for example. It's an opportunity for them to showcase what's special about the role in a succinct way. Sanitizing that away would reduce the amount of information given to all sides in the transaction. It also seems like a really broad task; there are way more niche specialties in this field than there are succinct buckets to place them in, job-title-wise.

I've found it more useful to tag roles based upon title and contents of the job description. It's a way to get the same info across without obscuring the employer's intent.


Do you think there's a lot of job boards lately? Or have they always been there? It feels like I've seen a lot of them popping up this year.


Just checked out your website - the "Buy the full database" idea is genius! I'm working on a job board and just assumed that I would never be able to charge the users at all, but your idea circumvents that. Nice work!


What do you mean, "buy the full database"?


There's a link at the bottom of the page that says "Tired of scrolling? Buy the full database organized in a spreadsheet " and it links to a stripe page that lets a user buy all of the data in a CSV sheet for $49


Scraping 1000 results a day is really not any kind of web scraping scale. There’s barely any of the same considerations of systems that scrape tens millions a day.

You could easily store those kinds of results in a local DB for offline processing and resuming.


Even at my cutesy, boutique scale, there is plenty there to obsess about. XD




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: