Hacker News new | past | comments | ask | show | jobs | submit login

Maybe I am being picky, but is traversing JSON files truely "web scraping"?



No, you're correct...these exercises were deliberately kept programmatically simple -- e.g. single loops and conditional statements -- ...not everyone student had much CS experience, nevermind web scraping. In cases where JSON is being parsed, it's usually because that's the easiest way to access the data...but the "skill" in the exercise is recognizing when a website feeds from such an API...and then go direct to that source.

For example, usajobs.gov is a consumer-friendly jobs search site. You could find the number of librarian jobs by manipulating the web form...or you could do a little looking around and see that there's an API:

https://data.usajobs.gov/Rest

And just as importantly, there's an official taxonomy for federal jobs: https://www.opm.gov/policy-data-oversight/classification-qua...

So being able to look at a website and deduce what might be behind it is good enough...and is actually what I would do in a real-world situation rather than just trying to reverse engineer a site.

And there's the increasingly common situation in which the website loads data client-side, such as analytics.usa.gov...and so inspecting the network traffic and working with the JSON files is the only way to collect the data displayed on the website.


One of the most important skills a web scraper can have is being able to take the easier path and use an API where it's available. APIs, after all, are just really really nicely-formatted webpages that follow an additional set of generally agreed on standards. If you were interviewing a web scraper for your company, and they didn't know how to parse JSON, you'd probably think twice about giving them an offer. I think it's entirely appropriate -- and necessary -- to include JSON parsing in the list of exercises. (Also, I recently wrote "Web Scraping with Python" (O'Reilly), and had a whole CHAPTER on APIs)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: