Hacker News new | past | comments | ask | show | jobs | submit login

I run the News Sniffer[1] project which has to parse BBC News pages and I knew about this rollout a few weeks ago when the HTML all changed format completely and my parsers broke.

As a side note, the new HTML is way more complicated and much harder to parse than before - I know the aim isn't to help parsing for content, but I was still saddened to see how it's ended up (a bit of a mess imo - hard to distinguish actual article content from other things).

If anyone knows a reliable and public way to access the content before the "web rendering" layer, that'd be very handy!

[1] https://www.newssniffer.co.uk




The BBC was huge on semantic web and rdf a few years ago. When you start using React and using the tools and techniques common with React or even just composing your site follow today’s common methods the semantic web is extremely difficult, especially with elements being created at runtime in the browser adhoc.

It’s kind of where we started in the early 2000s and gone full circle. CSS was created to remove the intended style of the site being crafted by the structure of the content. We now have CSS frameworks that dictate how you define the content for layout to take effect.

Is CSS Garden still even a thing these days?


If they have a mobile app, often looking at the requests that it makes can be enlightening, and the same story if the have a mobile version of the site, as it tends to have less "fluff" and more content

I also know they have a moderately public Nitro API for their media programming (the iPlayer offerings) so it's possible they have a similar one for their web content


You may find the amp pages slightly easier to scrape.

i.e. This: https://www.bbc.co.uk/news/amp/health-54795657 vs This: https://www.bbc.co.uk/news/health-54795657

I'm working on a similar thing at moment (BBC html -> markdown) so also exploring the best way to do it.


nice thanks, that looked promising but I've checked a number of older articles and they have no amp version (despite the HTML for the site referencing an amp version, the url is 404!)

I'll see if I can figure out if all newer pages will have permanent amp versions, or whether all amp versions drop away over time.


Maybe my side project rss-proxy [0] might be interesting for you. It analyzes the dom and extracts the articles, so ideally you would not need to write a parser manually.

[0] https://github.com/damoeb/rss-proxy


They still have a full set of RSS feeds do they not?


yes, though the page linking to them hasn't been updated since 2011. And notably, it doesn't have the new HTML: https://www.bbc.co.uk/news/10628494

But the RSS feeds are just the headlines. And also don't contain every article ever - only the latest ones with a limit. So not much use to News Sniffer.


This looks pretty active, if you have any luck let me know

https://newspaper.readthedocs.io/en/latest/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: