I run the News Sniffer[1] project which has to parse BBC News pages and I knew about this rollout a few weeks ago when the HTML all changed format completely and my parsers broke.
As a side note, the new HTML is way more complicated and much harder to parse than before - I know the aim isn't to help parsing for content, but I was still saddened to see how it's ended up (a bit of a mess imo - hard to distinguish actual article content from other things).
If anyone knows a reliable and public way to access the content before the "web rendering" layer, that'd be very handy!
The BBC was huge on semantic web and rdf a few years ago. When you start using React and using the tools and techniques common with React or even just composing your site follow today’s common methods the semantic web is extremely difficult, especially with elements being created at runtime in the browser adhoc.
It’s kind of where we started in the early 2000s and gone full circle. CSS was created to remove the intended style of the site being crafted by the structure of the content. We now have CSS frameworks that dictate how you define the content for layout to take effect.
If they have a mobile app, often looking at the requests that it makes can be enlightening, and the same story if the have a mobile version of the site, as it tends to have less "fluff" and more content
I also know they have a moderately public Nitro API for their media programming (the iPlayer offerings) so it's possible they have a similar one for their web content
nice thanks, that looked promising but I've checked a number of older articles and they have no amp version (despite the HTML for the site referencing an amp version, the url is 404!)
I'll see if I can figure out if all newer pages will have permanent amp versions, or whether all amp versions drop away over time.
Maybe my side project rss-proxy [0] might be interesting for you. It analyzes the dom and extracts the articles, so ideally you would not need to write a parser manually.
yes, though the page linking to them hasn't been updated since 2011. And notably, it doesn't have the new HTML: https://www.bbc.co.uk/news/10628494
But the RSS feeds are just the headlines. And also don't contain every article ever - only the latest ones with a limit. So not much use to News Sniffer.
As a side note, the new HTML is way more complicated and much harder to parse than before - I know the aim isn't to help parsing for content, but I was still saddened to see how it's ended up (a bit of a mess imo - hard to distinguish actual article content from other things).
If anyone knows a reliable and public way to access the content before the "web rendering" layer, that'd be very handy!
[1] https://www.newssniffer.co.uk