Hacker News new | past | comments | ask | show | jobs | submit login
Requests-HTML: HTML Parsing for Humans (github.com)
467 points by ingve on Feb 25, 2018 | hide | past | web | favorite | 90 comments



+1 to make nicer APIs! It is always good to have more high-quality API designs to look at.

That said, it looks more like an API experiment, not a practical solution for a day job, at least in its current state:

* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;

* base url detection is wrong, as it doesn't take <base> tag in account;

* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")

* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from a GPL package, your package becomes GPL as well;

* each .xpath call parses a chnk of HTML again, even if a tree is already present

* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)


Would you mind posting these as issues?


TBH I don't see myself using this package: in its current stage it is very little code, and almost every method has an issue either with edge cases or with API; also, it is tied to requests library, unnecessarily IMHO, and in my opinion it is GPL even if setup.py says it is MIT.

Because there is nothing usable code-wise in requests-html from my point of view (it is no better than existing alternatives), I don't feel like raising these issues, advocating for fixing them, discussing alternative solutions with a goal of improving requests-html. Of course, everyone is free to raise these issues in a repo.

I appreciate the work put into requests-html API design, the design is very nice overall. This might be a way to go: create a nice API design, attract people, fix implementation over time, but this battle is not mine, sorry :(


GPL dependency removed.

All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.


I addressed most of your issues, like not using urlparse in the latest release.

With libraries like these, it's all about getting the API right first, them optimizing for perfection second. :)


<base> tag is now implemented as well. Thanks for bringing that to my attention — I wasn't aware of it!


:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.


Thanks for the feedback, integrated w3lib!


Boo, you should have just made it GPL.


Oooo... I just finished writing a script with BeautifulSoup. While it wasn't all that bad (it works ;) ), I'm sure the "Kenneth Reitz experience" would be much better. I won't be rewriting the script now, I can't wait to find an excuse to try this. :)

EDIT: first commit 22 hours ago - goes to show that when you have thought about the idea and know what you're doing it doesn't take long to produce the first version. :)


Would not recommend BeautifulSoup for this type of thing.

lxml.html is much better in my experience. If you want to use CSS selectors, there's pyquery.


I prefer lxml.etree even for HTML on account of the parser. Either way, I don't understand what's not "for humans" about lxml. It provides a ton of simple and useful abstractions, is easy to learn, and is crazy robust and fast under the hood.


I remember years ago I needed to parse some html(which was about 2-3 million characters) and after a fair bit of time, I had it up and running with beautifulsoup. Now, my use case was likely quite atypical to most html parsing, but my god was it ever slow! I forget the exact numbers, but I think it was taking about 150 seconds to complete. So, then I wrote it using lxml, which was an improvement, but that was still taking around 100 seconds.

Now, I very rarely have any need to scrape and parse html data, and I was scratching my head at how it was taking these parsers so long to parse a 3.5 mib html page. I mean, it should be able to go through that and get what I want in under a second right?

So, I said screw it and wrote some regexes. 10-15 seconds was how long it was now taking to parse that html. It actually took 1.5 or so seconds to parse the html; the rest was waiting for it to download the webpage.

Ironically, implementing the regexes was actually quicker than figuring out how to use those html parsers and write the code. Of course, that's assuming you know how to craft regexes. Since it was set-up to run every 5 minutes, I wanted something that could do it without spending 1/2 the time parsing the data(amongst other tasks the processors were needed for).

YMMV


I actually had the same experience. I was scraping a large number of pages and upon profiling my script, I found out that bs4 was really slow. Changing the parser from the default to lxml helped things a bit, but I decided I would just try a regex to check quickly whether things could be better. Lo and behold, it was much faster. It's true that it's impossible to parse HTML in its entirety with regex, but if you're looking to extract only a portion of data from a page with a known structure, a bit of regex might be the way to go.


You're using regex to parse html? Have you not read: https://stackoverflow.com/a/1732454/1090568 ?


If you have an HTML document you want to extract information from, regexes are fast and easy.

It's when you don't have guarantees about the structure of the html you are working with, that regex will come up short.


You can also use css selectors with lxml. Works great, in my experience.

http://lxml.de/cssselect.html


> lxml.html is much better in my experience.

lxml.html has a terrible parser.

> If you want to use CSS selectors, there's pyquery.

CSS selectors are built into lxml through cssselect[0] which is used to convert CSS3 to XPath 1.0 selectors.

[0] If you want to use CSS selectors, there's pyquery.


Lxml is C so annoying to install in some environments. Also, the API is inconsistent and verbose.

I’ve found more luck with html5lib and cssselect2.


> Kenneth Reitz Experience

I heard KRE was going to headline at the upcoming PyCon.


For old UNIX nerds, this could cause some confusion: KRE is Robert Elz, originator of timezone support in BSD UNIX.


And the quota system, and the existence of '.oz' as a domain name which regrettably ISO3166 refused to allocate.


Really looking forward to using this. BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)

Also check out newspaper3k if you haven't seen it. It is high level, but really useful for a bunch of simple scraping related use cases


> BeautifulSoup is great, but it's counterintuitive (I always need to refer to documentation even for very basic aspects that I've used dozens of times before) and it's often slow and has some really weird xml bugs)

I won't argue the slowness and the occasional bugs, but unlike you I find b4 to be very intuitive. And this is mainly why I use it, despite its faults. Maybe our use case are different, but with a basic knowledge of html, I only rarely find myself reading the documentation. Care to give some examples of what you find counter-intuitive?


Same! Python requests is one of my favorite libraries of all time. Kenneth Reitz is a treasure.


A similar library is https://github.com/tryolabs/requestium

Though it adds parsel as a parser (which has a really nice api) to requests. It also integrates with selenium.


> Render an Element as Markdown:

I prefer my dependencies to be orthogonal and lightweight i.e. do one thing well. Maybe this is better for interactive use in the REPL.


I consider it a nice-to-have, but we can definitely remove it if deemed unnecessary. Want to open an issue about it?


That’s just my preference. Keep the features that are best for your specific use cases. It’s your project after all. I think that’s better than design by committee. Linus didn’t write Linux for other people.


If you’re doing markdown conversion well, keep it. Apis are not about doing the smallest thing, it’s about designing a clean abstraction of a problem you’re trying to solve.


Sure, if APIs existed within a vacuum. In the real world we have to worry about ease of packaging, ease of maintenance, and various other practical costs.


For what it is worth, I strongly prefer leaving it in. Sometimes I want to just generate documentation for different things based on different sources and being able to just clean the html by rendering to Markdown would be great.


I personally find it useful, although I can see why it feels out of place.

If you do decide to remove it, I think it's worth adding as an example in the README.


Could be made an optional feature in setup.py.


Definitely shouldn’t be there.


Nice. It would be pretty cool to have an "run javascript and wait for network idle" optional for scraping js-requiring websites. Can selenium do this? Headless chrome?

I personally use bs4 for web scraping and it works pretty well, but if there was an option to also do js with a sane API, I'd switch in a heartbeat.


We use puppeteer for our smoke tests. Ensure all network requests load, no JavaScript errors, take screenshot of page and simple validations so we know our deploys aren’t borked.

I really love puppeteer over selenium. Much deeper control.


Puppeteer has "networkIdle0" (0 network connections for 500ms) and "networkIdle2" (no more than 2 network connections for 500ms). My experiences with it have been very positive.


Selenium can absolutely be used for this, the caveat being that it is much slower than using a regular HTML parser. In my experience, it's best to milk as much as you possibly can out of a site's plain HTML and APIs, only resorting to Selenium where it's absolutely necessary.


The output of r.html.absolute_links on the home page looks like it contains errors. For example,

  'https://www.python.org//docs.python.org/3/tutorial/'

  'https://www.python.org//docs.python.org/3/tutorial/controlflow.html#defining-functions'
I think it's important to remember for every single new library that comes out, that you are trading apparent usability for unknown issues that haven't surfaced yet because the library is so new.


That bug has been fixed, but the docs hadn't been. Fixed now :)


Looks like there's still some room for improvement though!


There is always room for improvement. I think it's easy to underestimate the time required to dwell in a thing before we really understand it. This is true for runtimes, libraries, even entire programming paradigms. We want to take shortcuts using abstract reasoning, and that works well usually, but sometimes you just gotta use something for years. (Alas, one lifetime may be too short to dwell in all the various paradigms the way they each require. But I digress...)


Always, very nice work however. Thanks for the pipenv also, makes working with venvs tolerable.


Interesting! How does this compare with MechanicalSoup (which seems to be the current best-in-class solution for scraping in Python)?


Very different use cases, imo. MechanicalSoup emulates a web browser experience. This is more for scraping.


I love Kenneth’s work.

He’s absolutely focused on user experience – in this case the developer experience – and it absolutely shows.

I’m sure I’ll end up using this utility at some point in the future.


Same functionality for .NET: http://html-agility-pack.net/


There is overlap, but this isn't the same as Requests-HTML. It is nice that this library has been developed further over the years.


> Select an element with a jQuery selector.

What is a 'jQuery' selector? Is it the same as CSS selectors, or does jQuery support non-standard syntax?


For what it's worth, yes, JQuery does support non-standard syntax/selectors


updated :)


Kenneth Reitz comes out with yet another good UI (in terms of library) to accomplish daily mundane tasks, with joy.


That's a neat little library. The problem is that the web is rapidly moving away from having HTML as the main information carrier to HTML merely being an envelope to deliver a bunch of JavaScript (and if you're even more unlucky: WebAsm).

Scraping the web will become a lot harder in the future.


In saying that, Javascript is just as parsable, and these 'javascript sites' are probably loading structured data via a JSON API which will probably be easier to scrape than a bunch of layout HTML.


I'm a lot less worried about that thanks to Chrome Headless and the puppeteer library: https://github.com/GoogleChrome/puppeteer

puppeteer makes scripting headless Chrome for scraping-style tasks trivial, and it's supported by the Chrome development team so it's likely to keep on working long into the future.


Is there a difference between Chrome+puppeteer vs Chrome+selenium?

Is there anything special that Chrome headless has that ordinary Chrome wouldn't have?


Chrome headless is just that. Chrome that’s headless. Selenium is a cross browser api that connects to browser via some port. But usually it’s a high level api since it’s a common denominator of browser apis. Selenium uses middleware webdriver libs sometimes and usually quite bloated.

Puppeteer uses Chrome Remote Debug Protocol, which is the same protocol Devtools uses. It’s just a simple JSON RPC over websockets. Puppeteer creates a nice library abstraction over this api.

The advantage of puppeteer over selenium is that you have a lot more control over chrome. Network, perf, screenshots, coverage, dom & style traversal, etc. It’s a very reliable API too since Chrome Devtools team maintains the backend.

Selenium on the other hand surprises me with all sorts of quirks.


Been using selenium and webdriver with ruby for the past five years, there is pretty much nothing it cannot do. Performance is a big hit, though.


I really love Selenium actually.. makes me feel.. powrful, somehow, like an evil genius or something


My evil genius moment with selenium happened when I ran a headless client to visit the login page of a bank, wait for 2fa sms code from my Android application, then use the code to log in and place some automated stock orders.


I know plenty of people who dislike working with Selenium. I haven't yet met anyone who's had enough experience with Puppeteer to hate it yet.


> Scraping the web will become a lot harder in the future.

Meh, if you're lucky you can grab raw data from API requests. Otherwise just let the javascript execute in a headless browser and continue scraping.


Or easier? You just need to look the endpoints the Javascript is using to get the data in JSON.


Even in JS heavy sites, the rendered output is still HTML. You just need to make sure you add a pre-render step. Page.REST, a micro-service I wrote a couple of months back follows this strategy.


Then again things keep becoming more API driven, with the JS delivered just containing the views. If it gets easier or harder remains to be seen I think


Looks like this is built on top of lxml and parse. I built an adapter over the bs4 interface in lxml, which was much faster than using bs4 with an lxml backend.

This is great news, as this space was dominated by bs4 in the Python ecosystem.

Can't wait to use this in the future :)


Also in PHP with Symfony Dom Crawler: https://symfony.com/doc/current/components/dom_crawler.html or Goutte for an easy to use web scraper https://github.com/FriendsOfPHP/Goutte that uses Dom Crawler


This is a nice wrapper around requests, pyquery https://github.com/gawel/pyquery/ and parse https://github.com/r1chardj0n3s/parse of which only requests is Kenneth Reitz. Let's give credit where it's due.


To give even more credit where it's due, requests is a nice wrapper around urllib3, which is the work of Andrey Petrov, Cory Benfield and contributors. While requests provides good user-friendly defaults and API semantics, urllib3 does a lot of the heavy lifting.


It wasn't at first, actually. It was originally a wrapper around urllib2 — Andrey and I collabed early on in both project's histories to make them what they are today.


requests does much more on top of urllib3 than this new requests-html does on top of requests + pyquery.

It greatly simplifies the most common patterns in doing HTTP requests; things like authentication, passing headers, retry logic, etc.


Which is written in python, so better give credit for that.

Wow and now python is in C so line up your credit books, we are in for a long night tonight.

Thoughts & prays for all contributors


To quote Carl Sagan, "If you wish to make an apple pie from scratch, you must first invent the universe."


...I know you're trolling, but I seriously wonder where it would stop if we were to go down all that way.


Gotta give credit to whichever cavemen/cavewomen discovered fire, and all the civilizations that invented the wheel, Ben Franklin for harnessing the power of electricity, and then maybe Ken Thompson and Dennis Ritchie for inventing Unix.


I think most people around here know this, but it doesn't really matter. We love `requests` because of its API, which is what Kenneth Reitz contributed. There are many competing HTTP client libs, but only one (that I know of) with intuitive syntax; I hope others follow suit. APIs are not just something that you slam on top of your lib, they should be developed in the same way every other interface is - UX is usually much more important than just pure performance.


What do you mean "give credit where it's due"? It's not common (in my experience) to thank every dependency's creator, right?


In about half of the examples, the methods are more user-friendly wrappers around underlying libraries - great! love it! In the other half of the examples, the niceness of the API is in fact due to the user-friendly design of the underlying PyQuery/Parse libraries. So I want to give credit to their good APIs, too.


Is there a shell interface, or do I have to call it through python? It would be nice to be able to use it in a Ruby project.


I don't get the point, it's just a nicer alternative to beautifulsoup. Html pages are always changing, instead of writing the parsing logic in code, I think we should put xpath or css expressions in some config files.


I'm sure some pages could be represented as css expressions in config files, but in general it seems like scraping is about working around unanticipated idiosyncrasies(e.g. X% of the pages have a different structure for mysterious reasons).

I haven't used xpath much though (and it seems pretty beefy!).


So stoked this popped up. Literally woke up this morning debating about moving away from bs4 and wrapping the functionality I needed in lxml. I was just thinking I wish there was a requests equivalent for parsing...


Is there any useful feature in bs4 that is not available in lxml.html?


This looks awesome, can’t wait to try it. Last time I used pyquery though it was considerably more limited than CSS and jQuery syntax, and I reverted to XPath. Has it improved recently?


HTML requests are: plain text, self-explanatory ("Content-length", "charset", etc.). What exactly is unhuman about that?


HTML request bodies are arbitrary bytes. There are nice ways to do useful things with those arbitrary bytes and less nice ways. This library purports to collect some of the nice ones.

"XYZ for Humans" is a trope for the author of this library; for example, requests' tagline is "HTTP for humans".


That's HTTP, not HTML.


Yikes, major misread on my part. Whoops.


That’s HTTP, but both it an HTML share a lot of things which are tedious for humans to deal with – encodings, error handling, handling things like attributes which may but usually do not have multiple values, etc. – and most HTML tooling has user-hostile APIs dating back to the XML era when developer convenience was seen as coddling the weak.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: