That said, it looks more like an API experiment, not a practical solution for a day job, at least in its current state:
* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;
* base url detection is wrong, as it doesn't take <base> tag in account;
* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")
* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from
a GPL package, your package becomes GPL as well;
* each .xpath call parses a chnk of HTML again, even if a tree is already present
* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)
Because there is nothing usable code-wise in requests-html from my point of view (it is no better than existing alternatives), I don't feel like raising these issues, advocating for fixing them, discussing alternative solutions with a goal of improving requests-html. Of course, everyone is free to raise these issues in a repo.
I appreciate the work put into requests-html API design, the design is very nice overall. This might be a way to go: create a nice API design, attract people, fix implementation over time, but this battle is not mine, sorry :(
All of these improvements I'd like to be made to the software. It's all about getting a nice API in place first, then making it perfect second.
With libraries like these, it's all about getting the API right first, them optimizing for perfection second. :)
A second iteration of review:
* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;
* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.
* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data (http://lxml.de/parsing.html#python-unicode-strings), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046....
* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...)
* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.
For encoding detection we're using https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f... in Scrapy. It works well overall; its weakness is that it doesn't require a HTML tree, and doesn't parse it, extracting meta information only from first 4Kb using a regex (4Kb limit is not good). Other than that, it does all the right things AFAIK.
EDIT: first commit 22 hours ago - goes to show that when you have thought about the idea and know what you're doing it doesn't take long to produce the first version. :)
lxml.html is much better in my experience. If you want to use CSS selectors, there's pyquery.
Now, I very rarely have any need to scrape and parse html data, and I was scratching my head at how it was taking these parsers so long to parse a 3.5 mib html page. I mean, it should be able to go through that and get what I want in under a second right?
So, I said screw it and wrote some regexes. 10-15 seconds was how long it was now taking to parse that html. It actually took 1.5 or so seconds to parse the html; the rest was waiting for it to download the webpage.
Ironically, implementing the regexes was actually quicker than figuring out how to use those html parsers and write the code. Of course, that's assuming you know how to craft regexes. Since it was set-up to run every 5 minutes, I wanted something that could do it without spending 1/2 the time parsing the data(amongst other tasks the processors were needed for).
It's when you don't have guarantees about the structure of the html you are working with, that regex will come up short.
lxml.html has a terrible parser.
> If you want to use CSS selectors, there's pyquery.
CSS selectors are built into lxml through cssselect which is used to convert CSS3 to XPath 1.0 selectors.
 If you want to use CSS selectors, there's pyquery.
I’ve found more luck with html5lib and cssselect2.
I heard KRE was going to headline at the upcoming PyCon.
Also check out newspaper3k if you haven't seen it. It is high level, but really useful for a bunch of simple scraping related use cases
I won't argue the slowness and the occasional bugs, but unlike you I find b4 to be very intuitive. And this is mainly why I use it, despite its faults. Maybe our use case are different, but with a basic knowledge of html, I only rarely find myself reading the documentation. Care to give some examples of what you find counter-intuitive?
Though it adds parsel as a parser (which has a really nice api) to requests. It also integrates with selenium.
I prefer my dependencies to be orthogonal and lightweight i.e. do one thing well. Maybe this is better for interactive use in the REPL.
If you do decide to remove it, I think it's worth adding as an example in the README.
I personally use bs4 for web scraping and it works pretty well, but if there was an option to also do js with a sane API, I'd switch in a heartbeat.
I really love puppeteer over selenium. Much deeper control.
He’s absolutely focused on user experience – in this case the developer experience – and it absolutely shows.
I’m sure I’ll end up using this utility at some point in the future.
What is a 'jQuery' selector? Is it the same as CSS selectors, or does jQuery support non-standard syntax?
Scraping the web will become a lot harder in the future.
puppeteer makes scripting headless Chrome for scraping-style tasks trivial, and it's supported by the Chrome development team so it's likely to keep on working long into the future.
Is there anything special that Chrome headless has that ordinary Chrome wouldn't have?
Puppeteer uses Chrome Remote Debug Protocol, which is the same protocol Devtools uses. It’s just a simple JSON RPC over websockets. Puppeteer creates a nice library abstraction over this api.
The advantage of puppeteer over selenium is that you have a lot more control over chrome. Network, perf, screenshots, coverage, dom & style traversal, etc. It’s a very reliable API too since Chrome Devtools team maintains the backend.
Selenium on the other hand surprises me with all sorts of quirks.
This is great news, as this space was dominated by bs4 in the Python ecosystem.
Can't wait to use this in the future :)
It greatly simplifies the most common patterns in doing HTTP requests; things like authentication, passing headers, retry logic, etc.
Wow and now python is in C so line up your credit books, we are in for a long night tonight.
Thoughts & prays for all contributors
I haven't used xpath much though (and it seems pretty beefy!).
"XYZ for Humans" is a trope for the author of this library; for example, requests' tagline is "HTTP for humans".