
Requests-HTML: HTML Parsing for Humans - ingve
https://github.com/kennethreitz/requests-html
======
kmike84
+1 to make nicer APIs! It is always good to have more high-quality API designs
to look at.

That said, it looks more like an API experiment, not a practical solution for
a day job, at least in its current state:

* response body encoding detection is wrong, as it doesn't take meta tags or BOM into account;

* base url detection is wrong, as it doesn't take <base> tag in account;

* URL parsing (joining, etc.) is implemented using string operations instead of stdlib, so a careful inspection is required to make sure it works in edge cases. For example, I can see right away that .absolute_links is wrong for protocol-related urls (e.g. "//ajax.microsoft.com/ajax/jquery/jquery-1.3.2.min.js")

* html2text used for .markdown is GPL - I know people have different opinion on this, but in my book if you import from a GPL package, your package becomes GPL as well;

* each .xpath call parses a chnk of HTML again, even if a tree is already present

* shortcuts are opinionated and with no clear behavior, e.g. .links deduplicates URLs by default, it deduplicates them using string matches (so e.g. different order of GET arguments => URLs are considering unique); it checks for '.startswith('#")' which looks arbitrary (what if these links are used in a headless browser? what if someone wants to fetch them using _escaped_fragment which many sites still support? why filter out such URLs if they are relative, but not if they are absolute?)

~~~
EmilStenstrom
Would you mind posting these as issues?

~~~
kmike84
TBH I don't see myself using this package: in its current stage it is very
little code, and almost every method has an issue either with edge cases or
with API; also, it is tied to requests library, unnecessarily IMHO, and in my
opinion it is GPL even if setup.py says it is MIT.

Because there is nothing usable code-wise in requests-html from my point of
view (it is no better than existing alternatives), I don't feel like raising
these issues, advocating for fixing them, discussing alternative solutions
with a goal of improving requests-html. Of course, everyone is free to raise
these issues in a repo.

I appreciate the work put into requests-html API design, the design is very
nice overall. This might be a way to go: create a nice API design, attract
people, fix implementation over time, but this battle is not mine, sorry :(

~~~
kenneth_reitz
GPL dependency removed.

All of these improvements I'd like to be made to the software. It's all about
getting a nice API in place first, then making it perfect second.

~~~
kenneth_reitz
I addressed most of your issues, like not using urlparse in the latest
release.

With libraries like these, it's all about getting the API right first, them
optimizing for perfection second. :)

~~~
kmike84
:thumbs up:

A second iteration of review:

* encoding detection from <meta> tags doesn't normalize encodings - Python doesn't use the same names as HTML;

* I'm still not sure encoding detection is correct, as it is unclear what are priorities in the current implementation. It should be 1) Content-Type header; 2) BOM marks; 3) encoding in meta tags (or xml declared encoding if you support it); 4) content-based guessing - chardet, etc., or just a default value. I.e. encoding in meta should have less priority than Content-Type header, but more priority than chardet, and if I understand it properly, response.text is decoded both using Content-Type header and chardet.

* lxml's fromstring handles XML (XHTML) encoding declarations, and it may fail in case of unicode data ([http://lxml.de/parsing.html#python-unicode-strings](http://lxml.de/parsing.html#python-unicode-strings)), so passing response.text to fromstring is not good. At the same time, relying on lxml to detect encoding is not enough, as http headers should have a higher priority. In parsel we're re-encoding text to utf8, and forcing utf8 parser for lxml to solve it: [https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046...](https://github.com/scrapy/parsel/blob/f6103c8808170546ecf046b9f4ea6dead94de189/parsel/selector.py#L38).

* when extracting links, it is not enough to use raw @href attribute values, as they are allowed to have leading and trailing whitespaces (see [https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...](https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215fcd159bc951c851ed7/w3lib/html.py#L325))

* absolute_links doesn't look correct for base urls which contain path. It also may have issues with urls like tel:1122333, or mailto:.

For encoding detection we're using
[https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215f...](https://github.com/scrapy/w3lib/blob/c1a030582ec30423c40215fcd159bc951c851ed7/w3lib/encoding.py#L187)
in Scrapy. It works well overall; its weakness is that it doesn't require a
HTML tree, and doesn't parse it, extracting meta information only from first
4Kb using a regex (4Kb limit is not good). Other than that, it does all the
right things AFAIK.

~~~
kenneth_reitz
Thanks for the feedback, integrated w3lib!

------
amenod
Oooo... I just finished writing a script with BeautifulSoup. While it wasn't
all that bad (it works ;) ), I'm sure the "Kenneth Reitz experience" would be
much better. I won't be rewriting the script now, I can't wait to find an
excuse to try this. :)

EDIT: first commit 22 hours ago - goes to show that when you have thought
about the idea and know what you're doing it doesn't take long to produce the
first version. :)

~~~
halflings
Would not recommend BeautifulSoup for this type of thing.

lxml.html is much better in my experience. If you want to use CSS selectors,
there's pyquery.

~~~
Vindicis
I remember years ago I needed to parse some html(which was about 2-3 million
characters) and after a fair bit of time, I had it up and running with
beautifulsoup. Now, my use case was likely quite atypical to most html
parsing, but my god was it ever slow! I forget the exact numbers, but I think
it was taking about 150 seconds to complete. So, then I wrote it using lxml,
which was an improvement, but that was still taking around 100 seconds.

Now, I very rarely have any need to scrape and parse html data, and I was
scratching my head at how it was taking these parsers so long to parse a 3.5
mib html page. I mean, it should be able to go through that and get what I
want in under a second right?

So, I said screw it and wrote some regexes. 10-15 seconds was how long it was
now taking to parse that html. It actually took 1.5 or so seconds to parse the
html; the rest was waiting for it to download the webpage.

Ironically, implementing the regexes was actually quicker than figuring out
how to use those html parsers and write the code. Of course, that's assuming
you know how to craft regexes. Since it was set-up to run every 5 minutes, I
wanted something that could do it without spending 1/2 the time parsing the
data(amongst other tasks the processors were needed for).

YMMV

~~~
jcadam
You're using regex to parse html? Have you not read:
[https://stackoverflow.com/a/1732454/1090568](https://stackoverflow.com/a/1732454/1090568)
?

~~~
wodenokoto
If you have an HTML document you want to extract information from, regexes are
fast and easy.

It's when you don't have guarantees about the structure of the html you are
working with, that regex will come up short.

------
sixhobbits
Really looking forward to using this. BeautifulSoup is great, but it's
counterintuitive (I always need to refer to documentation even for very basic
aspects that I've used dozens of times before) and it's often slow and has
some really weird xml bugs)

Also check out newspaper3k if you haven't seen it. It is high level, but
really useful for a bunch of simple scraping related use cases

~~~
Momquist
> _BeautifulSoup is great, but it 's counterintuitive (I always need to refer
> to documentation even for very basic aspects that I've used dozens of times
> before) and it's often slow and has some really weird xml bugs)_

I won't argue the slowness and the occasional bugs, but unlike you I find b4
to be very intuitive. And this is mainly why I use it, despite its faults.
Maybe our use case are different, but with a basic knowledge of html, I only
rarely find myself reading the documentation. Care to give some examples of
what you find counter-intuitive?

------
realhamster
A similar library is
[https://github.com/tryolabs/requestium](https://github.com/tryolabs/requestium)

Though it adds parsel as a parser (which has a really nice api) to requests.
It also integrates with selenium.

------
friendlydude12
> Render an Element as Markdown:

I prefer my dependencies to be orthogonal and lightweight i.e. do one thing
well. Maybe this is better for interactive use in the REPL.

~~~
kenneth_reitz
I consider it a nice-to-have, but we can definitely remove it if deemed
unnecessary. Want to open an issue about it?

~~~
friendlydude12
That’s just my preference. Keep the features that are best for your specific
use cases. It’s your project after all. I think that’s better than design by
committee. Linus didn’t write Linux for other people.

~~~
nojvek
If you’re doing markdown conversion well, keep it. Apis are not about doing
the smallest thing, it’s about designing a clean abstraction of a problem
you’re trying to solve.

~~~
friendlydude12
Sure, if APIs existed within a vacuum. In the real world we have to worry
about ease of packaging, ease of maintenance, and various other practical
costs.

------
ivan_ah
Nice. It would be pretty cool to have an "run javascript and wait for network
idle" optional for scraping js-requiring websites. Can selenium do this?
Headless chrome?

I personally use bs4 for web scraping and it works pretty well, but if there
was an option to also do js with a sane API, I'd switch in a heartbeat.

~~~
nojvek
We use puppeteer for our smoke tests. Ensure all network requests load, no
JavaScript errors, take screenshot of page and simple validations so we know
our deploys aren’t borked.

I really love puppeteer over selenium. Much deeper control.

------
javajosh
The output of _r.html.absolute_links_ on the home page looks like it contains
errors. For example,

    
    
      'https://www.python.org//docs.python.org/3/tutorial/'
    
      'https://www.python.org//docs.python.org/3/tutorial/controlflow.html#defining-functions'
    

I think it's important to remember for every single new library that comes
out, that you are trading apparent usability for unknown issues that haven't
surfaced yet because the library is so new.

~~~
kenneth_reitz
That bug has been fixed, but the docs hadn't been. Fixed now :)

~~~
kenneth_reitz
Looks like there's still some room for improvement though!

~~~
javajosh
There is always room for improvement. I think it's easy to underestimate the
time required to _dwell_ in a thing before we really understand it. This is
true for runtimes, libraries, even entire programming paradigms. We want to
take shortcuts using abstract reasoning, and that works well usually, but
sometimes you just gotta use something for _years_. (Alas, one lifetime may be
too short to dwell in all the various paradigms the way they each require. But
I digress...)

------
ameliaquining
Interesting! How does this compare with MechanicalSoup (which seems to be the
current best-in-class solution for scraping in Python)?

~~~
kenneth_reitz
Very different use cases, imo. MechanicalSoup emulates a web browser
experience. This is more for scraping.

------
nbrempel
I love Kenneth’s work.

He’s absolutely focused on user experience – in this case the developer
experience – and it absolutely shows.

I’m sure I’ll end up using this utility at some point in the future.

------
Const-me
Same functionality for .NET: [http://html-agility-pack.net/](http://html-
agility-pack.net/)

~~~
zerkten
There is overlap, but this isn't the same as Requests-HTML. It is nice that
this library has been developed further over the years.

------
nishs
> Select an element with a jQuery selector.

What is a 'jQuery' selector? Is it the same as CSS selectors, or does jQuery
support non-standard syntax?

~~~
madeofpalk
For what it's worth, yes, JQuery does support non-standard syntax/selectors

------
ecthiender
Kenneth Reitz comes out with yet another good UI (in terms of library) to
accomplish daily mundane tasks, with joy.

------
jacquesm
That's a neat little library. The problem is that the web is rapidly moving
away from having HTML as the main information carrier to HTML merely being an
envelope to deliver a bunch of JavaScript (and if you're even more unlucky:
WebAsm).

Scraping the web will become a lot harder in the future.

~~~
madeofpalk
In saying that, Javascript is just as parsable, and these 'javascript sites'
are probably loading structured data via a JSON API which will probably be
easier to scrape than a bunch of layout HTML.

------
darpa_escapee
Looks like this is built on top of lxml and parse. I built an adapter over the
bs4 interface in lxml, which was much faster than using bs4 with an lxml
backend.

This is great news, as this space was dominated by bs4 in the Python
ecosystem.

Can't wait to use this in the future :)

------
coolgoose
Also in PHP with Symfony Dom Crawler:
[https://symfony.com/doc/current/components/dom_crawler.html](https://symfony.com/doc/current/components/dom_crawler.html)
or Goutte for an easy to use web scraper
[https://github.com/FriendsOfPHP/Goutte](https://github.com/FriendsOfPHP/Goutte)
that uses Dom Crawler

------
brilee
This is a nice wrapper around requests, pyquery
[https://github.com/gawel/pyquery/](https://github.com/gawel/pyquery/) and
parse
[https://github.com/r1chardj0n3s/parse](https://github.com/r1chardj0n3s/parse)
of which only requests is Kenneth Reitz. Let's give credit where it's due.

~~~
ak217
To give even more credit where it's due, requests is a nice wrapper around
urllib3, which is the work of Andrey Petrov, Cory Benfield and contributors.
While requests provides good user-friendly defaults and API semantics, urllib3
does a lot of the heavy lifting.

~~~
tty7
Which is written in python, so better give credit for that.

Wow and now python is in C so line up your credit books, we are in for a long
night tonight.

Thoughts & prays for all contributors

~~~
d33
...I know you're trolling, but I seriously wonder where it would stop if we
were to go down all that way.

~~~
patneedham
Gotta give credit to whichever cavemen/cavewomen discovered fire, and all the
civilizations that invented the wheel, Ben Franklin for harnessing the power
of electricity, and then maybe Ken Thompson and Dennis Ritchie for inventing
Unix.

------
vinceguidry
Is there a shell interface, or do I have to call it through python? It would
be nice to be able to use it in a Ruby project.

------
ospider
I don't get the point, it's just a nicer alternative to beautifulsoup. Html
pages are always changing, instead of writing the parsing logic in code, I
think we should put xpath or css expressions in some config files.

~~~
closed
I'm sure some pages could be represented as css expressions in config files,
but in general it seems like scraping is about working around unanticipated
idiosyncrasies(e.g. X% of the pages have a different structure for mysterious
reasons).

I haven't used xpath much though (and it seems pretty beefy!).

------
vapemaster
So stoked this popped up. Literally woke up this morning debating about moving
away from bs4 and wrapping the functionality I needed in lxml. I was just
thinking I wish there was a requests equivalent for parsing...

~~~
halflings
Is there any useful feature in bs4 that is not available in lxml.html?

------
Lxr
This looks awesome, can’t wait to try it. Last time I used pyquery though it
was considerably more limited than CSS and jQuery syntax, and I reverted to
XPath. Has it improved recently?

------
fao_
HTML requests are: plain text, self-explanatory ("Content-length", "charset",
etc.). What exactly is unhuman about that?

~~~
dullgiulio
That's HTTP, not HTML.

~~~
fao_
Yikes, major misread on my part. Whoops.

