Cool library, I might use this for a side project I have also parsing HTML data from HackerNews.
What are your impressions of scraper and html5ever? When I initially looked at HTML/XML parsing libraries for Rust, there didn't seem to be a standout library such as serde_json for JSON data. I was also considering using scraper + html5ever. However, I'm curious if scraper adds enough to warrant the additional dependency as opposed to directly using html5ever.
I haven't used scraper too much. I personally find the predicate approach of select.rs [0] easier to use. However in this case the selector approach just made more sense.
Standalone html5ever can be a bit cumbersome to work with directly, scraper is basically an implementation of the html5ever's `TreeSink` trait, where as `select.rs` uses the hmtl5ever `RcDom` to parse the document but stores it in a more convenient way. If you look for a minimal approach you should at select.rs which basicially only depends on html5ever
Hey just FYI, when you run the hackernews and explore examples without enabling the tokio feature flag, you get compilation errors about undeclared types for all the tokio stuff. I think these examples just need entires in the Cargo.toml requiring the tokio feature like the reddit example.
Once I add the tokio feature, they all run as expected.
I tried to build a scraper in Rust just a few days ago and got stuck trying to concurrency limit my calls (the website I was scraping appropriately had a rate limit). I couldn’t figure out how to get tokio::stream/tokio_stream to work. Does this fix that problem?
How do you set the concurrency limit? I'm down to use your framework (thanks!), just curious how you implement it. I couldn't get Stream::buffered to work correctly.
You can use `Futures::StreamExt::buffer_unordered` for this. [1] is an example where I used it for benchmark which creates a certain amount of QUIC connections at a time.
But you can also spawn all tasks upfront via `tokio::spawn`, and let them wait on a `tokio::sync::Semaphore` before making the request. The drawback of this is that you might allocate more memory for tasks upfront - but if you don't have an extremely high number it might not matter.
What are your impressions of scraper and html5ever? When I initially looked at HTML/XML parsing libraries for Rust, there didn't seem to be a standout library such as serde_json for JSON data. I was also considering using scraper + html5ever. However, I'm curious if scraper adds enough to warrant the additional dependency as opposed to directly using html5ever.