Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Voyager – write a web crawler/scraper as a state machine in Rust (github.com/mattsse)
110 points by matsche on Dec 30, 2020 | hide | past | favorite | 11 comments



Cool library, I might use this for a side project I have also parsing HTML data from HackerNews.

What are your impressions of scraper and html5ever? When I initially looked at HTML/XML parsing libraries for Rust, there didn't seem to be a standout library such as serde_json for JSON data. I was also considering using scraper + html5ever. However, I'm curious if scraper adds enough to warrant the additional dependency as opposed to directly using html5ever.


I haven't used scraper too much. I personally find the predicate approach of select.rs [0] easier to use. However in this case the selector approach just made more sense. Standalone html5ever can be a bit cumbersome to work with directly, scraper is basically an implementation of the html5ever's `TreeSink` trait, where as `select.rs` uses the hmtl5ever `RcDom` to parse the document but stores it in a more convenient way. If you look for a minimal approach you should at select.rs which basicially only depends on html5ever

[0] https://github.com/utkarshkukreti/select.rs


Hey just FYI, when you run the hackernews and explore examples without enabling the tokio feature flag, you get compilation errors about undeclared types for all the tokio stuff. I think these examples just need entires in the Cargo.toml requiring the tokio feature like the reddit example.

Once I add the tokio feature, they all run as expected.


Thanks, you're right. I fixed it just now by also requiring the tokio feature for all the tests in the Cargo.toml.


I tried to build a scraper in Rust just a few days ago and got stuck trying to concurrency limit my calls (the website I was scraping appropriately had a rate limit). I couldn’t figure out how to get tokio::stream/tokio_stream to work. Does this fix that problem?


I assume the website has something like "requests per second" quota, which then you'd want the `governor` crate [0].

It was recently used to implement rate limiting middleware for both tide[1] and actix[2]

[0] https://docs.rs/governor/0.3.1/governor/_guide/index.html

[1] https://github.com/ohmree/tide-governor

[2] https://github.com/AaronErhardt/actix-governor


Yes, you can either limit how many request can be send concurrently or enforce a fixed or random delay in between consecutive requests.

Basically this a just a futures_timer::Delay [0] that is reset after each request, which is non blocking.

[0] https://docs.rs/futures-timer/3.0.2/futures_timer/


How do you set the concurrency limit? I'm down to use your framework (thanks!), just curious how you implement it. I couldn't get Stream::buffered to work correctly.


You can use `Futures::StreamExt::buffer_unordered` for this. [1] is an example where I used it for benchmark which creates a certain amount of QUIC connections at a time.

But you can also spawn all tasks upfront via `tokio::spawn`, and let them wait on a `tokio::sync::Semaphore` before making the request. The drawback of this is that you might allocate more memory for tasks upfront - but if you don't have an extremely high number it might not matter.

[1] https://github.com/quinn-rs/quinn/blob/de627437bc7d836564c36...


This is a late comment, but this helped me out a lot. Thank you for the detailed explanation and code example, I appreciate it.


You can limit the numbers pf concurrent calls using a semaphore and acquiring it before making a http request.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: