
Scrapinghub/Scrapyrt: Scrapy Realtime - r_singh
https://github.com/scrapinghub/scrapyrt
======
lpellis
I'm using scrapinghub extensively for
[https://pagewatch.dev](https://pagewatch.dev) , is this project something you
can use as a self-hosted replacement? Its not very clear how it works, what
does the Realtime mean?

~~~
bdcravens
Looks to just put an API in place. I think they are using creative license
with the term "realtime" (since I presume all scraping actions are queued and
async)

~~~
mdaniel
_and async_

I don't get that impression from the description of the response from `POST`:
[https://scrapyrt.readthedocs.io/en/0.11.0/api.html#success-r...](https://scrapyrt.readthedocs.io/en/0.11.0/api.html#success-
response) since it does not return a job-id that requires polling, it appears
to block until your scrape request is completed or it times out

It's likely an implementation detail about whether the _scrapyrt_ component is
the one which blocks, but otherwise uses queuing and asynchronous invocations
when interacting with Scrapy, or whether -- as the custom CrawlManager implies
([https://scrapyrt.readthedocs.io/en/0.11.0/api.html#crawl-
man...](https://scrapyrt.readthedocs.io/en/0.11.0/api.html#crawl-manager)) --
scrapyrt actually takes over and makes the entire CrawlManager -> Scheduler ->
Spider call stack synchronous and thus able to respond to a POST within
[https://scrapyrt.readthedocs.io/en/0.11.0/api.html#timeout-l...](https://scrapyrt.readthedocs.io/en/0.11.0/api.html#timeout-
limit) seconds with the actual Items

