
Show HN: Transistor, a Python web scraping framework for intelligent use cases - bobjordan
https://github.com/bomquote/transistor
======
lapnitnelav
Looks interesting but I'm struggling to see (at quick glance) what makes it
unique / better than the alternatives out there.

~~~
bobjordan
As compared to a mature framework like Scrapy. Transistor is a lot lighter
than Scrapy and easier to grok the entire codebase, while having less magic. I
wanted a scraping framework with useful classes/abstractions which I could
subclass/override, customize to my specific needs, and then run tightly
integrated with our gevent-based Flask web app.

Bottom line is, Scrapy's codebase is so big and also running Twisted, which
I'm not familiar with. So I kind of threw my hands in the air on that
integration and instead decided to take a few weeks to write my own framework
only with what I needed, while also using gevent. Learned a lot and overall it
was a great exercise and will serve us well a long time.

Transistor module itself has about ~6,000 LOC including full support for
Splash headless browser/javascript rendering service and Crawlera service.
While the base Scrapy framework repo alone has ~30,000 LOC, with further
middleware repos required to integrate Splash/Crawlera.

That said, the current Transistor implementation doesn't really compare with
Scrapy as a crawler, in that Transistor is like a surgical knife to get the
specific data you are after, while Scrapy can be more suited to cataloging,
following-all-the-links.

Where Transistor shines right now is spinning up a few hundred workers, each
with a scrape task, with each task being a term (like a part-number) which is
searched on a website. Transistor get's the job done well in this case.

~~~
lapnitnelav
Hey thanks for the reply.

I've done a few things with Scrapy but never really poked under the hood, so
I'll take your word for it.

If I get you right, it's more targeted towards precise extraction than website
crawling?

Last point : why the tight integration with an app? Monolith approach?

