
Show HN: Sukhoi – A flexible and extensible Webcrawler in Python - iogf
https://github.com/iogf/sukhoi
======
gear54rus
Name seems to reference a prominent Russian aerospace engineer or maybe that's
just wishful thinking.
[https://en.wikipedia.org/wiki/Pavel_Sukhoi](https://en.wikipedia.org/wiki/Pavel_Sukhoi)

~~~
doubleplusgood
Also it literally means "dry".

------
dguo
Interesting timing! I just started using Scrapy today for a project, and I'm
trying to figure out how to elegantly piece together information from
different sources. I'm glad to see that that problem is the focus of your
README example.

------
vosper
How useful are scrapers that don't execute Javascript these days? I find
Selenium + PhantomJS (now Chrome Headless I guess) is pretty easy to drive
from Python, and it works everywhere because it's a real browser.

~~~
dhruvkar
Phantom still gets blocked, as it reveals itself in the header.

I've had success with a headless Chrome instance in a virtual display (xvfb)
driven with Selenium, backed by Postgres. It's as close you can get to
scripting a real browser.

~~~
rectangletangle
You can set the user-agent with Phantom

    
    
        var webPage = require('webpage');
        var page = webPage.create();
        page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
    

[http://phantomjs.org/api/webpage/property/settings.html](http://phantomjs.org/api/webpage/property/settings.html)

~~~
dhruvkar
While that's true, user-agent isn't the only thing in the header that reveals
PhantomJS[0].

You could take the time to build in spoofs for these issues. But for testing
(and scraping), you're going to be better off if your headless browser is the
same as your GUI browser.

0: [https://blog.shapesecurity.com/2015/01/22/detecting-
phantomj...](https://blog.shapesecurity.com/2015/01/22/detecting-phantomjs-
based-visitors/)

------
tedmiston
Pretty cool project. It looks more enjoyable to use than BeautifulSoup.

How does it approach throttling or rate limiting? I didn't see this mentioned
in the readme examples. Would be nice if there were some simple config to kick
requests back into a queue to be re-run once limits aren't exhausted.

Minimal support for caching / ETag / etc would be a nice addition.

~~~
iogf
The throtting can be set directly from untwisted reactor(planning to implement
soon once i get untwisted on py3). I think the support for caching is really
good too, i plan to implement it this week.

~~~
tedmiston
Awesome. It looks like you're reusing your own dependencies which is cool. Can
you explain how untwisted relates to twisted a little more? I read the repo
readme, but not sure I'm following.

~~~
iogf
Untwised is meant to solve all problems twisted solves but it does it in quite
a different way. They are two different tools that would solve the same
problems using different approaches. Untwisted doesnt share code nor
architecture with twisted. In untwisted, sockets are abstracted as event
machines, they are sort of "super sockets" that can dispatch events. You map
handles to Spin instances, these handles are mapped upon events, when these
events occurs then your handles get called. The handles can spawn events
inside the Spin instances, in this way you can better abstract all kind of
internet protocols consequently achieving a better level of modularity and
extensibility. That is one of the reasons that sukhoi's code is sort of short,
it is due to the underlying framework in which it was written on.

~~~
sandGorgon
I'm not able to figure out dependencies.. is this pure python ? Or are you
using one of gevent, libev, uvloop, etc.

Since it is py2, i suppose asyncio is out of the picture

~~~
iogf
untwisted is pure python. it uses either select/epoll for scaling sockets.

~~~
sandGorgon
This is very interesting. did you consider using libev/uvloop - which are
generally consider battle tested async frameworks ?

Is there anything missing that prompted you to reimplement ?

~~~
iogf
It seems a good thing to do, indeed. i'll consider that.

------
ldng
Depending on your needs, sometimes it might be more interesting starting for
there :

[https://about.commonsearch.org/](https://about.commonsearch.org/)

and then scrap whatever is missing or not fresh enough. The scrapping process
can be quite intense on servers.

------
pryelluw
Is this Python 3 compatible? Searched but the wiki is empty and the readme has
examples in Python 2.

~~~
iogf
It is py2 now, however, i'm gonna port it to py3 soon. I'm planning to write
some better docs for it tomorrow.

~~~
pryelluw
That's great. Ill check it out. Thanks!

------
monksy
How does this differ from Scrapy?

~~~
iogf
Try to imagine how to solve the second example of the sukhoi README.md using
scrapy, you'll notice you'll end up with some kind of obscure logic to achieve
that json structure thats outputed by the second example in sukhoi's
README.md.

~~~
bbernoulli
FWIW I don't believe this would be overly convoluted in scrapy. I'd probably
scrape the tags and quotes in one pass...

Also, generator expressions would make the examples more readable IMO.

    
    
      self.extend((tag, QuoteMiner(self.geturl(href))) for tag, href in self.acc)

~~~
iogf
I would like to see that in scrapy. I think you may have a point about the
generators, yea.

