Hacker News new | comments | show | ask | jobs | submit login
Show HN: Sukhoi – A flexible and extensible Webcrawler in Python (github.com)
131 points by iogf 98 days ago | hide | past | web | 34 comments | favorite

Name seems to reference a prominent Russian aerospace engineer or maybe that's just wishful thinking. https://en.wikipedia.org/wiki/Pavel_Sukhoi

Also it literally means "dry".

Interesting timing! I just started using Scrapy today for a project, and I'm trying to figure out how to elegantly piece together information from different sources. I'm glad to see that that problem is the focus of your README example.

How useful are scrapers that don't execute Javascript these days? I find Selenium + PhantomJS (now Chrome Headless I guess) is pretty easy to drive from Python, and it works everywhere because it's a real browser.

Phantom still gets blocked, as it reveals itself in the header.

I've had success with a headless Chrome instance in a virtual display (xvfb) driven with Selenium, backed by Postgres. It's as close you can get to scripting a real browser.

You can set the user-agent with Phantom

    var webPage = require('webpage');
    var page = webPage.create();
    page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

While that's true, user-agent isn't the only thing in the header that reveals PhantomJS[0].

You could take the time to build in spoofs for these issues. But for testing (and scraping), you're going to be better off if your headless browser is the same as your GUI browser.

0: https://blog.shapesecurity.com/2015/01/22/detecting-phantomj...

I think I read recently that Chrome / Chromium is now able to run without having to use xvfb, so now truly headless.

We scrape a significant amount of highly structured data from a large number of websites (in our case, inventory from ecommerce sites), and have yet to find a site where we needed to use a headless browser. So far we've managed with just lxml. That also includes driving some checkout processes as well (although we don't do this for everyone).

We used to use Selenium + Firefox for the checkout processes (run manually), but it was too much maintenance overhead so we switched to requests+lxml.

We generally find the more "single page app" a website is, the easier it is to scrape, because we can just use the API that's backing the SPA directly, rather than parsing data out of the HTML.

In my experience when the page is rendered with javascript, there is often a json "API", which is even easier to use. Web browsers are often too slow and load content I am not interested in.

I've done some scraping recently and 99% of sites don't have the json api

These ones are slower in most cases, it seems for some situations the ones that dont execute js would better do the job.

They're still surprisingly useful, however it depends a lot on your use case. In my case I've scraped quite a bit from Wikipedia (not everything is available in a clean API) and other sources this way.

Instead of scraping Wikipedia have you had a look at http://wiki.dbpedia.org/ ? They provide a SPARQL endpoint for querying the knowledge graph on Wikipedia.

Depending on your use case you can also download full database dumps (https://dumps.wikimedia.org/).

Pretty cool project. It looks more enjoyable to use than BeautifulSoup.

How does it approach throttling or rate limiting? I didn't see this mentioned in the readme examples. Would be nice if there were some simple config to kick requests back into a queue to be re-run once limits aren't exhausted.

Minimal support for caching / ETag / etc would be a nice addition.

The throtting can be set directly from untwisted reactor(planning to implement soon once i get untwisted on py3). I think the support for caching is really good too, i plan to implement it this week.

Awesome. It looks like you're reusing your own dependencies which is cool. Can you explain how untwisted relates to twisted a little more? I read the repo readme, but not sure I'm following.

Untwised is meant to solve all problems twisted solves but it does it in quite a different way. They are two different tools that would solve the same problems using different approaches. Untwisted doesnt share code nor architecture with twisted. In untwisted, sockets are abstracted as event machines, they are sort of "super sockets" that can dispatch events. You map handles to Spin instances, these handles are mapped upon events, when these events occurs then your handles get called. The handles can spawn events inside the Spin instances, in this way you can better abstract all kind of internet protocols consequently achieving a better level of modularity and extensibility. That is one of the reasons that sukhoi's code is sort of short, it is due to the underlying framework in which it was written on.

I'm not able to figure out dependencies.. is this pure python ? Or are you using one of gevent, libev, uvloop, etc.

Since it is py2, i suppose asyncio is out of the picture

untwisted is pure python. it uses either select/epoll for scaling sockets.

This is very interesting. did you consider using libev/uvloop - which are generally consider battle tested async frameworks ?

Is there anything missing that prompted you to reimplement ?

It seems a good thing to do, indeed. i'll consider that.

> It looks more enjoyable to use than BeautifulSoup.

I don't believe BS is a full scraping solution, it's only the HTML parsing/querying isn't it? In that case, this project actually uses lxml for that part - a relatively well known alternative to BS.

I highly recommend lxml, the API isn't perfect, but in my experience it's much more powerful than BS, and significantly faster as well. We run custom scrapers for a large number of websites, and apart from a few where we use JSON feeds, the majority use lxml, it has been very useful.

Yes, BeautifulSoup is focused on parsing not crawling (BS supports the lxml parser out of the box). Scrapy is more of an opinionated scraping framework whereas BS is a parsing library for scraping. I think the choice depends on what exactly you're trying to build and scale. I like both personally, though I'd use BS for simple MVPs and Scrapy if I wanted to crawl thousands of pages.

Depending on your needs, sometimes it might be more interesting starting for there :


and then scrap whatever is missing or not fresh enough. The scrapping process can be quite intense on servers.

Is this Python 3 compatible? Searched but the wiki is empty and the readme has examples in Python 2.

It is py2 now, however, i'm gonna port it to py3 soon. I'm planning to write some better docs for it tomorrow.

That's great. Ill check it out. Thanks!

How does this differ from Scrapy?

Try to imagine how to solve the second example of the sukhoi README.md using scrapy, you'll notice you'll end up with some kind of obscure logic to achieve that json structure thats outputed by the second example in sukhoi's README.md.

FWIW I don't believe this would be overly convoluted in scrapy. I'd probably scrape the tags and quotes in one pass...

Also, generator expressions would make the examples more readable IMO.

  self.extend((tag, QuoteMiner(self.geturl(href))) for tag, href in self.acc)

I would like to see that in scrapy. I think you may have a point about the generators, yea.

The way of how you construct your json structures in scrapy it is different, scrapy has a longer learning curve too. It seems sukhoi has got better results in performance too.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact