Hacker News new | comments | show | ask | jobs | submit login
Python 3 comes to Scrapy (scrapinghub.com)
249 points by ddebernardy 624 days ago | hide | past | web | 76 comments | favorite



The breakup between Python 2 and 3 has been very slow and painful. Python devs know that, and that's why they won't break compatibility in such big and drastic way ever again.

I'm glad that we're starting to see light at the end of the tunnel. I find myself using Python 3 in most of my projects. Some times I still have to resort to Python 2 when some dependency is not ready but those cases are more rare every day. Also it's frustrating to use Python 2 when many cool features are now Python 3 only. It will take some more time, but I'm sure that the transition will eventually be completed.


I'm happy that we finally updated a lot of our code at work to be Python 3 compatible last year. The conversion was not that bad in the end. However, under Python 3 a number of benchmarks are slower, which we do need to look into some more. Has anyone else experienced this, specifically with scientific computing code migrating from 2 to 3?

Ultimately, we test in 2 and 3 via CI, but still primarily develop using 2.


For scientific code I've seen slowdowns, but in the end they turn out to be unrelated to Python 3 - e.g. numpy compiled against different BLAS libraries in Python 2.x and 3.x.

There also was a gotcha with replacing cStringIO with io.BytesIO: the latter copied the data even for read-only cases (when you want to have a file-like interface for a binary object), so it can be much slower.

To make things worse, Python 3.x didn't have cStringIO alternative which can be used to provide file-like interface without overhead, and most porting guides suggest using io.BytesIO instead of cStringIO in Python 3. So libraries like tornado, django and flask are affected, and likely some scientific libraries as well.

This problem should be fixed in Python 3.5: io.BytesIO becomes as efficient as cStringIO for read-only cases since Python 3.5.


You sound like a good person to ask. I use numpy and I haven't looked into the BLAS/LAPACK details but I understand there are different ways of setting up numpy for optimal performance. Any advice on the best way to approach this?

At the moment I more or less:

    apt-get install liblapack-dev libopenblas-dev
    pip install numpy


Honestly it is very dependent on exactly what you are doing. For general purpose computing on a general purpose machine you're not going to do much better on average than an up to date openblas. In some cases, especially in the parallel case, Intel's MKL BLAS is slightly faster (but in some cases it is also slower).

There is also scikit.cuda which wraps Nvidia's cuBLAS and which can be very fast in certain cases, but isn't in any way a drop in replacement for openblas.

Then there's NumbaPro (a commercial product) from Continuum Analytics which is an LLVM backed JIT that attempts to automatically speed up your numpy coda and can automatically make your code use cuBLAS where it makes sense to do so.


Ok. Think I'll leave well enough alone then!


A data point: `scrapy bench` is about 1.5x - 2x slower in Python 3.5 than in Python 2.7. This command starts a simple HTTP server which serves HTML pages linked in arbitrary depth, and crawls it using Scrapy. Profiler shows URL parsing is much slower in Python 3 for some reason, I'm not sure why. I haven't digged further.

Python 3 code also have to encode/decode strings more often, but the overhead was not significant for `scrapy bench`.


I think you should google the python.org bug tracker with the function names that seems to exhibit a performance regression according to your profiling session. If don't find any existing issue, please open a new one with a minimalist reproduction script that demonstrates the performance regression.


Yeah, totally agree; I did it for io.BytesIO vs cStringIO performance issue (https://news.ycombinator.com/item?id=11037213), and Python core devs fixed it very fast even though the fix was far from trivial. You have to wait a year or so before next release to use the fix, but that's expected.

For URL parsing it was just a quick profiling session. Also, stdlib URL parsing was a bottleneck even in Python 2. It takes time to benchmark it properly and submit a meaningful issue to bug tracker or explain the problem in python-dev. It is in my todo list, but you only have so much time :)


"The breakup between Python 2 and 3 has been very slow and painful. Python devs know that, and that's why they won't break compatibility in such big and drastic way ever again."

Fork, don't break, cf Pillow (PIL fork) ~ http://python-pillow.github.io/

What are the most commonly used Py2 packages that need to be Py3?


> What are the most commonly used Py2 packages that need to be Py3?

https://python3wos.appspot.com http://py3readiness.org


For most of the ones that aren't compatible I've found an alternative or monkey patched a method to support python 3 (I'm looking at supervisor -> circus and flask-session -> monkey patch because the author seems to have abandoned it).


For me it's supervisor and beanstalkc and fabric but you can run your fabric scripts with Python2 easily enough.


PaiMei was a pretty good toolkit, but it doesn't even work with anything after 2.4, at least not out of the box.

So maybe that's not even an example of the 2/3 split, it's just old.


Many congrats on the release & also thanks to the Scrapy team for the effort involved.

As far as I am concerned, this was the last package I used heavily that still had not made the upgrade.

For the Python community: which packages are you still waiting for/working on?


Still waiting on Ansible to make the jump. As far as I can tell it's not even on their roadmap, which is a bit crazy given that _almost_ everyone else has solved this by now.

This may not be a blocker for the rest of our stuff though, since ansible can install its own python 2 on servers automatically.


I've been waiting for Fabric to make the migration to Python 3. Fortunately I no longer use Python that much (I jumped on the hype train to Go and I've been very satisfied with it) so I no longer need Fabric nowadays.


Mechanize would be nice, but it's practically abandonware at this point.


same


A wrapper for the exiv2 library (for editing metadata of picture files) that works on Windows


Esri's ArcPy module and ArcGIS Desktop's "built-in" python environment. :(

ArcGIS Pro is using 3.4, so the wrappers for ArcObjects are updated, but who knows if they will put that work back into ArcGIS Desktop...


This is great news for Python 3. We're almost there! -> http://py3readiness.org


Hm, why is opencv not there? I was playing around with it last week, and it seems it works well for only py2 right now (from what I could gather). OpenCV should be there in that list, a lot of people really need it.


OpenCV 3 supports Python 3[1], but it's not distributed through PyPI.

[1] http://opencv.org/opencv-3-0.html


My guess: because OpenCV is so unlikely to be installable through pip, its download statistics on PyPI end up artificially low.


It's only the 360 most downloaded packages, perhaps OpenCV is not one of them?


[deleted]


As the site clearly states, this is the top 360 projects in terms of downloads on PyPI


"This site shows Python 3 support for 360 most downloaded packages on PyPI"


Hahahaha. 7 years later. Python 3 is almost there!


Your response to lots of people's programming effort is "Hahahaha"? Get lost.


You have grossly misunderstood. That "Hahahaha" is directed at Guido (creator of python and self-declared benevolent dictator for life) and his short-sighted arrogant decision to break backward compatibility moving from Python 2 to 3. He recognized it as a mistake but out of some stubborn pride resolved not to fix it. Ever. So damn skippy my response to him is "Hahahahaha". The programmers that have put up with his poor decision making obviously deserve respect. I am sad that you assumed it was directed at the programmers having to deal with Guido.

Fortunately there is a simple solution -- fork Python 2.7 to make the Python 2.8 that Guido refuses to make. A python version that doesn't have compatibility issues with Guido's playground programming. This version would not have the poor performance of Python 3. This version is essentially already made with PyPy. I just wish everyone would have told Guido to get lost 7 years ago and switched to a development group with sane deprecation of features. If we had then all of the programmer effort that was poured into making their code Guido approved would have been saved. Can you imagine how much better Python development would have been without a 7 year lag that included BREAKING the code base of everyone who had previously programmed in Python 2.x?

Guido is the self proclaimed benevolent dictator for life. However, that doesn't describe a leader who breaks the programming efforts of every developer using the language prior to Py3k. He then has the stubborn arrogance to deny a transition version and instead requires 7 years of programming effort. He certainly deserves a "Hahahahaha" that it took 7 years to get "almost there". At least he is not still shocked that py3k wasn't greeted with open arms. Hopefully before 2020 he will swallow his pride and agree to a transition version -- a 4.0 which works with 2.7 or a 2.8 that works with 3.x. If not, a fork will happen.

Hopefully that clarifies the previous comment, and congratulations to the Scrapy developers for powering through and putting up with a language that was purposely broken by the language developers.


"simple solution" - simple to propose, certainly. Many have conjectured about a third-party (non-core developers) who might take on a 2.8. Any takers? No.

"self proclaimed benevolent dictator for life" - As a bit of minutia, it was Ken Manheimer who proclaimed him thus. This is a minor point, of course.

"a 4.0 which works with 2.7 or a 2.8 that works with 3.x. If not, a fork will happen." - I look forward to a solution to enable chained exceptions and asyncio to work under a 2.x environment with a __future__ flag. I wonder who will pay for the hard work.


Cython works with 2.7 and 3.x, and supports both chained exceptions and asyncio.


Thanks for expressing the unpopular opinion


I'll also add that Julia has some of the killer features that py3 should... Metaprogramming in particular.


I agree he's not being a great sport, but I sort of see where he's coming from. Having worked with python in the last few months for computer vision problems, I've been really frustrated with how things are. I was expecting better. At this point I don't use python for my work anymore, I use matlab. It's superbly well-documented, and things just /work/. And of course, nothing like the mess you see with this py2/py3 problem is to be seen anywhere near matlab. Yes, it's not free, but I've decided that the amount of time I had to spend, the headaches I had to endure, they were not worth the free price of python (and matlab for its great support was in fact worth its price). ymmv.


This is mostly true until you need new data that is not a .mat or image, or to deploy to a server for production (licensing Matlab for multi-core servers is pretty rough), or deal with strings, or send stuff over the network, or one of the other thousand things the Python ecosystem does better. For algorithmic experiments, Matlab has an edge but the "with batteries" / overall utility approach has always been the primary strength of Python for science.


In my experience Matlab compatibility problems are much more common, because many Matlab users don’t update to the latest version (this costs money), and changes to common packages which work on the latest version tend to break on older Matlab versions. This is exacerbated by a lot of research prototype quality code in the Matlab ecosystem; grad students don’t necessarily have the bandwidth to worry about regression tests or compatibility workarounds.

More generally: (a) Most Matlab packages have much messier APIs than the equivalent python packages, and anything complicated written in “object oriented Matlab” tends to have much weirder failure modes in edge cases than Python, because the basic data types and control flow mechanisms are much less robust and much less capable than Python/Ruby/Lua/Javascript/Lisp/etc. equivalents. (b) Python has much broader library support for doing anything that isn’t pure number crunching, which makes it a much more flexible tool overall.

If you can do your work sticking to only the packages directly provided by Mathworks, and you are comparing those to third-party Python libraries for equivalent functionality, then you might be right though.


What?

Py2/3 shouldn't impact CV code at all. THe major issue should be installation, which is generally a solved problem. OpenCV, and the scipy stack fully support python3, and are some of the best documented libraries I've ever encountered (better than the bit of matlab I've used for robotics at least)

I'm curious as to what problems you encountered?


I want to like Python 3 but the print statement change really kills it for me. Years later it's still frustrating on the occasions I write a python script. If I ran a project like one of those, I might not use Python just for this reason.


This is the most trivial of all the changes introduced in Python3.

If you'd said something about string encodings, you might have got sympathy from someone...


Really?, Really! Because every other output method was a function the statement is annoying as all hell.

Swapping between print and fh.write() log.error(), stringio.write(), sys.stderr.write(), mycustom_thing_that_suppresses_print_when_quiet_arg_supplied() in python 2.x is annoying as all fuck and simple 'ct(' with python 3.x


Look up 'q' on pypi. Thank me later. :)


I resolved to use print debugging less and pdb more. This makes that even harder, thanks a lot. :)


q.d() ;)


I really don't see how the print statement can be that bigger deal. After a week at most you're used to it and `print(x)` is valid syntax on Python 2 (so don't worry about what way to write it).


I write Python 2 code, and just always use `print("...")` in both. It makes a lot more sense, matches other languages syntax, and is compatible in both. I honestly don't see it in any way worse than `print "...."`.


Agreed. Still on Python 2.7 but at some point last year I just decided I'd always:

  from __future__ import print_function
When I needed to print and it's just habit now..


Been using most of __future__ for years to make the transition seamless for me. This is what Py 2.x devs should be doing now. It's not like the path forward hasn't been paved for us for many years.


I empathize. Makes me want a parenthesis-optional language, though that has its own headaches.

I find switching to log() as an output function helps psychologically - it's shorter to type than print, and it's something I'm more likely to keep in production.


It is a major change to ask but I am wondering if they consider to switch to asyncio instead of using Twisted. Twisted is great library but it is a huge dependency to maintain.


We're seriously considering this option (maybe for scrapy 2.0?), but no concrete plans yet. It can't be just asyncio - we'll need e.g. aiohttp and other packages.

Some links:

* POC for aiohttp as a http handler: https://github.com/scrapy/scrapy/pull/1455

* some thought about how to make async/await API for Scrapy; it is not all roses: https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...


There has been some discussions and a proof-of-concept with asyncio + aiohttp: https://github.com/scrapy/scrapy/pull/1455

However, as you said, it'd be a major change and it would affect the whole ecosystem (plugins and extensions), so it's complicated. We'll see what happens. :)


Can you elaborate? What makes it a "huge dependency to maintain"? Is there anything that the Twisted project can do to make it easier? If this is actually a problem I'd really like to hear from users on the Twisted mailing list and bug tracker.


Twisted is a general purpose library/framework with lots of features. This is the "huge" part. In my previous projects I have used it a lot and appreciated it.

What I was trying to tell is if Scrapy uses only small part of library, it may be possible for developers to use similar constructs from Python's standard library. In any case dependency is dependency and it is always better to minimize code footprint.


I've seen developers refuse to use Python 3 because of not being able to use Scrapy. Hopefully this gets some more devs to finally make the switch.


I've been refusing to use Scrapy because of not being able to use Python 3 :P So this is a very exciting announcement for me.


Believe me, we've been hearing this too. We're super excited as well. :-)


This is great! I wrote "Web Scraping with Python" (O'Reilly) and did everything with Python 3... except for the oddball section on Scrapy. Glad to know I can update that for the second edition!


Nice! Everytime I create a new Scrapy project I keep forgetting it was not python3 and had to recreate the virtualenv. Great news!


I've been holding out for a while on moving to 3. I have tried every 3 release and always found issues or performance regressions. Like others have said here it's a concern about the performance issues, Python is already slow. It's an even harder sell when

- all major (and equally important, the long tail of minor) libraries support 2

- CPython2 has the performance advantage in most (if not all) applications

- virtually every 3rd party implementation supports it very well (PyPy in particular)

- Python2 already supported unicode, so that gets old to hear about

- Most of the new features are available as backports

- Some new features are absurd, like the new 4th string formatting method in 3.6

People shouldn't openly wonder why someone uses 2 instead of 3 if you really look at it.

I just started a new 3.5 project because while I gave 3.0-3.4 a shot, 3.5 hasn't had its runthrough yet. Most people in my shoes have more than likely moved on from Python to Go. I'd like to have this be the one and stop going back to 2.7. Admittedly patience is running on fumes after ~8 years of testing CPython3 releases.

It wasn't just a bad break, it seems like it was a sloppy break. Instead of feature bloat, I'd like to see Python3 focus on performance.


Also, Jython py3 dev lagged far behind py2.


I'm glad Python 3 supports so many libraries now. I just made the switch a couple of months ago (from 2).


Try it out via conda!

  conda install -c scrapinghub/label/dev scrapy


I mentioned some of the issues I had with scrapy ages ago on reddit: http://www.reddit.com/r/Python/comments/g112q/installing_and...

Never got a reply, but I'll reproduce here; I wonder how much of this is still true?

" The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.

When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.

I created the 'DropElderMiddleware' middleware to do this. I couldn't see any other way of making calculations based on items returned from a particular page.

I could never figure out what the difference between input and output processors where, or when I should use one or the other.

The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.

The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?

In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).

Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts. "


> The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.

Yeah, docs used to be a problem; they improved a lot in 1.0 and 1.1 releases though.

> When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.

The easies way is to raise CloseSpider exception in a callback if no new items are scraped - see http://doc.scrapy.org/en/1.0/topics/exceptions.html#closespi...

> I could never figure out what the difference between input and output processors where, or when I should use one or the other.

> The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.

I also have troubles understanding ItemLoader details. They are totally optional though, and they are no longer in Scrapy tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html). Item loaders provide features very similar to https://github.com/Suor/funcy or https://github.com/kachayev/fn.py.

> The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?

I'm not sure what problems did you have. Scrapy selectors library (https://github.com/scrapy/parsel) is quite similar to PyQuery (esp. when CSS selectors are used), and nothing prevents you from using PyQuery with Scrapy. In future we may add PyQuery (and BeautifulSoup?) support to parsel and provide PyQuery selectors as response.pq (like response.css and response.xpath), +1 to do that.

> In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).

Yeah, library interface used to be a problem. It was improved in 1.0 release (there is an official API for integrating Scrapy with Twisted apps and running spiders from user scripts), but there is still more to go. See http://doc.scrapy.org/en/1.0/topics/practices.html#run-scrap....

It probably won't be as easy to integrate with regular Python scripts as mechanize because Scrapy is async. On the other hand, Scrapy is easier to integrate with async servers like Twisted or Tornado.

> Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts.

You may want to check the new 'Scrapy at glance' page (http://doc.scrapy.org/en/latest/intro/overview.html). The main advantage of Scrapy over mechanize is that it handles parallel downloads and have a wide range of built-in extensions you won't have to implement yourselves.


Hmm, the matching issue might have been something like wanting to do "[i.match(tag='foo') for i in body.match(tag='bar')]" and getting a list-of-lists back, but this was a long time ago :-)

Incidentally, I've since gone off pyQuery as it doesn't always keep up with jquery. I now prefer lxml or BS4..

BTW, I love ScrapingHub. I bashed out a few Spiders with portia, but ultimately, I'll prob start scripting instead. Do you know if portia actually generates script code? Might be easier for fast scraping to get 60% of the ways with portia, then manually write the rest of the script.

One last thing - looking at this page

> http://stackoverflow.com/questions/6261714/inferring-templat...

there is mention of a "wrapper induction library"; I can't find anymore mention of it though, does the class/functionality still exist?


Wrapper induction library is separated from Scrapy: https://github.com/scrapy/scrapely. It is used in Portia under the hood. Portia can be seen as a tool to annotate scrapely templates and define crawling rules and post-processing rules.

I'm not a Portia developer/user myself, but I think it is possible to get script code from Portia; it exports Scrapy spider to some folder. But I don't really know what I'm talking about, it is better to ask at https://groups.google.com/forum/#!forum/portia-scraper or at stackoverflow (use tag 'Portia').


Thanks for your help :-)


Not related to Scrapy, but what are some things you scrape the web for?


I once scraped every October posting from Slashdot to see long term trends. Short story: it's dying. I project the active userbase will be gone by 2020. Curiously the bulk of the posters were in the 100k to 300k UID range. There was also evidence of shenanigans with UID assignment where they were skipping even numbers and odd numbers at various times possibly to inflate their numbers.

This will get you IP banned BTW but I did get a full data set before their script caught me.


ScrapingHub (the guys behind Scrapy) offers Crawlera which provides some sort of automatic proxying and throttling so you can scrape away avoiding getting banned.


I wonder if Netcraft already confirmed that Slashdot is dying ;)


This is great news. I often read the stick developers give to Py3, not wanting to upgrade from 2->3 citing code bases still using Py2.


Minor point of clarification re title; Python 3 has been here, it's Scrapy that has (finally and good for them) come to Python 3.





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: