I'm glad that we're starting to see light at the end of the tunnel. I find myself using Python 3 in most of my projects. Some times I still have to resort to Python 2 when some dependency is not ready but those cases are more rare every day. Also it's frustrating to use Python 2 when many cool features are now Python 3 only. It will take some more time, but I'm sure that the transition will eventually be completed.
Ultimately, we test in 2 and 3 via CI, but still primarily develop using 2.
There also was a gotcha with replacing cStringIO with io.BytesIO: the latter copied the data even for read-only cases (when you want to have a file-like interface for a binary object), so it can be much slower.
To make things worse, Python 3.x didn't have cStringIO alternative which can be used to provide file-like interface without overhead, and most porting guides suggest using io.BytesIO instead of cStringIO in Python 3. So libraries like tornado, django and flask are affected, and likely some scientific libraries as well.
This problem should be fixed in Python 3.5: io.BytesIO becomes as efficient as cStringIO for read-only cases since Python 3.5.
At the moment I more or less:
apt-get install liblapack-dev libopenblas-dev
pip install numpy
There is also scikit.cuda which wraps Nvidia's cuBLAS and which can be very fast in certain cases, but isn't in any way a drop in replacement for openblas.
Then there's NumbaPro (a commercial product) from Continuum Analytics which is an LLVM backed JIT that attempts to automatically speed up your numpy coda and can automatically make your code use cuBLAS where it makes sense to do so.
Python 3 code also have to encode/decode strings more often, but the overhead was not significant for `scrapy bench`.
For URL parsing it was just a quick profiling session. Also, stdlib URL parsing was a bottleneck even in Python 2. It takes time to benchmark it properly and submit a meaningful issue to bug tracker or explain the problem in python-dev. It is in my todo list, but you only have so much time :)
Fork, don't break, cf Pillow (PIL fork) ~ http://python-pillow.github.io/
What are the most commonly used Py2 packages that need to be Py3?
So maybe that's not even an example of the 2/3 split, it's just old.
As far as I am concerned, this was the last package I used heavily that still had not made the upgrade.
For the Python community: which packages are you still waiting for/working on?
This may not be a blocker for the rest of our stuff though, since ansible can install its own python 2 on servers automatically.
ArcGIS Pro is using 3.4, so the wrappers for ArcObjects are updated, but who knows if they will put that work back into ArcGIS Desktop...
Fortunately there is a simple solution -- fork Python 2.7 to make the Python 2.8 that Guido refuses to make. A python version that doesn't have compatibility issues with Guido's playground programming. This version would not have the poor performance of Python 3. This version is essentially already made with PyPy. I just wish everyone would have told Guido to get lost 7 years ago and switched to a development group with sane deprecation of features. If we had then all of the programmer effort that was poured into making their code Guido approved would have been saved. Can you imagine how much better Python development would have been without a 7 year lag that included BREAKING the code base of everyone who had previously programmed in Python 2.x?
Guido is the self proclaimed benevolent dictator for life. However, that doesn't describe a leader who breaks the programming efforts of every developer using the language prior to Py3k. He then has the stubborn arrogance to deny a transition version and instead requires 7 years of programming effort. He certainly deserves a "Hahahahaha" that it took 7 years to get "almost there". At least he is not still shocked that py3k wasn't greeted with open arms. Hopefully before 2020 he will swallow his pride and agree to a transition version -- a 4.0 which works with 2.7 or a 2.8 that works with 3.x. If not, a fork will happen.
Hopefully that clarifies the previous comment, and congratulations to the Scrapy developers for powering through and putting up with a language that was purposely broken by the language developers.
"self proclaimed benevolent dictator for life" - As a bit of minutia, it was Ken Manheimer who proclaimed him thus. This is a minor point, of course.
"a 4.0 which works with 2.7 or a 2.8 that works with 3.x. If not, a fork will happen." - I look forward to a solution to enable chained exceptions and asyncio to work under a 2.x environment with a __future__ flag. I wonder who will pay for the hard work.
If you can do your work sticking to only the packages directly provided by Mathworks, and you are comparing those to third-party Python libraries for equivalent functionality, then you might be right though.
Py2/3 shouldn't impact CV code at all. THe major issue should be installation, which is generally a solved problem. OpenCV, and the scipy stack fully support python3, and are some of the best documented libraries I've ever encountered (better than the bit of matlab I've used for robotics at least)
I'm curious as to what problems you encountered?
If you'd said something about string encodings, you might have got sympathy from someone...
Swapping between print and fh.write() log.error(), stringio.write(), sys.stderr.write(), mycustom_thing_that_suppresses_print_when_quiet_arg_supplied() in python 2.x is annoying as all fuck and simple 'ct(' with python 3.x
from __future__ import print_function
I find switching to log() as an output function helps psychologically - it's shorter to type than print, and it's something I'm more likely to keep in production.
* POC for aiohttp as a http handler: https://github.com/scrapy/scrapy/pull/1455
* some thought about how to make async/await API for Scrapy; it is not all roses: https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...
However, as you said, it'd be a major change and it would affect the whole ecosystem (plugins and extensions), so it's complicated. We'll see what happens. :)
What I was trying to tell is if Scrapy uses only small part of library, it may be possible for developers to use similar constructs from Python's standard library. In any case dependency is dependency and it is always better to minimize code footprint.
- all major (and equally important, the long tail of minor) libraries support 2
- CPython2 has the performance advantage in most (if not all) applications
- virtually every 3rd party implementation supports it very well (PyPy in particular)
- Python2 already supported unicode, so that gets old to hear about
- Most of the new features are available as backports
- Some new features are absurd, like the new 4th string formatting method in 3.6
People shouldn't openly wonder why someone uses 2 instead of 3 if you really look at it.
I just started a new 3.5 project because while I gave 3.0-3.4 a shot, 3.5 hasn't had its runthrough yet. Most people in my shoes have more than likely moved on from Python to Go. I'd like to have this be the one and stop going back to 2.7. Admittedly patience is running on fumes after ~8 years of testing CPython3 releases.
It wasn't just a bad break, it seems like it was a sloppy break. Instead of feature bloat, I'd like to see Python3 focus on performance.
conda install -c scrapinghub/label/dev scrapy
Never got a reply, but I'll reproduce here; I wonder how much of this is still true?
The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.
When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.
I created the 'DropElderMiddleware' middleware to do this. I couldn't see any other way of making calculations based on items returned from a particular page.
I could never figure out what the difference between input and output processors where, or when I should use one or the other.
The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.
The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?
In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).
Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts.
Yeah, docs used to be a problem; they improved a lot in 1.0 and 1.1 releases though.
> When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.
The easies way is to raise CloseSpider exception in a callback if no new items are scraped - see http://doc.scrapy.org/en/1.0/topics/exceptions.html#closespi...
> I could never figure out what the difference between input and output processors where, or when I should use one or the other.
> The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.
I also have troubles understanding ItemLoader details. They are totally optional though, and they are no longer in Scrapy tutorial (http://doc.scrapy.org/en/latest/intro/tutorial.html). Item loaders provide features very similar to https://github.com/Suor/funcy or https://github.com/kachayev/fn.py.
> The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?
I'm not sure what problems did you have. Scrapy selectors library (https://github.com/scrapy/parsel) is quite similar to PyQuery (esp. when CSS selectors are used), and nothing prevents you from using PyQuery with Scrapy. In future we may add PyQuery (and BeautifulSoup?) support to parsel and provide PyQuery selectors as response.pq (like response.css and response.xpath), +1 to do that.
> In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).
Yeah, library interface used to be a problem. It was improved in 1.0 release (there is an official API for integrating Scrapy with Twisted apps and running spiders from user scripts), but there is still more to go. See http://doc.scrapy.org/en/1.0/topics/practices.html#run-scrap....
It probably won't be as easy to integrate with regular Python scripts as mechanize because Scrapy is async. On the other hand, Scrapy is easier to integrate with async servers like Twisted or Tornado.
> Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts.
You may want to check the new 'Scrapy at glance' page (http://doc.scrapy.org/en/latest/intro/overview.html). The main advantage of Scrapy over mechanize is that it handles parallel downloads and have a wide range of built-in extensions you won't have to implement yourselves.
Incidentally, I've since gone off pyQuery as it doesn't always keep up with jquery. I now prefer lxml or BS4..
BTW, I love ScrapingHub. I bashed out a few Spiders with portia, but ultimately, I'll prob start scripting instead. Do you know if portia actually generates script code? Might be easier for fast scraping to get 60% of the ways with portia, then manually write the rest of the script.
One last thing - looking at this page
there is mention of a "wrapper induction library"; I can't find anymore mention of it though, does the class/functionality still exist?
I'm not a Portia developer/user myself, but I think it is possible to get script code from Portia; it exports Scrapy spider to some folder. But I don't really know what I'm talking about, it is better to ask at https://groups.google.com/forum/#!forum/portia-scraper or at stackoverflow (use tag 'Portia').
This will get you IP banned BTW but I did get a full data set before their script caught me.