
Python 3 comes to Scrapy - ddebernardy
https://blog.scrapinghub.com/2016/02/04/python-3-support-with-scrapy-1-1rc1/
======
ceronman
The breakup between Python 2 and 3 has been very slow and painful. Python devs
know that, and that's why they won't break compatibility in such big and
drastic way ever again.

I'm glad that we're starting to see light at the end of the tunnel. I find
myself using Python 3 in most of my projects. Some times I still have to
resort to Python 2 when some dependency is not ready but those cases are more
rare every day. Also it's frustrating to use Python 2 when many cool features
are now Python 3 only. It will take some more time, but I'm sure that the
transition will eventually be completed.

~~~
bootload
_" The breakup between Python 2 and 3 has been very slow and painful. Python
devs know that, and that's why they won't break compatibility in such big and
drastic way ever again."_

Fork, don't break, cf Pillow (PIL fork) ~ [http://python-
pillow.github.io/](http://python-pillow.github.io/)

What are the most commonly used Py2 packages that need to be Py3?

~~~
jacobolus
> _What are the most commonly used Py2 packages that need to be Py3?_

[https://python3wos.appspot.com](https://python3wos.appspot.com)
[http://py3readiness.org](http://py3readiness.org)

~~~
nerdwaller
For most of the ones that aren't compatible I've found an alternative or
monkey patched a method to support python 3 (I'm looking at supervisor ->
circus and flask-session -> monkey patch because the author seems to have
abandoned it).

------
djm_
Many congrats on the release & also thanks to the Scrapy team for the effort
involved.

As far as I am concerned, this was the last package I used heavily that still
had not made the upgrade.

For the Python community: which packages are you still waiting for/working on?

~~~
crdoconnor
Mechanize would be nice, but it's practically abandonware at this point.

~~~
kevinwang
same

------
jstoiko
This is great news for Python 3. We're almost there! ->
[http://py3readiness.org](http://py3readiness.org)

~~~
jimmaswell
I want to like Python 3 but the print statement change really kills it for me.
Years later it's still frustrating on the occasions I write a python script.
If I ran a project like one of those, I might not use Python just for this
reason.

~~~
njharman
Really?, Really! Because every other output method was a function the
statement is annoying as all hell.

Swapping between print and fh.write() log.error(), stringio.write(),
sys.stderr.write(),
mycustom_thing_that_suppresses_print_when_quiet_arg_supplied() in python 2.x
is annoying as all fuck and simple 'ct(' with python 3.x

~~~
scrollaway
Look up 'q' on pypi. Thank me later. :)

~~~
njharman
I resolved to use print debugging less and pdb more. This makes that even
harder, thanks a lot. :)

~~~
scrollaway
q.d() ;)

------
bbayer
It is a major change to ask but I am wondering if they consider to switch to
asyncio instead of using Twisted. Twisted is great library but it is a huge
dependency to maintain.

~~~
glyph
Can you elaborate? What makes it a "huge dependency to maintain"? Is there
anything that the Twisted project can do to make it easier? If this is
actually a problem I'd really like to hear from users on the Twisted mailing
list and bug tracker.

~~~
bbayer
Twisted is a general purpose library/framework with lots of features. This is
the "huge" part. In my previous projects I have used it a lot and appreciated
it.

What I was trying to tell is if Scrapy uses only small part of library, it may
be possible for developers to use similar constructs from Python's standard
library. In any case dependency is dependency and it is always better to
minimize code footprint.

------
engi_nerd
I've seen developers refuse to use Python 3 because of not being able to use
Scrapy. Hopefully this gets some more devs to finally make the switch.

~~~
squeaky-clean
I've been refusing to use Scrapy because of not being able to use Python 3 :P
So this is a very exciting announcement for me.

~~~
ddebernardy
Believe me, we've been hearing this too. We're super excited as well. :-)

------
erroneousfunk
This is great! I wrote "Web Scraping with Python" (O'Reilly) and did
everything with Python 3... except for the oddball section on Scrapy. Glad to
know I can update that for the second edition!

------
Keats
Nice! Everytime I create a new Scrapy project I keep forgetting it was not
python3 and had to recreate the virtualenv. Great news!

------
BuckRogers
I've been holding out for a while on moving to 3. I have tried every 3 release
and always found issues or performance regressions. Like others have said here
it's a concern about the performance issues, Python is already slow. It's an
even harder sell when

\- all major (and equally important, the long tail of minor) libraries support
2

\- CPython2 has the performance advantage in most (if not all) applications

\- virtually every 3rd party implementation supports it very well (PyPy in
particular)

\- Python2 already supported unicode, so that gets old to hear about

\- Most of the new features are available as backports

\- Some new features are absurd, like the new 4th string formatting method in
3.6

People shouldn't openly wonder why someone uses 2 instead of 3 if you really
look at it.

I just started a new 3.5 project because while I gave 3.0-3.4 a shot, 3.5
hasn't had its runthrough yet. Most people in my shoes have more than likely
moved on from Python to Go. I'd like to have this be the one and stop going
back to 2.7. Admittedly patience is running on fumes after ~8 years of testing
CPython3 releases.

It wasn't just a bad break, it seems like it was a sloppy break. Instead of
feature bloat, I'd like to see Python3 focus on performance.

~~~
Chris2048
Also, Jython py3 dev lagged far behind py2.

------
bobby_9x
I'm glad Python 3 supports so many libraries now. I just made the switch a
couple of months ago (from 2).

------
darkrho
Try it out via conda!

    
    
      conda install -c scrapinghub/label/dev scrapy

------
Chris2048
I mentioned some of the issues I had with scrapy ages ago on reddit:
[http://www.reddit.com/r/Python/comments/g112q/installing_and...](http://www.reddit.com/r/Python/comments/g112q/installing_and_using_scrapy_web_crawler_to_search/)

Never got a reply, but I'll reproduce here; I wonder how much of this is still
true?

" The docs where pretty good, but it was unclear sometimes how to proceed;
There was a lot of structure to understand in order to get started.

When I used it, I wanted to scrape a site until certain conditions were met;
when the last page scraped returned no objects. I wanted all results initially
returned from a page to be dropped if they were older than a certain date;
Thus I wanted Scrapy to keep scraping until no new items were found. Also, I
wanted the latest date of the items returned so I could use this the next time
I scrape.

I created the 'DropElderMiddleware' middleware to do this. I couldn't see any
other way of making calculations based on items returned from a particular
page.

I could never figure out what the difference between input and output
processors where, or when I should use one or the other.

The MapCompose function flattens object by default, So I had to be careful
sometimes when returning lists that represented structure I wanted to retain.

The way the html match object worked was sometimes confusing; If I wanted to
match multiple items, then match items within each of those, I wanted a list
of lists (group matches together base based on what matches they were found
in). I can't remember the details of why I found this hard, but I can try to
come up with an example if you like?

In the end I figured I was having to learn the structure of Scrapy for
everything that I wanted it to do, but many of Scrapy's features I didn't need
e.g. I didn't want command-line control (I would actually prefer not to use
the interface, though didn't discover how I could write a python script to
apply the spider directly).

Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as
processing web pages as Scrapy's object, and if I need something more for
opening a page e.g. complicated login, I can use mechanize. I find this a more
modular approach, and think that I better understand what's going on in my
scripts. "

~~~
kmike84
> The docs where pretty good, but it was unclear sometimes how to proceed;
> There was a lot of structure to understand in order to get started.

Yeah, docs used to be a problem; they improved a lot in 1.0 and 1.1 releases
though.

> When I used it, I wanted to scrape a site until certain conditions were met;
> when the last page scraped returned no objects. I wanted all results
> initially returned from a page to be dropped if they were older than a
> certain date; Thus I wanted Scrapy to keep scraping until no new items were
> found. Also, I wanted the latest date of the items returned so I could use
> this the next time I scrape.

The easies way is to raise CloseSpider exception in a callback if no new items
are scraped - see
[http://doc.scrapy.org/en/1.0/topics/exceptions.html#closespi...](http://doc.scrapy.org/en/1.0/topics/exceptions.html#closespider)

> I could never figure out what the difference between input and output
> processors where, or when I should use one or the other.

> The MapCompose function flattens object by default, So I had to be careful
> sometimes when returning lists that represented structure I wanted to
> retain.

I also have troubles understanding ItemLoader details. They are totally
optional though, and they are no longer in Scrapy tutorial
([http://doc.scrapy.org/en/latest/intro/tutorial.html](http://doc.scrapy.org/en/latest/intro/tutorial.html)).
Item loaders provide features very similar to
[https://github.com/Suor/funcy](https://github.com/Suor/funcy) or
[https://github.com/kachayev/fn.py](https://github.com/kachayev/fn.py).

> The way the html match object worked was sometimes confusing; If I wanted to
> match multiple items, then match items within each of those, I wanted a list
> of lists (group matches together base based on what matches they were found
> in). I can't remember the details of why I found this hard, but I can try to
> come up with an example if you like?

I'm not sure what problems did you have. Scrapy selectors library
([https://github.com/scrapy/parsel](https://github.com/scrapy/parsel)) is
quite similar to PyQuery (esp. when CSS selectors are used), and nothing
prevents you from using PyQuery with Scrapy. In future we may add PyQuery (and
BeautifulSoup?) support to parsel and provide PyQuery selectors as response.pq
(like response.css and response.xpath), +1 to do that.

> In the end I figured I was having to learn the structure of Scrapy for
> everything that I wanted it to do, but many of Scrapy's features I didn't
> need e.g. I didn't want command-line control (I would actually prefer not to
> use the interface, though didn't discover how I could write a python script
> to apply the spider directly).

Yeah, library interface used to be a problem. It was improved in 1.0 release
(there is an official API for integrating Scrapy with Twisted apps and running
spiders from user scripts), but there is still more to go. See
[http://doc.scrapy.org/en/1.0/topics/practices.html#run-
scrap...](http://doc.scrapy.org/en/1.0/topics/practices.html#run-scrapy-from-
a-script).

It probably won't be as easy to integrate with regular Python scripts as
mechanize because Scrapy is async. On the other hand, Scrapy is easier to
integrate with async servers like Twisted or Tornado.

> Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as
> processing web pages as Scrapy's object, and if I need something more for
> opening a page e.g. complicated login, I can use mechanize. I find this a
> more modular approach, and think that I better understand what's going on in
> my scripts.

You may want to check the new 'Scrapy at glance' page
([http://doc.scrapy.org/en/latest/intro/overview.html](http://doc.scrapy.org/en/latest/intro/overview.html)).
The main advantage of Scrapy over mechanize is that it handles parallel
downloads and have a wide range of built-in extensions you won't have to
implement yourselves.

~~~
Chris2048
Hmm, the matching issue might have been something like wanting to do
"[i.match(tag='foo') for i in body.match(tag='bar')]" and getting a list-of-
lists back, but this was a long time ago :-)

Incidentally, I've since gone off pyQuery as it doesn't always keep up with
jquery. I now prefer lxml or BS4..

BTW, I love ScrapingHub. I bashed out a few Spiders with portia, but
ultimately, I'll prob start scripting instead. Do you know if portia actually
generates script code? Might be easier for fast scraping to get 60% of the
ways with portia, then manually write the rest of the script.

One last thing - looking at this page

> [http://stackoverflow.com/questions/6261714/inferring-
> templat...](http://stackoverflow.com/questions/6261714/inferring-templates-
> from-a-collection-of-strings)

there is mention of a "wrapper induction library"; I can't find anymore
mention of it though, does the class/functionality still exist?

~~~
kmike84
Wrapper induction library is separated from Scrapy:
[https://github.com/scrapy/scrapely](https://github.com/scrapy/scrapely). It
is used in Portia under the hood. Portia can be seen as a tool to annotate
scrapely templates and define crawling rules and post-processing rules.

I'm not a Portia developer/user myself, but I think it is possible to get
script code from Portia; it exports Scrapy spider to some folder. But I don't
really know what I'm talking about, it is better to ask at
[https://groups.google.com/forum/#!forum/portia-
scraper](https://groups.google.com/forum/#!forum/portia-scraper) or at
stackoverflow (use tag 'Portia').

~~~
Chris2048
Thanks for your help :-)

------
penetrarthur
Not related to Scrapy, but what are some things you scrape the web for?

~~~
kevin_thibedeau
I once scraped every October posting from Slashdot to see long term trends.
Short story: it's dying. I project the active userbase will be gone by 2020.
Curiously the bulk of the posters were in the 100k to 300k UID range. There
was also evidence of shenanigans with UID assignment where they were skipping
even numbers and odd numbers at various times possibly to inflate their
numbers.

This will get you IP banned BTW but I did get a full data set before their
script caught me.

~~~
hatchoo
ScrapingHub (the guys behind Scrapy) offers Crawlera which provides some sort
of automatic proxying and throttling so you can scrape away avoiding getting
banned.

------
bootload
This is great news. I often read the stick developers give to Py3, not wanting
to upgrade from 2->3 citing code bases still using Py2.

------
njharman
Minor point of clarification re title; Python 3 has been here, it's Scrapy
that has (finally and good for them) come to Python 3.

~~~
kmike84
that's an example of
[https://en.wikipedia.org/wiki/Galilean_invariance](https://en.wikipedia.org/wiki/Galilean_invariance)

