
Finding the best ticket price – Simple web scraping with Python - danielforsyth
http://www.danielforsyth.me/finding-the-best-ticket-price-simple-web-scraping-with-python/
======
jknupp
A shorter, more comprehensible version:

import requests

from bs4 import BeautifulSoup

from urlparse import urljoin

URL =
'[http://philadelphia.craigslist.org/search/sss?sort=date&quer...](http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets')

BASE =
'[http://philadelphia.craigslist.org/cpg/'](http://philadelphia.craigslist.org/cpg/')

response = requests.get(URL)

soup = BeautifulSoup(response.content)

for listing in soup.find_all('p',{'class':'row'}):

    
    
        if listing.find('span',{'class':'price'}):
    
            price = int(listing.text[2:6])
    
            if 100 < price <=250:
    
                print listing.text
    
                print urljoin(BASE, listing.a['href']) + '\n'

~~~
danielforsyth
Thanks for posting this, I am still very new to python and your website has
taught me a lot. Appreciate the feedback.

------
motoboi
Some months ago I found [https://import.io/](https://import.io/) and it just
blow my mind.

I remember the pain it was to write custom scrapers every time (I used to do
it with Perl, btw).

They have a custom browser with a nice interface, but the biggest thing are
the so called "Connectors": you instruct the system into how to query and
parse results and Import.IO will give you an API endpoint for this query, now
automatized.

One can, say, create a "connector" which can query Airbnb and parse results,
then create another "connector" which queries booking.com. Now it is possible
to use the API to make a query for Boa Vista, Roraima (my city) and get the
dataset.

I am not affiliated with them in any way, just a very happy old-school
scrapper.

Nice walkthrough:
[http://www.youtube.com/watch?v=_16O10Wx2W4](http://www.youtube.com/watch?v=_16O10Wx2W4)

UPDATE:

Unsurprisingly, import.io was Hacker News stuff in the past:
[https://news.ycombinator.com/item?id=7582858](https://news.ycombinator.com/item?id=7582858)

~~~
dmn001
Other browser based screen scrapers that are in the space are 80 legs,
kiminolabs, Mozenda and OutWit Hub, I'm sure there are more. Last time I
checked, import.io was a fairly lightweight browser wrapper.

I also write web scrapers using Perl and Python, recently have been
gravitating towards Python as the code looks more readable. I don't use
browser based scrapers because the sites I scrape are usually more complex so
it is just easier to write my own code, and they lack functionality and
control of the data, and there is the overhead of learning the terminology and
how it works.

------
tst
I can recommend scrapy[0] if you work on a bit bigger problem. But even then
if you familiar with scrapy it's incredible fast to write a simple scraper
with your data neatly exported in .json.

[0]: [http://scrapy.org/](http://scrapy.org/)

~~~
crdoconnor
I don't recommend scrapy. Classic example of a framework that should have been
a library. It will work up until a point and then it will railroad your app
and you will have a really painful time breaking out of the 'scrapy' way of
doing things. Classic 'framework' problem.

I prefer a combination of celery (distributed task management), mechanize
(pretend web browser) and pyquery (jquery selectors for python).

~~~
kmike84
I'm not sure how would you design a _library_ for event-loop based website
navigation when an event loop is explicit. Scrapy (which is a wrapper over
Twisted) is already quite close to this IMHO. You can plug anything to the
same event loop if needed (think twisted web services, etc).

You can parallelize synchronous mechanize/requests scripts via celery, but it
is less efficient in terms of resource usage if the bottleneck is I/O; also,
it has larger fixed costs per each task.

N Scrapy processes, each processing 1/N of total urls is an easy enough way to
distribute load; if that is not enough then a shared queue like
[https://github.com/darkrho/scrapy-redis](https://github.com/darkrho/scrapy-
redis) is also an option.

I think it is not "scrapy" way of doing things that causes the problems, it is
an inherent complexity of concurrency; you either give up some concurrency or
build your solution around it.

------
dai_pole
I had a go "just for fun" using curl, grep, sed, and tr. Probably too much
regex?

    
    
        #!/bin/sh
        #
        # tickets.sh - A "no BS" ticket price scraper. Output in CSV format.
        #              Uses standard issue Unix utilities only.
        #              No soup for you!
        
        
        URL="http://philadelphia.craigslist.org"
        QUERY="firefly+tickets"
        
        RESULTS=`curl -s -m 10 "$URL/search/sss?sort=date&query=$QUERY" \
                | grep '<p class=\"row' \
                | sed 's!^[ \t]*!!; \
                       s!>[ \t]*<!><!g; \
                       s![,:]! !g; \
                       s!<p class=\"row[^/]*\"\([^\"]*\)\" class=\"[^#]*\">&#x0024;\([0-9]\{1,\}\)</span>[^.]*>\([A-Z]\{1\}[a-z]\{2\} \{1,\}[0-9]\{1,2\}\)[^.]*<a h[^>]*\.html">\([^<]*\)</a>\([^.]*</p>\)!\1,$\2,\3,\4:!g; \
                       s!   *! !g; \
                       s!,  *!,!g' \
                | tr ':' '\n'`
        
        echo "$RESULTS"

------
josegonzalez
Shameless Plug: I work for an NYC-based startup - SeatGeek.com - that is
basically this[1]. We used to do forecasting but found that wasn't really
useful[2] or worth the time it took to maintain, so we nixed it.

\- [1]: As an example, here is the Firefly event the OP was scraping. :
[https://seatgeek.com/firefly-music-festival-
tickets](https://seatgeek.com/firefly-music-festival-tickets)

\- [2]: We haven't included Craigslist because the data is much less
structured and inexperienced users may have a Bad Time™. YMMV

\- [3]: It was also a royal pain in the ass to maintain. I know because I had
to update the underlying data provided to the model, and also modify it
whenever available data changed :( . Here is a blog post on why we removed it
from the product in general: [http://chairnerd.seatgeek.com/removing-price-
forecasts](http://chairnerd.seatgeek.com/removing-price-forecasts)

------
jtokoph
Combine this with Pushover[0] to get alerted whenever there is a new lowest
price. I had to resort to scraping+pushover to snatch a garage parking spot in
SF.

[0] [https://pushover.net/](https://pushover.net/)

------
tomaisthorpe
Useful article, I use lxml myself. Find that this is a good resource:
[http://jakeaustwick.me/python-web-scraping-
resource/](http://jakeaustwick.me/python-web-scraping-resource/)

------
gjreda
I did this recently when trying to get tickets to a sold out Cloud Nothings
show. I'd scrape Craiglist for postings every 10 minutes, and then send myself
a text if any of the posts were new. I ended up getting tickets the day before
the show.

Since the show was at a very small venue (capacity of maybe 500), I didn't
have to worry about a constant stream of false positives. I would have needed
to handle these if I were searching for tickets to a sold out <popular band>
show, since ticket brokers just spam Craigslist constantly with popular terms.

------
buro9
This reminds me of something I knocked up back in 2006. It's not a scraper,
it's not Python, but here you are:

[http://giggr.com/?q=klaxons](http://giggr.com/?q=klaxons)

Searches multiple UK ticket sites and returns the artist page matching the
query.

Clicking a header label (i.e. Ticketweb) switches to that provider.

Double-clicking the header re-searches based on the value of the search box.

I use it for the 9am scramble for newly released tickets.

Oh, it seems Ticketmaster has broken. Maybe I'll fix that one day... I haven't
used it in a while.

~~~
zo1
If you don't mind me asking. Who pays for the bandwidth cost of running
"giggr"? It doesn't look like you have any ads running. Or are you monetizing
on it in some other way?

~~~
buro9
It's a static web page on a Linode I use for other projects and purposes.

Even _if_ millions of people decided to suddenly use it, the cost would be
almost nothing.

I _might_ even consider putting the free CloudFlare in front of it to ensure
the cost is nothing (one static HTML file cached forever).

Heh, just looked at the source code again... it's a single request web page,
not even an external CSS or JavaScript file.

You can't get cheaper really.

------
nivertech
Doesn't work for me. Which Python version is required?

    
    
        Traceback (most recent call last):
          File "./tickets.py", line 20, in <module>
            for listing in soup.findall('p', {'class': 'row'}):
        TypeError: 'NoneType' object is not callable

~~~
danielforsyth
I am using 2.7.6

~~~
nivertech
Maybe wrong BS version?

    
    
        $ sudo pip install BeautifulSoup
        Downloading/unpacking BeautifulSoup
          Downloading BeautifulSoup-3.2.1.tar.gz
          Running setup.py (path:/tmp/pip_build_root/BeautifulSoup/setup.py) egg_info for package BeautifulSoup
            
        Installing collected packages: BeautifulSoup
          Running setup.py install for BeautifulSoup
            
        Successfully installed BeautifulSoup
        Cleaning up...

~~~
danielforsyth
Ah yes thats the problem, I am using beautifulsoup4==4.3.2.

Try pip install beautifulsoup4

~~~
nivertech
I found the problem. I think your listing ate '_'. It should be
'soup.find_all' instead of 'soup.findall' and 'link_end' instead of 'linkend'

~~~
danielforsyth
Good find! Fixed it, thanks!

------
rakoo
You should integrate this in weboob [0]

[0] [http://weboob.org/](http://weboob.org/)

~~~
bshimmin
That _really_ isn't a good name.

~~~
wingerlang
Seems to be their thing

QHandJoob

QFlatBoob

~~~
bshimmin
Wow. Spurred on by your discoveries, I found this:
[http://weboob.org/applications/qhavedate](http://weboob.org/applications/qhavedate)

"QHaveDate is a graphical application able to interact with dating websites,
and help you manage your numerous conquests."

I, uh, don't really know where to start with that.

~~~
lumpypua
Lol, there's some ridiculous stuff on that page: "Management of the meeting
places, with for each some stats for the number of contacts met, your success
rate, which seduction methods have been used, if they worked or not, if you’ve
slept with the contact, which sexual positions, etc."

------
mjhea0
This is not very good code. Here's a little better refactor -
[https://github.com/realpython/interview-
questions/blob/maste...](https://github.com/realpython/interview-
questions/blob/master/refactor_me/after1.py)

------
Hilyin
There is also iftt.com that can poll a specific CL search and email you when
something hits.

