
Dead simple Python crawler for extracting structured data from a website to CSV - CarolineW
http://blog.webhose.io/2015/08/16/dead-simple-for-devs-python-crawler-script-for-extracting-structured-data-from-any-almost-website-into-csv/
======
echochar
"No external imports required..."

I convert HTML to CSV regularly (multiple times a day). But I do not use
Python; I use C. Actually I use flex to make filters which I compile as static
binaries that read from stdin. This is in fact how I read HN. The HTML is
converted to CSV and then the CSV is imported into a database.

Prior to using flex I primarily used sed. For many sites I still do; it's
faster than having to compile, test, recompile.

If anyone has a website they want in CSV, and need something faster than
Python or Ruby, just post the url. I like to think I am reasonably good at
this, but I only do it for personal use on sites I'm interested in so who
knows. For me HTML conversion to CSV and plain text is an art - I practice it
every day.

~~~
catmanjan
>This is in fact how I read HN.

Out of curiosity, why?

~~~
echochar
1\. Practice with new programming language and database.

2\. Turn unstructured, difficult to parse data adorned with HTML, and other
window dressing into structured data that is easier to parse.

------
brachi
urlparse().netloc, not netlock. Would be nice to clarify is for Python 2, even
better to just do it in 3. Maybe not as dead simple for beginners if you use
regular expressions.

------
blacksqr
Uses one weird trick. Ruby users HATE this!

------
sethherr
This is why I recommend Ruby over Python for web development. The difference
between a blog posted code snippet and the wealth of mature, multiple
contributor projects like
[https://github.com/propublica/upton](https://github.com/propublica/upton)

~~~
orf
The author has just decided to roll their own half baked script for some
reason, there are plenty of libraries to help. Python is good for scraping, I
released my own library recently[1] that I think hits a sweet spot but there
are many others.

Python and Ruby have their places, I wouldnt say one is clearly better than
the other.

1\. [http://tomforb.es/scraping-websites-with-
cyborg](http://tomforb.es/scraping-websites-with-cyborg)

~~~
rangeva
The script is intended to be a copy/paste script, with no imports and just
"Fill in the blanks" parameters. There A LOT of other options that are much
better, but require more setup: [http://scrapy.org/](http://scrapy.org/)
[https://pypi.python.org/pypi/spider.py](https://pypi.python.org/pypi/spider.py)

