
Ask HN: What info do you web scrape for? - cblock811
I have been web scraping for several months and am starting to teach it at Meetups. I&#x27;m lucky enough to work for a company that has a few pre-crawled copies of the web that I can query against and a distributed processing platform to speed up any scraping I do.<p>I&#x27;m running out of ideas of what to build though. I build scrapers to produce content for the company based on the data and insights I find. They are usually marketing verticals though, such as finding all websites using feedback tools (I search based on their javascript widgets), and do analysis on that info.<p>So if you had these resources, what would you be looking for? I love building tools that help people so any feedback&#x2F;ideas would be great!<p>I&#x27;m also open to hearing what you would scrape for on the live web.  I find that if I&#x27;m doing broad analysis then the pre-crawled copies are best, and for specific sites&#x2F;information I use the live web.
======
chollida1
There is a closet industry for scraping any sort of data that can move
markets. Fed, crop, weather, employment,etc.

Anything that is released at a certain time on a fixed calendar, you can bet
that multiple parties are trying to scrape it as fast as possible.

If you can scrape this data( the easy part), put it in a structured format(
somewhat hard) and deliver it in under a few seconds(this is where you get
paid) then you can almost name your price.

It's an interesting niche that hasn't been computerized yet.

If you can't get the speed then the first 2 steps can still be useful to the
large number of funds that are springing up using "deep learning" techniques
to build a portfolio over timelines of weeks to months.

To answer the question of: > Wouldn't this require a huge network of various
proxy IPs to constantly fetch new data from the site without being flagged and
blacklisted?

This is why I gave the caveat of only looking at data that comes out at
certain times. That way you only have to hit the server once, when the data
comes out, or atleast a few hundred times in the seconds leading up to the
data's release:)

~~~
jyu
That is a fairly surprising opportunity. I have experience monitoring/scraping
thousands government websites for a different purpose. Considering some
government sites have a round trip of well over 5 seconds, seems like it'd be
a fun challenge to parse, format, and deliver it that fast.

What types of data formatting are you talking about here? Would it require a
unique template for each individual site?

------
fnbr
I work as a research analyst for a Canadian provincial opposition party. Most
government data is in terrible HTML tables, often dynamically generated, and
almost none of it is in an easily machine readable format. I spend a lot of
time downloading PDF files of data and converting them to JSON formats.

I have two main recurring scrapes:

\- political donations. Every donation to a political party in my province
above ~$300 is posted publicly on a gov't website (in a PDF). I use the data
to run machine learning algorithms to predict who is most likely to want to
donate to my party.

\- public service expenses. My province has a "sunshine list" which publishes
the salaries and contracts for all senior government officials. We grab it
weekly (as once someone quits the gov't, their data disappears).

One tool that you could consider building is an easily accessible expense
website, where people can enter the name of a public official and see all
their expenses, including a summary of the total amount spent. There have been
a number of massive expenses here in Canada related to this [1, 2].

[1] [http://news.nationalpost.com/tag/alison-
redford/](http://news.nationalpost.com/tag/alison-redford/) [2]
[http://en.wikipedia.org/wiki/Canadian_Senate_expenses_scanda...](http://en.wikipedia.org/wiki/Canadian_Senate_expenses_scandal)

~~~
dataminer
This is very interesting work, regarding your last point, I came across this
site which is searchable by employee name or government organization
[http://canada.landoffree.com/](http://canada.landoffree.com/).

I think the salary and expense disclosure is only for Ontario, based on the
sunshine list.

~~~
fnbr
I don't know much about the civil service, but I know that most provincial
assemblies have expense disclosures, at least for the elected officials, e.g.
[1], [2]. Federal cabinet ministers also have to disclose all of their office
expenses (can't find a link, but I know it exists).

[1] BC:
[http://www.leg.bc.ca/Mla/remuneration/index.htm](http://www.leg.bc.ca/Mla/remuneration/index.htm)
[2] Alberta:
[http://alberta.ca/travelandexpensedisclosure.cfm](http://alberta.ca/travelandexpensedisclosure.cfm)

------
lumpypua
I've had three primary uses of web scraping. The hard part for me has never
been speed. Getting the results structured is somewhere between easy and
hideously complicated.

1\. Reformatting and content archival (lag times of hours to days are no
prob).

As an example, I put together a site to archive comments of a ridiculously
prolific commenter on a site I follow. I needed the content of his comments,
as well as the tree structure to shake out all the irrelevant comments leaving
only the necessary context. Real time isn't an issue. Up until recently it ran
on a weekly cron job. Now it's daily.

2\. Aggregating and structuring data from disparate sources (real time can
make you money).

I work in real estate. Leasing websites are shitty and the information
companies are expensive and also kinda shitty. Where possible we scrape the
websites for building availability but a lot of time that data is buried in
PDFs. For a lot of business domains, being able to scrape data in a structured
way from PDFs would be killer if you could do it! I guarantee the industries
chollida1 mentioned want the hell out of this too. We enter the PDFs manually.
:(

Updates go in monthly cycles, timeliness isn't a huge issue. Lag times of ~3-5
business days are just fine especially for the things that need to be manually
entered.

This is exactly the sort of scraping that Pricenomics is doing [1]. They
charge $2k/site/month. Hopefully y'all are making that much.

3\. Bespoke, one shot versions of #2.

One shot data imports, typically to initially populate a database. I've done a
ton of these and I hate them. An example is a farmer's market project I worked
on. We got our hands on a shitty national database of farmers markets, I ended
up writing a custom parser that worked in ~85% of cases and we manually
cleaned up the rest. The thing that sucks about one shot scrape jobs from bad
sources is that it almost always means manual cleanup. It's just not worth it
to write code that works 100% when it will only be used once.

Make any part of structuring scraped data easier and you guys are awesome!

[1] [http://priceonomics.com/data-services/](http://priceonomics.com/data-
services/)

~~~
samcrawford
There are services that cover at least part of what you mentioned. These
effectively provide you a tool to visually build a scraper and then they
automate the scraping in the background, creating an API or spreadsheet of the
data.

Import.io is one example, and I think there's another more recent YC-backed
one. However, I tried using import.io a little while back but without much
joy.

~~~
lumpypua
We have a non-tech intern and import.io looks like like a great tool to get
him chewing up data. I'm playing with it now. Why didn't it work out for you?
Beyond the wrapped browser interface being a little funky lol. (Edit: eugh,
selecting data for import is really clunky.)

Ask HN: Anybody got a visual scraping service they like?

~~~
samcrawford
It was the data extraction and selection process I couldn't get to work. I was
trying to scrape a particular search on autotrader.co.uk (I wanted more up to
date results than their daily emails provide, and I wanted to filter out cars
that had been written off). I don't remember all the details, but I followed
the tutorial video and got to the stage where you select a single item that
matches your criteria and it's supposed to extrapolate from there. However I
just seemed to be stuck in an infinite loop of it asking me to do this.

~~~
mattmanser
I found you often have to select two, then it figures it out. I assumed it was
probably because of alternating odd/even row CSS classes.

------
Smerity
Partial plug, but very related to the topic: if you're doing large scale
analysis on the web and you don't want to have to actually run a large scale
crawl, use the CommonCrawl dataset[1]! Common Crawl is a non profit
organization that wants to allow anyone to use big web data.

I'm one of the team behind the crawl itself. Last month (July) we downloaded 4
billion web pages. Thanks to Amazon Public Datasets, all of that data is
freely distributed via Amazon S3, under a very permissive license (i.e. good
for academics, start-ups, businesses, and hobbyists). If your hardware lives
on EC2, you can process the entire thing quickly for free. If you have your
own cluster and many many terabytes of storage, you can download it too!

People have used the dataset to generate hyperlink graphs[2], web table
content[2], microdata[2], n-gram and language model data (ala Google
N-grams)[3], NLP research on word vectors[4], and so on, so there's a lot that
can be done!

[1]: [http://commoncrawl.org/](http://commoncrawl.org/) [2]:
[http://webdatacommons.org/](http://webdatacommons.org/) [3]:
[http://statmt.org/ngrams](http://statmt.org/ngrams) [4]:
[http://nlp.stanford.edu/projects/glove/](http://nlp.stanford.edu/projects/glove/)

~~~
rgrieselhuber
What would it take in terms of resources, etc. to get to the point where
Common Crawl was doing web scale crawls on a regular basis?

~~~
Smerity
We're aiming to do monthly crawls from this point on. The main hold was
automating the intensive manual steps of our crawl process. Now we have
scripts that make running our 100 node EC2 cluster and processing the
terabytes of web data relatively trivial.

If anyone wants to discuss sourcing well distributed crawl lists for billions
of pages per month, we'd love to chat. We want to make sure we cover a diverse
variety of languages and domains. Given that we're trying to get a good sample
of the web, that's a difficult proposition!

------
michaelt
A lot of services with online billing refuse to send bills by e-mail, instead
requiring users to log into their websites.

No doubt the companies would justify this by saying e-mail isn't secure
enough. The side-effect that it'll stop many users bothering to look at their
bill isn't why they do it at all, no sir.

I've been considering making a web scraper that goes to the phone company,
electricity company, gas company, broadband company, electronic payslips,
bank, stockbroker, AWS and so on; logs in with my credentials; downloads the
PDF (or html) statements; and sends them by e-mail.

Of course, such a web scraper would need my online banking credentials, so I'm
not in the market for a software-as-a-service offering.

~~~
xur17
I started working on a tool like this over the weekend to pull down my bills,
and pay the balance.

I think there is a market for something like Mint for bill paying - it's a
it's a bit of a pain to remember when I have to pay all of my bills each
month, and make sure to go through each one, and pay the balance on time.

------
jawns
I scrape about 60 blogs and news sites that deal with a niche topic and
examine all the hyperlinks. If more than one of them links to the same page, I
assume that it's a page that's generating some buzz, so I send it out in an
email. It's proved to be a generally reliable assumption.

~~~
Torgo
Would love it if you posted a blog article sometime about the technical
details.

~~~
jawns
There's really not much to it. Scrape each site/feed every X minutes, find all
the hyperlinked URLs on the page, add them to a database table ... and if
they're already in the table, send out an email with links to the "buzzed
about" URL, as well as all of the sites/feeds that mention it. I keep the
links in the table for about a month.

------
allegory
I scrape Gumtree and eBay hourly using a python script for certain things I
want under a certain price. The script sends me an email with the link in it
and I get on top of it sharpish.

Managed to bag a lot of stuff over the last couple of years for not much
money.

If someone bags this up as a service I'd pay for it.

~~~
eddywebs
@allegory could you share the the script with us ?

~~~
allegory
Would love to but not at the minute because it has hard coded credentials for
eBay API in it. It's on my list as a TODO to tidy it up. Will stuff on github
and post a Show HN on it soon :)

I've got one that monitors amazon prices for sudden lows as well.

~~~
ToastyMallows
I'll second a want for this script, let us know when it's up!

------
jpetersonmn
I do a lot of scraping for my day job. We have a business intelligence team
that will build us reports that we need from the data that we have. However I
find that this process is so incredibly slow and sometimes we only need to
compile the data for a one-off project. I used to use vb.net for this as
that's what I started learning programming with. Now I use python/requests/bs4
for all my scraping scripts.

I've started working on a new website that will use data scraped from several
vbulletin forums. I've found that even 2 vbulletin forums running the same
version may have completely different html to work with. I'm assuming that
it's the templates they are using that changes it so much.

I'm setting up the process so that the webscraping happens from different
locations than the server were the site is hosted. The scraping scripts upload
to the webserver via an api I've built for this. Mostly did this because for
now I'm just using a free pythonanywhere account and their firewall would
block all of this without a paid account. And then also none of these sites
would see the scraping traffic coming from my website, etc...

~~~
miket
At Diffbot, we have an automated discussion thread parser, currently in beta
testing, that might be exactly what you need. Send me a note at
mike@diffbot.com and I'd be glad to hook you up.

(disclosure: I work there)

------
Cyranix
When I worked at MyEdu, I didn't actually sign on with the dev team originally
— I worked on "the scraper team". We scraped college and university websites
to get class schedule information: which classes were being taught, broken
down by department and course number; by which professors; at which times on
which days. If you're ever looking for an interesting challenge, I would
encourage you to try getting this data.

Well-formed HTML is the exception rather than the rule and page navigation is
often "interesting". Sometimes the school's system will use software from
companies like Sungard or PeopleSoft, but there's customization within that...
and of course, there's no incentive for the schools to aggregate this
information in a common format (hence MyEdu's initiative), so there are plenty
of homegrown systems. In short, there's no one-size-fits-all solution.

* NOTE: If you do attempt this, I insist that you teach throttling techniques from the very start. Some schools will IP block you if you hit them too hard; other schools have crummy infrastructure and will be crushed by your traffic. Scrape responsibly!

~~~
vital101
I've attempted to do this before for a side project (RateMyProfessor, but for
textbooks!) and its incredibly hard to do accurately. One of the (many) issues
that I ran into was that some schools still have all of their course data PDF
format, in addition to the problems Cyranix listed above.

Much respect to any person or team that has to wade through this stuff.

------
finkin1
We do a lot of live web scraping of product information from retail sites for
[http://agora.sh](http://agora.sh). We basically scrape all of the essential
product info and offer it to the user in an optimized view (we call it the
'product portal') that can be accessed without having to load a new page. This
reduces tab sprawl and provides a general improvement to a lot of shopping
workflows, such as just wanting to see larger/more images of the product
(likely clothing) and being able to do so with an overlay on the existing
page.

------
richardbrevig
My first scraping project was well over 10 years ago in college. I was a
member of the education club and we wanted to get funding so I convinced the
college of education to allow us to charge $10 automatically to students in
their school. But then the administration dragged their feet to give us a list
to submit to the accounting office for billing. 1 (of the many professors)
submitted a list to me via outlook that they copied off the site so I was able
to look at the HTML structure of their list. The university used basic
security (htaccess) and didn't verify that you had permission for a task once
you were in. I had access because I worked for the dean of men. So I scraped
all the faculties student lists and then used another system behind the
htaccess point to get all the relevant information on each student. Compiled a
list of 300 students and submitted it getting the club $3,000 in funding. The
college of ed office staff were freaked because they had no clue how I came up
with the student roster (no one in their office gave it to me) but nothing
came of it.

Been scraping a lot lately but mostly:

\- government website for license holders

\- creating lists of businesses for different segments (market
research/analysis)

\- using those lists to scrape individual sites and make analysis (how many
use facebook/youtube/etc)

------
jasallen
I am currently scraping for brand product and nutrition data. Having to build
custom scrapers per brand is hell.

I have a dream to use something closer to OCR against a rendered page, rather
than parsing DOM. That way it would be less custom, and I could say, for
instance, "find 'protein', the thing to the right of that is the protein
grams".

I, personally, don't know how to do this, but I'd be willing to pay for a more
generic way to scrape nutrition data (email in profile :) )

~~~
imns
There's companies out there that already have this info. Checkout
[http://kwikeesystems.com](http://kwikeesystems.com) .

------
oz
I once wrote a scraper for a Yellow Pages site in Python. It pulled down the
business category, name, telephone and email for every entry, and returned a
nicely formatted spreadsheet. The hours I spent learning the ElementTree API
and XPath expressions have paid for themselves several times over, now that I
have a nicely segmented spreadsheet of business categories and email
addresses, which I target via email marketing.

~~~
soundjack
As someone responsible for search on a yellow pages company, I can confirm
that most YP websites have little to no protection against this. Company
information is usually public anyway. We just make it very easy for you to get
it :)

------
dennybritz
I'm working on a startup that has web scraping at its core. The vision is a
bit larger and includes fusing data from various sources in a probabilistic
way (e.g. the same people, products, or companies found on different sides
with ambiguous names and information. This is based on the research I've doen
at uni). However, I found that there are no web crawling frameworks out there
that allow for large-scale and continuous crawling of changing data. So the
first step has become to actually write such a system myself, and perhaps even
open source it.

In terms of use cases, here are some I've come across:

\- Product pricing data: Many companies collect pricing data from e-commerce
sites. Latency and temporal trends are important here. Believe it or not,
there are still profitable companies out there that hire people to manually
scrape websites and input data into a database.

\- Various analyses based on job listing data: Similar to what you do by
looking at which websites contain certain widgets, you can start understanding
job listing (using NLP) to find out which technologies are used by which
companies. Several startups doing this. Great data for bizdev and sales. You
can also use job data to understand technology hiring trends, understand the
long-term strategies of competitor's, or us them as a signal for the health of
a company.

\- News data + NLP: Crawling news data and understanding facts mentioned in
news (using Natural Language Processing) in real-time is used in many
industries. Finance, M&A, etc.

\- People data: Crawl public LinkedIn and Twitter profiles to understand when
people are switching jobs/careers, etc.

\- Real-estate data: Understand pricing trends and merge information from
similar listings found on various real estate listing websites.

\- Merging signals and information from different sources: For example, crawl
company websites, Crunchbase, news articles related to the company, LinkedIn
profile's of employees and combine all the information found in various source
to arrive at meaningful structured representation. Not limited to companies,
you can probably think of other use cases.

In general, I think there is a lot of untapped potential and useful data in
combining the capabilities of large-scale web scraping, Natural Language
Processing, and information fusion / entity resolution.

Getting changing data with low latency (and exposing it as a stream) is still
very difficult, and there are lots of interesting use cases as well.

Hope this helps. Also, feel free to send me an email (in my profile) if you
want to have a chat or exchange more ideas. Seems like we're working on
similar things.

~~~
fnbr
How do you use a probabilistic approach to scraping data? Were you able to get
a low number of false positives?

~~~
dennybritz
Sorry for the confusion. They are used for "merging" scraped data from various
sources, not in the scraping process itself. For example, they help in
figuring out if similar-sounding listings on related websites refer to the
same "thing".

If interested, take a look at this (and related) papers:
[http://www.cs.ubc.ca/~murphyk/Papers/kv-
kdd14.pdf](http://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf)

~~~
fnbr
That makes more sense. Thanks! I'll check out the paper. I was hoping you had
some revolutionary new scraping method.

------
andy_ppp
YQL is surprisingly quite brilliant:

[https://developer.yahoo.com/yql/](https://developer.yahoo.com/yql/)

------
jaequery
I'm currently scraping data such as "tweets, comments, likes" a website gets
each day so I can graph them over time.

One thing I am having a hard time scraping backlinks to websites. Currently
using bing but they are paid after like 5000 queries. I really wonder how
other companies like seomoz do this daily against millions of websites.

------
contingencies
1\. Monitoring competitors. By monitoring product/service offerings close to
my own operations, I can get bizdev people on the phone and speak to partners
when I see indications in the public marketplace that someone has a better
sourcing deal than I do. Haven't done this in five years or so.

2\. Gathering basic data that should be freely available anyway (like currency
exchange rates, global weather, etc.). Always this is done carefully and with
a light touch, with maximum respect for load inferred on targeted systems.
Again, haven't bothered in about five years.

3\. Automating content acquisition. For search engines, media libraries, etc.
This is more like ten years ago. These days there's so little call for it...
maybe if I ran a boutique hotel chain in a copyright-isn't-respected
jurisdiction and wanted to provide a fat library of in-room entertainment...

------
viggity
I just started getting into scaping (mostly been using import.io) mostly
because it is a complement to what I really care about - data visualization.
I've gotten a ton of interest for my side project and despite that I haven't
opened the beta I'm still worried that it won't be as lucrative as creating
some niche reporting services for various verticals (real estate, auto, etc).
Essentially data that is very tabular and not hierarchical or qualitative. You
can think of my work as pivot charts on crack. If someone already pre-compiled
this data, I'd much rather pay for it that do it myself. My value add is the
analysis/viz done on top of the data. If you want to chat, feel free to email
me, contact info is in profile.

~~~
sogen
Machete.io is awesome!

~~~
viggity
Thanks!

------
pbowyer
In terms of ideas, how to scrape Javascript-heavy sites. This one has broken
me and the Import.io helpdesk:
[http://www.usrentacar.co.uk/](http://www.usrentacar.co.uk/). I'm now trying
in CasperJS/PhantomJS but no joy there either.

I'm looking to buy a house, and not all local estate agents post to Rightmove
(or some post with a 24-hour delay). Trying to submit the search form on the
agent's own hideous website, parse the results and get a standard structure
between them all is hideous - I gave up in the end.

Once I have the data the challenge is then analysing it (geolocation, how long
are the commute times, distance to amenities etc) which is its own separate
challenge

------
hallz
I made a javascript based automated scraper in a win7/vista desktop gadget. It
was originally for dislplaying the remaining credit on my mobile. You put in
search terms and a website and it scrapes it and tries to return what it
thinks you are looking for (weather, stock prices, remaining blance etc). It
works ok. I think there is definetly demand for a well made scraper/alert
app/service though.

App is here
[http://robotification.com/creditile/](http://robotification.com/creditile/)

Also didn't yahoo make a thing for scraping this 'Yahoo Pipes' or something.

------
mushfiq
I scraped around hundreds of the major universities of US for one of my
client. And he used the data to build a mobile platform for students which
integrated different services. I mainly grabbed course, class routines, bus
route and email of both professors and students. I still have around one and
half million of email address of academics.

I also did some e-commerce information scraping.

One of the most interesting one was for a data selling company. They asked me
to collect data of geo information, disaster, finance, tweets etc. We used to
apply ML and statistics to give forecast with historic data.

------
sayangel
What about [https://www.kimonolabs.com/](https://www.kimonolabs.com/) ? Makes
it pretty easy to collect data and presents in a structured format (JSON).

------
kohanz
I have a side-project which scrapes play-by-play data from NBA games to gain
more insights into these games.

Here is an example of the (un-finished) side-project:
[http://recappd.com/games/2014/02/07](http://recappd.com/games/2014/02/07)

I'm far from the only person scraping this data. Look at sites liked
[http://vorped.com](http://vorped.com) and
[http://nbawowy.com](http://nbawowy.com) for even better examples.

------
aruggirello
Scraping really is a quite complex process, and not everybody does it right.
Do you employ a (distributed?) crawler pool? What if a scraped page goes
offline (404/410)? And, how do you handle network errors, and 403's / getting
caught (and possibly blocked) - if at all? Do you conceal the scraping by
employing a fake user agent? Do you (sometimes?) request permission for
scraping to relevant webmasters? These are the things that can make it or
break it IMHO.

~~~
aruggirello
BTW I write tailor made PHP+MySQL scraper scripts, targeting English or
Italian language sites; contact me for more info :)

------
jerhinesmith
A while ago, I had the idea of creating a travel site that catered to the
group of people that enjoy traveling but aren't bound by time (i.e. I want to
go to X, but I don't care when -- just show me the cheapest weekend for the
next 3 months).

Anyway... it turns out that flight APIs are ridiculously non-existent. I ended
up scraping two different airline sites, but since it was against their terms,
I never took the site any further.

~~~
cblock811
The hospitality and travel industries are very slow to update their
technologies. I used to work with Ritz Carlton and St. Regis and even those
brands are practically in the stone age, so I can't imagine how scraping for
flight info would go.

I've thought of even building a simple event aggregator for some friends in
the industry and they are blown away that it's possible. Then I remember how
many venues are in cities like Charlotte and San Francisco and realize why
these industries lag in technology. There just isn't a large pool of
developers who want to solve their problems.

Do you have any projects you are currently working on?

~~~
jerhinesmith
Completely agreed on the "lag in technology" \-- couple that with the fact
that even if you do manage to book a flight as an affiliate, most providers
only give you a flat fee (as opposed to hotel affiliates which tend to give a
percentage).

Disclaimer: Those claims were relevant the last time I investigated (>1 year
ago). It could have changed by now.

On the plus side, building out the flight tool really endeared me to hacking
on things - which eventually lead me to leave my last job and co-found
something new. :)

------
hatethis
Frankly I'm annoyed to see this topic here. Most people who have taken to
scraping are low-life scum. They see content that others have spent months or
years producing, simply set up a site that aggregates all of that information,
and then sit back and collect revenue from ads or reseller links you paste
everywhere.

People who put in a few hours of work to take advantage of other people's hard
work piss me off. :/

~~~
davidy123
Are you against hyperlinks too? Seriously, computers naturally make content
reusable. Try to imagine the next level when we don't depend on hoarding. The
semantic web imagined this, though it was too complicated. But the idea still
has huge benefits; every web site a linked database, with content precisely
described. But I think many orgs are too afraid they don't really have
something to offer in the big picture (that's what so much of business is
about).

I do a lot of work in scraping, but it's for non profit healthcare, academic,
and general knowledge augmentation. It's painful, but the only way to get to
the next level without waiting a thousand years for everyone to make their own
consistent API and metadata descriptions.

------
murukesh_s
I used to scrape web for a daily deals search engine i wrote for a client in
2010. But we scraped in realtime as the number of sites were really low (in
10s).

pre-crawled copies with distributed processing platform could be cool. you
could come up with a better search engine with programmable rules that are
edited collaboratively (like wikipedia)

------
TomBeckman
I've used Mozenda for Web scraping. They have a free trial and can scrape some
complex formats, like drilling down several levels in a Web site or database.
They can also parse PDF's.

See [https://www.mozenda.com](https://www.mozenda.com)

------
pknerd
I love scrapping and even made a subreddit for the purpose where I showcased
few of my public work. Any scraping lover can join in.

[http://www.reddit.com/r/scrapingtheweb/](http://www.reddit.com/r/scrapingtheweb/)

------
jhonovich
We do so to determine new pages on websites within our industry. Often the new
information here is not formally announced or done so only weakly. We
regularly uncover valuable new info about company developments and changes.

------
hpagey
I currently scrap my lending club account to automatically trade loan notes on
the secondary market. This way I can buy/sell notes that satisfy my criteria.
If anyone is interested in this, I can send you the scripts.

------
keviv
I scrape for Google Playstore for app data, top ranked apps, etc. Unlike
iTunes, they don't have a public API/feed. So, in order to get the data I
require, I scrape playstore on a pretty regular basis.

------
amitagarwal
I scrape Google to save search results.

[http://www.labnol.org/internet/google-web-
scraping/28450/](http://www.labnol.org/internet/google-web-scraping/28450/)

------
thinkcomp
PlainSite ([http://www.plainsite.org](http://www.plainsite.org)) uses about 20
different scrapers/parsers to download and standardize legal materials.

------
skanga
I scrape craigslist for side by side comparisons of stuff I want to buy from
there. Eg: Cars, motorcycles, etc. Maybe even real estate would be a good
target.

------
kurrent
Sports scores, statistics, etc are always high in demand for scraping and
great to get people interested when learning scraping techniques.

~~~
jeffclark
Yeah, very much agreed. I have to scrape schedules together to populate games
for my ticket site ([http://www.boxrowseat.com](http://www.boxrowseat.com)).

The popular sites are wising up to scraping of information, so it's really an
exercise in futility. But when it works, it's super rewarding.

------
rtcoms
I am looking for data having list of all universities and associated colleges
. I didn't found anything related to it anywhere

~~~
cblock811
So like the dmoz.org or higher education?

~~~
rtcoms
Yes, sort of like that only. I found a good list on
[http://www.4icu.org/](http://www.4icu.org/) , and wikipedia too have list of
many colleges and universities.

But information about all colleges associated with university is generally
available only on specific site of university, and many times those are in
pdf.

Having said that, my above requirement(college associated with university)
itself is secondary. I found the list of universities at many places , but not
the list of colleges.

So for now just getting a list of all college with location is enough for me.

------
cperciva
I scrape the Hacker News website to get links and numbers of points. These
allow me to produce my "top 10" lists.

------
Mikeb85
I regularly scrape financial data - historical prices, live quotes, company
information, quarterly reports, etc...

------
bussiere
Reddit and twitter account for tendancy.

but twitter is so vast you may want to categorize account.

But reddit is a good source for a lot of info.

~~~
jwcrux
Why not use PRAW? It's very mature, useful library using the Reddit API.

~~~
bussiere
i will dig this thks

------
dochtman
I used to scrape a bunch of webcomics to turn them into RSS feeds. I still
have one or two running, actually.

~~~
JoshTriplett
Try [https://www.comic-rocket.com/](https://www.comic-rocket.com/) and see if
it has all the comics you want to read. It supports either RSS, a web-based
reader, or an Android app.

------
catshirt
i'm building a database of games and scores. web scraping has been very
helpful.

maybe instead of trying to change up the content, try to change up the method.
ie. do a talk on running crawlers/scrapers to seed your database at an
interval. (instead of just "scraping").

------
yutah
if you could publish a price list for items sold at major grocery chains, I am
sure that many people could use it (bonus if it includes aisle numbers)

~~~
_neil
Similar to [http://www.aislefinder.com](http://www.aislefinder.com) and
[http://www.supermarketapi.com](http://www.supermarketapi.com)

------
Daviey
Online personal banking

