
Ask HN: Any data scraping project ideas you can share? - dchuk
I&#x27;ve done a good amount of scraping over the years, but haven&#x27;t done much recently. Getting an itch to do some side projects in this area as well, so interested if anyone has a need for data that they can&#x27;t currently get, or can&#x27;t get in a clean structured way.<p>One example I&#x27;ve thought of recently (because of my own 9-5 job needs) is to scrape all the heavy duty truck company&#x27;s sites and expose make model data (and images) via an API paired with a VIN decoder. Each OEM obfuscates their vehicle data in one way or another (JS widgets, only in PDFs, etc) and as far as I can tell, there aren&#x27;t any API-based data sources for heavy duty&#x2F;commercial vehicles.<p>Any other ideas?
======
staticautomatic
Aggregate data on weather and soil composition to identify areas of land
around the world most similar to famous wine producing regions.

~~~
gyvastis
Wow! That's an awesome idea.

~~~
JJarrard
[http://www.agrimetrics.co.uk/](http://www.agrimetrics.co.uk/) there are a few
others using land records for farming, not to say there isn't still room for
more

------
richardknop
Well the obvious scraping business ideas would be hotel rooms and airplane
tickets. You might think this area is already saturated but I think there’s
still room for disruption and you can capture niche giants like Expedia /
Skyscanner / Agoda / Booking don’t handle well. Or you could do b2b with these
companies.

Also, what about scraping restaurant menus and offering a food search engine?

~~~
dchuk
Can you share any examples/ideas for niches that those sites don't already
handle well?

~~~
anywherenotes
I just booked a vacation on Expedia. What I really wanted was to find a list
of rooms by price (airfare included) which can comfortably sleep 4 people - so
I needed 4 actual beds. I tried looking at bigger rooms, but it looked like
booking 2 cheap rooms was less expensive, but I'm still not sure if that's
true. So basically you could see if you can accommodate large parties. Also,
when I booked the rooms, I think it made me select one type, but in theory, i
might have wanted one ocean front room and one without the view.

------
dchuk
Here's another idea I just thought of randomly:

Scrape and monitor a company's competitor's job listings for them. Some of
that data might be difficult to get given the nature of job sites and
craigslist and such, but could be interesting to accumulate all of that
(including from the company's own site) so you can get an idea when they are
hiring.

Maybe.

~~~
gyvastis
Sorry if I've missed the main idea behind this, but why would that be
relevant?

~~~
BjoernKW
Looking at companies' hiring data is a great way to monitor competition.

If they try to hire for more positions than in the past they're probably
growing, conversely they might be stagnating if it's the other way round.

If they hire people with specific skills it might also tell you what they're
up to right now like going public or working on a new, supposedly secret
project. Take Apple for instance. A notoriously secretive company, previous
new projects like the iPhone, the Apple Watch and most notably a self-driving
car have first been revealed by their own job postings.

~~~
peternicky
This is in my opinion the same as buying at the height of a bubble; by the
time you get this data on your competitor, you will be way behind. Why not
spend resources on improving the offering?

~~~
BjoernKW
True. Still, keeping tabs on the competition is big business.

------
remyp
Am I the only one that worries about licensing and legal issues when it comes
to web scraping? I'd be terrified to build a product around it since one law
suit would threaten the core business.

~~~
gyvastis
Everything that you can open in the browser you can scrape without any
problem. Though keep in mind the number of requests you send to those parties
should be thought about as it shouldn't vary greatly compared to a regular
user. A user doesn't open 1000 pages in 60 seconds.

~~~
anywherenotes
don't most sites claim they own data? Like could you legally scrape reddit and
make your own site?

------
zapperdapper
Some of the scraping projects I've done in the past have been where the
article content I wanted was on a large site with great content but the site
was awful to read - due to horrible combinations of pop-ups, colour schemes,
adverts etc etc. I would spider and download content, process, and build my
own database/simple CMS to make reading the content offline a much better
experience.

Are you just looking for a personal project something like that might work for
you...

------
joshribakoff
Its a huge undertaking. You'd be competing with ACES/PIES, those guys charge
about $10k a year last I checked. So there's definitely room to undercut them,
if you can somehow get all that data.

~~~
dchuk
Thanks for the reply. I was thinking of just using the scraped data to create
an API that can take a VIN and give you the specs of that heavy duty vehicle
(engine, weight, body type options, etc). I couldn't even imagine how crazy it
would be to try and collect all of the parts data for every vehicle.

------
howscrewedami
Scrape product information from ebay and other auction sites. Have a machine
learning model that compares auction price vs. real price (or usual auction
price). If the auction price is good... buy the products and flip them. In
other words, you're basically building a system to help you find the best
possible products to flip.

------
dhruvkar
Searching for flights using rewards miles.

I think United & American used to have APIs that were shut down, so you'd need
to scrape account data and flights. It would work best as a desktop app. Other
airlines have APIs, but not sure how deep they are.

Huge pain point, especially when trying to combine different rewards programs.

~~~
ezekg
I had something similar awhile back but it was eventually shut down by the
airline’s legal department. If it’s not provided via a public API, I doubt
scraping will turn out any different than my project.

~~~
dhruvkar
If it's a desktop app and the crawling is not centralized, I doubt they'd be
able to do much about it.

Lot of crawl-heavy SEO tools work this way.

~~~
ezekg
Idk, I had a free open source command line project that scraped flight data
and that got shut down. It might have been the particular airline, though,
because they specifically disallow scraping from _all_ third-parties including
eg Google Flights, SkyScanner, etc.

~~~
dhruvkar
Interesting. As a consumer, I'd love to see something in that space. I hate
paying for most things, but a desktop app that allows me to search for rewards
miles would be something I would pay a yearly fee for.

------
speps
Scrape websites like TripAdvisor, Amazon, etc. for the ratings and compute an
actual rating not based on averages. I've seen a few articles on how ratings
are shown on those websites recently and they never seem to actually reflect
the truth.

------
gandutraveler
I have been trying to scrape travel places and recommendations data from
TripAdvisor and other travel sites. Read about scrapy in another hn post
yesterday and have been trying to get it running. Would appreciate any help on
this.

sam.xenai [at] gmail [dot] com

~~~
gyvastis
Did you check out GitHub for open-source solutions? I'm sure the biggest names
in the market are already covered when it comes to scrapers.

------
skate22
Streaming sources for music from popular artists that are not available on
spotify (like mixtapes)

------
uptownfunk
Good quality, accurate, high resolution stock / options / futures data

------
rosha
I recently had a similar idea which I ended up putting into real world, so I
built a small search engine "UkookU" for used vehicles from scratch, it did
not take me a long time to get the data and maintain it as I did not need to
build any scrapers or care about sites blocking/abuse as I used a third party
scrapping API called ProxyCrawl
[https://proxycrawl.com](https://proxycrawl.com) which allows me to make a
proof of concept of the idea for free, so that shortened the time I needed to
build the hadoop engine, etc.

I am thinking now recently to build a recruiting talent pool service, which is
based on aggregated data from LinkedIn, Google, Bing, Yahoo, Facebook and many
other sites and I am pitching something around it as I can get all the data
with ProxyCrawl.

I am also thinking recently to do something about keyword ontologies, small
markets around me using google/yandex data and offer it as a free helpful tool
in the form of a mobile app. How do you normally get your data for your
projects?

