
Ask HN: Do you scrape data? What do you use it for? - jmaccabee
I&#x27;ve got a theory that in 2018, most people or companies can use web scraping tools to make some aspect of their lives easier.<p>For example, I wrote a script a few months back to scrape transit data and send me a text when the subway is delayed for my morning commute. Maybe you&#x27;re helping your company scrape competitor prices to know when they change?<p>So - are you scraping data today? As part of a hobby project or for professional purposes? What do you use it for? And if you don&#x27;t, why not?
======
klez
I do scrape. We are making a sort of meta-search engine for long term car
rental, so to bootstrap it we are scraping offer from various sites and
directing users to the original websites to actually go through with the
offer.

If anyone is interested in more detail, I can explain further.

~~~
deadcoder0904
Go on

~~~
klez
Ok, here it goes.

The scrapers are written in python 2, because that's what the guys who started
the project were familiar with.

Most of the scraping is done by hand with XPath queries on the pages we
fetched, so no beautifulsoup or stuff like that. Again, I think mostly because
those who started the project were not familiar with the task. It's not even
that bad when I need to modify something (because the page changes etc.), as
the code is very well written.

The problems started when the CEO and CTO proposed (mandated?) to use
something made by a guy who is supposed to be a web scraping expert in the
same domain we're working in (I don't doubt that, but still...). The software
it gave us is written in Ruby (which no-one here ever even saw a line of) and
Rails, and works with recipes instead of the imperative code + xpath we used
at the beginning. It works flawlessly until it doesn't. Mainly because there's
a big logical error (if an offer disappears from the original website we
should mark it as deleted on our db, but the scrapers tells us it's still
there) and I don't have time to learn ruby, rails and the whole system to fix
this. And the original dev is not available anymore. So we're phasing that out
and going back to our nice land of python :)

Anyway the process goes like this:

1\. Scraper fetches the page, scrapes the data, and generates a JSON file with
the info of all the offers it finds

2\. Those JSON files are uploaded on S3

3\. A trigger on S3 calls the "writer" on a EC2 instance, that downloads the
JSON file, unpacks the content and writes the data to a postgres database.

Current problem: scraping arbitrary strings representing car options and
categorizing them. We have something like 15,000 strings that need to be put
inside a category. Manually.

------
xstartup
Ad scraping can be very profitable.

See: [https://adplexity.com/](https://adplexity.com/)

~~~
is_true
I don't understand something. How do they know how many impressions ads get?
They are just scraping or they have access to ad networks stats?

~~~
xstartup
It's determined based on guesswork.

If you scrap a website, you'll know how what % of times an ad shows up.

Now, if you know the distribution for top N ads. You can skip everything and
go straight to the ad network and check their total number of impression
volume on particular website/GEO. You can pretend to be an advertiser and
easily get access to this info.

Now, you know the distribution and total, can't you work out the impression
count for each ad?

~~~
is_true
ty

