
How we scrape 300k prices per day from Google Flights - gusgordon
https://medium.com/brisk-voyage/how-we-scrape-300k-flight-prices-per-day-from-google-flights-79f5ddbdc4c0
======
jsnell
> This isn’t an astronomical number, but it’s large enough that we (at least,
> as a bootstrapped company) have to care about cost efficiency.

... by externalizing the costs to a third party.

In general, I'm really surprised that they published this article. It's like
they described exactly the data that somebody working on preventing scraping
would need to block this traffic, in totally unnecessary level of detail.
(E.g. telling exactly which ASN this traffic would be arriving from,
describing the very specific timing of their traffic spikes, the kind of
multi-city searches that probably see almost no organic traffic).

I just don't get it. It's like they're intentionally trying to get blocked so
that they can write a follow-up "how Google blocked our bootstrapped business"
blog post.

~~~
fxtentacle
Or they just don't understand that what they are doing is illegal.

I'm always surprised by the level of ignorance, but I've seen more than one
startup burn because the founders didn't understand which taxes were due and,
thus, failed to account for them in their pricing.

~~~
aardvark291
Unethical? Yes. Illegal? How?

~~~
digsy
Unethical to build a business scraping data from a company that makes money
scraping data?

~~~
aardvark291
The Google Flights data is not "scraped." They interface directly to the
airline reservations systems.

------
cortesoft
> The crawl function reads a URL from the SQS queue, then Pyppeteer tells
> Chrome to navigate to that page behind a rotating residential proxy. The
> residential proxy is necessary to prevent Google from blocking the IP Lambda
> makes requests from.

I am very interested in what a 'rotating residential proxy' is. Are they
routing requests through random people's internet connections? Are these
people willing participants? Where do they come from?

~~~
nsgf
Yes, to all your questions.

[https://luminati.io/](https://luminati.io/)

Providers of the 'free' Hola vpn.

~~~
ac29
How awful.

"80M+ Monthly devices hosting Luminati's SDK" & "100% Peers chose to opt-in to
Luminati's network" ([https://luminati.io/network-
details](https://luminati.io/network-details))

There is a 0% chance that 80M+ are agreeing "I am OK with Luminati selling
access to my home internet connection to any party able to pay", which seems
like an honest description of their business model. More likely Luminati is
paying unscrupulous app developers to include this SDK in their apps, and some
put some legalese into 10,000 word install-time agreements that no one reads.

~~~
sdinsn
People who use the network also participate in the network themselves.

------
randombytes6869
To those lamenting that they're scraping... Google is the biggest scraper of
them all. Facebook, Amazon, Google, Microsoft. All the big boys scrape
voraciously, yet try their best to block themselves from being scraped.
Scraping is vital for the functionality of the internet. The narrative that
scraping is evil is what big companies want you to think.

When you block small scrapers from your site but permit giants like Googlebot
and Bing all you're doing is locking in a monopoly that's bad for everyone

~~~
occamrazor
Google has the (often implicit) permission of the website owner to scrape.
OTOH, Google Flights explicitly disallows scraping results.

~~~
sdinsn
No, Google's scraping is opt-out only, which they offer to be friendly.

Google does not need anyone's permission to scrape publicly accessible data,
and they are not required to follow any opt-out requests.

------
cleansy
It's ironic writing an article like that, while their ToS states:

> As a user of the Site, you agree not to:

> 1\. Systematically retrieve data or other content from the Site to create or
> compile, directly or indirectly, a collection, compilation, database, or
> directory without written permission from us.

~~~
Frost1x
Irony is even deeper when you look on the other side, which is Google who made
most their money off scraping data from people in different forms.

It's data scraping/middlemen all the way down... I wonder if Google indexes
their scrape results to throw some loops in the mix.

~~~
namdnay
Google have respected non scrape headers for decades no?

~~~
Frost1x
Google also curates and republishes data from a lot of sites, including news
sites and informational sites that significantly reduces traffic to other
sites, etc. There's a lot of data Google scraped that wasn't necessarily
explicitly given permission to outside their page crawler. They chose "beg for
forgiveness" over "ask for permission" in many cases.

The point being that there's irony in every direction, the proverbial "pot
calling the kettle black." Lots of irony in both directions.

------
dmortin
It's strange they write about this so openly. Aren't they wary that someone at
Google Fights will read it and they will try blocking them? (E.g. by
scrambling the page's code)

~~~
meritt
Google doesn't need to block them on a technical level, they just need to send
a simple C&D. If Brisk keeps scraping without permission after that, they can
look forward to financially a ruinous legal battle [1]. Or they could just not
blog about what they're doing and fly under the radar for years and years
without any concern.

[1] [https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-
v-l...](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-
protects-scraping-public-data)

------
dlhavema
Interesting. A scraper scraping a scraper. I don't get what the value add is
over clients just searching Google Flights directly. Not trying to be mean,
just trying to understand.

~~~
namdnay
Google Flights isn’t a scraper, it’s an evolution of ITA matrix from what I
remember, directly connected to that GDS. They aren’t piggy backing on someone
else’s servers.

Which is what this guy could have done, instead of behaving like pond scum.
It’s not like it’s particularly complicated to get programmatic access to a
GDS API, that’s what they’re there for.

~~~
gusgordon
It’s expensive to get access to a GDS API and, from what I’ve heard, the data
they provide is quite difficult to work with. There’s a reason Google bought
ITA for $700m, right? If this project ever grows, it could make sense to pull
from a GDS.

~~~
namdnay
> It’s expensive to get access to a GDS API and, from what I’ve heard, the
> data they provide is quite difficult to work with.

Well, it’s expensive to provide live answers for flight search queries across
hundreds of airlines and thousands of airports... Some of the old booking
interfaces are ugly, but for simple searching most of them provide relatively
sane REST/JSON

I don’t understand your attitude, steal it until you make it?

~~~
james412
That's the attitude of just about every successful company in history. Once
large enough, some of them (e.g. YouTube) even force industrial changes to
accommodate all the theft that made them successful.

Meanwhile on the topic of attitudes, referring to a startup as 'pond scum'
simply because they scrape an extremely expensive data set, especially
regarding an industry with a long and controversial history of strategies
designed to avoid price transparency.. hmm.

------
nunez
Flights isn't really the best way of getting cheap flights. They pepper the
results, especially if they think you're scraping (which they probably do).
Matrix is more accurate. Using a GDS is even more accurate but that costs
money.

------
dandanio
Hey Gus, you might be interested in
[https://pricelinepartnernetwork.com/](https://pricelinepartnernetwork.com/)
(take a look at the API part for example)

(Disclaimer: I work for priceline).

------
founderling
The way I read it, they scrape 25k pages per day?

I wonder if that could already bring them on Googles radar. If so, Google
would probably send a cease and desist letter and this startup would simply
give up.

I wonder if Google would also demand their legal expenses? Probably a couple
thousand dollars?

I know, nobody would go to court against Google - but what would happen if
this _did_ go to court? Which laws would Google cite to deem this illegal?

------
BaitBlock
Reader mode in case you don't prefer Medium:
[https://baitblock.app/read/medium.com/brisk-voyage/how-we-
sc...](https://baitblock.app/read/medium.com/brisk-voyage/how-we-
scrape-300k-flight-prices-per-day-from-google-flights-79f5ddbdc4c0)

------
mongodbhater
All the (AWS) technologies used are totally unnecessary. SQS/DynamoDB/Lambda.
I can buy a laptop in walmart for $500 and i can do all the scrapping in
starbucks wifi.

~~~
nunez
Lambda is needed to get rotating IPs and scale while avoiding browser
fingerprinting. SQS takes the results of those scrapes and puts them into a
database, DynamoDB. It's a straightforward web scraping pipeline.

~~~
toddh
Lambda isn't enough. You'll get blocked in a heartbeat. You still need a proxy
service.

------
nojito
You state that you care about costs but you end up using some of the most
expensive cloud offerings out there?

~~~
heipei
I'm torn about their account. It's true that you could easily scrape 25k pages
per day on a small VPS that costs less than the $50 Lambda costs they
mentioned. And in order to scrape from that VPS you wouldn't have to engineer
this much with getting Chrome to run in Lambda, batching URLs, and you
wouldn't worry about Lambda timeouts because you could run the whole scrape in
one session more or less. So you could say that the engineering effort they
spent building this was a waste of money. On the other hand, if they ever do
need to scale up for whatever reason (information spread across more pages, or
they need to scrape more services, or need multiple attempts per URL), all
they have to do is push a button, at which point the upfront engineering
effort will have paid off. Either way, their current Lambda costs are
definitely eclipsed by the costs of paying for the residential proxy IPs. My
two cents.

------
ykevinator
This is awesome

------
tpmx
The Internet is not series of tubes. It's a series of leeches...

