After looking at those: What’s "big" about any of them? That’s literally a stand...

vgt · on March 29, 2016

Apologies, but where do you find the implication that these datasets are "big"? There is "big" in BigQuery, so apologize if you misunderstood.

We try to keep datasets reasonable, so that folks can get the most out of the BigQuery free pricing tier :)

If you want bigger, here's an example of a 10TB dataset:

https://cloud.google.com/genomics/data/1000-genomes

fhoffa · on March 29, 2016

Hi Janne!

I'm sure you'll enjoy a challenge.

Let's talk about Wikidata. Can you download this 8 GB compressed file?

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all...

I want to know the id of the JSON row which length is 102 bytes.

It took me 4 seconds with BigQuery - how can we improve this with "any postgres install"?

https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...

kuschku · on March 29, 2016

What you just did isn't how data is usually processed — a single 8GB JSON never occurs in real life cases, except maybe as transitional medium while you migrate from one database to another.

In most cases, you'll have it stored in a relational or graph database, easily accessible.

Especially in the sciences.

But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.

I personally use pgsql currently to analyse data from Android crash reports.

Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"

Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,

Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.

It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.

________________

Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?

Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.

fhoffa · on March 30, 2016

> Especially in the sciences.

Take a look at how Stanford is leveraging BigQuery for their genomics analysis:

http://www.eventbrite.com/e/interactive-cloud-analytics-extr...

> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.

Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.

> I personally use pgsql currently to analyse data from Android crash reports.

Cool! Guess what Motorola uses...

http://www.slideshare.net/PatrickDeglon/predictive-analytics...

kuschku · on March 30, 2016

> If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery

But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?

I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.

Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery

fhoffa · on March 30, 2016

Look... if you don't have problems that take hours to solve in your current environment, you haven't found big data problems (yet). And that's OK. Not everyone works with big data.

But if one day you do, please ping me, it will be fun to do a follow up.

kuschku · on March 30, 2016

Yeah, that's why I had hoped the test datasets would be big data problems — so I could see what that actually looks like ;)