Hacker News new | past | comments | ask | show | jobs | submit login

After looking at those:

What’s "big" about any of them?

That’s literally a standard database size, and can be easily done with any postgres install.

All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.

And with "all" I mean "all 20TB".

And with "quickly" I mean "faster than a network request to Google".

________________

Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:

> Well, like this:

> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.

> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.




Apologies, but where do you find the implication that these datasets are "big"? There is "big" in BigQuery, so apologize if you misunderstood.

We try to keep datasets reasonable, so that folks can get the most out of the BigQuery free pricing tier :)

If you want bigger, here's an example of a 10TB dataset:

https://cloud.google.com/genomics/data/1000-genomes


Hi Janne!

I'm sure you'll enjoy a challenge.

Let's talk about Wikidata. Can you download this 8 GB compressed file?

https://dumps.wikimedia.org/wikidatawiki/entities/latest-all...

I want to know the id of the JSON row which length is 102 bytes.

It took me 4 seconds with BigQuery - how can we improve this with "any postgres install"?

https://lists.wikimedia.org/pipermail/wikidata/2016-March/00...


What you just did isn't how data is usually processed — a single 8GB JSON never occurs in real life cases, except maybe as transitional medium while you migrate from one database to another.

In most cases, you'll have it stored in a relational or graph database, easily accessible.

Especially in the sciences.

But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.

I personally use pgsql currently to analyse data from Android crash reports.

Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"

Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,

Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.

It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.

________________

Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?

Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.


> Especially in the sciences.

Take a look at how Stanford is leveraging BigQuery for their genomics analysis:

http://www.eventbrite.com/e/interactive-cloud-analytics-extr...

> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.

Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.

> I personally use pgsql currently to analyse data from Android crash reports.

Cool! Guess what Motorola uses...

http://www.slideshare.net/PatrickDeglon/predictive-analytics...


> If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery

But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?

I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.

Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery


Look... if you don't have problems that take hours to solve in your current environment, you haven't found big data problems (yet). And that's OK. Not everyone works with big data.

But if one day you do, please ping me, it will be fun to do a follow up.


Yeah, that's why I had hoped the test datasets would be big data problems — so I could see what that actually looks like ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: