That’s literally a standard database size, and can be easily done with any postgres install.
All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.
And with "all" I mean "all 20TB".
And with "quickly" I mean "faster than a network request to Google".
________________
Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:
> Well, like this:
> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.
> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.
What you just did isn't how data is usually processed — a single 8GB JSON never occurs in real life cases, except maybe as transitional medium while you migrate from one database to another.
In most cases, you'll have it stored in a relational or graph database, easily accessible.
Especially in the sciences.
But sure, I can go through the data and tell you the amount of cats with specific properties, or similar.
I personally use pgsql currently to analyse data from Android crash reports.
Like, "list me all exceptions which have reports from more than 50 uniqie users which do not all use Samsung phones"
Or, my favourite, as I store for each report the exception, for which I store the stack trace elements, for which I store methods, files, lines of files and classes,
Show me a breakdown of operating systems of users affected by exceptions occuring in this method which have sent more than 40 fatal crashes each in the past month.
It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
________________
Seriously, @dang, can you maybe consider deactivating downvotes unless people also post a comment explaining why they consider a comment not constructive?
Getting mass downvotes within of seconds of posting is very annoying, and just destroys the discussion culture.
> It manages that faster than just the latency to bigquery would be — 18ms over a test dataset of several gigabyte by now.
Yes!!! Exactly! That's the whole point. If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery. Going from hours to seconds changes your life.
> I personally use pgsql currently to analyse data from Android crash reports.
> If something is taking you less than a second, don't bring BigQuery into the mix. But when you start hitting processes that take hours or days to run... try BigQuery
But that's my entire point? Why are all the test datasets small enough to be faster in pgsql than the latency to Google would take?
I mean, if I wanted to showcase my software, I'd use a huge dataset with a complex problem — say protein folding, or superconductor molecular analysis — and show it in comparison on bigquery and a standard local database.
Although these two examples are bad, as I know from my university that they can be solved in the same time for less money locally than by using bigquery
Look... if you don't have problems that take hours to solve in your current environment, you haven't found big data problems (yet). And that's OK. Not everyone works with big data.
But if one day you do, please ping me, it will be fun to do a follow up.
What’s "big" about any of them?
That’s literally a standard database size, and can be easily done with any postgres install.
All of that data combined can be stored and quickly queried on just 10 dedicated servers, for an overall price of 200€/month.
And with "all" I mean "all 20TB".
And with "quickly" I mean "faster than a network request to Google".
________________
Due to rate limiting, I can't answer you at the moment, vgt. So my answer will happen here, inline:
> Well, like this:
> If someone wants to decide if they want to build a business on bigquery, they want to evaluate it's performance on an average data set before for free.
> Assuming the test data is thought for similar cases it has to be of similar size to the average dataset used with BigQuery.