Hacker News new | past | comments | ask | show | jobs | submit login

Legit question: I have a dataset that's a terabyte in size spread over multiple tables, but my queries often involve complex self joins and filters; for various reasons, I'd prefer to be able to write my queries in SQL (or spark code) because it's the most expressive system I've seen. What tool should I use it to load this dataset on RAM and run these queries?

There are a few steps to consider before you are loading data into ram.

Can you partition the data in any useful way? For example if queries use separate ranges of dates, then you can partition data so that queries only need to touch the relevant date range. Can you pre-process any computations? Sometimes tricky things done within the context of multiple joins can be done once and written to a table for later use. Can you materialize any views? Do you have the proper indexes set up for your joins and filters? Are you looking at execution plans for your queries? Sometimes small changes can speed up queries by many orders of magnitude.

Smart queries + properly structured data + a well tuned postgres DB is an incredibly powerful tool.

Can I set up efficient indexes on parquet data to use with Spark, or is it necessary to use a DB?

Most DB engines will use what RAM is available and even if they don't, your OS's page cache will make sure stuff is fast anyways.

> What tool should I use it to load this dataset on RAM and run these queries?

The question should really be: What tool should I use to make this fast?

Postgres can be pretty fast when used correctly and you can make your data fit.

Exactly this. DBs are really good at utilizing all the memory you give them. The query planners might give you some fits when they try and use disk tables for complicated joins, but you can work around them.

Both mysql and pgsql bypass the page cache if they can and maintain their own page caches. You have to do this, otherwise you’re double caching! That is, you’d have your own page cache, which you need to manage calls to read() and to know when to flush pages, while the OS would also have the same pages in its own cache.

(mongodb I believe uses direct mmap access instead of a pagecache, and lmdb does this as well)

You can get a box with 4TB of ram on EC2 for $4/hr spot, so copy your data into /dev/shm and go hog wild.

For lots of databases, most of their time is spent locking and copying data around, so depending on your workload you might getsignificant speedups in Pandas/Numpy if it's just you doing manipulations, and there are multicore just-in-time compilers for lots of Pandas/Numpy operations (like Numba/Dask/etc).

If you have lots of weird merging criteria and want the flexibility of SQL I'd say use a modern Postgresql with multicore selects on that 4TB box.

How long does it take to copy your data in? And what’s the bandwidth cost involved?

People like to talk about the elasticity or compute, but startup is not free (or even cheap in most cases).

If your data is in S3, my experience is that you can push ~20-40MB/core/sec on most instances.

OP is probably talking about an x1e.32xlarge. According to Daniel Vassalo's S3 benchmark [1], it can do about 2.7GB/sec.

So your 4TB DB might take ~30min to fetch.

Bandwidth is free, you'd pay $2 for the 30 min of compute, and some fractions of pennies for the few hundred S3 requests.

[1]: https://github.com/dvassallo/s3-benchmark

It's to note that any data exported out of AWS will be billed at $0.09/GB, or $90/TB

Currently I have it loaded on redshift with as much optimization as possible, and the queries are far more analytical than end-user like (often having to self join on the same dataset). This works okay, but doesn't scale with more than a handful users at a time. I'll probably run some tests with the postgres suggestion but curious if this is still a better alternative or not

[Disclaimer: I worked on BigQuery a couple lives ago]

I'd give Google BigQuery a shot. Should work fast [seconds] and scale seamlessly to [much] larger datasets and [many] more users. For a 1 TB dataset, I have a hard time imagining crafting a slow query. Maybe something outlandish like 1000[00?] joins. They also have an in-memory "BI Engine" offering, alas limited to 50GB max.

On premise, there is Tableau Data Engine. I don't think they offer a SQL interface, you have to buy into the entire ecosystem.

Long shot: I've been working on "most expressive query system over multiple tables" as an offshoot of some recent NLP work. Your use case piqued my interest. I'd love to help / understand it better. My contact is in my profile.

"My contact is in my profile"

This makes for a good laugh: Queries for the masses. Contact: f'info@${user}.com'

Every major database will load the hottest data into RAM, where the scope of "hottest" broadens to whatever amount will fit in RAM. A small percentage of them require you to confirm how much RAM it can use for this cache.

Putting the data on a ramdisk just becomes entirely redundant because it's still going to create a second memory cache that it uses.

Many operations do a local cache warming by running the common queries over the database before they it is brought online for processing. As a secondary note, people often under-estimate the size of their data because they don't account for all of the keys, indexes and relationships that also would be memory cached in an ideal situation.

You don't need to store everything into RAM to get fast results. Data warehouse relational databases are designed exactly for this kind of fast SQL analysis over extremely large datasets. They use a variety of techniques like vectorized processing on compressed columnar storage to get you quick results.

Google's BigQuery, AWS Redshift, Snowflake (are all hosted), or MemSQL, Clickhouse (to run yourself). Other options include Greenplum, Vertica, Actian, YellowBrick, or even GPU-powered systems like MapD, Kinetica, and Sqream.

I recommend BigQuery for no-ops hosted version or MemSQL if you want a local install.

Some of those aren't designed for data that doesn't fit in RAM. They even have it in their name (MemSQL for example).

All of those support datasets that don't fit in RAM, or else they would be useless at data warehousing.

MemSQL uses rowstores in memory combined with columnstores on disk. Both can be joined together seamlessly and the latest release will automatically choose and transition the table type for you as data size and access patterns change.

You can just write Spark SQL, set the executor memory to whatever the machine is and not worry about whether it's in RAM or not.

Spark will naturally use RAM first and then disk as needed.

You should use the cache or persist call on the spark data frame/dataset. Persist gives you more control.

This is what most people in my org do, this is orders of magnitude slower than running queries on the same dataset on redshift with optimized presorting and distribution. Redshift doesn't scale for tens or hundreds of parallel users though, so looking for options

I'm about 75% joking: restore a new cluster from a snapshot for horizontal scaling. They did launch a feature along those lines, because my joke suggestion probably doesn't scale organizationally: take a look at Concurrency Scaling. This is also a fundamental feature of Snowflake with the separation of compute and storage.

Exactly the same game plan for us now, evaluating snowflake, but wanted to check if there's a fundamentally different paradigm that can be much more faster and scalablr

I'll be That Guy and ask: what are you doing with the data, and can you change your processing or analysis to reduce the amount of data you need to touch?

In my experience, it's nearly always the case that pulling in all data is not necessary, and that thinking through your goals, data, and processing can often reduce both the amount of data touched and the processing run on it massively. Look up the article on why GNU grep is so fast for a bunch of general tricks that can be employed, many of which may apply to data processing generally.


1. Random sampling. The Law of Large Numbers applies and affords copious advantages. There are few problems a sample of 100 - 1,000 cannot offer immense insights on, and even if you need to rely on larger samples for more detailed results, these can guide further analysis at greatly reduced computational cost.

2. Stratified sampling. When you need to include exemplars of various groups, some not highly prevalent within the data.

3. Subset your data. Divide by regions, groups, accounts, corporate divisions, demographic classifications, time blocks (day, week, month, quarter, year, ...), etc. Process chunks at a time.

4. Precompute summary / period data. Computing max, min, mean, standard deviation, and a set of percentiles for data attributes (individuals, groups, age quintiles or deciles, geocoded regions, time series), and then operating on the summarised data, can be tremendously useful. Consider data as an RRD rather than a comprehensive set (may apply to time series or other entities).

Creating a set of temporary or analytic datasets / tables can be tremendously useful. As much fun as it is to write a single soup-to-nuts SQL query.

5. Linear scans typically beat random scans. If you can seek sequentially through data rather than mix-and-match, so much the better. With SSD this advantage falls markedly, but isn't completely erased. For fusion type drives (hybrid SSD/HDD) there can still be marked advantages.

6. Indexes and sorts. The rule of thumb I'd grown up with in OLAP was that indexes work when you're accessing up to 10% of a dataset, otherwise a sort might be preferred. Remember that sorts are exceedingly expensive.

If at all possible, subset or narrow (see below) data BEFORE sorting.

6. Hash lookups. If one table fits into RAM, then construct a hash table using that (all the better if your tools support this natively -- hand-rolling hashing algorithms is possible, but tedious), and use that to process larger table(s).

7. "Narrow" the data. Select only the fields you need. Most especially, write only the fields you need. In SQL this is as simple as a "SELECT <fieldlist> FROM <table>" rather than "SELECT * FROM <table>". There are times you can also reduce total data throughput by recoding long records (say, geocoded names, there are a few thousands of place names in the US, using Census TIGER data, vs. placenames which may run to 22 characters ("Truth or Consequences", in NM), or even longer for international placenames. You'll need a tool to remap those later. For statistical analysis, converting to analysis variables may be necessary regardless.

The number of times I've seen people dragging all fields through extensive data is ... many.

Some of this can be performed in SQL, some wants a more data-related language (SAS DATA Step and awk are both largely equivalent here).

Otherwise: understanding your platforms storage, memory, and virtual memory subsystems can be useful. Even as simple a practice as running "cat mydatafile > /dev/null" can often speed up subsequent processing.

in memory sqlite is pretty fast

Yes, but SQLite is probably the least expressive SQL dialect there is. If you're choosing SQL because of its expressiveness, you probably aren't thinking of a dialect with only 5 types (including NULL).

I tried this once, creating an index on a 20 billion row table isn't fast :/

I haven't tried this and don't know if it would work -- but depending on the shape of your data and queries, you might not need certain indices. That is, for some workloads (especially if you're thinking of spot instances), it might be overall faster to skip the indexing and allow the query to do a full table scan. It sounds like maybe you never tried the query without the index, so I'm curious to know if there's any weight behind this theory.

If you run on MySQL/InnoDB you can set innodb_buffer_pool_size=1000Gb and it should cache your data after the first query.


I am not a big-data guy but wouldn't it be along the lines of A) get a big honking server B) fire up "X" SQL server C) Allocate 95-98% of the RAM to DB cache?

A single terabyte is a few magnitudes from what you need big-data-anything for. You could probably work with that just fine on your average 64GB ram desktop with an SSD.

Another poster already replied with a decent refutation of this claim, but a single pass over a TB of data is often not enough for 'big data' use cases and at tens of minutes per pass, it may very well be infeasible to operate on such at dataset with only 64GB of memory.

In the machine learning world, some of the algorithms that are industrial workhorses will require you to have your dataset in memory (ie: all the common GBM libraries), and will walk over it lots of times.

You may be able to perform some gymnastics and allow the OS to swap your terabyte+ dataset around inside your 64GB of RAM, but the algorithms are now going to take forever to complete as you thrash your swap constantly while the training algorithm is running.

tl;dr - a terabyte dataset in the machine learning context may very well need that much RAM plus some overhead in terms of memory available to be able to train a model on the dataset.

A small computer with 1 SSD will take at least 10-20 minutes to make a pass over 1TB of data, if everything is perfectly pipelined.

Samsung claims their 970 Pro NVMe can read 3.5GB/s sequentially. That's about 300 seconds or 5 minutes per TB.

It can't though.

It can, and their fastest enterprise SSD can write at that speed too, or do sequential reads at 7-8GB/s, or random reads at over 4 GB/s.

I just ran `time cp /dev/nvme0n1 /dev/null` on the 1TB 970 Pro. The result:

  real    4m50.724s
  user    0m2.001s
  sys     3m10.282s
So with literally zero optimization effort, we've hit the spec (and saturated a PCIe 3.0 x4 link).

Impressive performance for a $345 consumer grade SSD.


That's impressive and all, but any fragmentation or non-linear access and performance will fall off a cliff

You'd probably be surprised. For reads, there are tons of drives that will saturate PCIe 3.0 x4 with 4kB random reads. Throughput is a bit lower because of more overhead from smaller commands, but still several GB/s. Fragmentation won't appreciably slow you down any further, as long as you keep feeding the drive a reasonably large queue of requests (so you do need your software to be working with a decent degree of parallelism).

What will cause you serious and unavoidable trouble is if you cannot structure things to have any spatial locality. If you only want one 64-bit value out of the 4kB block you've fetched, and you'll come back later another 511 times to fetch the other 64b values in that block, then your performance deficit relative to DRAM will be greatly amplified (because your DRAM fetches would be 64B cachelines fetch 8x each instead of 4kB blocks fetched 512x each).

Option A is the best imo, I worked on many sql db's that the rule was to fit it into ram. Option c will bite you in the ass eventually. The kernel and your other processes need some space to malloc, and you dont want to page in/out.

Having 4/16TiB servers or "memory db servers" as I thought of them solved a lot of problems outright. Still need huge i/o but less of it depending on your workload.

I'm pretty sure that was supposed to be a list of steps, not a list of options.

> The kernel and your other processes need some space to malloc, and you dont want to page in/out.

Some space, like "most of 20-50 gigabytes"?

You want to take into account how exactly the space used by joins will fit into memory, but 2-5% of a terabyte is an extremely generous allocation for everything else on the box.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact