Hacker News new | past | comments | ask | show | jobs | submit login

I'll be That Guy and ask: what are you doing with the data, and can you change your processing or analysis to reduce the amount of data you need to touch?

In my experience, it's nearly always the case that pulling in all data is not necessary, and that thinking through your goals, data, and processing can often reduce both the amount of data touched and the processing run on it massively. Look up the article on why GNU grep is so fast for a bunch of general tricks that can be employed, many of which may apply to data processing generally.


1. Random sampling. The Law of Large Numbers applies and affords copious advantages. There are few problems a sample of 100 - 1,000 cannot offer immense insights on, and even if you need to rely on larger samples for more detailed results, these can guide further analysis at greatly reduced computational cost.

2. Stratified sampling. When you need to include exemplars of various groups, some not highly prevalent within the data.

3. Subset your data. Divide by regions, groups, accounts, corporate divisions, demographic classifications, time blocks (day, week, month, quarter, year, ...), etc. Process chunks at a time.

4. Precompute summary / period data. Computing max, min, mean, standard deviation, and a set of percentiles for data attributes (individuals, groups, age quintiles or deciles, geocoded regions, time series), and then operating on the summarised data, can be tremendously useful. Consider data as an RRD rather than a comprehensive set (may apply to time series or other entities).

Creating a set of temporary or analytic datasets / tables can be tremendously useful. As much fun as it is to write a single soup-to-nuts SQL query.

5. Linear scans typically beat random scans. If you can seek sequentially through data rather than mix-and-match, so much the better. With SSD this advantage falls markedly, but isn't completely erased. For fusion type drives (hybrid SSD/HDD) there can still be marked advantages.

6. Indexes and sorts. The rule of thumb I'd grown up with in OLAP was that indexes work when you're accessing up to 10% of a dataset, otherwise a sort might be preferred. Remember that sorts are exceedingly expensive.

If at all possible, subset or narrow (see below) data BEFORE sorting.

6. Hash lookups. If one table fits into RAM, then construct a hash table using that (all the better if your tools support this natively -- hand-rolling hashing algorithms is possible, but tedious), and use that to process larger table(s).

7. "Narrow" the data. Select only the fields you need. Most especially, write only the fields you need. In SQL this is as simple as a "SELECT <fieldlist> FROM <table>" rather than "SELECT * FROM <table>". There are times you can also reduce total data throughput by recoding long records (say, geocoded names, there are a few thousands of place names in the US, using Census TIGER data, vs. placenames which may run to 22 characters ("Truth or Consequences", in NM), or even longer for international placenames. You'll need a tool to remap those later. For statistical analysis, converting to analysis variables may be necessary regardless.

The number of times I've seen people dragging all fields through extensive data is ... many.

Some of this can be performed in SQL, some wants a more data-related language (SAS DATA Step and awk are both largely equivalent here).

Otherwise: understanding your platforms storage, memory, and virtual memory subsystems can be useful. Even as simple a practice as running "cat mydatafile > /dev/null" can often speed up subsequent processing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact