
Ask HN: Recommend tools to manipulate massive (5 million line) csv files. - api_or_ipa
Hi HN,<p>I&#x27;ve been tasked with analyzing a massive dataset and checking for inconsistencies.  The data set contains over 5 million lines which makes traditional tools (excel, OOspreadsheet) utterly useless and brings more specialized tools (CSVed) to their knees.<p>I understand this is well within db territory.  In the chance that there are tools out there that can sort and visualize large datasets, I would greatly appreciate your recommendations.
======
wikwocket
MS Access is often a great tool for this. It works well for datasets that are
too big for Excel, but too simple (1 table, only 5mm rows) for a true database
app.

Plus, if your comfort level is closer to Excel than MySQL, or you have
experience writing Excel macros/userforms, you can do well in Access.

Just promise you will get a proper database if your needs continue to grow; I
have seen too many frankensteinian monstrosities grow from once-orderly Access
databases!

------
akg_67
You may want to consider using R (RHIPE for large datasets - R with Hadoop
MapReduce) for analysis and visualization. R/RHIPE can easily handle 5 million
observations. I have used RHIPE with 500GB dataset. I am currently trying to
setup a multi-node RHIPE cluster to handle larger datasets for my personal
project.

You may be able to use Tableau Public, but your dataset may exceed the storage
limits on Public free account.

Another option may be to use Hadoop-as-a-Service (no visualization) such as
Quoble or TreasureData. I recently tried out TreasureData, a decent service if
your data has timestamps. You can run HIVE queries to filter and narrow your
data and then export out results to use in Visualization softwares.

------
phiggy
Have you heard of OpenRefine, formerly Google Refine?
[https://github.com/OpenRefine/OpenRefine](https://github.com/OpenRefine/OpenRefine)

~~~
msantos
I use OpenRefine fairly often and I find it great when I need to normalise
(millions of rows of) raw data. But it's not a data analysis tool and you'll
inevitably need another tool to create charts for example.

------
pbrumm
You could load the data into a database and run queries, sorts against that.
postgres has a COPY INTO command that supports csv inputs, which can greatly
reduce insert time.

------
delinquentme
Maybe I'm missing the part where you can just batch job it and read per line
with hadoop / python and the csv lib?

Shoot me an email ( details in profile ) as I'd love to poke at something just
for the line on my resume.

------
sithu
Not sure what you mean by checking for inconsistencies, but a good statistical
package would be able to handle this pretty well -- You could use Stata, or R
with R studio (free)

------
fsk
Set up a database (mysql or MS sql or whatever you use) and import the CSV.
From there, use standard SQL queries to look for duplicates, missing, etc.

------
x0x0
depending on what you're doing, awk / sort / uniq / specialized scripts in
ruby / sqlite / R

provide more details

------
stadeschuldt
I would start with a lightweight database like sqlite.

