

Ask HN: What should I learn if i want to manage and parse lots of data? - sixQuarks

I have some large csv files that I'd like to parse for information, sort, and just figure interesting things out.  I don't really know how to code, but I'd like to be able to create a searchable web interface for this data.  What are the best programming languages or frameworks I should learn to accomplish this?
======
JustARandomGuy
Can you define how much "lots of data" is?

If you have the data in CSV files, it's probably not that large. Consider
using Google Fusion Tables (
[https://developers.google.com/fusiontables/docs/v1/sql-
refer...](https://developers.google.com/fusiontables/docs/v1/sql-reference) )
You can set up simple queries by using SQL syntax (it's easy to learn).

After that, you can set up a simple interface by connecting with Google Apps
Scripts, which is easy to learn and to create a simple interactive web site
with.

~~~
sixQuarks
this looks promising. But one of the things I need to do is remove duplicate
rows. Can't figure out how to do that easily.

~~~
bmelton
Parse the rows in any programming language you desire.

Insert the results of the parsing into any popular database.

Google "<my_chosen_database> delete duplicates"

~~~
sixQuarks
that's the thing, I don't know any programming languages.

~~~
bmelton
I completely overlooked that part. My apologies.

------
pcowans
Can you be more specific about how large 'large' is?

Edit: without knowing the specifics, you might like to look at Elasticsearch
(<http://www.elasticsearch.org/>). That'll give you an HTTP/JSON API into your
data, so you might be able to do all the UI work you need client side, e.g.
with Backbone.js (<http://backbonejs.org/>) or similar.

~~~
sixQuarks
the csv files are not that big. a few thousand rows. But there are lots of
duplicates rows I need to get rid of

~~~
pcowans
Exact duplicate rows? If you have access to standard Unix tools, try this on a
command line:

cat input_file.csv | sort | uniq > output_file.csv

------
lsiebert
well, you can treat csv's as a sql database directly with perl, and it's
fairly easy to use.

You need to talk about the number of entries you have though, to give us an
idea of the sort of tools you need.

~~~
draegtun
For OP reference here are a few CSV modules on CPAN (Perl) that maybe handy:

* DBD::CSV - Treats your CSV file like it was a RDBMS so can use SQL to inspect/change it - <https://metacpan.org/module/DBD::CSV>

* Parse::CSV - For parsing large CSV files - <https://metacpan.org/module/Parse::CSV>

* Text::CSV_XS - Fast CSV parser - <https://metacpan.org/module/Text::CSV_XS>

