
Ask HN: Best Language for Programmatically Transforming Large Datasets? - tylershuster
As part of my job I often find myself creating tools to transform large datasets. Some examples include restructuring GTFS feeds for structured consumption, or CRM migration. I come from web development so my tools of choice generally revolve around PHP and MySQL in a Drupal wrapper (for security, utilities, &amp;c). As you might expect this approach seems slow and I know those tools aren’t well suited to the task. What languages, frameworks, or apps do you use for similar tasks?
======
davismwfl
I've used a variety of languages, plain SQL + some bash scripting, nodejs,
python, C++, C and C#.NET to name some. Hell, I've even used bash scripts with
some sed/awk/grep etc to get it done but that isn't ideal most of the time.

The factors I'd consider when picking a language:

1\. Datasource

2\. Destination datasource

3\. Number and type of translations

4\. Source language that created the dataset (typically more relevant if you
are pulling flat files or custom data structures, but always on the radar).

5\. Datasize both in terms of storage size and number of records/transactions.

6\. Data location, are my source and destination co-located or across the
state, country or world from each other?

7\. Data growth, do I need to keep running this process to keep up or is it a
one and done ??

In general the lower the level language the longer it will take to code but it
generally leads to more performant code. I usually opt for the easiest
solution first and then work from there because as you are trying the easy one
you'll find challenges which might push your decision a specific way. Today,
I'd also look at Rust if I was doing a large dataset or working with binary
data or I required more performance. I'd generally avoid PHP for this as it
really isn't the right match, but if I had a small dataset and the creation
was done via PHP I'd probably be tempted to just knock it out in PHP and be
done (if it wasn't going to be around long).

Also, in the last 10 years or so when I do these types of projects I almost
always use a mediator cache/queue like Redis or memcached + a queue etc. This
gives me quick common lookups as well as I can process data in steps through
the queue, survive failures and separate the code into clean sections which is
nice for a lot of reasons.

I just wrote all that and didn't really answer your question with a direct
answer, but that is because I don't think there is one correct or simple
answer. I'd default to the easiest and work your way to the most complex based
upon at least these factors, although many others exist.

There are tools that exist to do a lot of this but usually they are database
focused. So if you are dealing with zip and text files etc it is hard to find
COTS software to do it cleanly.

