

Ask HN: Concise and performant languages for database loading? - bsg75

I have a couple of cases where I need to load data from delimited files once per minute - they arrive at a one minute interval. The source files average 150MB in size, with a fixed length of 20 fields.<p>These are simple workflows where writing a simple script is quicker and easier to maintain than using an ETL system. However, my go-to language Python seems unable to keep up, taking about 3 minutes to parse and insert. An example pattern can be seen at https://gist.github.com/2914601 (I have tried a few  tuning patterns, this is the fastest I have come up with). The example uses Redis, as an experiment, but not the only target.<p>I am interested in looking into languages that will be better performing, but come with some of Python's strengths - concise, easy to write, read and maintain code, and a library of relevant libraries and drivers.<p>I have been reading a bit about FP languages (Haskell/GHC), because of their performance, but I am not sure if the FP paradigm is appropriate for data integration tasks.<p>Opinions / Advice ?
======
nandemo
I'm not knowledgeable on Python and Redis, but very likely you're doing
something wrong with either of them.

Parsing fixed-length fields should take no time at all, so the bulk of the
time should be taken by reading the file and inserting the parsed data into
the data store. Reading and writing a 150MB file from disk should take only a
couple of seconds. So before bothering with another language, you should try
to pinpoint where the problem is. Python has a profiler, use it. Also, you
appear to be using multiple threads. That probably won't help you much since
your task is IO-bound, so try making it single-threaded.

If you cannot find the problem then I suggest asking for advice on
StackOverflow (or CodeReview).

~~~
bsg75
Removing the single call to update Redis takes execution time from 3 minutes
to ~5 seconds. Profiling the execution shows the most of the time is taken up
inside the Redis driver, but I see similar times if I use the MongoDB Python
driver.

The drivers are open-source, so I could invest some time in looking for
bottlenecks there, but I believe the authors of those drivers are better
Python developers than I.

Interest in a different language comes with some thought that drivers written
in faster languages will perform faster.

------
bsg75
Pipelining the server calls in my example above to limit round trip traffic
greatly improves performance, but the nature of my question is not to a
specific language or database.

I am interested in exploring platforms for ETL that offer performance for both
machine and developer.

------
smoyer
I had similar issues with Python and Oracle and found that it was the indices
on the target table that was slowing down the process. I was only importing
once so for me the answer was creating the index after the table was fully
loaded.

~~~
bsg75
The test case using Redis is not affected by indexing, as Redis does not
support indexing.

