
Ask HN: Cost-effective hardware for text processing? - jason_slack
I need to process 15,000 text files each day. This includes un-raring, manipulating data, creating a CSV, etc etc. it’s a lot. Each file is from 1mb to over 100mb. I can’t change how I get this data.<p>I currently use a combination of rar, gnu parallel, python, pandas, rename, sed, in a bash script. Debian 9.6.<p>Having 60gb backlog plus new data each day my laptop doesn’t keep up. I need to have a dedicated setup for this.<p>Ideas on doing this cheaply but still pretty quick? What are hardware ideas?  Would a Pi cluster be effective? A used PC would be fine but there are so many processors and motherboards all slightly different.<p>Edit: I could switch to C++ using system() calls. Then eventually replace these calls with code to do the work that makes sense.
======
bradknowles
Myself, I would probably use a virtual machine in the cloud somewhere — maybe
AWS, maybe some other provider. I’d want to make sure the instance storage is
SSD, and you always clean up your temporary disk space usage once you’re done.

Pick an instance size that has enough RAM and CPU for your requirements, and
then configure it to boot at specific times, run and do its thing, then
shutdown again.

So long as its not running except when you need it to run, even fairly beefy
hardware configurations can still be pretty cheap for you.

If you can’t do that, then I’d run a separate machine at my home, and not try
to do it on my laptop.

~~~
jason_slack
Thanks. The cloud could work. I was hoping to use any idle time with building
models of the data.

Any thoughts what physical hardware would be beefy yet cost effective on my
wallet?

~~~
bradknowles
Your largest dataset is ~100MB, right?

I would think that any VM that has at least 2-3 times this amount of memory
free after the OS is loaded (and your other critical apps are loaded) should
be a good place to start. But your data and your own testing will be your best
guide — try something you think might work, and if you’re not happy with the
speed or the system seems to be thrashing too much, then try the next size up.
Or maybe try a different model that has a different mix of permanent storage
versus CPU performance versus RAM.

But you will need to do some testing here on your own, and see what feels
right for you and the amount of money you’re willing to spend. In that regard,
I would try starting off with some of your biggest data sets and use those as
your current worst case for testing. Then keep an eye on how your data sets
change over time, to see if you need to go re-do your testing.

