Hacker News new | comments | ask | show | jobs | submit login
Ask HN: Cost-effective hardware for text processing?
3 points by jason_slack 5 days ago | hide | past | web | favorite | 3 comments
I need to process 15,000 text files each day. This includes un-raring, manipulating data, creating a CSV, etc etc. it’s a lot. Each file is from 1mb to over 100mb. I can’t change how I get this data.

I currently use a combination of rar, gnu parallel, python, pandas, rename, sed, in a bash script. Debian 9.6.

Having 60gb backlog plus new data each day my laptop doesn’t keep up. I need to have a dedicated setup for this.

Ideas on doing this cheaply but still pretty quick? What are hardware ideas? Would a Pi cluster be effective? A used PC would be fine but there are so many processors and motherboards all slightly different.

Edit: I could switch to C++ using system() calls. Then eventually replace these calls with code to do the work that makes sense.

Myself, I would probably use a virtual machine in the cloud somewhere — maybe AWS, maybe some other provider. I’d want to make sure the instance storage is SSD, and you always clean up your temporary disk space usage once you’re done.

Pick an instance size that has enough RAM and CPU for your requirements, and then configure it to boot at specific times, run and do its thing, then shutdown again.

So long as its not running except when you need it to run, even fairly beefy hardware configurations can still be pretty cheap for you.

If you can’t do that, then I’d run a separate machine at my home, and not try to do it on my laptop.

Thanks. The cloud could work. I was hoping to use any idle time with building models of the data.

Any thoughts what physical hardware would be beefy yet cost effective on my wallet?

Your largest dataset is ~100MB, right?

I would think that any VM that has at least 2-3 times this amount of memory free after the OS is loaded (and your other critical apps are loaded) should be a good place to start. But your data and your own testing will be your best guide — try something you think might work, and if you’re not happy with the speed or the system seems to be thrashing too much, then try the next size up. Or maybe try a different model that has a different mix of permanent storage versus CPU performance versus RAM.

But you will need to do some testing here on your own, and see what feels right for you and the amount of money you’re willing to spend. In that regard, I would try starting off with some of your biggest data sets and use those as your current worst case for testing. Then keep an eye on how your data sets change over time, to see if you need to go re-do your testing.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact