Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Help me build a powerful data cruncher for a few thousand dollars.
8 points by ceesai on Oct 23, 2007 | hide | past | favorite | 11 comments
I am quite the novice when it comes to building computers, and in this case I am not even sure if I need one or can substitute it with a bunch of PS3s.

Here are my two main requirements:

1) Crunch heaps of data. I am talking say a 100 million high-dimensional data points, and I may need quite a large number of those handy (that probably means, in memory) to run some machine learning (ML) algorithms on it.

2) A storage device to store up to a terabyte of data.

And some tertiary requirements:

While I'll probably be running hand-crafted algorithms on my data, it would be great to know of existing database technologies that do data management well and also have bundled ML code.

On the coding side, If you suggest python please also suggest a good resource for python-based ML scripts or package (pyML?)

Thanks for any help, pointers, and/or suggestions!




Lucky for you, Amazon EC2 announced their "extra large" instances just recently with 1.7TB of storage. EC2 and a TB of data on S3 may be a pain to set up, but it has so many advantages that it's a huge win. I would at least take that as the presumptive choice which other options are weighed against.

I know a lot of people will line up to beat the drum for Amazon Web Services but it really is one of the most fantastic resources for startups since Linux.


I wasn't aware of the recent upgrade and unlimited beta, thanks. This will indeed be an option I'll weigh seriously; I do not foresee sudden spikes in my usage patterns so I am not sure if it EC2 is also economical for steady and constant crunching. I'll check the numbers on this...


If you're looking for a big data cruncher on the cheap, I think your choice of programming language becomes important.

Ditch Python for something a little snappier. There's a discussion of languages for Machine Learning at http://hunch.net/?p=230.


Thanks; I now remember seeing this discussion on Hunch a while back and tagging it on delicious. I also remember why this didn't spring to mind; it left me more informed and befuddled :)


How much is "a few thousand"? You can buy machines like you're describing off the shelf for a few thousand these days. If "a few" means six, Apple will sell you a 3 GHz 8-core machine with a terabyte of storage and 8 gigs of RAM.

You can probably save 20-30% building it yourself. I haven't built any high-end workstations, so I'll refrain from making specific recommendations for components or retailers. A good general strategy would be to use the Mac Pro as a template - look for a motherboard with similar features.


Have you looked at Orange? http://www.ailab.si/orange/

It's implemented with Python and C++.


How much memory (on average) is your data point? Are we talking number crunching, only? I mean, no analysis on strings?


Strings come from a limited range and can be mapped to numbers. A data point can have upto 350 dimensions with an average of 5 bits per dimension.


you might find our google group helpful.

http://groups.google.com/group/fireseed-fs3-omecc

we have been collecting information for a year about using gpus and ps3s for Artificial Intelligence and Physics research.


Applied for membership; any reason why even reading through the posts needs membership-level access?


That's a google group? You have to seek and obtain permission to even read it.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: