

Help me build a powerful data cruncher for a few thousand dollars. - ceesai

I am quite the novice when it comes to building computers, and in this case I am not even sure if I need one or can substitute it with a bunch of PS3s.<p>Here are my two main requirements:<p>1) Crunch heaps of data. I am talking say a 100 million high-dimensional data points, and I may need quite a large number of those handy (that probably means, in memory) to run some machine learning (ML) algorithms on it. <p>2) A storage device to store up to a terabyte of data.<p>And some tertiary requirements:<p>While I'll probably be running hand-crafted algorithms on my data, it would be great to know of existing database technologies that do data management well and also have bundled ML code.<p>On the coding side, If you suggest python please also suggest a good resource for python-based ML scripts or package (pyML?)<p>
Thanks for any help, pointers, and/or suggestions!
======
zach
Lucky for you, Amazon EC2 announced their "extra large" instances just
recently with 1.7TB of storage. EC2 and a TB of data on S3 may be a pain to
set up, but it has so many advantages that it's a huge win. I would at least
take that as the presumptive choice which other options are weighed against.

I know a lot of people will line up to beat the drum for Amazon Web Services
but it really is one of the most fantastic resources for startups since Linux.

~~~
ceesai
I wasn't aware of the recent upgrade and unlimited beta, thanks. This will
indeed be an option I'll weigh seriously; I do not foresee sudden spikes in my
usage patterns so I am not sure if it EC2 is also economical for steady and
constant crunching. I'll check the numbers on this...

------
viergroupie
If you're looking for a big data cruncher on the cheap, I think your choice of
programming language becomes important.

Ditch Python for something a little snappier. There's a discussion of
languages for Machine Learning at <http://hunch.net/?p=230>.

~~~
ceesai
Thanks; I now remember seeing this discussion on Hunch a while back and
tagging it on delicious. I also remember why this didn't spring to mind; it
left me more informed and befuddled :)

------
Zak
How much is "a few thousand"? You can buy machines like you're describing off
the shelf for a few thousand these days. If "a few" means six, Apple will sell
you a 3 GHz 8-core machine with a terabyte of storage and 8 gigs of RAM.

You can probably save 20-30% building it yourself. I haven't built any high-
end workstations, so I'll refrain from making specific recommendations for
components or retailers. A good general strategy would be to use the Mac Pro
as a template - look for a motherboard with similar features.

------
travisbrady
Have you looked at Orange? <http://www.ailab.si/orange/>

It's implemented with Python and C++.

------
rglullis
How much memory (on average) is your data point? Are we talking number
crunching, only? I mean, no analysis on strings?

~~~
ceesai
Strings come from a limited range and can be mapped to numbers. A data point
can have upto 350 dimensions with an average of 5 bits per dimension.

------
jdavid
you might find our google group helpful.

<http://groups.google.com/group/fireseed-fs3-omecc>

we have been collecting information for a year about using gpus and ps3s for
Artificial Intelligence and Physics research.

~~~
ceesai
Applied for membership; any reason why even reading through the posts needs
membership-level access?

