

Ask HN: Linux rig for data mining and machine learning - big_data

Here's the scenario: if you were asked to build out three Linux machines that would be used together in a cluster to perform data mining and machine learning tasks, with the occasional mapreduce job thrown in, how would you spec the machines out?  What distro would you use?  Any must have software installs?<p>With regards to the hardware, what is your preference for manufacturer?  How much would you expect to pay per machine?<p>Your thoughts and suggestions are appreciated!
======
burgerbrain
I hate to say it, I'm not sure you have the necessary skills required to
actually do what you're looking to do if these are the types of questions you
have. A better question might be _"what are good resources to read to get into
datamining and machine learning"_.

~~~
big_data
Or even better, I'll ask a guidance counselor for Linux advice. She'll know
the answer. Thanks!

------
turbojerry
You have a requirement, now you need a specification, until you can specify
the needs accurately it is impossible to design a solution. So now you need to
ask questions regarding the algorithms that will be used, what hardware can
they be run on, CPUs, GPUs? What size are the datasets? What sort of speed is
needed? What constraints are there, such as cost? Etc. As for hardware
manufacturers, you might look at Supermicro and Appro, it really depends on
your needs.

~~~
big_data
I totally agree with your point. I am working on this in parallel with the
hardware spec because my main financial stakeholder is pushing to create the
project budget.

~~~
turbojerry
Is there any possibility that the project could be split into 2 parts, an R&D
part, which you can specify a development machine for and a production part
which can be specified separately once you have sufficient data? If not then
you could search for mailing lists and newsgroups that deal with the
algorithms you are using and ask for advice and some real world data on what
other people are using to tackle similar problems. Also you might find
relevant data in published papers, try searching ACM, IEEE and Springer.

~~~
big_data
That's a really good point, thanks! It may be easier to get funding for the
R&D portion as a POC for the overall endeavor. Well worth asking at this
point.

------
bobf
Use AWS until you have a reasonable grasp of your dataset and real
requirements. Then buy whatever provides the best bang for your buck, in terms
of servers. That will probably mean getting 6 mid-range servers, rather than
the three servers with the absolute fastest CPU/most memory available. Use
either RedHat (CentOS) or Debian, and you'll almost certainly be using Hadoop.
Dell servers are fine, although you can sometimes save significantly by going
with something like Supermicro servers from Newegg. In terms of cost, you'll
want to order the bulk of your servers' memory from a third-party, not have it
included in the build.

~~~
big_data
Excellent, thank you! Since I use AWS now for other stuff, this approach makes
the most sense.

------
bayareaguy
A former employer of mine in the financial sector used Scalable Informatics[1]
and Dell[2] servers for that sort of thing.

1- <http://scalableinformatics.com/>

2- <http://www.dell.com/us/business/p/poweredge-cloud-servers>

~~~
big_data
Thanks, I'll take a look at these.

