
The Best Free and Open Source Data Mining Software - fogus
http://www.junauza.com/2010/11/free-data-mining-software.html
======
tibbon
If I may plug the 140kit (<http://140kit.com>) which is an open source Twitter
mining and analytics solution that uses the streaming API to make downloading
millions of Tweets in easily accessible formats quick and easy for anyone.

(doh, it looks like its down for some reason. Well, when it's back up- its
there. I'm sure you could find the github repo and run it yourself since it is
open source). Written in Rails.

~~~
mark_l_watson
This subdomain is up: <http://hackfest.140kit.com/>

~~~
tibbon
Oh thank you! i should have known that.

------
kanak
I guess R is a bit more DIY than these frameworks, but it has a very large
collection of tools. I've found libraries for everything from CART
(classification and regression trees), to SVM, to HMM learning, to clustering,
to EM. R with libraries from CRAN is my go-to tool for statistical learning.

~~~
mcgin
I am amazed R didn't make it to the list.

~~~
ez77
Something like SAS or DAP would be better suited for large data sets, as R
tends to load everything on the RAM.

From <http://www.gnu.org/software/dap/> : "Because Dap processes files one
line at a time, rather than reading entire files into memory, it can be, and
has been, used on data sets that have very many lines and/or very many
variables."

~~~
kanak
This is definitely a problem with R, although the biggest problem IMO is that
a lot of libraries aren't multicore capable. Fixing the memory problem was
just a matter of adding lots of ram into our workstation, we can't fix the
"can't use more than one core at a time" problem as easily.

~~~
nivertech
Yes you can: just run it in many single-core VMs. This what I was recommended
by vendor making legacy single-threaded software. They were the best in their
field, so they never really tried to port it to multicore ;(

------
bluedevil2k
Weka isn't limited to the GUI if you want to mine your data. It's a regular
JAR file you can drop in your server-side web application and make calls to it
like any other Java library. I've used it in some of my apps for some
clustering algorithms (the easy stuff, since it can get complicated).

I've also written a few articles on Weka if you want to read a few nice
tutorials on how to use it. I'm not a Data Mining Expert, but I've had a few
semesters of it in grad school.

[http://www.ibm.com/developerworks/opensource/library/os-
weka...](http://www.ibm.com/developerworks/opensource/library/os-weka1/)

[https://www.ibm.com/developerworks/opensource/library/os-
wek...](https://www.ibm.com/developerworks/opensource/library/os-weka2/)

[https://www.ibm.com/developerworks/opensource/library/os-
wek...](https://www.ibm.com/developerworks/opensource/library/os-weka3)

------
Swizec
How cool is that! I'm studying Computer Science at the University that makes
Orange (the first on the list). And the professor who originally came up with
it is an advisor for my startup.

~~~
ams6110
What is your professor advising you on, specifically? I'm curious because the
computer science profs I know would not have the first clue about running a
startup. Not that there's anything wrong with that...

~~~
Swizec
Actually a lot of my startup is based on the idea that we can create the
machine-learning/data-mining algorithms. He's the lead of the faculty
department that deals primarily with data mining.

So essentially, he's advising on algorithm stuff.

------
earle
Mahout should clearly be on this list!

------
mark_l_watson
Good list, but I would add NLTK.

~~~
fogus
This is precisely why I posted this to HN. While it's a nice list, I can't
help but think that HN can fill in the missing pieces.

------
gtani
from my bookmarks:

<http://gate.ac.uk/>

<http://mallet.cs.umass.edu/download.php>

<http://alias-i.com/lingpipe/index.html>

<http://incubator.apache.org/uima/>

<http://elefant.developer.nicta.com.au/>

~~~
lenley
I don't think lingpipe is open source

~~~
uxp
The Royalty Free version is open source, however it appears to be a highly
restricted version of the GPL v1 license, where any application that uses
lingpipe, by API calls or even by separately using the output of lingpipe,
must comply with an Open Source Initiative license.

Somewhat interesting clause that may restrict "freedom", but none-the-less
still enables it to be open-source.

<http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt>

------
elblanco
I've seen some great stuff done with Rapid Miner. Really cool package -- plus
I've heard it supports all the Weka components.

------
thingsilearned
Chart.io (YCS10) is building something like these as a service. If you're
interested in getting on the private beta email me at dave@chart.io.

<http://chart.io>

------
NHQ
google bought a company within the last couple years that had made a really
smart open source data app that ran in the browser or something. Anybody know
what it was called?

~~~
albahk
Did it become Google Refine? Its a downloadable app that creates a local
webserver to run in a browser.

<http://code.google.com/p/google-refine/>

~~~
NHQ
Yeah that's it. It was Freebase before.

