

Performing Linear Regression Using Ruby - ucsd_surfNerd
http://www.sharethrough.com/2012/09/linear-regression-using-ruby/

======
cheald
We do regressions with rb-gsl.

    
    
        require 'gsl'
        x = GSL::Vector.alloc(array_of_x_values)
        y = GSL::Vector.alloc(array_of_y_values)
        c0, c1, cov00, cov01, cov11, chisq, status = GSL::Fit::linear(x, y)
    

It's not nearly as much work, and it's much faster than doing it in pure Ruby.
:)

(It also does weighted regressions and exponential fitting, among a host of
other things. That wheel's gone done been invented already.)

~~~
ucsd_surfNerd
I did look at GSL :). I decided to write it by hand because I wanted to help
people understand the underlying math. I find that too many people use
libraries without understand the math which can become problematic especially
when performing statistical analysis.

Thanks for pointing out GSL. I appreciate the feedback.

~~~
cheald
It's definitely good to understand the underlying math, and it's a good
article on "this is how to translate a formula into Ruby code", but given that
the article is positioned as "We needed to solve this problem, and this is how
we solved it", it seems like it'd make more sense to focus on solving it with
the least work and the best performance, which is why I mentioned GSL.

It's great to see other people doing statistical work in Ruby, though, so
please don't let my criticism keep you from continuing to do it! :)

~~~
ucsd_surfNerd
I appreciate the feedback, especially when it is constructive criticism. It
helps me understand how other people were interpreting the post. Perhaps I got
the positioning slightly wrong as I wanted it to be more about teaching the
basic math and how to translate that into Ruby. I will try and get the
positioning better next time.

While we do some stats in Ruby it definitely doesn't represent the entire
"this is how we solved it". In fact we use a large amount of R and Java to
solve all our statistics problems.

------
EzGraphs
Ruby is great for data prep, basic calculations, web app development, and
scraping aggregating data and R for visualization. I find them to be a joyful
combination. Great for small data sets, quick estimations, and various small
projects.

Larger data sets and performance intensive operations are better handled in
Python (or Java, C++, etc). Lots of statistical analysis is way below the
threshold of Hadoop and company.

~~~
thauck
I mostly agree with this, because I do like Ruby even though I don't use it.

The missing link here, and the reason Python gets more love from the data
community, is that Python scales down to the smaller data sets as well as it
handles big ones. (Not sure if you ment it couldn't, but the distinction you
make implies that.)

~~~
textminer
Python is surprisingly heavy-duty. But my kingdom for a seamlessly distributed
or parallelized version of NumPy/SciPy! How nice would it be to just enter "C
= A * B", with A living as a sparse CSC across many nodes?

~~~
msellout
Would Disco (<http://discoproject.org/>) work for you?

~~~
cdavid
I don't think MR is a good abstraction for implementing linear algebra, and I
expect the overhead to be too high (although I don't have numbers to back that
up). For large problems (>> couple of machines worth of RAM), you use big iron
HPC solutions, or you avoid 'exact' linear algebra altogether to focus on one-
pass algorithms.

For example, instead of computing an exact SVD, you will use something like
Hebbian algorithm to compute the SVD in a streaming manner (that's what
Mahaout implements for example).

------
sunkencity
Math stuff in ruby can be greatly speeded up by using the NArray library
<http://narray.rubyforge.org/> Method list here:
<http://narray.rubyforge.org/SPEC.en>

------
gammarator
...and at the end he shows you how to do the same thing with two lines of R.

It's very useful to code basic statistical algorithms yourself so you
understand how they work, but for any serious analysis you'll get more
reliable and performant results with a library.

~~~
ucsd_surfNerd
That is exactly what I was going for. Basic algorithm yourself to help people
understand the math but you wouldn't want to use this code in a production
system.

In fact I perform the vast majority of my statistical analysis in R.

Ruby is just a fun language to implement basic statistical algorithms in and
it the Ruby community as a whole doesn't hasn't put a lot of emphasis on
stats.

------
zmjones
And this is why we don't do statistical programming in Ruby.

------
gergles
I think it is especially important to note that linear regression assumes that
the relationship between the variables is, well, _linear_ , and that in the
real world, it very rarely actually is.

At best, a big asterisk should come from any of these results if you didn't
have someone with actual experience validate your design/proposed analyses
first.

~~~
rohitarondekar
You can use the same techniques used in Linear Regression (with multiple
features) to do Polynomial Regression. For example suppose you have two
features x1 and x2, you can add higher order features like x1 _x2 or x1^2 or
x2^2 or a combination of these. While doing linear regression you treat these
terms as individual features so x1_ x2 is a feature say x3. This way you can
fit non-linear data with a non-linear curve. However there is a problem of
overfitting, i.e your curve may try to be too greedy and fit the data
perfectly, but that's not what you want. So Regularization is used to lower
the contributions of the higher order terms.

Wikipedia has an article on Polynomial Regression:
<http://en.wikipedia.org/wiki/Polynomial_regression>

P.S I'm doing this course <https://www.coursera.org/course/ml> so my knowledge
may not be entirely correct so take everything I've said with a pinch of salt.
:)

------
niggler
What would be a cool followup post is an incremental solution that used ruby
blocks.

Here: <http://news.ycombinator.com/item?id=4508837>

------
sciboy
B = (X^TX)^(-1)X^TY anyone?

~~~
rohitarondekar
In computing (X^TX)^(-1) if the number of features is large then it can be
slow as computing the inverse of a matrix is slow. Also unless you use pseudo
inverse (pinv in octave) you need to take care of degenerate cases. However if
you use Regularization i.e replace the (X^TX)^(-1) with (X^TX +
lambda*W)^(-1), where lambda is the regularization parameter and W is a matrix
of the form:

    
    
      |0 0 0|
      |0 1 0|
      |0 0 1|
    

i.e identity matrix with (0,0) set to 0

This ensures that the matrix is now invertible. Regularization takes care of
overfitting.

P.S I'm a ml n00b doing Machine Learning course on Coursera so I might be
unaware of more practical knowledge of the above. :D

~~~
beagle3
All regularization work I'm aware of uses W=I (an identity matrix). Where did
you find this zero origin matrix?

Note that your W does not guarantee invertability - e.g., if your original
(0,0) is already 0.

~~~
rohitarondekar
This was shown by Professor Andrew in the Coursera ML class that's happening
right now.

Given n features x1 to xn we introduce x0 feature which is always set to 1.
During the Regularization lectures the professor said that we don't need to
control (or regularize) the theta0 (the parameter for x0) because it doesn't
make a difference. I believe this is the reason W(0,0) is set to 0.

The lectures are a little light on the maths, i.e the professor explains only
enough maths to explain the techniques so I'm not aware of more details. I'm
planning on watching some Linear Algebra lectures to fill in the gaps. :)

Re: Invertability, according to the professor, if lambda is > 0 then the
matrix will be invertable. Again I'm not 100% sure if this is true or not.

~~~
beagle3
Ok, that clears it up:

He doesn't need to set W(0,0) to 1 specifically _because_ he sets x0 to 0
(which guarantees a non-zero value in the covariance matrix).

But the standard way to do L2 regularization (also known as "ridge
regression") is to add a scaled identity matrix (the entire diagonal set to be
nonzero)

~~~
rohitarondekar
You mean set x0 to 1, right?

People who do linear regression at work don't add a x0 feature? During the
lecture the prof. only said that adding a x0=1 for all samples m, is by
convention and helps simplify the computation. Unless I missed something
during the lecture that's the only explanation that was given.

~~~
beagle3
Yes , I did, thanks.

> People who do linear regression at work don't add a x0 feature?

Sometimes they do that; sometimes the data already has a subset known to have
sum 1 (e.g., if you binary variables that reflect "one of n choices" which
must be set), and in this case adding x0=1 makes things worse (from a
numerical perspective) for many algorithms.

Regardless, I've always seen regulation theory stated with lambda*identity
matrices.

