Hacker News new | past | comments | ask | show | jobs | submit login
Performing Linear Regression Using Ruby (sharethrough.com)
64 points by ucsd_surfNerd on Sept 11, 2012 | hide | past | web | favorite | 40 comments

We do regressions with rb-gsl.

    require 'gsl'
    x = GSL::Vector.alloc(array_of_x_values)
    y = GSL::Vector.alloc(array_of_y_values)
    c0, c1, cov00, cov01, cov11, chisq, status = GSL::Fit::linear(x, y)
It's not nearly as much work, and it's much faster than doing it in pure Ruby. :)

(It also does weighted regressions and exponential fitting, among a host of other things. That wheel's gone done been invented already.)

As pointed out by Tony Arkles in blog post comments, GSL is a bit restrictive in it's licensing:

  GSL can be used internally ("in-house") without restriction, but only redistributed in other software that is under the GNU GPL.

I did look at GSL :). I decided to write it by hand because I wanted to help people understand the underlying math. I find that too many people use libraries without understand the math which can become problematic especially when performing statistical analysis.

Thanks for pointing out GSL. I appreciate the feedback.

It's definitely good to understand the underlying math, and it's a good article on "this is how to translate a formula into Ruby code", but given that the article is positioned as "We needed to solve this problem, and this is how we solved it", it seems like it'd make more sense to focus on solving it with the least work and the best performance, which is why I mentioned GSL.

It's great to see other people doing statistical work in Ruby, though, so please don't let my criticism keep you from continuing to do it! :)

I appreciate the feedback, especially when it is constructive criticism. It helps me understand how other people were interpreting the post. Perhaps I got the positioning slightly wrong as I wanted it to be more about teaching the basic math and how to translate that into Ruby. I will try and get the positioning better next time.

While we do some stats in Ruby it definitely doesn't represent the entire "this is how we solved it". In fact we use a large amount of R and Java to solve all our statistics problems.

Ruby is great for data prep, basic calculations, web app development, and scraping aggregating data and R for visualization. I find them to be a joyful combination. Great for small data sets, quick estimations, and various small projects.

Larger data sets and performance intensive operations are better handled in Python (or Java, C++, etc). Lots of statistical analysis is way below the threshold of Hadoop and company.

I mostly agree with this, because I do like Ruby even though I don't use it.

The missing link here, and the reason Python gets more love from the data community, is that Python scales down to the smaller data sets as well as it handles big ones. (Not sure if you ment it couldn't, but the distinction you make implies that.)

Python is surprisingly heavy-duty. But my kingdom for a seamlessly distributed or parallelized version of NumPy/SciPy! How nice would it be to just enter "C = A * B", with A living as a sparse CSC across many nodes?

Would Disco (http://discoproject.org/) work for you?

I don't think MR is a good abstraction for implementing linear algebra, and I expect the overhead to be too high (although I don't have numbers to back that up). For large problems (>> couple of machines worth of RAM), you use big iron HPC solutions, or you avoid 'exact' linear algebra altogether to focus on one-pass algorithms.

For example, instead of computing an exact SVD, you will use something like Hebbian algorithm to compute the SVD in a streaming manner (that's what Mahaout implements for example).

No, the sparse matrix code in SciPy is plain C (not even multi-core, let alone distributed).

EDIT: or did you mean Disco offers distributed sparse CSC operations?

we, http://continuum.io/, are working on this.

Agreed, Python scales down and so is also good for small tasks. What I was saying is that Ruby - as much as I like it - does not generally scale up beyond a certain point.

Both R and Ruby have had issues with large data sets which have been addressed to some degree in more different distributions and more recent releases. Python is ready out of the box for large data sets. So what I meant to communicate is that if you know that you are going to be dealing with a large data set, you might as well go straight to python.

I totally agree that Ruby + R makes a great data toolbox. I use Ruby + R for almost all of my day to day exploratory data analysis work.

As far as python goes I think python has become more popular simply because it has more community around data applications. Unfortunately a lot of people view ruby as just Rails. I think academia's adoption of python has also helped it grow into a data analysis language.

Really any MR you do with python you could do with Ruby as it all uses hadoop streaming.

Math stuff in ruby can be greatly speeded up by using the NArray library http://narray.rubyforge.org/ Method list here: http://narray.rubyforge.org/SPEC.en

...and at the end he shows you how to do the same thing with two lines of R.

It's very useful to code basic statistical algorithms yourself so you understand how they work, but for any serious analysis you'll get more reliable and performant results with a library.

That is exactly what I was going for. Basic algorithm yourself to help people understand the math but you wouldn't want to use this code in a production system.

In fact I perform the vast majority of my statistical analysis in R.

Ruby is just a fun language to implement basic statistical algorithms in and it the Ruby community as a whole doesn't hasn't put a lot of emphasis on stats.

Totally agree. And if you want to call R from ruby to export the analytics to a web page for example, there are many options, such as this one: https://github.com/clbustos/Rserve-Ruby-client

And this is why we don't do statistical programming in Ruby.

I think it is especially important to note that linear regression assumes that the relationship between the variables is, well, linear, and that in the real world, it very rarely actually is.

At best, a big asterisk should come from any of these results if you didn't have someone with actual experience validate your design/proposed analyses first.

You can use the same techniques used in Linear Regression (with multiple features) to do Polynomial Regression. For example suppose you have two features x1 and x2, you can add higher order features like x1x2 or x1^2 or x2^2 or a combination of these. While doing linear regression you treat these terms as individual features so x1x2 is a feature say x3. This way you can fit non-linear data with a non-linear curve. However there is a problem of overfitting, i.e your curve may try to be too greedy and fit the data perfectly, but that's not what you want. So Regularization is used to lower the contributions of the higher order terms.

Wikipedia has an article on Polynomial Regression: http://en.wikipedia.org/wiki/Polynomial_regression

P.S I'm doing this course https://www.coursera.org/course/ml so my knowledge may not be entirely correct so take everything I've said with a pinch of salt. :)

Someone once said that all models are wrong, but some are more useful than others (paraphrasing).

What would be a cool followup post is an incremental solution that used ruby blocks.

Here: http://news.ycombinator.com/item?id=4508837

B = (X^TX)^(-1)X^TY anyone?

In computing (X^TX)^(-1) if the number of features is large then it can be slow as computing the inverse of a matrix is slow. Also unless you use pseudo inverse (pinv in octave) you need to take care of degenerate cases. However if you use Regularization i.e replace the (X^TX)^(-1) with (X^TX + lambda*W)^(-1), where lambda is the regularization parameter and W is a matrix of the form:

  |0 0 0|
  |0 1 0|
  |0 0 1|
i.e identity matrix with (0,0) set to 0

This ensures that the matrix is now invertible. Regularization takes care of overfitting.

P.S I'm a ml n00b doing Machine Learning course on Coursera so I might be unaware of more practical knowledge of the above. :D

All regularization work I'm aware of uses W=I (an identity matrix). Where did you find this zero origin matrix?

Note that your W does not guarantee invertability - e.g., if your original (0,0) is already 0.

This was shown by Professor Andrew in the Coursera ML class that's happening right now.

Given n features x1 to xn we introduce x0 feature which is always set to 1. During the Regularization lectures the professor said that we don't need to control (or regularize) the theta0 (the parameter for x0) because it doesn't make a difference. I believe this is the reason W(0,0) is set to 0.

The lectures are a little light on the maths, i.e the professor explains only enough maths to explain the techniques so I'm not aware of more details. I'm planning on watching some Linear Algebra lectures to fill in the gaps. :)

Re: Invertability, according to the professor, if lambda is > 0 then the matrix will be invertable. Again I'm not 100% sure if this is true or not.

Ok, that clears it up:

He doesn't need to set W(0,0) to 1 specifically because he sets x0 to 0 (which guarantees a non-zero value in the covariance matrix).

But the standard way to do L2 regularization (also known as "ridge regression") is to add a scaled identity matrix (the entire diagonal set to be nonzero)

You mean set x0 to 1, right?

People who do linear regression at work don't add a x0 feature? During the lecture the prof. only said that adding a x0=1 for all samples m, is by convention and helps simplify the computation. Unless I missed something during the lecture that's the only explanation that was given.

Yes , I did, thanks.

> People who do linear regression at work don't add a x0 feature?

Sometimes they do that; sometimes the data already has a subset known to have sum 1 (e.g., if you binary variables that reflect "one of n choices" which must be set), and in this case adding x0=1 makes things worse (from a numerical perspective) for many algorithms.

Regardless, I've always seen regulation theory stated with lambda*identity matrices.

DO NOT EVER use Penrose pseudoinverse in numerical computation. It is guaranteed to diverge.

Can you please explain why this is? Or maybe point to an explanation somewhere? Thanks.

Implementation tip - you don't need to invert that matrix! [1]. Whenever you see an equation of the form Ab = c and you want to find b, you should use a function equivalent to lu_solve (in the GNU Scientific Library) or the left-divide operator in Octave/Matlab (b = A\c), which will use the LU factorization algorithm [2] without ever computing A^(-1).

This is normally faster, more numerically stable and more space efficient. Even better, once you've computed the LU factorization of A once (which takes O(n^3) operations) you can then solve Ab = c for many different values of b and c in O(n^2) operations, by caching the factorization of A.

[1] http://www.johndcook.com/blog/2010/01/19/dont-invert-that-ma... [2] http://en.wikipedia.org/wiki/LU_decomposition

According to Wikipedia[1]Matrix Inversion is also O(n^3) and according to the note can be even better by inverting smaller matrices within and multiplying[2].

My linear algebra isn't very good so I'll have to look into LU Factorization but is there vast difference between the computational performance of the two operations? (Assuming you don't need solve Ab=c for different values of b and c)

Also is LU Factorization used often in machine learning instead of inverse?

[1]: http://en.wikipedia.org/wiki/Computational_complexity_of_mat...

[2]: Because of the possibility to blockwise invert a matrix, where an inversion of an n×n matrix requires inversion of two half-sized matrices and 6 mulitplications between two half-sized matrices, and since matrix multiplication has a lower bound of Ω(n2 log n) operations[17], it can be shown that a divide and conquer algorithm that uses blockwise inversion to invert a matrix runs with the same time complexity as the matrix multiplication algorithm that is used internally. source is [1]

The reason you want to avoid matrix inversion is not computational performance - it is precision.

You can generally "multiply by the inverse" without actually computing the inverse, in a way that needs less intermediate floating point precision.

If your matrices have a large spread of eigenvalues, it makes a lot of difference - double precision often doesn't have enough precision for direct inversion in the real world.

(And even if you do want to invert, it is more numerically stable to do that as a solution of Ax=e, for each 'e' a basis element, as long as you compute A^-1 \* e indirectly using a numerically stable method)

Thank you for explaining, I think I get it now. The ML course on Coursera is a little light on the Maths. As in the professor only explains as much is needed to perform the techniques. So I think I'll be watching Strangs Linear Algebra lectures on OCW soon. :)

Did somebody say B = (X'X)^(-1)X'Y?

Here is one of my favorite OLS-related abuses of linear algebra: y = XB + e ---> e = y - XB ---> e = y - X(X'X)^(-1)X'Y ---> e = [I - X(X'X)^(-1)X'Y]y = My

Now use the two idempotent matrices to compute the SSR: e'e = (My)'(My) ---> e'e = y'M'My ---> e'e = y'MMy ---> e'e = y'My

Definitely. The jump from univariate regression to multivariate regression is a non-obvious step because the univariate approach does not generalize in a straightforward fashion.

Put a pseudoinverse in there :)

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact