
Surprise – a simple recommender system library for Python - danso
http://surpriselib.com/
======
bedros
GPLv3 will kill this project.

this is a library, and should have a library license, at least LGPLv3.

~~~
Niourf
Hi, I'm Nicolas, Surprise's author. Thanks for the feedback!

You're not the first person to tell me GPL may not be a good idea and to be
fair I never really thought it would make such a difference. Appart from LGPL,
what other license would you recommend?

~~~
bedros
most python based web applications are using Django, and Django is BSD
licensed.

[https://github.com/django/django/blob/master/LICENSE](https://github.com/django/django/blob/master/LICENSE)

a recommendation engine is great for all kind of web apps, and licensing it
with BSD makes it more adopted by django web apps.

Thanks,

~~~
Niourf
Thanks for the input, I changed the license to BSD!

~~~
zeveb
Be aware that if you license your work under the BSD then someone else can
take your work, change it, improve it, sell it — and never share his changes
with you.

So Netflix could take your recommender system, use it to make millions and do
nothing but mention you in a footnote somewhere on their website.

~~~
oliwarner
And they wouldn't even need to do that.

------
nkozyra
What is the general approach to doing this in realish time?

The benchmarks on this make than untenable, but you could probably RYO KNN and
be in the low seconds ... that still won't work.

Is the approach to build generalized profiles ahead of time and then compare
against those in real time?

------
abetusk
Are there any actually free/libre recommendation data sets? All that I can
find, including the ones used by Surprise (MovieLens, etc.), look to be under
restrictive licenses.

~~~
danso
My first instinct was to point you to a colleague's page, Justin Grimmer, who
does analysis of free-form text such as Congressional press releases (and
other statements). I swear he had direct links to zip files of this bulk text
data but can't find them on a quick glance. Here's his research listing just
to give an idea of what he's able to analyze from this public data:
[http://www.justingrimmer.org/research.html](http://www.justingrimmer.org/research.html)

As for why there aren't newer, or at least more of a variety of public
datasets as canonical as the iris, diamond, Titanic survivors, MovieLens,
etc...I'm baffled too (being a recent newcomer to academia).

If scraping is difficult for you, Socrata has made finding data exponentially
easier than what it was a few years ago. Here's a great place to start:
[http://www.opendatanetwork.com/](http://www.opendatanetwork.com/)

~~~
abetusk
I can program so scraping is pretty easy. Finding data sets that aren't
restricted is the hard part.

Thanks for the links.

~~~
danso
If you have an interest in U.S. politics, check out the github/unitedstates
repo, which is full of data and assets, particularly about legislators and
their votes:

[https://github.com/unitedstates](https://github.com/unitedstates)

Sunlight Labs, the tech arm of the Sunlight Foundation, is sadly closing down,
but they passed many of their projects to ProPublica. Here's a writeup of that
to give an idea of what the projects contained (in terms of data):
[https://www.propublica.org/nerds/item/sunlight-labs-
takeover...](https://www.propublica.org/nerds/item/sunlight-labs-takeover-
update)

Just in case you don't care about the U.S., it must be mentioned that the UK
has the best country-level data repository:
[https://data.gov.uk/](https://data.gov.uk/)

------
amelius
Perhaps HN can use this to recommend stories (?)

~~~
Bombthecat
Or something for WordPress :)

~~~
amelius
Or something for webbrowsers in general :)

But then you need to share all your urls, not sure if everybody would be
comfortable with that.

