Hacker News new | comments | show | ask | jobs | submit login

I studied a fair amount of NLP (a true passion of mine) at school and after I graduated I spent several months working on tech which did this (and other things). That was intended to be a startup, but sadly, at the time, my business sense sucked and I couldn't decide on a good product to fit the tech to (the fact that I was developing tech before I had a strong sense of my product is already telling).

I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.

Still, I hate seeing the code rot...

dump it on github - someone will come around and clean it up for you

what is it written in?

Mostly Python. Some C. Some R to play around with datasets and prototype algorithms.

Dump it on a hub and provide a link, or just provide a link to gz or zip version...

What does the tech do exactly?

The bulk of it was:

News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.

I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.

If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....

Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo

This looks pretty great, I'll definitely keep an eye on it. Thanks :)

Which python port are you using? Last time I looked all the python Readability code I could find was either incomplete, old, or buggy.

I've experimented with https://github.com/gfxmonk/python-readability but it's extremely slow. There's a decent fork called decruft that is a couple orders of magnitude faster http://www.minvolai.com/blog/decruft-arc90s-readability-in-p...

Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact