

Show HN: Hacker News, automagically organized - bravura
http://metaoptimize.com/projects/autotag/hackernews/

======
tsycho
Awesome stuff! I tried a couple of searches, and the search ranking looks
decently good (at first glance, atleast).

I also like the fading of text as the results become less relevant.

What kind of relevance algorithms are you using? Since this site is targeted
at hackers, who tend to like more control, you could expose some of the
parameters and allow users to tweak their relevance.

For instance, sliders that let you determine the importance of article age, #
of comments in the article, karma points of article, avg karma points of
readers, and of course the pattern match counts...

~~~
bravura
_I tried a couple of searches, and the search ranking looks decently good (at
first glance, atleast)._

Its based solely upon the text in the comment thread. I was actually pretty
surprised this worked as well as it did. I am currently crawling outlinks,
which should hopefully improve relevancy even more, as well as discover more
topics.

 _Since this site is targeted at hackers, who tend to like more control, you
could expose some of the parameters and allow users to tweak their relevance._

Rather than drilling specifically into Hacker News, I'm more interested in
exposing functionality to other hackers by building an API that will allow
them to auto-organize their sites too. The one issue is that indexing is a
batch, off-line process, and most APIs are built in a real-time, on-demand
setting.

------
petercooper
The other day I though how cool it would be to have a Web service that could
crawl your site and auto categorize all of your pages (or at least help you to
do it). As ever, turns out someone is on the case ;-) Nice work! I think
there's definitely a wider audience for this technology.

~~~
bravura
_I think there's definitely a wider audience for this technology._

What audiences do you see for this technology?

Also, how would you expand the audience for this technology?

Possible options:

* Auto-crawl content and automagically organize it, without involving the content owner. (The Google approach).

* Build a turn-key solution that people can upload their content and get the index returned to them. (An API approach.)

* Talking to businesses directly, and make one on one deals. (An enterprise/B2B approach.)

~~~
hendler
There are lots of uses for this, but my main advice is to not loose these
three things:

1\. advantage of relevancy within specific domains. The page-rank was a huge
value-add to relevancy over other search. But internet wide is now too
ambitious. HN is a great corpus because the content is already vetted by a
community. The work of integrating other specialized communities content can
give density and relevancy.

2\. ease-of-use in integration. The less configuration to use this API, the
better. Autotagging, done well, is very useful. I have a lot of ideas around
this if you'd like to chat some time.

3\. ease-of-use interface . Combining browsable, faceted search with NLP is, I
think, the sweet spot between getting lots of relevant results, but allowing
for discovery.

~~~
techbio
Mostly #1, but agreed with all. Especially so as to leverage a managed topic
domain into a transferable form of domain knowledge.

------
hendler
Using Python's NLTK and Lucene can produce results like this. I wrote
something similar using Wordnet, PHP/Zend Lucene, and and Freeling (C++ NLP)
for NewsCup.

I think what makes this project interesting to me is the interface and quality
of search results. They show a really good understanding how to use NLP and
search in conjunction.

Nice work.

------
jcroberts
This is not a complaint, but simply a bug report.

With javascript disabled, if you type something in the provided text box and
hit enter, you end up with an error message:

    
    
      Not Found
      The requested URL /projects/autotag/php/search.php was not found on this server.
    

I just figured you'd want to know.

~~~
adammichaelc
I have JS enabled and am getting the same error.

------
locopati
Very cool - can you add dates to the lists of articles, be nice to see how old
an article is (or maybe color code the list to provide two axes of relevance)?

~~~
pudquick
Agreed on this point - it was the first thing I looked for.

I was hoping to use this site as an alternate view into HN to find semi-recent
submissions covering specific topics.

In addition, it might be nice to expose the popularity of the article somehow.
Something that got 24+ votes is probably going to be more relevant to me than
something with only 2.

------
techbio
I love this. I am working on a related project (the result, not HN) inspired
by Paul Graham's Naive Bayes Spam Filter.

If you have a moment, I would like to hear more about your architecture and
interface. It is responsive, clean, accurate, and multi-device ready, and
clearly an implementation to be replicated.

~~~
bravura
Wow, you seriously have a lot of side-projects (<http://techbio.org/>). Drop
me an email, and we'll talk.

~~~
techbio
Your contact page quickly led me to here:
<http://pypi.python.org/pypi/topia.termextract/>

and: <http://www.metaoptimize.com/qa>

Great stuff.

------
hipcat
I appreciate your work on this (in the past I've just used google with the
site option). As a nitpick, one thing I'd do is remove the fluff from the
bottom of the page and give an indicator that the large box is for a search
term (as opposed to several lines of text which is what it looks like). In
fact, a normal sized box would work just fine.

I realize that there is a line that instructs you to enter text in the box,
but to me in got lost in the noise of the page (another reason for getting rid
of the extra details at the bottom). On the results page, it'd be nice if you
showed the results foremost and the related topics as a sidebar. I want my
results, and only if I don't find what I'm looking for do I want to know the
code's opinion of where I should try to go next.

Anyways, like I said those are just nitpicks. Nice job.

------
nck4222
Very very cool.

A couple problems though. For terms with spaces in them it creates a tag for
each word. For example "stack overflow" has tags for "stack" and "overflow",
which aren't all that useful (although yes the tag stackoverflow is created):
[http://metaoptimize.com/projects/autotag/hackernews/term/5a/...](http://metaoptimize.com/projects/autotag/hackernews/term/5a/stack-
overflow.html)?

What would be a solution to this? Parse the text so each token includes the
first space it encounters and see if that multi word token occurs frequently?

The other problem is stripping out punctuation. The tag c++ doesn't exist,
neither does c#. Not sure how you'd include relevant punctuation and strip out
the rest.

------
pchristensen
Awesome! Gabriel, could you please get these search results into Duck Duck Go?

~~~
kno
Nice, some folks will demote you here for any reason.

------
jackfoxy
Nice work. This gets a place on my bookmark bar.

When can you get it current?

Minor note: found that F# does not index.

------
naner
Pretty cool. The only goof I found was that it thinks that Emacs is plural for
'Emac'.

------
rgrieselhuber
Very nicely done. Would love to see some stats about how long this took, in
terms of crawling, indexing, etc. time.

Also, would love to hear more about the tools behind it.

------
r11t
Feature request: Ability to browse list of existing tags would be great apart
from the auto-complete feature.

------
bigbang
Cool. How do you find the related topics to a given topic? What api do you use
for that? Google suggestions?

------
HNer
Could be really interesting for creating silios within sites for automagically
creating navigation bars whereby all the related nav links were relevant to
the page currently being viewed, removing unrelated clutter and offering
navigation for a site much more relevant.

------
fabiandesimone
Wow! Excellent stuff!

Just added it to my Utilities bookmark folder.

Congrats!

------
ifesdjeen
great! i've been working on the same exact thing for a month already! good
job! :) glad to know that my idea existed in someone else's mind.

------
ceejayoz
No minecraft tag?

~~~
seancron
The index is out of date. Minecraft wasn't around in October 2009.

~~~
Grouper
It would be interesting to see what new tags pop over the years. And maybe
trends in tag usage etc.

For example, to see a google trend style chart for Ruby vs Python tags.

------
webXL
Nice work. Why does it stop at Oct. 13, 2009?

~~~
bravura
Mike Cheng (<http://searchyc.com>) gave me this data dump a year ago. I am
currently crawling the rest of hacker news, as well as outlinks, to fill out
this index.

Consider the site right now a proof-of-concept. I'm trying to gauge people's
interest level, and get feedback.

------
finemann
Awesome work mate :)

