

Ask HN: Please review our startup Euraeka.com - haidut

Euraeka.com is an artificial intelligence search discovery and recommendation engine for news. The site is entirely algorithmic but its ranking mechanisms are based on human preferences. Euraeka uses a massive machine learning model trained on the preferences of million of users about what news are worth reading. However, it also includes objective metrics of content quality based on natural language processing. The objective metrics are used to dampen down the propensity of the crowd to get swayed too much over controversial fads. In fact, we can measure controversial topics directly and you can use the sorting mechanism to get Controversial, Engaging or Popular news. One of the most important features of Euraeka is its deception detection mechanism. Again, using natural language processing techniques the site can detect deceptive intent in news articles. We think this feature is something long overdue in the news market - i.e. a barometer of lying essentially. Along the same lines the site can also detect political bias in news (liberal/conservative). Finally, the site also can learn a user's preferences and will recommend news based on those preferences.
Any feedback on features or user interface will be appreciated. More detailed info available at http://www,euraeka.com/faq
======
aristus
Where do you get this data on millions of users?

The name is impossible to spell.

The misleading, engaging, etc filters are interesting, but kind of hand-wavy.
It's not apparent why and how "Miami Heat's Dwyane Wade sues ex-business
partner for libel" or "The SEVEN SECRETS of SMART PARENTS" are "misleading"
(and compared to what?).

You need an information/interaction designer to go over the site and make the
important things important. Right now nothing really catches the eye.

on /faq: " _ulterior_ motives", not "alterior"

Good luck!

~~~
haidut
We've nee collecting top rated news from Yahoo News, NYT, Washington Post, etc
for the last 3 years. Almost all major sources have sections "Most viewed",
"most read", "most email" etc. So as we have been collecting the top ranked
news we also kept track of topics that have been top ranked at multiple
sources. For instance, if a news article on Iran's riot gets to the top ranked
in both NYT and Yahoo News it gets more points in our training set.

As far as your other point, Misleading is really a less legally loaded word
than "deceptive". You are right, it's not very clear why the articles are
misleading but the bottom line is that the language used in the article has
high "deception markers" that other articles from the day. So when you sort by
Misleading, it's not really that the article is beyond a doubt deceptive, it's
just the ones with higher rate of deception markers (i.e. content and
structural indicators associated with deception). The science behind is pretty
solid and comes from forensic psychiatry - i.e. interviews/interrogations with
criminals and analyzing their statements for deception hints. So for alack of
a better term Euraeka essentially implements a linguistic polygraph. Thanks
for the other comments, we'll work on fixing the issues.

~~~
aristus
Are you conflating/clustering articles? ie what constitutes the "same story"
in your system?

Those "most viewed", etc boxes are often placed by editors, not by impartial
algorithms. How do you control for that?

~~~
haidut
Yes, we definitely have clustering. In fact, it's more extensive than simple
word distances like cosine b/c we also take into account synonyms and word
relationships (set membership, etc). For instance, in our system sentences
like "Tiger chases antilope down the river" is very closely "related" to the
sentence "Lion is pursuing a buffalo by the lake" b/c both sentences
essentially say that a large cat is pursuing a prey of bovine origin near a
water source. In terms of the "most viewed" and how we control for that - like
I said we cross-track news on multiple news sites and weight the cross posted
one more often. We also cross-validated the most important tags for an article
by using Google Trends data. Basically we tracked topics on multiple sites and
then performed some statistical analysis to see how those topics did over time
based on their presence on the web (topic momentum and longevity). We also run
a partial search engine in house that crawls a subset of the web so we can
ensure that the numbers we get from Google/Yahoo are legit. Finally, there is
linguistic theory of topic popularity and how memes propagate over time. We
use some of that theory to control for the crowd effect - i.e. sometimes
people pick up and spread topics that are of no real importance to the world.
Example: Paris Hilton's latest escapades may be widely discussed online and
appear important news but the latest report on the recession estimates and
projections is of much higher "impact" to society. So we try to
account/estimate some of that "impact". Combining all factors gives an article
a composite score. No two article really have the same score but a lot of
articles cluster close to each other in terms of their "importance" cores. We
fed the articles in a machine learning algorithm that is a combination of
Support Vector Machine, Neural Network, and Naive Bayes and when a new article
is fetched by our crawler the model "preditc" its various scores
(controversial, engaging, popular) based on the data set that it has already
learned. Deception detection is much trickier and is almost entirely analysis
based - i.e. no machine learning there. There is quite a bit of research on
deception detection published online. Just search google for "deception
detection ext:pdf" and it will come back with a lot of results.

------
ujjwalg
I think the concept is very intriguing and if what you say is what is on your
site and you have actually made it by collecting all the information in the
last 3 years from all the major news websites, I think you will end up being
bought pretty soon. Rather, what you should do is patent your process asap, if
you haven't done it already and then license it. Amazing and good luck.

~~~
haidut
The patent application is in the works. As far as the the data set - we
definitely have it. In fact we were thinking of releasing it under some type
of open license (i.e. creative commons) after the site gets some traction. In
terms of search engines - we ARE in fact a search engine. Just type something
in the box at the top and you can also use the available 4 score to filter
rank search results. So the search works just like google but you get to sort
the results by Controversial, Popular, Engaging, and Deceptive. Some pretty
interesting combinations can be created using the scores. Like for instance,
you can search for "paris hilton" but you are interested in her scandalous
"achievements" rather than her community work. Well, then you search for
"paris hilton" and sort by Controversial. Google can't give you that - i.e.
sort results based on what impact are they likely to have on people. Thanks
for suggestions!

~~~
ujjwalg
I tried it for a couple of keywords with different ways of sorting and it
seems to be working great.

My personal feedback for this would be: to make it sticky you need to have
something similar to what google news webpage looks like and have 3 sections
(controversial, popular, engaging) in every section. And then you should have
similar features (keyword news and number of news article in every section)
and I will make it my homepage, no kidding. Currently, you are not utilizing
the complete web space very efficiently.

~~~
haidut
Yes, this is what we were thinking of having eventually - separate sections
based on score. The current design is something we threw together very quickly
to get it out of the door. Btw, the article tags are clickable and run a
search for that tag in the background. Also, you can filter news by domain. As
we accumulate more data, we can start ranking domains based on the articles
they have produced so far. Kinda like PageRank but not based on links.
Finally, the algorithm can guess authoriship and cluster articles based on
author as well. So again, we can rank authors/people after some time. This
seems to be a much needed feature as this links discusses:
[http://threeminds.organic.com/2009/06/docs_are_old-
school_we...](http://threeminds.organic.com/2009/06/docs_are_old-
school_we_need_pa.html)

I think the above article came up on HN today.

------
jackdempsey
Just wanted to say thanks for the comments. We've worked hard on this, and
there's obviously still a ways to go....but we definitely appreciate the
constructive criticism, and the time taken to reply.

jack

------
ujjwalg
Another point I want to add, is if this is possible, your algorithm should
definitely be added into any search engine page rank system to not only get
rid of click fraud but also improve the searches.

