

Ask HN: What NLP features would you find useful (pay for)? - haidut

I am a co-founder of news startup called Euraeka. The site can be found at http://www.euraeka.com and I have posted about it here:
http://news.ycombinator.com/item?id=670844<p>The bottom line is that given the nature of the startup I had to implement a number of text processing routines using NLP techniques. I am thinking of developing an NLP API with those features for people to build upon. Here is the list I currently have in production:<p>1) Text extraction: i.e. given a URL or the actual HTML, strip all unnecessary junk and return the raw English text. This routing can also parse text into sentences, and words/frequency hashes. Stemming is also available.<p>2) Tagging: i.e. extract the top N representative words.<p>3) Text summarization: automatically extracts the top N most representative sentences for a given article. Basically automatic text summarization.<p>4) Cognitive fluency - i.e. a numerical decsription (range 0 to 1) of how difficult is to comprehend a given text.<p>5) Author intelligence - i.e. a numerical description (range 0 to 1) of what is the estimated verbal IQ of the author of the text.<p>6) Named Entity Recognition and Extraction - i.e. given a text, mark all entities in it that represent a person, place, or event. Examples would be this sentence: "Secretary &#60;person&#62;Clinton&#60;/person&#62; is expected to &#60;event&#62;visit&#60;/event&#62; &#60;place&#62;Pakistan&#60;/place&#62; to discuss ongoing military cooperation".<p>7) Semantic similarity between two texts: i.e. a process that given two texts returns a similarity measure on how semantically close the texts are. This is NOT a simple word overlap comparison. This takes into account context similarity which is much more powerful. For example a simple word similarity measure would return no similarity between the text "Senator X visited Pakistan" and "Senator Y visited Afghanistan". However the semantic similarity would return a high number based on the context that both sentences are about similar political figures visiting neighboring countries related in terms of the war on terrorism.<p>8) Typing effort: i.e. a measure on how difficult was the text to type: i.e. a physiological measure that gives an estimate on how much effort went into typing the text. For instance words that requires typing keys that are far from each other on the keyboard are harder to type then words with keys close to each other.<p>9) Text mood classifier: i.e. automatic classification on positive/negative mood of the text.<p>10) Text style classifier: i.e. automatic classification on subjectivity/objectivity of the the text.<p>11) Topic classifier: i.e. automatic classification of text into general topics like business, politics, science, technology, etc.<p>So my question is - which ones of those do you think would be worth exposing into an API? Any of those have potential as premium service? Do you have any additional ones to suggest that I can implement?
======
vitovito
I've seen a need for both ad hoc (submitting text to you one complete posting
at a time) and real-time (chat streams, high-volume forums) handling of text
for many of these, plus censorship: both detection (does this text contain
banned words) and processing (manipulate the banned words) in multiple
languages.

