Hacker News new | comments | show | ask | jobs | submit login

The most important detail noted is training over more data gets higher results than using a better algorithm. One of the recent pushes in the field of machine learning and natural language processing has been trying to bootstrap larger training corpora from smaller initial sets.

As an example, one of my friends did her thesis on trying to use simpler sentences (that you're either confident you have parsed correctly or have gold standard (i.e. correct) training data for) to parse more complex but related sentences (see [1]) This is useful as even if you don't have a huge amount of gold standard training data statistically the parser is far more likely to get the derivation correct for shorter sentences than for longer. Using those shorter sentences you can help in parsing longer sentences.

That's why Google is so powerful. I spent a summer internship there and they have two really powerful things - data and the tools and techniques to handle it. In one afternoon a single employee could run through more data than entire companies would use for months.

[1] "Mozart was born in Salzburg in 1756" vs "Wolfgang Amadeus Mozart was born on the 27th of January 1756 at 9 Getreidegasse in Salzburg" (the latter is a slightly modified example from Wikipedia)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact