Unfortunately, I haven't seen a good software platform that actually allows to build a good curation site. Ones that exist want you to build the content for them. I want one I can run/own/brand on my own. I suspect there might be some in the library space though (haven't search _very_ hard yet).
Emacs and HTML work fine, and have been a pretty good solution for the last 20 years.
Including Emacs, which you could use with a bit of elisp :-)
But thank you for persevering. :-)
The Internet contains so much information on any given topic that if you have a question, it probably has already been answered. If we could build better search engines, we could learn anything in a fraction of the time.
The web had a chaotic growth in the first decades but now it looks as if on one end, the larger websites have killed smaller ones, and on the other it has grown so large that search is no longer enough.
You need organization.
(sorry for the offtopic)
Looking up ready to use NLP software you can go with Solr and that's it. What I mean is that there is way too much NLP libraries. That said, it might be because there is many ways to do it. Anyway, I really think we need scikit learn for NLP.
organization != knowledge.
The original Word2Vec is missing too. While Gensim and Glove are nice, Word2Vec still outperforms them both in some circumstances.
Surely there is a good LTSM language modelling project somewhere too? I can't think of one off the top of my head though. There's some code in Keras, but maybe Karpathy's char-RNN would be better because of the documentation.
Perhaps naively, it seems a big part of the deducing meaning could be done doing ordinary dictionary lookups with terms like 'bedroom', 'apartments', "Capitol Hill", "seattle" etc.
Is this indeed naive, or is this 'dictionary lookup'-technique part of the bag of tricks used? If so, any good references to use this in combination with other techniques described here?
Highly interested in this topic, but looking for a nice introduction to get used to the terminology of the field.
The general idea of NLP is not different from general computer science ie. 1) narrow the problem 2) solve it 3) try to solve a bigger problem.
The tower of sentence structure in NLP is:
- bag of word
- part of speech + named enties tagging
- dependency tagging/framing
- semantic tagging
The idea is to create templates for most common questions. Then you parse questions recognizing the named entities like "Capitol Hill", "Seattle" and commons "appartement" you can resolve the question. It's not an ordinary dictionary hash lookup since for in given template there is several "key". The value of the dictionary is the correct search method. It makes me think to multiple method dispatch which support dispatch by value.
Also something to take into account is that in the "assistant" example you give, the assistant can ask for confirmation. You don't explicitly state that you are looking to "rent" something. So the system might not recognize the question, but just guess that you talk about renting something because it's the most popular search around Capitol Hill, Seattle. You can implement a "suggest this question" feature that will feedback the "question dispatch" algorithm to later recognize this question.
This is mostly a Dynamic Programming approach. Advanced NLP pipelines use logic, probabilistic programming, graph theory or all of them ;)
The other big problems of NLP are:
- summary generation
- automatic translation
Important to note is that like other systems it must be goal driven. You can start from the goal and go backward infering the previous steps or do it from the initial data and go forward. Again, it's very important to simplify. Factorize by recognizing patterns. It's the main idea regarding the theory of the mind.
Have a look at this SO question  I try to fully explain an example QA. Coursera NLP course is a good start.
OpenCog doesn't deal solely with NLP but gives an example of what a modern artificial cognitive assistant can be made of.
Beware that NLP is kind of loop-hole.
Above you said: > The idea is to create templates for most common questions.
I assume here that a template would be an abstract phrase where things like Named Entities (Seattle, Capitol Hill), Adjectives (2 bedroom), etc. are removed and substituted by variables. Correct?
Could supervised learning then be used to map natural language questions to templates? After all, there's only so many ways in which you can ask a particular abstract question (i.e.: template) in a limited domain.
What I'm thinking then are the following steps:
- 1. Source questions that cover the domain. (e.g.: Mechanical Turk)
- 2. Manually come up with abstract templates that cover these questions. (Although somehow I feel it must be possible to semi-automate this using Wrapper Induction or something)
- 3. Manually label a test set <question -> template>
- 4. Have the system learn/classify the remaining questions and test for accuracy (what classifiers would you use here?)
Flow of new question:
1. if coverage in 2 was big enough, the system should be able to infer the template.
2. A template should be translatable to a bunch of queries (e.g.: GraphQL format). Not the hard part I believe.
Out pops your answer in machine form. Bonus points to transform that answer into a Natural Language answer using some generative grammar.
Of course the devil is in the details but from 10,000 feet does this look solid? Suggestions/glaring omissions? Thanks again.
2. semi-manually come up with templates (a grammar for the questions). You have to analyse the dataset in a unsupervised way to find out the common patterns and sanatize the results.
3. maybe step 2 is enough.
4. markov networks are useful in this context but I can be wrong
> A template should be translatable to a bunch of queries (e.g.: GraphQL format). Not the hard part I believe.
Yes once you have the templates with typed variables (named entities, adjectives, etc...) like you describe you can write the code to search for the results. I doubt GraphQL is a good solution for that problem. You can't translate the templates into a search on the fly. It's a mapping that you need to build manually or automatically.
I think in your case SQL will be fine. Have a look at https://github.com/machinalis/quepy
say -v Alva "Tomten dricker julmust på tomten"
"Tomten" can mean either "Santa Claus" or "the yard"/"the plot" depending on context, and apparently they're able to detect this properly.
(The irony of this misunderstanding being kicked off by a comment about the text-to-speech engine understanding the context of a word amuses me)
The cmudict isn't under the text-to-speech subheading in this list, but I think the folks at Carnegie Mellon may have considered text-to-speech applications, like a talking GPS navigator, when they compiled the dictionary. I recall the cmudict containing lots of US city names.