
Ask HN: How to extract the topic of a child’s question? - krschacht
How can I identify the kid-friendly topic of a sentence? Google NLP API and Stanford NLP are helpful for finding nouns &amp; verbs, but I need to identify the subset of these that are meaningful “topics”, from a kids perspective. How can I do this?<p>Context:
My company has become popular for explaining science to kids. We now see how hard it is for kids to explore their questions so we asked elementary teachers: if your students ask a question you can’t answer, submit it to us<p>We received 600K well-formed questions! It’s fascinating to browse. We’ve been so inspired by this that we now plan to answer every question and build a video-Wikipedia-for-kids (living at mystery.org). E.g. Here’s some random questions:
https:&#x2F;&#x2F;tinyurl.com&#x2F;y43bwtek<p>First we wonder: what’s the most popular question? This is tough! “Synonymous” questions look so different. Take a look at these, you’d give the same explanation to all three kids:<p>Why is there sand on beaches?
How is sand made?
Where does sand come from?<p>Ugh. Before we embark on de-duping, we’re first trying to find the topic(s) of each question. Organizing topics into a kid’s conceptual hierarchy will be useful:<p>Living things &gt; Birds &gt; Penguine
Man-made things &gt; Food &gt; Junk food &gt; French fries<p>NLP libraries help us get parts of speech &amp; the “topics” we want are the nouns and verbs of the sentence, but not all are meaningful topics. E.g.:<p>Why do spoons and forks go in a certain place when setting the table?<p>“Place” is a noun, but it’s too vague here to be a topic. We would not want to let kids “Browse all questions about place.” So we need the subset of nouns which are meaningful topics.<p>&gt; How did people make glue?<p>“Make” is a verb. In most questions, “make” is not a topic. But here it’s being used in a significant way. We’d want to list under:<p>All questions about <i>making&#x2F;invention</i>
All questions about <i>glue</i><p>Any advice to filter down nouns &amp; verbs in a sentence to the kid-friendly topics?
======
Eridrus
This is a very hard task, not least of which because your definition of what
is a good "topic" probably has more to do with the use case in your head than
anything generic about language. E.g. what does it mean to "make"? Does
cooking count? You will have an answer, but it is not a universal answer.

I would personally avoid trying to categorize these questions into discrete
topics since that does not actually seem to be your final goal.

My suggestion would be to do some sort of clustering of your questions, and
then go through the clusters in decreasing size. It will be coarse, but should
be sufficient to support what I assume is your goal of trying to answer the
most frequent questions first.

To do the clustering, I would encode the questions into vectors with some
pretrained model, e.g. [https://www.dlology.com/blog/keras-meets-universal-
sentence-...](https://www.dlology.com/blog/keras-meets-universal-sentence-
encoder-transfer-learning-for-text-data/)

And then run a few clustering algorithms on it to see if you can get "good
enough" clusters for your purposes. You should be able to hack up something in
a week to see if this is a workable approach.

There are other short text clustering/topic modeling approaches that people
have published papers on, but it's hard to know a priori whether they will
work for your data, so I would avoid sinking too much time into something that
isn't your actual core goal.

Once you actually have the answers you are making for these questions, I think
it makes much more sense to categorize those into whatever ontology/knowledge
graph you think is good for kids. Trying to do it on the questions is just
going to be a world of pain for little gain IMO.

Once you have documents (or transcripts of your videos), you're at least in
the realm of traditional search / Q&A systems which have established ideas on
how to determine if the query matches any documents you may have. You can even
use this system to start weeding out duplicates by putting subsequent
questions into your search system to see if answers pop up.

~~~
krschacht
Thanks Eridrus for this. Your clustering approach looks promising.

I've been chipping away at this with a more straightforward heuristic based on
parts of speech, just examining a hundred questions at a time. It turns out
that filtering things down to the Direct Object and Noun-Subject (as
identified by Google's NLP) filters out a lot of the noise. And then, the
vocabulary of children under 10 years old is a finite enough that with a short
blacklist, I'm getting quite a good hit rate for the first few hundred
questions I've gone through. With a little more testing and refinement, I
think this might get me what I want but I'm going to dig into sentence
clustering if I hit a dead end.

In case you're curious, here's the last few questions I processed as an
example:

How is an eraser made? > Topics: _eraser_

Why do rainbows always have the same colors? > _rainbow_ , color

How do electromagnets work in devices like loudspeakers and microphones? >
Topics: _electromagnet_ , device, loudspeaker, microphone

What is frostbite and why does it make your skin turn black? > Topics:
frostbite, _skin_

How can you see yourself in a mirror? > Topics: seeing, _mirror_

