
Ask HN: Can Python NLP categorization sort varying text files with no metadata? - segfaltz
I need to sort a collection of text files based upon content and thought Python NLP categorization to be a wonderful tool to complete the task. Subject matter organizing of each file is my aim.<p>For example, I want to take a directory of journal entries as input, sort them, and then have a bunch of sub-directories based on subject (i.e., Software, Hardware, Astronomy, Painting, etc.) as my result. Some manual interaction and tweaking is expected; sincerely, anything to help sort my collection is highly appreciated.<p>The text files:<p>- Do not include any metadata and are not tagged for content. *<p>- Are free form and can cover any subject or topic. There are no limitations.<p>- The length of each file can vary.<p>Am I correct in my understanding that NLP is up to the task? Or is this a square peg round hole way of applying the technology? Does a better approach (or even algorithm) exist? If so, could you point me to this (ideally mature) tool?<p>I&#x27;m currently working up a PoC example via Python to see if this is viable. I&#x27;m not being lazy ;-) Thank you for your help and time.<p>----------------------------------------<p>* Perhaps a simple solution is as easy as:<p>A script could obtain a word count per word to generate a tag list. I.e., the most used words minus any stop words. After which, the tags could then associate one file with another. Conversely, any file that does not have a similar percentage or spread of content words, it could then be marked accordingly -- that the subject matter is not the same.
======
dfraser992
Off the top of my head... what do you think NLP means? Parsing sentences,
understanding the grammar, POS... are not needed here. Named entity
recognition would probably be useful... NLP is such a broad term that it's
almost useless, IMO.

There are a lot of research papers out there on web page classification. This
sounds much like that.

Classification based on keywords sounds like the easiest, most straightforward
way - read up on tf-idf [1]

topic models (e.g. LDA) is another approach [2]

[edit] And I forgot: if you're using Python, use spaCy [3]. There are quite a
number of Python/Java toolkits out there for NLP type work and I keep using it
as the base for a project.

[1]
[https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[2]
[https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

[3] [https://spacy.io/](https://spacy.io/)

