
Ask HN: Non-tech guy looking for unstructured data startup advice - bedhead
I have an investment background but am interested in creating a new finance product for internal use. My idea is to take all historical filings that public companies submit to the SEC (as well as a live feed of new filings) and basically create a Bloomberg for mostly qualitative information that the big financial software companies do not analyze.<p>For example, I would like to take a given company&#x27;s entire document history (almost 20 years of text files) and create an application that can determine whether the company has ever received a material weakness in its internal controls notice, the dates received&#x2F;outstanding, the reasons provided, and when&#x2F;if it was finally resolved or not. There are about 50-80 other conditions like this I would like to discover through an application, but what&#x27;s tricky is that companies can use different language to describe these things.<p>I have absolutely no idea where to start with a project like this, but am excited to learn some new tricks. Am wondering about how to best extract unstructured data I want, best ways to populate and organize a DB (presumably NoSQL), etc. I realize this might sound comically general and naive, but figured there might be some folks out there with good experience working with unstructured data and document-oriented databases. Any thoughts are GREATLY appreciated! Thanks.
======
rahimnathwani
You could do it step by step. First step is to get a decent amount of material
together. Without that you will have nothing to analyse.

If the data is available as a bulk download, you're all set to move to the
next step. If not, next best is if they offer an API. In that case, learn how
to use Python or something to pull each document using that API, and storing
it somewhere (either the filesystem, or in a database). If not, get the book
'Web Scraping with Python' and use that.

Once you have the stuff together, Udacity has a gentle introduction on
cleaning document-ish data (using JSON/Python and Mongodb):
[https://www.udacity.com/course/data-wrangling-with-
mongodb--...](https://www.udacity.com/course/data-wrangling-with-mongodb--
ud032)

Then the analysis starts. If it were me, I might start by splitting documents
into chunks (paragraphs?), and classifying them somehow. Maybe use NLTK:
[http://www.nltk.org/book/ch06.html](http://www.nltk.org/book/ch06.html)

Let us know how you get on :)

~~~
achompas
The first couple of steps are actually the easiest. The last paragraph --
which is also your shortest -- is the hardest part, by far.

OP, you want to basically extract entities from documents without necessarily
knowing them a priori ("companies can use different language to describe these
things"). This might be even tougher if companies aren't required to report
material weaknesses (are they? I'm unsure). If they don't report, then you'd
also likely need to manually code training data for companies for which you
have filings.

If companies are required to report material weaknesses, then you can likely
rely on NLP for this job. Start by building a classifier using material
weakness data as your labels and tf-idf vectorized documents as your features
(check out parent's link, and pick up a general NLP book while you're at it).

For each of the 50-80 things you mention in your post, OP, you'll need to
repeat this: (1) obtain labels and (2) figure out a way to train a machine
learning algorithm to handle that task for you. The alternative is to rely on
heuristics (eg. "I know from past experience that companies using such-and-
such word in filings experienced X, so I'll encode that in business rules"),
which is an ugly way to build something but likely faster than learning about
machine learning or NLP.

------
jrowley
Sounds like a cool project! I'm not sure it's necessary for you to use a No-
SQL db - in fact I'd avoid it until you have a good reason to use it. Postgres
is great but you might already be familiar with it.

------
sheraz
Interesting problem. I would start looking at things like natural language
processing, machine learning, and maybe even solr / elastic search.

also, I would look into various saas offerings out there (monkeylearn, and
others)

