
Ask HN: What are the best tools for analyzing large bodies of text? - CoreSet
I&#x27;m a researcher in the social sciences, working on a project that requires me to scrape a large amount of text and then use NLP to determine things like sentiment analysis, LSM compatibility, and other linguistic metrics for different subsections of that content.<p>The issue: After weeks of work, I&#x27;ve scraped all this information (a few GB&#x27;s worth) and begun to analyze it using a mixture of Node, Python and bash scripts. In order to generate all of the necessary permutations of this data (looking at Groups A, B, and C together, A &amp; C, A &amp; B, etc), I&#x27;ve generated an unwieldy number of text files (the script generated &gt; 50 GB before filling up my pitiful MBP hard drive), which I understand is no longer sustainable.<p>The easiest way forward is loading this all into a database I can query to analyze different permutations of populations. I don&#x27;t have much experience with SQL, but it seems to fit here.<p>So how do I put all these .txt files into a SQL or NoSQL database? Are there any tools I could use to visualize this data (IntelliJ, my editor, keeps breaking). And where should I do all this work? I&#x27;m thinking now either an external hard drive, or on a VPS I can just tunnel into.<p>Thanks in advance for your advice HN!
======
drallison
It seems to me that you are driving from the wrong direction. Given that you
have a large body of text, what is it you want to learn about/from the text.
Collecting and applying random tools and making measurements without some
inkling about what you want or expect to discover makes no sense. Tell us more
about the provenance of your corpus of text and what sort of information you
want to derive from the data.

~~~
danso
If I could upvote your comment 10 times, I would. It doesn't seem that the OP
knows what exactly they want, and so the question of whether they should be
doing things in Node/Python/Bash/SQL/NoSQL is putting the cart way before the
horse. It's quite possible grep -- and some regex, and maybe some scripting
logic for tokenizing -- is all that's needed, if the hypothesis is clear
enough. But right now, it just sounds like OP has a ton of text and is hoping
that some tool can somehow produce an unanticipated, _useful_ insight...in my
experience, that rarely (OK, never, actually) happens.

~~~
ganeshkrishnan
I was wondering what exactly is OP looking for. Most probably he forgot to put
his problem in.

If he just wants to parse the text his bare minimum ubuntu install is enough
with grep, cat etc.

------
rasengan0
>a project that requires me to scrape a large amount of text and then use NLP
to determine things like sentiment analysis, LSM compatibility, and other
linguistic metrics for different subsections of that content.

I ran into a similar project and found this helpful working with the
unstructured data:
[https://textblob.readthedocs.org/en/dev/](https://textblob.readthedocs.org/en/dev/)
[https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/)

------
cmarciniak
More information would be helpful. In terms of having a data store that you
can easily query text I would recommend Elasticsearch. Kibana is a dashboard
built on Elasticsearch for performing analytics and visualization on your
texts. Elasticsearch also has a RESTful api which would play nicely with your
Python scripts or any scripting language for that matter. I would also
recommend the Python package gensim for your NLP.

~~~
ashley
Ah, somehow I skipped your comment when reading through them; otherwise I
would've just upvoted instead of posted :)

------
lsiebert
What social science?

You shouldn't be generating the text in advance and then processing it. You
should be dynamically generating the text in memory, so you basically only
have to worry about the memory for one text file at a time.

As for visualizations, R and ggplot2 may work (R can handle text and data
munging, as well as sentiment analysis etc.) It may be worth using it as a
social scientist.

ggplot2 has a python port.

That said, you are probably using nltk, right? There are some tools in
nltk.draw. There is probably also a user's mailing list for whatever package
or tool you are using, consider asking this there.

~~~
fole
I worked for an NLP research think tank for a while and we __always __created
text files as intermediate steps to each part of our system. It was basically
a cache of each step, and you could restart the system at whatever step _did_
work.

Hard drive space is cheap. Use as much as you want.

~~~
lsiebert
Making it clear that he doesn't need to buy equipment is a good thing. I agree
with you that logging results as you go is worthwhile, but for data munging, I
think it's better to keep your data in it's original source, and document how
you get your data into the system in code, and not require somebody
reproducing your results to have a huge HD or buy something.

As an aside, I was also a social scientist originally. My first degree was in
Psychology. The first time I felt like a programmer was taking supplied R code
that would have taken 8+ days to finish (2400 Rausch scores at 5 minutes
each), and got the whole thing to run in less than a minute by moving from
sequential search of every possibility to a probing strategy to find the score
that best fit the curve. Learning how to be more efficient in your code, to
use less space, or time through a better algorithm to handle your data, is
both useful in it's own right, and intellectually rewarding.

------
nutate
Right now the fastest alternative to nltk is spaCy
[https://honnibal.github.io/spaCy/](https://honnibal.github.io/spaCy/)
definitely worth a look. I don't know what you're trying to do with the
permutations part, but it seems like you can generate those on the fly through
some reproducible algorithm (such that some integer seed describes the
ordering in a reproducible way) then just keep track of the seeds, not the
permuted data.

------
mark_l_watson
One approach is to put text files in Amazon S3 and write map reduce jobs that
you can run with Elastic MapReduce. I did this a number of years ago for a
customer project and it was inexpensive and a nice platform to work with.
Microsoft, Google, and Amazon all have data warehousing products you can try
if you don't want to write MapReduce jobs.

That said, if you are only processing 2 GB of text, you can often do that in
memory on your laptop. This is especially true if you are doing NLP on
individual sentences, or paragraphs.

------
ChuckMcM
Well if you're willing to relocate to the Bay Area I could set you up in an
office with a variety of tools to analyze and classify text and do all sorts
of analysis on it, I'd even pay you :-) (yes I've got a couple of job openings
where this pretty much describes the job)

That said, converting large documents into data sets is examined in a number
of papers, you may find yourself getting more traction by splitting the
problem up that way, step 1) pull apart the document into a more useful form,
then step 2) do analysis on those parts. They are interelated of course and
some forms of documents lend themselves to disassembly better than others
(scientific papers for example, easy to unpack, random blog posts, less so.

As for "where" to do it, the ideal place is a NoSQL cluster. This is what
we've done at Blekko for years (and continue to do post acquisition) which is
put the documents we crawl from the Internet into a giant NoSQL data base and
then run jobs that execute in parallel across all of those servers to analyze
those documents (traditionally to build a search index, but other modalities
are interesting too.

------
koopuluri
What tools exactly are you using in Node and Python? Python has a nice data
analysis tool Pandas([http://pandas.pydata.org/](http://pandas.pydata.org/))
which would help with your task of generating multiple permutations of your
data. Check out plot.ly to visualize the data (it integrates well with a
pandas pipeline from experience); It would also help if you mentioned exactly
what kind of visualizations you're looking to create from the data.

With regards to your issue of scale, this might help:
[http://stackoverflow.com/questions/14262433/large-data-
work-...](http://stackoverflow.com/questions/14262433/large-data-work-flows-
using-pandas)

I had similar issues when doing research in computer science, and I feel a lot
of researchers working with data have this headache of organizing, visualizing
and scaling their infrastructure along with versioning data and coupling their
data with code. Also adding more collaborators to this workflow would be very
time consuming...

------
ashley
>> _I 'm thinking now either an external hard drive, or on a VPS I can just
tunnel into._ Consider setting up an ElasticSearch cluster somewhere, like
with AWS, which takes plugins for ElasticSearch. Once you've indexed your data
with ES, then queries are pretty easy (JSON-based). This would also solve your
other problem with data visualization. ElasticSearch has an analytics tool
called Kibana. Pretty useful and doesn't require too much effort to set up or
use. I'm using this setup for a sentiment analysis project myself.

You didn't mention the libraries in your NLP pipeline (guessing NLTK bc of the
Python?), but if you're doing LSM compatibility, I'm guessing you might be
interested in clustering or topic-modelling algorithms and such...Mahout
integrates easily with ElasticSearch.

~~~
hodwik
Are there libraries you would suggest for someone using nltk & ElasticSearch
to get started doing sentiment analysis?

------
pvaldes
Don't know if is that you need or not, but common lisp has available the
package 'cl-sentiment', specifically aimed to do sentiment analysis in text.

[https://github.com/RobBlackwell/cl-
sentiment](https://github.com/RobBlackwell/cl-sentiment)

Other packages that you could find useful are cl-mongo, for mongo no-sql
databases, cl-mysql, postmodern (postgresql) and cl-postres (postgresql)

And for Perl you have also Rate_sentiment

[http://search.cpan.org/~prath/WebService-
GoogleHack-0.15/Goo...](http://search.cpan.org/~prath/WebService-
GoogleHack-0.15/GoogleHack/Examples/Rate_Sentiment.pl)

------
skadamat
The first immediate thing I would recommend is moving all of your files into
AWS S3: [http://aws.amazon.com/s3/](http://aws.amazon.com/s3/)

Storage is super cheap, and you can get rid of the clutter on your laptop. I
wouldn't recommend moving to a database yet, especially if you don't have any
experience working with them before. S3 has great connector libraries and good
integrations with things like Spark and Hadoop and other 'big data' analysis
tools. I would start to go down that path and see which tools might be best
for analyzing text files from S3!

~~~
nhaehnle
AWS-stuff is also a new tool that first has to be learned, and frankly it
sounds like overkill for this kind of problem (not to mention that it probably
comes with its own kind of overhead).

lsiebert's comment about not generating such large text files in the first
place is a good one: If you already have a tool that generates the text file,
and you already have the tool that analyzes it, it should be possible to just
run them simultaneously via some form of pipe, so that the intermediate result
never has to be stored.

Finally, if this is not possible for some unclear reason and depending on the
kind of space and throughput that really is required, simply getting an
external disk may be the better way to go.

------
CoreSet
EDIT:

I’m astounded by the number and quality of responses for appropriate tools.
Thank you HN! To shed a little more light on the project:

I’m compiling research for a sociology / gender studies project that takes a
national snapshot of romantic posts from classifieds sites / sources across
the country, and then uses that data to try and draw meaningful insights about
how different races, classes, and genders of Americans engage romantically
online.

I’ve already run some basic text-processing algorithms (tf-idf, certain term
frequency lists, etc) on smaller txt files that represent the content for a
few major U.S metros and discovered some surprises that I think warrant a
larger follow-up. So I have a few threads I want to investigate already, but I
also don’t want to be blind to interesting comparisons that can be drawn
between data sets, now that I have more information (that’s why I’m asking for
a bit of a grab-bag of text-processing abilities).

My problem is that the techniques from the first phase (analyzing a few
metros) didn’t scale with the larger data set: The entire data set is only 2GB
of text, but it started maxing my memory as I recopied the text files over and
over again into different groupings. Starting with a datastore from the
beginning would also have worked, but it just wasn’t necessary at the
beginning of the project.

My current setup: Python’s Beautiful Soup + CasperJS for scripting (which is
done) Node, relying primarily on the excellent NLP package “natural,” for
analysis Bash to tie things together My personal MBP as the environment

SO given the advice expressed in the thread (and despite my love of Shiny New
Things), a combination of shell scripts and awk (a CL language specifically
for structured text files!), which I had heard about before but thought was a
networking tool, will probably work best, backed up by a 1TB or similar
external drive, which I could use anyway (and would be more secure). I have
the huge luxury of course that this is a one-time research-oriented project,
and not something I need to worry about being performant, etc.

I will of course look into a lot of the solutions provided here regardless, as
something (especially along the visualizations angle) could prove more useful,
and it’s all fascinating to me.

Thanks again HN for all of your help.

~~~
syllogism
You don't need to copy around your data into different groupings. Just group
the doc IDs and rerun the analysis pipeline each time. If a part of the
analysis pipeline is slow, it can be cached. But that's that module's
business. Don't let it disrupt the interface.

If you want the analysis to be fast, some speed benchmarks with my software,
spaCy: [http://honnibal.github.io/spaCy/#speed-
comparison](http://honnibal.github.io/spaCy/#speed-comparison)

For what you want to do, the tokenizer is sufficient --- which runs at 5,000
documents per second. You can then efficiently export to a numpy array of
integers, with tokens.count_by. Then you can base your counts on that, as
numpy operations. Processing a few gb of text in this way should be fast
enough that you don't need to do any caching. I develop spaCy on my
MacbookAir, so it should run fine on your MBP.

As a general tip though, the way you're copying around your batches of data is
definitely bad. It's really much better to structure the core of your program
in a simple and clear way, so that caching is handled "transparently".

So, let's say you really did need to run a much more expensive analysis
pipeline, so it was critical that each document was only analysed once.

You would still make sure that you had a simple function like:

    
    
        def analyse(doc_id):
            <do stuff>
            return <stuff>
    

So that you can clearly express what you want to do:

    
    
        def gather_statistics(batch):
            analyses = []
            for doc_id in batch:
                analyses.append(analyse(doc_id))
            <do stuff>
            return <stuff>
    

If the problem is that analyse(doc_id) takes too long, that's fine --- you can
cache. But make sure that's something only the analyse(doc_id) deals with. It
shouldn't complicate the interface to the function.

------
bkin
No answer to your question per se, but Software Engineering Radio has a nice
episode about working with and extracting knowledge from larger bodies of
text: [http://www.se-radio.net/2014/11/episode-214-grant-
ingersoll-...](http://www.se-radio.net/2014/11/episode-214-grant-ingersoll-on-
his-book-taming-text/)

------
machinelearning
We're building a database specifically to solve this problem. We're almost
production ready and will be going OpenSource in a couple of months. Send us
an email at textedb@gmail.com if you'd like to try it out.
[http://textedb.com/](http://textedb.com/)

~~~
ninebrows
Is it a standalone tool or a web based offering?

~~~
machinelearning
standalone tool, which can be deployed on a remote host identical to any other
database

------
quizotic
There are commercial products that do a decent job of what you want. I have
experience with Oracle Endeca, and have heard that Qlik is even better. Both
have easy ways to load and visualize.

There are research frameworks that are quite good. One is Factorie from U.Mass
Amherst (factorie.cs.umass.edu) that supports latent dirichlet allocation and
lots more. Stanford also has wonderful tools.

Yes, you can dump text into SQL, and postgres has some text analytics. But my
guess is that you'll soon want capabilities that RDBs don't have. Mongo has
some support for text, but not nearly as much as postgres. I think both SQL
and NoSQL are currently round pegs in square holes for deep text analytics.
They're barely ok for search.

------
hudibras
Here's a good place to start: [http://www.matthewjockers.net/text-analysis-
with-r-for-stude...](http://www.matthewjockers.net/text-analysis-with-r-for-
students-of-literature/#comment-39991)

------
chubot
This sounds like an algorithmic issue. How many permutations are you
generating? Are you sure you can scale it with different software tools or
hardware, or is there an inherent exponential blowup?

Are you familiar with big-O / computational complexity (I ask since you say
your background is in social sciences.)

A few GB's of input data is generally easy to work with on a single machine,
using Python and bash. If you need big intermediate data, you can brute force
it with distributed systems, hardware, C++, etc. but that can be time
consuming, depending on the application.

------
jaz46
I'd have to know a little more about your setup to be sure, but Pachyderm
(pachyderm.io) might be a viable option. Full disclosure, I'm one of the
founders. The biggest advantage you'd get from our system is that you can
continue using all of those python and bash scripts to analyze your data in a
distributed fashion instead of having to learn/use SQL. If it looks like
Pachyderm might be a good fit, feel free to email me joey@pachyderm.io

------
cafebeen
Might be worth trying a visual analytics system like Overview:

[https://blog.overviewdocs.com/](https://blog.overviewdocs.com/)

There's also a nice evaluation paper:

[http://www.cs.ubc.ca/labs/imager/tr/2014/Overview/overview.p...](http://www.cs.ubc.ca/labs/imager/tr/2014/Overview/overview.pdf)

------
Rainymood
Interesting. I recently wrote my thesis on Latent Dirichlet Allocation (LDA),
it's worth checking it out. Without going into too much technical detail, LDA
is a 'topic model'. Given a large set of documents (a corpus), it estimates
the 'topics' of the corpus and gives a breakdown of each document, in terms of
how much it contains of topic 1, topic 2, etc.

------
SQL2219
SQL Server has a feature called file tables. Basically it's the ability to
dump a bunch of files into a folder, you can then query the contents using
semantic search.

[https://msdn.microsoft.com/en-
us/library/ff929144%28v=sql.11...](https://msdn.microsoft.com/en-
us/library/ff929144%28v=sql.110%29.aspx)

------
kaa2102
Looking at some of the comments it would be helpful if you took a step back
and clearly defined for the data, data markers and relationship hypothesis you
are looking for.

I've done some text file parsing and analysis with just C++ and Excel. You
could possibly simplify the analytical process by clearly defining what you
need from the text file.

------
tedchs
Have you considered Google Bigquery? It's a managed data warehouse with a SQL-
like query language. Easy to load in your data, run queries, then drop the
database when you're done with it.

------
effnorwood
[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

------
wigsgiw
You might be interested in [http://entopix.com](http://entopix.com), could be
ideal.

------
timmins
I used a Win app called TextPipe. It has a large feature set to wrangle text
but the visualization aspect may need another tool.

------
halayli
it's not clear what your objective is. I can have 1kb text file and end up
with a 1TB file after "analyzing" if I don't have a goal in mind.

------
gt565k
apache solr or elasticsearch

------
kitwalker12
a java application with SQL adapters and Apache Tika might work

------
bra-ket
Apache Spark

~~~
johan_stenberg
I would not use Apache Spark when the amount of data is this little. I have
myself done this mistake when processing _only_ a few gigabytes.

Take a look at this: [http://aadrake.com/command-line-tools-can-
be-235x-faster-tha...](http://aadrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

IMHO I think people abuse Spark and I would be truly impressed if anybody
could write a Spark program faster then just a regular Scala program for
processing this.

------
nodivbyzero
grep, sed

------
codeonfire
If you want high performance and simple why not use flat files, bash, grep
(maybe parallel), cut, awk, wc, uniq, etc. You can get very far with these and
if you have a fast local disk you cat get huge read rates. A few GB can be
scanned in a matter of seconds. Awk can be used to write your queries. I don't
understand what you are trying to do, but if it can be done with a SQL
database and doesn't involve lots of joins then it can be done with a
delimited text file. If you don't have a lot of disk space you can also work
with gzipped files, zcat, zgrep, etc. I would not even consider distributed
solutions or nosql until I had at least 100GB of data (more like 1TB of data).
I would not consider any sort of SQL database unless I had a lot of complex
joins.

~~~
cactusface
This is definitely the most straightforward way to go. It is almost certainly
not the way he will go. In particular Awk is just great for this job and very
easy to learn.

------
developer1
Does the NSA allow its employees to look for help from the general public like
this? Seems odd for such a secretive organization to post publicly asking for
help on how to parse our conversations.

~~~
huskyr
Where do you see NSA in the question of the OP?

