Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are the best tools for analyzing large bodies of text?
83 points by CoreSet on June 17, 2015 | hide | past | favorite | 52 comments
I'm a researcher in the social sciences, working on a project that requires me to scrape a large amount of text and then use NLP to determine things like sentiment analysis, LSM compatibility, and other linguistic metrics for different subsections of that content.

The issue: After weeks of work, I've scraped all this information (a few GB's worth) and begun to analyze it using a mixture of Node, Python and bash scripts. In order to generate all of the necessary permutations of this data (looking at Groups A, B, and C together, A & C, A & B, etc), I've generated an unwieldy number of text files (the script generated > 50 GB before filling up my pitiful MBP hard drive), which I understand is no longer sustainable.

The easiest way forward is loading this all into a database I can query to analyze different permutations of populations. I don't have much experience with SQL, but it seems to fit here.

So how do I put all these .txt files into a SQL or NoSQL database? Are there any tools I could use to visualize this data (IntelliJ, my editor, keeps breaking). And where should I do all this work? I'm thinking now either an external hard drive, or on a VPS I can just tunnel into.

Thanks in advance for your advice HN!

It seems to me that you are driving from the wrong direction. Given that you have a large body of text, what is it you want to learn about/from the text. Collecting and applying random tools and making measurements without some inkling about what you want or expect to discover makes no sense. Tell us more about the provenance of your corpus of text and what sort of information you want to derive from the data.

If I could upvote your comment 10 times, I would. It doesn't seem that the OP knows what exactly they want, and so the question of whether they should be doing things in Node/Python/Bash/SQL/NoSQL is putting the cart way before the horse. It's quite possible grep -- and some regex, and maybe some scripting logic for tokenizing -- is all that's needed, if the hypothesis is clear enough. But right now, it just sounds like OP has a ton of text and is hoping that some tool can somehow produce an unanticipated, useful insight...in my experience, that rarely (OK, never, actually) happens.

I was wondering what exactly is OP looking for. Most probably he forgot to put his problem in.

If he just wants to parse the text his bare minimum ubuntu install is enough with grep, cat etc.


Success really depends on the conception of the problem, the design of the system, not in the details of how it's coded. - Leslie Lamport

You're not going to come up with a simple design through any kind of coding techniques or any kind of programming language concepts. Simplicity has to be achieved above the code level before you get to the point which you worry about how you actually implement this thing in code. - Leslie Lamport

You're not going to find the best algorithm in terms of computational complexity by coding. - Leslie Lamport

Sometimes the problem is to discover what the problem is. - Gordon Glegg, 'The Design of Design' (1969)

The besetting mistake of expert designers is not designing the thing wrong, but designing the wrong thing. - Frederick P. Brooks, 'The Design of Design: Essays from a Computer Scientist' (2010)

... then again ...

In practice, designing seems to proceed by oscillating between sub-solution and sub-problem areas, as well as by decomposing the problem and combining sub-solutions. - Nigel Cross

... but did you want re-use? ...

A general-purpose product is harder to design well than a special-purpose one. - Frederick P. Brooks, 'The Design of Design: Essays from a Computer Scientist' (2010)

Design up front for reuse is, in essence, premature optimization. - AnimalMuppet

... lines of thought from my fortune clone @ https://github.com/globalcitizen/taoup

Man, this comment is simultaneously both SO right and SO wrong. I guess the tools we have today are set up for knowing what you want to learn. Named Entity Retrieval? Topic classification? Sentiment Extraction?

OTOH, why should OP have to know what's interesting a-priori? The mantra of big data is "listen to the data, let it tell you, leave your preconceived notions at the door." The fault is with our current tool chain. We need something that tells us what about the text is interesting before we dive in for a closer look.

I'm imagining a tool that told me: "this text seems to have a lot of opinions and sentiment," "it's about a product that was returned," "a number of people's names are mentioned," "this text was loaded along with some structured data that appears to reference a price, a location, and a date."

Why is it such a stretch to combine the tools we already have to generate and push summarizations? Maybe it's just the cost of computation? If you know you're looking for topic and don't care about sentiment, then you can avoid paying for it?

I'd like to understand a bit more about your data structures. Raw text should live in a file system, but if you're creating highly structured data using a SQL database could be a good fit. I'd also suggest taking a look at Spark; you can run it locally on your machine (does not need a Hadoop cluster) and can interact with flat files. If you're finding that your machine is lacking in horsepower or space, AWS could be your friend.

>a project that requires me to scrape a large amount of text and then use NLP to determine things like sentiment analysis, LSM compatibility, and other linguistic metrics for different subsections of that content.

I ran into a similar project and found this helpful working with the unstructured data: https://textblob.readthedocs.org/en/dev/ https://radimrehurek.com/gensim/

More information would be helpful. In terms of having a data store that you can easily query text I would recommend Elasticsearch. Kibana is a dashboard built on Elasticsearch for performing analytics and visualization on your texts. Elasticsearch also has a RESTful api which would play nicely with your Python scripts or any scripting language for that matter. I would also recommend the Python package gensim for your NLP.

Ah, somehow I skipped your comment when reading through them; otherwise I would've just upvoted instead of posted :)

What social science?

You shouldn't be generating the text in advance and then processing it. You should be dynamically generating the text in memory, so you basically only have to worry about the memory for one text file at a time.

As for visualizations, R and ggplot2 may work (R can handle text and data munging, as well as sentiment analysis etc.) It may be worth using it as a social scientist.

ggplot2 has a python port.

That said, you are probably using nltk, right? There are some tools in nltk.draw. There is probably also a user's mailing list for whatever package or tool you are using, consider asking this there.

I worked for an NLP research think tank for a while and we always created text files as intermediate steps to each part of our system. It was basically a cache of each step, and you could restart the system at whatever step did work.

Hard drive space is cheap. Use as much as you want.

Making it clear that he doesn't need to buy equipment is a good thing. I agree with you that logging results as you go is worthwhile, but for data munging, I think it's better to keep your data in it's original source, and document how you get your data into the system in code, and not require somebody reproducing your results to have a huge HD or buy something.

As an aside, I was also a social scientist originally. My first degree was in Psychology. The first time I felt like a programmer was taking supplied R code that would have taken 8+ days to finish (2400 Rausch scores at 5 minutes each), and got the whole thing to run in less than a minute by moving from sequential search of every possibility to a probing strategy to find the score that best fit the curve. Learning how to be more efficient in your code, to use less space, or time through a better algorithm to handle your data, is both useful in it's own right, and intellectually rewarding.

R is going to give you some headaches as it relies heavily on the local machine's memory. Using RStudio on a beefed up AWS instance might help make calculation time a bit more palatable.

I totally agree, but If you want to use R consider also this:


Package for fast access to Large ASCII Files. Can be used with big files that don't fit in the available memory.

> You shouldn't be generating the text in advance and then processing it.

That depend on how long the preprocessing take and how often you need to manipulate the processed text.

Right now the fastest alternative to nltk is spaCy https://honnibal.github.io/spaCy/ definitely worth a look. I don't know what you're trying to do with the permutations part, but it seems like you can generate those on the fly through some reproducible algorithm (such that some integer seed describes the ordering in a reproducible way) then just keep track of the seeds, not the permuted data.

One approach is to put text files in Amazon S3 and write map reduce jobs that you can run with Elastic MapReduce. I did this a number of years ago for a customer project and it was inexpensive and a nice platform to work with. Microsoft, Google, and Amazon all have data warehousing products you can try if you don't want to write MapReduce jobs.

That said, if you are only processing 2 GB of text, you can often do that in memory on your laptop. This is especially true if you are doing NLP on individual sentences, or paragraphs.

Well if you're willing to relocate to the Bay Area I could set you up in an office with a variety of tools to analyze and classify text and do all sorts of analysis on it, I'd even pay you :-) (yes I've got a couple of job openings where this pretty much describes the job)

That said, converting large documents into data sets is examined in a number of papers, you may find yourself getting more traction by splitting the problem up that way, step 1) pull apart the document into a more useful form, then step 2) do analysis on those parts. They are interelated of course and some forms of documents lend themselves to disassembly better than others (scientific papers for example, easy to unpack, random blog posts, less so.

As for "where" to do it, the ideal place is a NoSQL cluster. This is what we've done at Blekko for years (and continue to do post acquisition) which is put the documents we crawl from the Internet into a giant NoSQL data base and then run jobs that execute in parallel across all of those servers to analyze those documents (traditionally to build a search index, but other modalities are interesting too.

What tools exactly are you using in Node and Python? Python has a nice data analysis tool Pandas(http://pandas.pydata.org/) which would help with your task of generating multiple permutations of your data. Check out plot.ly to visualize the data (it integrates well with a pandas pipeline from experience); It would also help if you mentioned exactly what kind of visualizations you're looking to create from the data.

With regards to your issue of scale, this might help: http://stackoverflow.com/questions/14262433/large-data-work-...

I had similar issues when doing research in computer science, and I feel a lot of researchers working with data have this headache of organizing, visualizing and scaling their infrastructure along with versioning data and coupling their data with code. Also adding more collaborators to this workflow would be very time consuming...

>>I'm thinking now either an external hard drive, or on a VPS I can just tunnel into. Consider setting up an ElasticSearch cluster somewhere, like with AWS, which takes plugins for ElasticSearch. Once you've indexed your data with ES, then queries are pretty easy (JSON-based). This would also solve your other problem with data visualization. ElasticSearch has an analytics tool called Kibana. Pretty useful and doesn't require too much effort to set up or use. I'm using this setup for a sentiment analysis project myself.

You didn't mention the libraries in your NLP pipeline (guessing NLTK bc of the Python?), but if you're doing LSM compatibility, I'm guessing you might be interested in clustering or topic-modelling algorithms and such...Mahout integrates easily with ElasticSearch.

Are there libraries you would suggest for someone using nltk & ElasticSearch to get started doing sentiment analysis?

Don't know if is that you need or not, but common lisp has available the package 'cl-sentiment', specifically aimed to do sentiment analysis in text.


Other packages that you could find useful are cl-mongo, for mongo no-sql databases, cl-mysql, postmodern (postgresql) and cl-postres (postgresql)

And for Perl you have also Rate_sentiment


The first immediate thing I would recommend is moving all of your files into AWS S3: http://aws.amazon.com/s3/

Storage is super cheap, and you can get rid of the clutter on your laptop. I wouldn't recommend moving to a database yet, especially if you don't have any experience working with them before. S3 has great connector libraries and good integrations with things like Spark and Hadoop and other 'big data' analysis tools. I would start to go down that path and see which tools might be best for analyzing text files from S3!

AWS-stuff is also a new tool that first has to be learned, and frankly it sounds like overkill for this kind of problem (not to mention that it probably comes with its own kind of overhead).

lsiebert's comment about not generating such large text files in the first place is a good one: If you already have a tool that generates the text file, and you already have the tool that analyzes it, it should be possible to just run them simultaneously via some form of pipe, so that the intermediate result never has to be stored.

Finally, if this is not possible for some unclear reason and depending on the kind of space and throughput that really is required, simply getting an external disk may be the better way to go.


I’m astounded by the number and quality of responses for appropriate tools. Thank you HN! To shed a little more light on the project:

I’m compiling research for a sociology / gender studies project that takes a national snapshot of romantic posts from classifieds sites / sources across the country, and then uses that data to try and draw meaningful insights about how different races, classes, and genders of Americans engage romantically online.

I’ve already run some basic text-processing algorithms (tf-idf, certain term frequency lists, etc) on smaller txt files that represent the content for a few major U.S metros and discovered some surprises that I think warrant a larger follow-up. So I have a few threads I want to investigate already, but I also don’t want to be blind to interesting comparisons that can be drawn between data sets, now that I have more information (that’s why I’m asking for a bit of a grab-bag of text-processing abilities).

My problem is that the techniques from the first phase (analyzing a few metros) didn’t scale with the larger data set: The entire data set is only 2GB of text, but it started maxing my memory as I recopied the text files over and over again into different groupings. Starting with a datastore from the beginning would also have worked, but it just wasn’t necessary at the beginning of the project.

My current setup: Python’s Beautiful Soup + CasperJS for scripting (which is done) Node, relying primarily on the excellent NLP package “natural,” for analysis Bash to tie things together My personal MBP as the environment

SO given the advice expressed in the thread (and despite my love of Shiny New Things), a combination of shell scripts and awk (a CL language specifically for structured text files!), which I had heard about before but thought was a networking tool, will probably work best, backed up by a 1TB or similar external drive, which I could use anyway (and would be more secure). I have the huge luxury of course that this is a one-time research-oriented project, and not something I need to worry about being performant, etc.

I will of course look into a lot of the solutions provided here regardless, as something (especially along the visualizations angle) could prove more useful, and it’s all fascinating to me.

Thanks again HN for all of your help.

You don't need to copy around your data into different groupings. Just group the doc IDs and rerun the analysis pipeline each time. If a part of the analysis pipeline is slow, it can be cached. But that's that module's business. Don't let it disrupt the interface.

If you want the analysis to be fast, some speed benchmarks with my software, spaCy: http://honnibal.github.io/spaCy/#speed-comparison

For what you want to do, the tokenizer is sufficient --- which runs at 5,000 documents per second. You can then efficiently export to a numpy array of integers, with tokens.count_by. Then you can base your counts on that, as numpy operations. Processing a few gb of text in this way should be fast enough that you don't need to do any caching. I develop spaCy on my MacbookAir, so it should run fine on your MBP.

As a general tip though, the way you're copying around your batches of data is definitely bad. It's really much better to structure the core of your program in a simple and clear way, so that caching is handled "transparently".

So, let's say you really did need to run a much more expensive analysis pipeline, so it was critical that each document was only analysed once.

You would still make sure that you had a simple function like:

    def analyse(doc_id):
        <do stuff>
        return <stuff>
So that you can clearly express what you want to do:

    def gather_statistics(batch):
        analyses = []
        for doc_id in batch:
        <do stuff>
        return <stuff>
If the problem is that analyse(doc_id) takes too long, that's fine --- you can cache. But make sure that's something only the analyse(doc_id) deals with. It shouldn't complicate the interface to the function.

No answer to your question per se, but Software Engineering Radio has a nice episode about working with and extracting knowledge from larger bodies of text: http://www.se-radio.net/2014/11/episode-214-grant-ingersoll-...

We're building a database specifically to solve this problem. We're almost production ready and will be going OpenSource in a couple of months. Send us an email at textedb@gmail.com if you'd like to try it out. http://textedb.com/

Is it a standalone tool or a web based offering?

standalone tool, which can be deployed on a remote host identical to any other database

There are commercial products that do a decent job of what you want. I have experience with Oracle Endeca, and have heard that Qlik is even better. Both have easy ways to load and visualize.

There are research frameworks that are quite good. One is Factorie from U.Mass Amherst (factorie.cs.umass.edu) that supports latent dirichlet allocation and lots more. Stanford also has wonderful tools.

Yes, you can dump text into SQL, and postgres has some text analytics. But my guess is that you'll soon want capabilities that RDBs don't have. Mongo has some support for text, but not nearly as much as postgres. I think both SQL and NoSQL are currently round pegs in square holes for deep text analytics. They're barely ok for search.

This sounds like an algorithmic issue. How many permutations are you generating? Are you sure you can scale it with different software tools or hardware, or is there an inherent exponential blowup?

Are you familiar with big-O / computational complexity (I ask since you say your background is in social sciences.)

A few GB's of input data is generally easy to work with on a single machine, using Python and bash. If you need big intermediate data, you can brute force it with distributed systems, hardware, C++, etc. but that can be time consuming, depending on the application.

I'd have to know a little more about your setup to be sure, but Pachyderm (pachyderm.io) might be a viable option. Full disclosure, I'm one of the founders. The biggest advantage you'd get from our system is that you can continue using all of those python and bash scripts to analyze your data in a distributed fashion instead of having to learn/use SQL. If it looks like Pachyderm might be a good fit, feel free to email me joey@pachyderm.io

Might be worth trying a visual analytics system like Overview:


There's also a nice evaluation paper:


Interesting. I recently wrote my thesis on Latent Dirichlet Allocation (LDA), it's worth checking it out. Without going into too much technical detail, LDA is a 'topic model'. Given a large set of documents (a corpus), it estimates the 'topics' of the corpus and gives a breakdown of each document, in terms of how much it contains of topic 1, topic 2, etc.

SQL Server has a feature called file tables. Basically it's the ability to dump a bunch of files into a folder, you can then query the contents using semantic search.


Looking at some of the comments it would be helpful if you took a step back and clearly defined for the data, data markers and relationship hypothesis you are looking for.

I've done some text file parsing and analysis with just C++ and Excel. You could possibly simplify the analytical process by clearly defining what you need from the text file.

Have you considered Google Bigquery? It's a managed data warehouse with a SQL-like query language. Easy to load in your data, run queries, then drop the database when you're done with it.

You might be interested in http://entopix.com, could be ideal.

I used a Win app called TextPipe. It has a large feature set to wrangle text but the visualization aspect may need another tool.

it's not clear what your objective is. I can have 1kb text file and end up with a 1TB file after "analyzing" if I don't have a goal in mind.

apache solr or elasticsearch

a java application with SQL adapters and Apache Tika might work

Apache Spark

I would not use Apache Spark when the amount of data is this little. I have myself done this mistake when processing only a few gigabytes.

Take a look at this: http://aadrake.com/command-line-tools-can-be-235x-faster-tha...

IMHO I think people abuse Spark and I would be truly impressed if anybody could write a Spark program faster then just a regular Scala program for processing this.

grep, sed

If you want high performance and simple why not use flat files, bash, grep (maybe parallel), cut, awk, wc, uniq, etc. You can get very far with these and if you have a fast local disk you cat get huge read rates. A few GB can be scanned in a matter of seconds. Awk can be used to write your queries. I don't understand what you are trying to do, but if it can be done with a SQL database and doesn't involve lots of joins then it can be done with a delimited text file. If you don't have a lot of disk space you can also work with gzipped files, zcat, zgrep, etc. I would not even consider distributed solutions or nosql until I had at least 100GB of data (more like 1TB of data). I would not consider any sort of SQL database unless I had a lot of complex joins.

This is definitely the most straightforward way to go. It is almost certainly not the way he will go. In particular Awk is just great for this job and very easy to learn.

Does the NSA allow its employees to look for help from the general public like this? Seems odd for such a secretive organization to post publicly asking for help on how to parse our conversations.

Where do you see NSA in the question of the OP?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact