What social science? You shouldn't be generating the text in advance and then pr...

fole · on June 18, 2015

I worked for an NLP research think tank for a while and we always created text files as intermediate steps to each part of our system. It was basically a cache of each step, and you could restart the system at whatever step did work.

Hard drive space is cheap. Use as much as you want.

lsiebert · on June 18, 2015

Making it clear that he doesn't need to buy equipment is a good thing. I agree with you that logging results as you go is worthwhile, but for data munging, I think it's better to keep your data in it's original source, and document how you get your data into the system in code, and not require somebody reproducing your results to have a huge HD or buy something.

As an aside, I was also a social scientist originally. My first degree was in Psychology. The first time I felt like a programmer was taking supplied R code that would have taken 8+ days to finish (2400 Rausch scores at 5 minutes each), and got the whole thing to run in less than a minute by moving from sequential search of every possibility to a probing strategy to find the score that best fit the curve. Learning how to be more efficient in your code, to use less space, or time through a better algorithm to handle your data, is both useful in it's own right, and intellectually rewarding.

abannin · on June 18, 2015

R is going to give you some headaches as it relies heavily on the local machine's memory. Using RStudio on a beefed up AWS instance might help make calculation time a bit more palatable.

pvaldes · on June 19, 2015

I totally agree, but If you want to use R consider also this:

install.packages("LaF")

Package for fast access to Large ASCII Files. Can be used with big files that don't fit in the available memory.

wodenokoto · on June 18, 2015

> You shouldn't be generating the text in advance and then processing it.

That depend on how long the preprocessing take and how often you need to manipulate the processed text.