Today I'm working on dataset of 1GB, which fits in memory. But it is not enough. If a variable is category/factor you need to introduce dummy values and your dataset starts picking the weight. Next - do you want apply ML algorithm in parallel? Upst, you need more memory. Done that? Now please use test dataset for prediction. My point that "data in memory" is just the beginning...

