
The Hidden Cost of Using Laptops for Data Science - gk1
https://blog.dominodatalab.com/cost-data-science-laptops/?r=2
======
et2o
Frankly, I disagree. I think the conclusions are domain-specific.

In my field (machine learning/informatics in a medical context), it is very
often the case that simple models are better as they generalize to the next
set of patients better, even if a more complex model may fit the test dataset
better. In some contexts, it requires a lot of hubris to think that your model
is really describing reality and free from the intricacies of your particular
dataset.

To generalize, complex models tend to overfit on the small datasets we
typically have. One example of this is the difficulty in transferring
transcriptomic models for risk prediction or other biological characteristics
between clinical trials; a model with a LOOCV AUC of 0.9 can get used on a
different clinical trial and perform quite poorly. Deep learning certainly has
its place, but there is nothing inherently superior about a more complex
model.

~~~
rubatuga
Could not agree with your statements more. Doing statistical modelling with a
small subset or too many independent variables leads to problems, which seems
to be similar to what you are describing. Although I know nothing about
machine learning, it would make sense that would be the case. This is off
topic, but I'm an undergrad in bioinformatics, and was wondering if you have
any tips into where new research may head? I will have to partner with a PI
later this fall, and was wondering what kind of project I would pursue

~~~
et2o
If you put your email in your profile, I will email you.

~~~
rubatuga
Awesome, I have updated my profile.

------
fphhotchips
Agree with the main thrust, but think that it misses a couple additional key
costs: lack of reproducibility and lack of collaboration. Both of these are
actually covered by the product that Domino sells, too!

Essentially, work done on laptops often doesn't make it past the laptop.
Perhaps the modelling code makes it into Git, but the test data gets lost, the
process that lead to it gets lost, etc. etc. Then reproducing and potentially
modifying the work that has already been done becomes a labor and the work
gets left behind or completely redone as a result. Let alone all the
differences in environment and setup that can manifest between different
laptops.

~~~
dwhitena
Reproducibility and provenance are essential for the data science workflows.
And it's important to maintain these at scale. Take a look at Pachyderm some
time as well if you get a chance (I work on the project for full disclosure).
We version data in addition to code for full reproducibility. Even for
distributed multi-stage and multi-language pipelines.

------
mark_l_watson
I agree with the article's main premise. I have a huge VPS on GCP that I can
quickly spin up, make a few runs, then halt it. Cost is about $0.60/hour

------
clircle
The post states that the accuracy of the GLM is reported for a test set (out
of sample data), but I cannot tell if the other models' accuracies are
reported for a test set or training set.

Also the statement "Decreased model complexity leads to less accurate models"
isn't necessarily true.

------
setgree
Their interface has nice fonts. I could also see their work being useful in
statistics classes.

~~~
pvaldes
Nice but unreadable semitransparent & thin fonts.

------
dkobran
100% agree with this article. The cloud is where heavy computational work
belongs but it's not trivial to leverage these resources today given the
complexity of existing offerings. We built
[https://paperspace.com](https://paperspace.com) to make this easy and
affordable. Ping me if you want to test out the platform or if you have any
questions!

