
Ask HN: Synthetic data for machine learning - simsim
I just started working on machine-learning project involving the interaction with a home appliance. You can think of it as a sort of Nest Thermostat. We do not have data that we can work with, so I&#x27;m developing a model of human behavior to generate synthetic data and bootstrap the process. I&#x27;m now asking myself how many of my fellow data scientist would like to use a service that generate synthetic data sets with well-known statistical properties or well-know underlying physical process, to test, calibrate and validate models. Would you use it or do you always prefer to cook it by yourself?
======
GFK_of_xmaspast
My initial take is that if you're generating synthetic data to train up
something then either you'll need to spend some time to make sure that your
fake data has the critical properties of the real data or you'll be able to do
something really simple and easy to code up.

Either way, it's not really something to outsource. If there's a lot of work
necessary to understand the data, you need to do it yourself and not offload
it to some remote service (or worse, expose your data/metadata to a third
party). If there's not a lot of work necessary, just do it yourself, it'll be
faster (there's a reason nobody's putting out things like
[https://en.wikipedia.org/wiki/A_Million_Random_Digits_with_1...](https://en.wikipedia.org/wiki/A_Million_Random_Digits_with_100,000_Normal_Deviates)
anymore)

~~~
simsim
good point, thanks!

