
Synthetic Data for Deep Learning (2019) - arseny-n
https://arxiv.org/abs/1909.11512
======
x86ARMsRace
Interesting for sure, I actually contributed to developing tools for making
synthetic data recently. This Python module => [https://github.com/artemis-
analytics/dolos/tree/master](https://github.com/artemis-
analytics/dolos/tree/master) is what we developed. We ended up hitting a snag
with our initial research as scaled synthetic data generators are generally
not open-source. We put this together to fix that.

It's pretty cool, because both the branches let you create custom generators.
One we tinkered with was using T-Digests to profile large datasets to produce
synthetic data, rather than just fake data. Basically we're working on
something that lets you take in data and spit out something with identical
statistical information inside. One thing you can use this for is stripping
out confidential data (in theory at least). Another use case was to expand
dataset sizes.

~~~
Random_ernest
Having worked extensively on synthetic data, what's your "verdict" on the
topic? People seem to be very divided about it.

~~~
x86ARMsRace
I think it definitely has its uses. Is it an effective drop in replacement for
sensitive data in all scenarios? I never really got that impression. My
biggest takeaway was that it is excellent for development and early
refinement.

Having access to synthetic data like this would let you give lower level
analysts or people with lower clearances data similar to the confidential data
you're working with. That's a good way to reduce costs, while also developing
skills that would have otherwise been difficult to develop without providing
access to the data itself.

What I found is that it's a valuable tool for bringing something up to a state
where it can be applied to real data, and refined further. So, long story
short would be that I think it certainly has uses, however those uses
eventually lead back to using real data.

------
Erlich_Bachman
In some sense, one could think of human dreams as synthetic data generation.
Surely looks a lot like one part of the brain (one network) generates samples
to train some other networks in the brain. Perhaps to consolidate the
knowledge across all systems, or transfer it from one high-level, flexible
slow network to another - a fast, responsive, dumb but highly customized one,
both to be used in different situations or for different aspects of life.

~~~
Fishysoup
as a neuroscientist it seems that sleep + dreaming is a lot more complex than
that (and nobody really knows what either of those does). but in terms of
learning and bridges to regression (DL) and other machine learning, my
suspicion would be it helps with generalization + relating experiences to one
another, as well as consolidation.

~~~
ricksharp
I’ve often wondered if nightmares serve to prepare us to act quickly in worse-
case type scenarios. Our mind responding to stress by saying, ok if this goes
really bad, what are you gonna do about it.

------
erichocean
We're doing synthetic data generation to predict complex 3D character
deformations—basically, a full anatomical bone/muscle/fat/skin simulation is
computed in thousands of poses (each pose is determined by comparatively
minuscule numbers of inputs), and then training on that data set so we can
predict high-quality deformations in real-time from live mo-cap data.

Photo-real 3D worlds are particularly appropriate for generating high-quality
synthetic ML training data sets—I know a bunch of autonomous driving companies
are doing it with great success. (We also use Houdini to generate our 3D data
sets.)[0]

[0]
[https://www.youtube.com/watch?v=GKb8ZL3bUbw](https://www.youtube.com/watch?v=GKb8ZL3bUbw)

~~~
thruflo22
Very cool!

We’re using GANs to generate synthetic transactional data that preserves
temporary and causal correlations [0].

[0] friends link to avoid paywall: [https://medium.com/towards-artificial-
intelligence/generatin...](https://medium.com/towards-artificial-
intelligence/generating-synthetic-sequential-data-using-
gans-a1d67a7752ac?source=friends_link&sk=1ab69626e19b95556cc3b2db83e64cf9)

------
citilife
I've actually done work on synthetic data development --

[https://medium.com/capital-one-tech/why-you-dont-
necessarily...](https://medium.com/capital-one-tech/why-you-dont-necessarily-
need-data-for-data-science-48d7bf503074)

Generally, we use it to avoid utilizing 'real' data. Accuracy usually is the
same using either synthetic or real data. There are edge cases where one fails
due to particular issues about how synthetic data suppresses outliers, etc.

------
vladoh
It seems that we still don't have a real breakthrough training a machine
learning algorithm on synthetic camera images alone and using it in real
applications. This is already done successfully for depth images by Microsoft
for the Kinect [1], but I haven't seen something like that for normal images.
The GTA V dataset is close, but not the real thing... [2]

It is difficult to create high-quality photo realistic images at big scale
with enough variance. It is interesting to see, if one can train a network
that transfers images (both synthetic and real) to some intermediate
representation and then train some detector/classifier/semantic segmentation
on it...

I had a lot of fun playing around with a open-source game engine called VDrift
to generate ground truth for optical flow, depth and semantic segmentation. I
think the video with the ground truth is nice [3], but the graphics of the
game weren't that good. All the code is open-sourced on Github if somebody
feels like playing around... [4]

[1] [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/02/main-39.pdf) [2] [http://vladlen.info/papers/playing-
for-data.pdf](http://vladlen.info/papers/playing-for-data.pdf) [3]
[https://vimeo.com/haltakov/synthetic-
dataset](https://vimeo.com/haltakov/synthetic-dataset) [4]
[https://github.com/haltakov/synthetic-
dataset](https://github.com/haltakov/synthetic-dataset)

~~~
w_t_payne
The GTA V dataset doesn't really have enough modes of variation, in my
opinion.

To my knowledge, the best public attempt at this is represented by the OpenAI
Rubik's Cube / Shadow Robotics dexterous hand demo.
[https://openai.com/blog/solving-rubiks-
cube/](https://openai.com/blog/solving-rubiks-cube/)
[https://arxiv.org/abs/1910.07113](https://arxiv.org/abs/1910.07113)

NVIDIA are also doing some interesting work in this area, but again, I'm not
really sure they put enough different modes of variation into it.
[https://research.nvidia.com/publication/2018-04_Training-
Dee...](https://research.nvidia.com/publication/2018-04_Training-Deep-
Networks).

CVEDIA also get really impressive results using similar techniques:-
[https://www.cvedia.com/](https://www.cvedia.com/)

------
amcoastal
Synthetic data is huge right now! I'm currently doing work on synthetic
generation with photo-realistic physical models to do inversion problems with
CNNs (particularly in the surf-zone and nearshore area, paper coming next week
;) ), and eventually domain adaption. If you want to get your project funded
make one of your keywords is "synthetic data"!

~~~
seron
I'm also interested in potentially looking at inversion problems using CNNs in
coastal regions. Is there a way I could reach out to you and chat some time?

~~~
amcoastal
Sure! Ive had a couple conference papers published on it but am always doing
work to improve what we're doing. I added my email to my profile, with a
slight typo to avoid spam ;)

------
bordercases
"Anyone who considers arithmetical methods of producing random digits is, of
course, in a state of sin." \- John von Neumann

One might argue that the point of synthetic data _is_ to produce pseudo-random
samples, but they are only ever going to reflect the biases of the interpreter
and so the critique will still be worth holding for its cautionary
significance.

~~~
w_t_payne
The point is to be able to _control_ the correlations that you introduce - so
you can therefore control what your learning algorithm learns.

~~~
bordercases
Nice in industrial settings, poor in scientific ones.

------
jhonatan08
I'm not familiar with the topic. How do you measure the quality of the
synthetic data? That is, how close the synthetic samples are from the real
ones. Moreover, can you control this quality while generating synthetic
samples?

~~~
thruflo22
You have similarity measures like mutual information score and generally
comparing correlations and distributions.

You can also A/B test for specific use cases, for example train a model on the
real and the synthetic data and compare relevant metrics.

You can see some of these illustrated on eg:
[https://hazy.com/blog/2020/03/23/synthetic-scooter-
journeys](https://hazy.com/blog/2020/03/23/synthetic-scooter-journeys)

~~~
w_t_payne
There's also some really interesting research from last year on local
dimensionality of a manifold that I really really want to try out.

~~~
tsbertalan
You have a citation on that for me to read?

------
jakublangr
Typically we find that without further adjustments, there tends to be a
substantial domain gap between the synthetic data and the real world data.

We're building something that helps us narrow or completely eliminate the gap
between synthetic data and real data at Creation Labs and will have some
exciting things to show in the next few weeks on our website (creationlabs.ai)

------
sutor
Submitted a lit review for this recently, such an amazing topic that could
have huge implications for learning algorithms.

------
w_t_payne
For machine vision problems, synthetic data with domain randomisation
_definitely_ works, provided, that is, you are able to generate a large enough
number of different modes of variation.

In fact, it works so well it feels like cheating.

------
larme
Synthetic data always reminds me the practice of feeding cattle with the meat
and bone of other cattle. Will this make our generation's mad cow disease?

~~~
SirLuxuryYacht
Maybe, since garbage in is garbage out. But if there's no grass in the first
place, what are you going to feed your cattle?

------
joshgel
Im working on this for generating synthetic EMR data for researchers to
potentially use to study various interventions on Covid-19. Early stages...

------
Sybth
I have not been able to create synthetic data for 3D bounding boxes. Is there
something available that can help?

------
ponker
Why does the process of synthesizing data and then training a model on it work
well, when hard-coding an algorithm never does?

~~~
w_t_payne
In my mind it's because you have a more direct route to the underlying physics
of the problem than either hand-coding features or trying to learn them from
real data (at least when that real data is as poorly managed and controlled as
it usually is).

Our ML algorithms are _really_ good at finding correlations -- but we don't
necessarily know if the correlations in our data are actually the ones we
_want_ our system to learn. When we're using synthetic data, we have many more
levers at our disposal to ensure that this is the case.

------
abriosi
I live in a higher dimension and my mood always collapses

------
person_of_color
How does this work in an information theoretic sense?

------
ultimaterr
very cool

