
Simulated data: the great equalizer in the AI race? - cbrun
https://medium.com/@charlesbrun/is-simulated-data-the-great-equalizer-in-the-ai-race-9ed30f9076db
======
cfusting
The author miss-understands how simulated data is created by GANS, VAEs, and
other non-physics based simulations. Let's say you have a dataset and would
like to create synthetic data using it and a GAN. Then you wish to estimate
the distribution D of the data with a GAN. To do so the GAN learns the joint
distribution P(X1, X2, ..., Xn) (where in the image case each X is usually a
pixel) such that one may sample from D and obtain a new, synthetic image.
Indeed, one will generate novel data but the distribution D that was estimated
is merely a description of the original data at best and in practice a little
bit (or a lot) off.

Now turn to the machine learning problem we sought to solve with the new
synthetic data: what is the P(y|X1, X2, ..., Xn) where y is usually a class
like "bird". In other words given an image predict its label. Since the data
was generated knowing only the statistics of the original data, it can add no
value beyond plausible examples developed using the original data itself.

Will this improve the accuracy of a model by providing additional edge case
examples and filling in gaps? Somewhat. Will it understand data not
represented by the original data and substitute for more thorough, diverse
datasets? Absolutely not.

In terms of model improvement, yes synthetic data can help. In terms of the
arms race? No. True examples provide knowledge that is unique. If one used a
physics engine (GTA is popular for self-drivings cars) one can gather truly
novel data; this is not the case for GANS.

It's concerning how willing people are to write articles on this subject
without understanding the mathematics underlying the technology.

Do your homework and RTFM.

~~~
EsssM7QVMehFPAs
You are ignoring the fact that generative AI is not closed-loop algorithm. You
can synthesize expected features in a data set and feed them to the detector -
out of bounds of the generative neural network that rather serves the purpose
of mapping into (a subset of) the proper input space.

The power of synthesis is not within the GAN or VAE, it is in the outside
mechanism that guides the creation of content with specific domain knowledge
about the feature space.

This might not replace the value of real data, but it will allow to accelerate
bootstrap, improve coverage (at cost of accuracy), or provide free
environments for auxiliary processes like CI/CD in many deep learning
applications.

There is a lot of published material on synthetic data augmentation if you
actually look for it.

~~~
missosoup
Everything you said doesn't dispute the above comment and agrees with its core
premise:

"In terms of model improvement, yes synthetic data can help. In terms of the
arms race? No. True examples provide knowledge that is unique. "

~~~
EsssM7QVMehFPAs
I was rather commenting on the first part implying that training a neural
network with the statistical distribution that comes out of a GAN or VAE does
not add value beyond that generative model capabilities.

I do not agree on that because as I explained, with domain knowledge it is
very much possible to shape the data generated for augmented learning - beyond
the plain statistical variations of GAN and similar, which are obviously of
very limited value in training.

------
throwawaymath
Information theoretically speaking, how do you generate a "synthetic" dataset
(as the article calls it) with the same fidelity as an original dataset
without having access to a critical basis set of the original? What would you
do to obtain that fidelity? Extrapolate from sufficiently many independent
conclusions drawn from the original?

And as a followup, if you can generate a synthetic dataset by extrapolating
from sufficiently many independent conclusions drawn from the original (as
opposed to having access to the original itself), would you still need to use
such a dataset for training?

Things like Monte Carlo simulation can be used to _approximate_ real world
conditions, but they can't typically capture the full information density of
organic data. For example, generating a ton of artificial web traffic for
fraud analysis or incident response only captures a few dimensions of what
real world user traffic captures.

The author talks about simulating data to focus on edge cases or avoid
statistical bias, but I don't see how simulated data actually achieves that.

~~~
gilbaz
Cool points -

"...original dataset without having access to a critical basis set of the
original?"

I think that they're not trying to copy existing datasets but are trying to
generate new datasets that solve various computer vision use-cases. Looks lke
they're using 3D photorealistic models and environments to then generate 2D
data. It is a cool idea, if they had the ability to synthesize a large amount
of 3D people and objects and insert them into 3D environment in ways that made
sense and then run motion simulation, they could hypothetically create an
incredible amount of high-quality data. Sounds pretty hard to do honestly...

I think Monte Carlo is used for something very different than computer vision
/ machine learning. Monte Carlo is usually used to estimate an average result
given many dependent variables and a simplified model of the problem. So if I
want to estimate how far my paper airplane will fly and I have a simulator, I
would vary the paper thickness, folds and wind. Each time I would run the
simulator, get a result and then I can estimate the average distance the paper
airplane would go! (actually sounds like a fun project lol). Anyway this is
just different.

Simulation is good for edge cases because you can simulate them
disproportionally to their prevalence in the real world. So let's say that
we're in a smart store and we want to recognize when an elderly person falls
on the floor to send human help to the correct location. This happens maybe
one in 5 year in a given store. If we were to gather data we may get 10
examples. If they can simulate this, they could simulate 100k elderly people
falling and then train models to recognize it! Kind of crazy really.

~~~
deehouie
But then how do you simulate, or imagine all the possible ways of falling and
all the possible places this could happen? You have one sample, that's all.
Ultimately, you have to use domain knowledge, but domain knowledge comes from
observed data. High fidelity comes from having a lot of data. This takes you
back to ground one.

------
omarhaneef
This one time we wanted to partner with a medical research team in the
university. They had data on a particular disease, and wanted to present it to
users visually.

Now this was for the public good and we were going to fund the technology to
display the data, and they would provide the data. This way people could
assess how much various drugs could help them and what the outcomes were.

It was also thought that other researchers could find patterns in the data.

Suddenly the not-for-profit institute got cold feet because they would be
"giving away" the data they had spent millions to acquire. Meanwhile we, a
for-profit institute, were happy to fund our share as a public good.

They decided that, instead of giving away their data, they would give away
simulated data. This, it was felt, would benefit the patients and
_researchers_ who might draw conclusions from the data.

Now these are phds at the top of their field. But, you know, its sort of
obvious that all they would do is reproduce their biases and make it so that
no one else could challenge those biases. I mean, for you data science types,
this is 101.

Ever since that experience, I have a distrust of simulated data.

~~~
cbrun
Sounds like a pretty bad experience indeed - surprised that was their
recommendation vs completely anonymizing the data. You didn't share if you
went ahead with it and saw any results? If so, it sounds like they were not
good. Either way, I don't think you can let one bad experience cast doubt on a
whole field. There are plenty of example of medical research institutes using
synthetic data in combination with real patient data to improve their neural
nets. I'm no medical expert, but data augmentation or full simulation works
when it's used in the right context. Having said that creating biased
algorithms that generate biased data is certainly a reality as well.

~~~
omarhaneef
we didn't go ahead with it because we kept having calls with the different
people in the university, and they kept putting off the decision. This was
years ago and as far as I know, maybe their process is still going on.

------
vonnik
There are a couple points not generally made in discussions of data and the
great AI race.

Most data is crap. So the mountains of data that are supposedly an advantage,
in China or in the vaults of the large corporations, are not fit for purpose
without a tremendous amount of data pre-processing, and even then... And that
means the real chokepoint is data science talent, not data quantity. In other
words, in many cases, the premises of this statement should be questioned.

Secondly, a lot of research is focused on few-shot, one-shot or no-shot
learning. That is, the AI industry will make this constraint increasingly
obsolete.

Thirdly, with synthetic data, it is only as good as the assumptions you made
while creating it. But how did you reach those assumptions? By what argument
should they be treated as a source of truth? If you are making those
assumptions based on your wide experience with real-world data, well then, we
run into the same scarcity, mediated by the mind of the person creating the
synthetic data.

~~~
ggggtez
Exactly this. Amazing, a picture of a white-mans arm holding a box of orange
juice. But what if the person is a woman, or dark skinned, or has prosthetics,
or is a child, or it's a bag/bottle instead of a box, or the lighting is
different, or the camera is low resolution, or there is someone standing in
the way...

Anyone in this space is well aware that the benefit of big data isn't just the
_amount_ of data, but that's it's a real representative sample of the type of
data you are actually going to need to work with. Big data solves the problem
of people being bad at creating simulations. To suggest simulations as a
solution to big data is kinda getting the relationship backwards.

~~~
cbrun
Respectfully disagree. Again, I'm not suggesting SD will solve all problems.
Big data is critical and will remain so. However, using a combination of SD
and real data will make AI algorithms more robust than using big data alone. I
do agree that the world is messy and it's hard to recreate the chaos and
weirdness of the world. However, to think that at some point we won't be able
to completely mimic the real world and all the variations out there is
strange. Re: your example, its actually pretty easy to span millions of humans
based on ethnicity, age, body mass, etc. It's just a matter of time until this
problem gets solved.

~~~
notahacker
> However, to think that at some point we won't be able to completely mimic
> the real world and all the variations out there is strange.

Why? I think it's strange to believe the opposite: that something as simple as
a computer program designed by something as simple as the human mind
definitely should be able to adequately simulate the complexity of the real
world.

~~~
cbrun
Really? I think if we were to bring back our close ancestors (4-5 generations
away) they'd look at our world like we see Harry Potter's: pure magic. I mean
flying metal birds, fire that instantly turns on and off, machine that move
around like ghosts, small boxes that can talk back, musicians on demand in a
box? I think you get my point, you're selling humanity short. There's no limit
to human ingenuity and there's a reason Elon Musk still doubts we're living in
a simulation.

~~~
notahacker
There's a massive difference between 'develop technologies which are
indistinguishable from magic to people who don't know how they work' and
'completely simulate the informational complexity of the world using the human
minds and computers with comparatively limited information processing
capabilities.'

I'm not sure the second is even a logical possibility, never mind a practical
one.

~~~
hackernews65
Dude, if you're referencing information theory you clearly don't understand
how to create tech. Obviously they're not trying to simulate the universe
accurately lol. Simulating the real world for specific use-cases to solve
computer vision for practical applications is super interesting. If they have
customers, makes sense.

------
kory
The answer is no, since when you generate data, you either

* Have a set of data as a "basis"
    
    
      1. This diminishes the "equalization" factor since you 
      need a lot of data to get a good approximation of the 
      distribution anyways.
    
      2. You need to create a model based off of the set,
      which mathematically should be close to the same problem 
      as just building the target model.
    

* No or small training set to use
    
    
      1. You need to create a model that probably uses some
      stat distribution to generate. Your target model will just 
      learn that distribution.
    
      2. Your initial assumptions create a distribution, and
      that is not going to be the same distro of real-world 
      data. Maybe painfully off-base. I've worked on this 
      problem for months and it's fairly difficult to get 
      perfect in an easy (modeled by simple stat distributions)
      scenario.
    

There are problems where generating data can work, but they're specific
problems or can only be used for rare edge-cases that don't show up enough in
a dataset. For the most difficult problems it is probably just as difficult to
generate "correct" data as it is to generate a model without real-world data.

~~~
gilbaz
Yeah but you're assuming that you know how their creating this data. Just
throwing an option out there - what if they created a latent space of 3D
people and iteratively expanded it with GANs and 2D real image datasets. That
would generalize.

Just a thought, not sure what's really going on there, I just know that they
probably have something interesting they're cooking up!

This is a really crazy vision

~~~
kory
Training a GAN to generate people without a significantly large dataset (if
that's even possible) is probably just as difficult of a problem as just
building the model you want in the end without sufficient data.

Assuming those image sets are small they will create a model with a large
bias. If you're talking about fine-tuning an existing model with small
datasets, this is done already and works fairly well if not overused.

It all comes down to: to create the first "data-generating" model you need a
lot of data and compute. Expanding it is a different story, but that isn't
where the problem lies. We come full-circle back to the same problem as what
we started with: big players can afford to build these models and small
players can't.

------
zwieback
I read this because it's an interesting question but the article is just an
advertisment for the poster's company. I didn't really learn anything and
half-way through felt like a sucker.

------
ggggtez
Should you write an article that can be summed up by "no"?

>For a lot of tasks the performance works well, but for extreme precision it
will not fly — yet.

But we all knew that going in, didn't we?

~~~
cbrun
Disagree, otherwise I obviously wouldn't have written this piece. ;) Synthetic
data (SD) is not a silver bullet that will solve all problems, but it opens up
a lot of opportunities. I'm seeing cool startups using SD to accelerate their
R&D efforts and launch products in production in ways I didn't see 2 years
ago. Imho, the quality of SD is reaching a tipping point and the sim2real gap
is starting to disappear.

~~~
throwlaplace
This not a personal attack but I would just like to my point out the insane
number of weasel words/marketing speak phrases (ie intentionally imprecise but
rhetorically powerful terms) in your response

>is not a silver bullet

>opens up a lot of opportunities

>accelerate their R&D efforts

>Imho

>reaching a tipping point

>gap

>starting to disappear.

This is what it reads like:

"we're getting to being able to approach the cusp of potentially honing in on
the vicinity of the realization of this technology in the future"

So your entire response doesn't commit to any strong claim at all but it sure
sounds like it does!

Incidentally I wonder if you're aware you're doing this or if it comes,
subconsciously, from reading lots of writing that's like this.

------
hansdieter1337
If you use computer generated images to teach your networks, you have a nice
network for computer generated images. It might be good to pre-train a net
(instead of random weights), but you still need labeling of real world images.
E.g., a guy once trained a self driving car model in GTA 5. I’m sure this
algorithm won’t do great in the real world. But I already see an industry
forming, promising to get rid of all labeled data. And there will be idiots
believing it. That’s how all the Silicon Valley startups work. Unicorns on
Powerpoint slides and an empty basket in reality. (src: living and working in
the valley)

~~~
cbrun
I read about the guy who used GTA to train a neural net. I think he was trying
to make the point that although obviously imperfect, using simulated data
could actually work. I'm not saying using simulated data (SD) should the be-
all end-all for training neural nets, but we're seeing algorithms performing
better when they're trained using a combination of real labelled data and SD
rather than real data alone. I hear your point though about hype cycles and
the tunnel vision SV can often fall into.

------
vardump
Just had a thought: is this part of the reason we sleep? Simulation of data to
train our neural networks.

~~~
EsssM7QVMehFPAs
Highly interesting thought! Based on Psychology's understanding of dreams I
would guess that sleep is actually rather a replay of real experience to
stabilize our neural network, but most probably augmented with synthetic
variations of the experience data.

------
jeromebaek
Simulated data is provably more vulnerable to adversarial attacks. It only
gives the illusion of more data (quite literally), and should not be used for
mission-critical debiasing. A NN trained on mostly simulated minority faces
and mostly real non-minority faces is a nightmare, the worst of both worlds: a
plausible deniability that the algorithm is unbiased, and extreme
susceptibility to adversarial attacks that take advantage of this illusory
unbiasedness.

------
natoucs
I think some people here are confused because they imagine financial/customer
synthetic data where the pattern to simulate is unclear, instead of computer
vision, where the pattern to replicate is obvious as we see it before our
eyes. This company seems to be focused on specific use cases of computer
vision synthetic data so makes sense imo.

------
gilbaz
I would definitely saw that this is extremely hard to accomplish on one hand
and on the other if it works it would be a game changer!

A good simulation is the holy grail of AI. It solves the data bottleneck once
the data can generalize to the real-world. Let's see them prove that!

------
overlords
Domain randomization works well for robotics tasks.

~~~
deehouie
references?

~~~
yshcht
[https://lilianweng.github.io/lil-log/2019/05/05/domain-
rando...](https://lilianweng.github.io/lil-log/2019/05/05/domain-
randomization.html)

~~~
deehouie
Good and convincing example. thanks for this robotic perspective.

------
fooker
This sounds very wrong.

If you train a model using simulated data, the result you can obtain will be a
slightly worse simulation.

------
antoinea
Really interesting read, thanks!

------
sjg007
Sure if you have a generative model and you can add in noise then why not.

