Hacker News new | past | comments | ask | show | jobs | submit login
Synthetic Data for Deep Learning (2019) (arxiv.org)
95 points by arseny-n 33 days ago | hide | past | favorite | 48 comments



Interesting for sure, I actually contributed to developing tools for making synthetic data recently. This Python module => https://github.com/artemis-analytics/dolos/tree/master is what we developed. We ended up hitting a snag with our initial research as scaled synthetic data generators are generally not open-source. We put this together to fix that.

It's pretty cool, because both the branches let you create custom generators. One we tinkered with was using T-Digests to profile large datasets to produce synthetic data, rather than just fake data. Basically we're working on something that lets you take in data and spit out something with identical statistical information inside. One thing you can use this for is stripping out confidential data (in theory at least). Another use case was to expand dataset sizes.


Having worked extensively on synthetic data, what's your "verdict" on the topic? People seem to be very divided about it.


I think it definitely has its uses. Is it an effective drop in replacement for sensitive data in all scenarios? I never really got that impression. My biggest takeaway was that it is excellent for development and early refinement.

Having access to synthetic data like this would let you give lower level analysts or people with lower clearances data similar to the confidential data you're working with. That's a good way to reduce costs, while also developing skills that would have otherwise been difficult to develop without providing access to the data itself.

What I found is that it's a valuable tool for bringing something up to a state where it can be applied to real data, and refined further. So, long story short would be that I think it certainly has uses, however those uses eventually lead back to using real data.


When would it be inappropriate to use synthetic data to increase your study's statistical power?


That's not something I'm really equipped to comment on all that much. My role was mostly as a software developer, less a data scientist. I can talk about how to improve the statistical power of the data you are using though, since there's a few ways to make synthetic data. Maybe something in here can answer your question.

Dummy data, like what the Faker (various languages) package does has very little utility other than testing systems and developing prototypes with similar schema to the real thing. It can be used as a starting point for making synthetic data, and that's what we did.

Getting into synthetic data, there's sequential and non-sequential synthetic data. Sequential would generate a single datum, such as age, then use that as the starting point to produce the rest. For instance:

age = 29 => Income for age (20 < x < 30) draw from distribution [30, 70], for incomes in that distribution... ect

Here you have actually really high utility data and can use it for producing basic models that you can then apply to the real data. Non-sequential data on the other hand creates independent datum with a specific rule, but ignores the interdependence of the rules. For example, where a sequential dataset may contain less than 1% of people age 20 to 30 who are retired, a non sequential dataset may contain a distribution based on the group average, leading to a skewed number of people age 20 to 30 who are retired.


In some sense, one could think of human dreams as synthetic data generation. Surely looks a lot like one part of the brain (one network) generates samples to train some other networks in the brain. Perhaps to consolidate the knowledge across all systems, or transfer it from one high-level, flexible slow network to another - a fast, responsive, dumb but highly customized one, both to be used in different situations or for different aspects of life.


as a neuroscientist it seems that sleep + dreaming is a lot more complex than that (and nobody really knows what either of those does). but in terms of learning and bridges to regression (DL) and other machine learning, my suspicion would be it helps with generalization + relating experiences to one another, as well as consolidation.


I’ve often wondered if nightmares serve to prepare us to act quickly in worse-case type scenarios. Our mind responding to stress by saying, ok if this goes really bad, what are you gonna do about it.


Never thought this way, great perspective.


We're doing synthetic data generation to predict complex 3D character deformations—basically, a full anatomical bone/muscle/fat/skin simulation is computed in thousands of poses (each pose is determined by comparatively minuscule numbers of inputs), and then training on that data set so we can predict high-quality deformations in real-time from live mo-cap data.

Photo-real 3D worlds are particularly appropriate for generating high-quality synthetic ML training data sets—I know a bunch of autonomous driving companies are doing it with great success. (We also use Houdini to generate our 3D data sets.)[0]

[0] https://www.youtube.com/watch?v=GKb8ZL3bUbw


Very cool!

We’re using GANs to generate synthetic transactional data that preserves temporary and causal correlations [0].

[0] friends link to avoid paywall: https://medium.com/towards-artificial-intelligence/generatin...


I've actually done work on synthetic data development --

https://medium.com/capital-one-tech/why-you-dont-necessarily...

Generally, we use it to avoid utilizing 'real' data. Accuracy usually is the same using either synthetic or real data. There are edge cases where one fails due to particular issues about how synthetic data suppresses outliers, etc.


It seems that we still don't have a real breakthrough training a machine learning algorithm on synthetic camera images alone and using it in real applications. This is already done successfully for depth images by Microsoft for the Kinect [1], but I haven't seen something like that for normal images. The GTA V dataset is close, but not the real thing... [2]

It is difficult to create high-quality photo realistic images at big scale with enough variance. It is interesting to see, if one can train a network that transfers images (both synthetic and real) to some intermediate representation and then train some detector/classifier/semantic segmentation on it...

I had a lot of fun playing around with a open-source game engine called VDrift to generate ground truth for optical flow, depth and semantic segmentation. I think the video with the ground truth is nice [3], but the graphics of the game weren't that good. All the code is open-sourced on Github if somebody feels like playing around... [4]

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/... [2] http://vladlen.info/papers/playing-for-data.pdf [3] https://vimeo.com/haltakov/synthetic-dataset [4] https://github.com/haltakov/synthetic-dataset


The GTA V dataset doesn't really have enough modes of variation, in my opinion.

To my knowledge, the best public attempt at this is represented by the OpenAI Rubik's Cube / Shadow Robotics dexterous hand demo. https://openai.com/blog/solving-rubiks-cube/ https://arxiv.org/abs/1910.07113

NVIDIA are also doing some interesting work in this area, but again, I'm not really sure they put enough different modes of variation into it. https://research.nvidia.com/publication/2018-04_Training-Dee....

CVEDIA also get really impressive results using similar techniques:- https://www.cvedia.com/


Synthetic data is huge right now! I'm currently doing work on synthetic generation with photo-realistic physical models to do inversion problems with CNNs (particularly in the surf-zone and nearshore area, paper coming next week ;) ), and eventually domain adaption. If you want to get your project funded make one of your keywords is "synthetic data"!


I'm also interested in potentially looking at inversion problems using CNNs in coastal regions. Is there a way I could reach out to you and chat some time?


Sure! Ive had a couple conference papers published on it but am always doing work to improve what we're doing. I added my email to my profile, with a slight typo to avoid spam ;)


I'm very interested in synthetic imagery in the maritime domain, and would be very interested in talking to you, if at all possible.


Sure! Ive had a couple conference papers published on it but am always doing work to improve what we're doing. I added my email to my profile, with a slight typo to avoid spam ;)


I can't see your email address in your profile. :-( Mine is my username (without the underscores, all one word) at gmail dot com.


As the others in this reply thread, I'm interested in nearshore synthetic data. Would you be open to chatting some time?


Sure! Ive had a couple conference papers published on it but am always doing work to improve what we're doing. I added my email to my profile, with a slight typo to avoid spam ;)


I couldn't find it in your profile :(


"Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." - John von Neumann

One might argue that the point of synthetic data is to produce pseudo-random samples, but they are only ever going to reflect the biases of the interpreter and so the critique will still be worth holding for its cautionary significance.


The point is to be able to control the correlations that you introduce - so you can therefore control what your learning algorithm learns.


Nice in industrial settings, poor in scientific ones.


I'm not familiar with the topic. How do you measure the quality of the synthetic data? That is, how close the synthetic samples are from the real ones. Moreover, can you control this quality while generating synthetic samples?


From the perspective of using synthetic imagery to train machine vision systems, I think that the idea of fidelity (i.e. how similar synthetic images are to real images) is less than half the story, and has the potential to be dangerously misleading.

Of greater concern are quality measures that look across the entire dataset. Here are some hypothetical metrics which (although impossible to compute in practice) will help get you thinking in the right way.

- How does the synthetic image manifold compare to the natural image manifold?

- Are there any points on the synthetic image manifold where the local number of dimensions is significantly less than at the corresponding point on the natural image manifold? (Would indicate an inability to generalise across that particular mode of variation in that part of feature space).

- For each point on the synthetic image manifold, are there any points where the distance between the synthetic image manifold and the natural image manifold is large AND the variance of the synthetic image manifold in the direction of that difference is small. (Would indicate an inability to generalise across the synthetic-to-real gap at that point in the manifold).

- Does your synthetic data systematically capture the correlations that you wish your learning algorithm to learn?

- Does your synthetic data systematically eliminate the confounding correlations that may be present in nature but which do not necessarily indicate the presence of your target of interest.

Engineering with synthetic data is not data mining. It is much more akin to feature engineering.


You have similarity measures like mutual information score and generally comparing correlations and distributions.

You can also A/B test for specific use cases, for example train a model on the real and the synthetic data and compare relevant metrics.

You can see some of these illustrated on eg: https://hazy.com/blog/2020/03/23/synthetic-scooter-journeys


There's also some really interesting research from last year on local dimensionality of a manifold that I really really want to try out.


You have a citation on that for me to read?


The naive way to do it would be to determine the (approximate) distribution and parameters of your data, then generate similar data which conforms to the same distribution under the same parameters, to a very high level of confidence (ideally over 99%). Then the confidence interval would also give you the error bars to control and tune the quality of the synthetic data. But that's not perfect, and you'd want to make sure you're conforming to other important features which are particular to your data (like sparsity and dimensionality). There's also a common pitfall: in many cases where you'd like to use synthetic data, you're doing it because you lack sufficient real data. This is very dangerous, because that might also mean you have a fundamental misunderstanding of the distribution and parameters of the real data (or those might be simply unknown). This is tantamount to extrapolating from limited data.

What another commenter said about how synthetic data is useful for providing analysts with good quality dummy data instead of confidential real data is correct. I think that's a great use case for synthetic data. But in general, I disagree with using synthetic data to augment a dearth of real world data unless you have reasonable certainty your data conforms to a certain distribution with certain features and parameters.

One such area is financial simulation. You can generally be reasonably certain that price data will conform to a lognormal distribution. So it's okay to generate synthetic lognormal price data in place of real price data for certain types of analysis. But again, I would still stress that you can't use that to measure (for example) how profitable an actual trading strategy would be. You need real data for that (to analyze order fills, counter survivorship bias, etc).

Another area is computer vision. As others have pointed out, since our understanding of roads is very good it's very effective to generate synthetic data for training self-driving vehicle models. But it's still tricky and it can be extremely confounding if misused.


I don't actually think you want to mirror the natural data distribution, but rather to provide a distribution which has a sufficiently high variance in the right directions so that the resulting NN polytope has a chance of being approximately 'correct'.

Because you have this piecewise-linear sort of warping of the feature space going on, the NN is basically a whole bunch of lever-arms. The broader the support that you can give those lever arms, the less they will be influenced by noise and randomness ... hence my obsession with putting enough variance into the dataset along relevant dimensions.

To put this another way, I think that the synthetic data manifold has to be 'fat' in all the right places.


You have a good point, and I probably should have been more clear. When I said same distribution and same parameters, one of the parameters I was thinking of is mean and variance. Though to be fair mean and variance aren't formal parameters of every distribution.

Can you give an example of successful synthetic data generation which doesn't need to map to the same distribution? I'm surprised at that idea.


Well, in a sensing-for-autonomous-vehicles type problem, it's actually more important to have simple and easy to specify data distributions than ones which map to reality, which in any case may be so poorly or incompletely understood that it's impossible to write the requirement for.

So, as a simple example, the illumination in a real data-set might be strongly bimodal, with comparatively few samples at dawn and dusk, but we might in a synthetic dataset want to sample light levels uniformly across a range that is specified in the requirements document.

Similarly, on the road, the majority of other vehicles are seen either head-on or tail-on, but we might want to sample uniformly over different target orientations to ensure that our performance is uniform, easily understood, and does not contain any gaps in coverage.

Similarly, operational experience might highlight certain scenarios as being a particularly high risk. We might want to over-sample in those areas as part of a safety strategy in which we use logging to identify near-miss or elevated-risk scenarios and then bolster our dataset in those areas.

In general, the synthetic dataset should cover the real distribution .. but you may want it to be larger than the real distribution and focus more on edge-cases which may not occur all that often but which either simplify things for your requirements specification, or provide extra safety assurance.

Also, given that it's impossible to make synthetic data that's exactly photo-realistic, you also want enough variation in enough different directions to ensure that you can generalize over the synthetic-to-real gap.

Also, I'm not sure how much sense the concepts of mean and variance make in these very very high dimensional spaces.


In the physical sciences there are plenty of domains where accurate measurements are sparse. In a case close to home for me, it's measurements of water depth off coasts (accurate to centimeters onna grid of size meters). The place where you have these measurements in the real world can be counted on one hand. But now you want to train a ML algorithm to be able to guess water depth in environments all over the world, so in this case you need your data to be representative of a bunch of possible cases that are outside real data. This differs slightly from the GP who I think is talking about creating data that isn't even represented in the real world at all, but that would help an algorithm predict real world data anyway. But they are fairly related topics.


Typically we find that without further adjustments, there tends to be a substantial domain gap between the synthetic data and the real world data.

We're building something that helps us narrow or completely eliminate the gap between synthetic data and real data at Creation Labs and will have some exciting things to show in the next few weeks on our website (creationlabs.ai)


Submitted a lit review for this recently, such an amazing topic that could have huge implications for learning algorithms.


For machine vision problems, synthetic data with domain randomisation definitely works, provided, that is, you are able to generate a large enough number of different modes of variation.

In fact, it works so well it feels like cheating.


Synthetic data always reminds me the practice of feeding cattle with the meat and bone of other cattle. Will this make our generation's mad cow disease?


Maybe, since garbage in is garbage out. But if there's no grass in the first place, what are you going to feed your cattle?


Im working on this for generating synthetic EMR data for researchers to potentially use to study various interventions on Covid-19. Early stages...


I have not been able to create synthetic data for 3D bounding boxes. Is there something available that can help?


Why does the process of synthesizing data and then training a model on it work well, when hard-coding an algorithm never does?


In my mind it's because you have a more direct route to the underlying physics of the problem than either hand-coding features or trying to learn them from real data (at least when that real data is as poorly managed and controlled as it usually is).

Our ML algorithms are really good at finding correlations -- but we don't necessarily know if the correlations in our data are actually the ones we want our system to learn. When we're using synthetic data, we have many more levers at our disposal to ensure that this is the case.


I live in a higher dimension and my mood always collapses


How does this work in an information theoretic sense?


very cool




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: