I once took a call with an ex-NSA guy who was a CEO selling email security software- your MX would point to them, they'd scan your incoming email for exploits, then deliver it on to you. When I spoke to the guy, I expressed my concern, having working for a large multinational corporate whose fiber optic lines were tapped by UK intelligence to avoid NSA laws against spying on americans, that if I couldn't inspect his software, I couldn't feel confident of the security and integrity of the scanning system.
I said, in good faith, that I would consider his product if I could inspect the running system and the code.
He said several things:
1) the NSA never did anything illegal
2) the software was too large to audit
3) it was an insult to his employment in the NSA that I was even asking these questions.
The NSA would never do anything illegal; if you have a problem with highly misleading and unethical actions being undertaken with flimsy pretenses established by classified memos citing dubious legal justification- then that’s obviously your problem, not theirs.
Also, didn’t the director of the NSA perjure himself when he lied under oath during his sworn testimony before SSCI? No, sorry, he actually gave the “least untruthful answer” and then changed his answer when contradictory facts became public. I would have called that illegal but obviously the NSA has a legal theory and justification as to when they need to provide “untruthful” “facts” to the institutions exercising oversight.
The difference is that I knew he wasn't on the level when he said it, because the NSA has been sued and it's come out they've done things that were illegal, as found by a court of law, in public. That's just one example that we know of.
Neither is objective. There's no pretense of objectivity in the legal system. This is why you see people say things like "we'll find out if this was legal when they rule on the case".
Found to be illegal at the judicial level:
https://www.nytimes.com/2010/04/01/us/01nsa.html
then overturned by the 9th court. Then Congress stepped in and changed laws to make the situation "clearer".
"In partnership with the British agency known as Government Communications Headquarters, or GCHQ, the N.S.A. has apparently taken advantage of the vast amounts of data stored in and traveling among global data centers, which run all modern online computing, according to a report Wednesday by The Washington Post. N.S.A. collection activities abroad face fewer legal restrictions and less oversight than its actions in the United States."
Note there's a fair amount of speculation on the specific details of how and what data is collected and shared.
There's been public claims about it, and honestly thinking that they wouldnt do it once they have the power to seems naive?
https://www.zdnet.com/article/thatcher-ordered-echelon-surve...
> Ex-spy Mike Frost told the CBS 60 Minutes programme that Thatcher had ordered surveillance on two cabinet colleagues according to excerpts released on Thursday. The allegation comes in the same week that a European Parliament report said Echelon, a surveillance system run by the United States, Canada, Britain, Australia and New Zealand, was used for industrial espionage.
"pushed back"? Like how the director of the NSA "pushed back" on congressional questions of whether the NSA was broadly collecting any data from American citizens?
"Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email us and we'll look at the data."
No it doesn't (at least not to me or the dictionary). Conspiracy denotes two or more people planning with each other to commit illegal, wrongful, or subversive acts. Whether or not it's hidden doesn't enter into it.
A lot of people are talking about a state of apartheid in Israel, lets not throw every theory in the corner of anti-Semitism. And to be crystal clear I'm saying this from a POV coming from Jewish heritage.
Not "more data than any of the others" - just a specific kind of access that usually comes with much closer intelligence relationships (e.g. Five Eyes), rather than the more wary relationship the Israeli and US intelligence communities generally have with each other.
Since when do spy agencies avoid doing things just because they're "not allowed" to do them? An activity being disallowed only means that they'll avoid telling people that they're doing it.
AUditing is a negotiation between the auditor and the auditee. The auditor rarely gets to dictate absolute terms (and in my experience, will often listen to well-reasoned and prepared arguments and plans from auditees).
Since I was effectively the CTO for a startup that cared about the security of its messaging, I think I made a reasoned judgement about the nature of the security of their product, offered a way that he could help increase my confidence that he wasn't just sending a copy of my unecrypted email (the email has to be unencrypted for their scanner to work) off wherever.
I don't really find that rude. A cloud customer certainly can go to a cloud provider, say "you know, it's possible you have rogue internal actors, I've read articles that said you've fired SREs before who snooped on user data, can I see your audits that show you deal with insider risk properly?"
Yes, he sells himself on that experience (I had already done due diligence on a previous company he founded, sqrrl, which was organized around open source software, but touted the NSA creds):
Oren Falkowitz
CEO and Co-founder
Oren Falkowitz co-founded Area 1 Security to discover and eliminate targeted phishing attacks before they cause damage to organizations. Previously, he held senior positions at the National Security Agency (NSA) and United States Cyber Command (USCYBERCOM), where he focused on Computer Network Operations & Big Data. That’s where he realized the immense need for preemptive cybersecurity.
This post could turn into +5 informative if names and contact details were added for this unidentified "ex-NSA guy" and the "security software."
As-is, this is post is becoming popular because people are replying with random experiences and hate they have for NSA and insecure systems. But instead it could be helpful by damaging the reputation of a specific person that is peddling insecure systems.
When you hear a company talk about the promises of synthetic data, you should run far far away. The fundamental problem is that in order for synthetic data to be useful for model training the generative synthetic model must have already solved the problem at hand. Training on the synthetic data is just a charade: you would be better off simply extracting the model you need from the source generative model. For generative models with densities this is as simple as P(Y | X) = P(X, Y) \ P(X).
So, what do you get out of it compared to simply training models as a service for people? Almost nothing useful. All you get are:
- Much worse privacy guarantees. (People selling synthetic data love to talk about improved privacy, but it's actually the reverse. The privacy guarantees for synthetic data are much, much worse than selling direct models to people.)
- Much worse model performance. See the previous notes about how a synthetic data generating procedure must have already solved the problem at hand.
- A much more complicated setup with much more expensive model training. Training generative models is hard and requires a lot of data and compute due to the difficulty in learning such a complex outcome space. This can easily cost 100-1000x as much as simply training a straightforward xgboost model.
The reality is, you can guarantee certain values are not passed. That’s fairly easy, what you can’t do is block general trends easily in synthetic data - else the model can’t learn. So you have to be willing to accept that “leakage” when using synthetic data.
If you accept that leakage, then you can actually improve model performance in some domains. See ubers architecture search blog, there’s a lot of material (also from Uber) showing this.
Regarding cost of training, yes model training costs could increase. Although, I’d suggest much much less than 10x more like 50% or something. Is that worth it for privacy?
> The reality is, you can guarantee certain values are not passed. That’s fairly easy
Nonsense. There's a fundamental requirement that for a model to have utility, it must have access to relevant data. If you truly find a leak-proof way to block certain values (which is borderline impossible in itself), then you make the model significantly weaker or outright useless if those values aren't uniformly distributed across the input domain.
And if you do have leakage (and you almost certainly do), then direct re-identification becomes a trivial problem and we're back to square one.
Synthetic data generated from statistical analysis of real data is either worthless or leaky. One of those two conditions is formally guaranteed to be true.
GP is correct that a viable synthetic data generator basically has to have already solved your problem for you, and in that sense it just becomes your model which is trained on real data. Training an additional model on top of that model doesn't add any privacy or mitigate reidentification.
What specific privacy guarantees are you hoping to provide and how do you know you are satisfying them? That is the key to basically all of it and most work on this subject falls flat (except for differential privacy, but current evidence indicates that it doesn't work very well for generative models).
Guaranteeing that certain values are not passed is both useless and trivial. I can easily satisfy that by simply adding 0.1 and a space character to all values in my dataset even though that doesn't remove any of the sensitive data.
The idea that the synthetic generator model must have solved the actual modelling problem is a attractive idea that doesn’t correspond to what people want data for: they want to eyeball it, see what’s in it, test some algorithms, figure out how they might approach the problem. You can do that very well on realistic synthetic data (much better than any other privacy tech) even if the synthetic data has lost some utility through its statistical approximations.
The idea that training on synthetic data is a “charade” misunderstands the usefulness of having realistic, “drop in compatible” data that works with your existing code or models.
The ideas of training models as a service and also of working directly with synthetic data generators to “extract“ from them are great but incompatible with (a) the complexity of real world DS workflows in regulated industries and (b) data scientists current code / workflows / techniques.
If I’m a bank, I’m not going to give you my fraud rules and you’re not going to solve my problems with xgboost. Access to models is just as locked down as access to data.
This is why it’s useful having an intermediary. Like ... a generative model that you can train where the data is and then copy over to where the modelling is happening.
Assuming the problem at hand is analyzing the data.
I don't know enough about what they're building to understand what use cases they have in mind. (Their blog says "developing new features or exploring insights", which is a bit vague. See https://medium.com/gretel-ai/gretel-readme-fd0c4eff8a09 .)
But if this allows people to take their production database, run a command, and generate fake data they can run integration tests against, then that's not analyzing the data. It's making it easier to run tests in an environment that is as close as possible to production, because data is part of an environment.
Aside from that, I'd argue that some kinds of analysis may even still be possible. Let's say you have an e-commerce site, with one database table for user accounts and another table for shipping addresses. And let's guess that Gretel anonymizes this data by creating a user table with different usernames and an address table with made-up shipping addresses, but the overall structure is isomorphic: for every row in real data, there is a corresponding row in fake data, just with different values in it. And the (synthetic) keys aren't private (they data you generated, not collected from the user), so they can be preserved. Then you can, on the fake data, run a query and find out what percentage of users have given a shipping address.
Of course it's just a guess that that's what they're aiming to do since so little detail is available. It's entirely possible that what I described isn't what they're planning to do. Maybe it doesn't sufficiently protect privacy.
But the point is, until we know what they're building, I don't know if we can conclude that people should run away.
I got your point and it sounds good, but let's take your example of the shipping addresses.
Let's say you want to develop a model that helps finding the best route for an ups driver. The real data may contain addresses all based in New York because your customers business is located there. The fake data contains addresses based all around New York.
I guess the model you train with the fake data won't be any helpful because instead the driver can do 5 locations in a hour, he has to drive 5 hours for the same amount.
Gretle needs to know the context of the data and the importance of relationship between the data rows. A pure anonymisation won't be enough.
While generally I agree with your conclusion (synthetic data doesn’t have better privacy guarantees, probably will hurt training if you use it naively), I wouldn’t be so pessimistic.
At the risk of digressing from TFA, Candes’ knockoffs, for instance, are an example of a (theoretically) successful use of synthetic data for model robustness. Still need original data, of course.
Basically, the broader point is that you don’t need to solve the full problem of joint likelihood estimation to use generative models effectively, e.g., GANs are another example.
If you take the MNIST dataset and generate synthetic data that has variations on the original images, e.g. rotations, added noise, different background images, etc. you will get a higher model accuracy than the baseline model in exchange for a longer training time.
Maybe I misunderstand something, but if x -> f(x) is easy to compute but you want to learn f(x) -> x, then isn't synthetic data exactly what you want to be using?
Example: Training an image upscaling algorithm by feeding it downscaled images. In this case you don't even need to train a generative model (the algorithm is known), but it should illustrate that the generative task can be extremely easy compared to the target task. You can't just handwave that away with "just divide by P(X)".
What about synthetic data that is correct? Example: Using the output of a physics simulation to train... The output is synthetic (as in, its not from the real world) but it is governed by the equations of motion...
Going from the parent comment's assertion, the answer would be that if you want to train something to predict physics, the original physics simulation that's generating the data is either:
1. Already doing that, in which case it's already the superior model.
or
2. Not doing that correctly, in which case the synthetic data output is equally poor.
Consider computer graphics and computer vision. If you had a photorealistic renderer, it could be useful for generating training data for a vision algorithm. However, you can't just run your rendering pipeline backwards to create a scene out of an image.
Yes. A typical example would be the use of statistical emulators, emulating process (physical) based models that are computationally complex in the environmental sciences.
This is the one that seems to be useful today. Running simulations with physics being modeled across objects seems to be useful for reinforcement learning algorithms that interact in the real world. You see some of this with autonomous vehicle training, robot manipulation training, etc. I think there are also emergent interactions that can occur that you wouldn't have modeled directly unless you let the simulation run.
Right, but presumably we don't want the model to learn those, since they're an impractically low-level model. If you're training an AI for flying a plane, you want it to learn "the simplest model that works"—presumably something like aerodynamics—not a far more computationally-intensive (and data-intensive!) model, like e.g. quantum chromodynamics.
The example is fundamentally different as physics simulations have closed form solutions or can be numerically approximated for various initial and boundary conditions. Therefore, by definition the generation is not synthetic.
Other examples I could think of fall in the category the top comment is referring to.
> When you hear a company talk about the promises of synthetic data, you should run far far away. The fundamental problem is that in order for synthetic data to be useful for model training the generative synthetic model must have already solved the problem at hand.
I've heard this group is pretty good, and they seem to be proud enough of their synthetic data to publish papers on it:
I'm being pedantic by posting this though, because their approach involved learning the synthetic data generator simultaneously with the classifier/whatever that you're training. That is not relevant for a static synthetic data source.
@lalaland1125 you're right that privacy is a hard problem, we're really excited about making techniques like synthetic data, data labeling, and analyzing data sets for potential privacy (re-identification risk/etc) available to all developers. We'll be open sourcing some of our work in the next few weeks- feel free to jump in the code and we'd love to hear your thoughts!
I don't know, if your generative model is something like getting the knn of each data point, then fuzzing each point by the local statistics of each point (from the knn's), I don't think you will miss the important trends and you get a bit of differential privacy.
Might you or someone else say what this specific area of ML generative models/synthetic data is called? Are there any introductory references you could share. Thanks.
You’re coming at this from a theoretical approach, which isn’t appropriate for deep learning because nobody has any idea how or why these neural networks actually work. Generating a bunch of rotated and stretched versions of your labeled images DOES work, and very effectively. Why? The neural network is an inscrutable God, we just have no idea.
I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a similar amount of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.
I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.
Sometimes an idea has a time. It may seem obvious now but it isn't like Instagram and Snapchat were the first image sharing applications developed. Slack was nowhere near the first chat app.
I happened to have a long discussion on the topic of data businesses last night with a friend. We brainstormed datasets that would be a combination of hard/expensive to obtain while also having resell ability to thousands of customers who would be willing to pay a high value for them. I don't want to get involved in datasets that are easy to obtain (too many competitors, no bar to entry) or datasets specific to a particular company (too much dependence on a small set of customers, cost of acquiring new customers also includes cost of acquiring data, no economy of scale).
It's easy to start with the tech problem: how to collect, clean and analyze data. But reasoning backwards from the business side is much more difficult. Expensive data I can sell once feels easy. Cheap data I can sell frequently feels like a race to the bottom. Expensive data I can re-sell 1000s of times to a niche audience feels like a perfect middle ground ... I just can't think of any examples.
I think the machine learning problem is not data. It's not models. It's not compute.
It's annotation. That is a workforce problem. You want to automate contracts? You need attorneys. You want to automate radiology? You need radiologists. You want to automate driving? You need drivers.
This makes ML less like a SaaS business, more like a mining business. There's tons, literally tons, of data/ore for any interesting problem. That's why it's an interesting problem. There are buyers of iron, gold, and marble. There are buyers of driverless cars, physician decision support systems, and contract automation solutions. But recovering the data from the mine (digitization) and enriching it (annotation) cost money. So much that the market variation may make it lucrative at some times and not others. If you are near peak employment, the value of a model goes up, but the cost of annotators is high also.
I'm not sure how finance guys capture that problem: how do you make a profit when there's high demand on both sides at one time, and low demand on both sides at other times? I submit that when both sides are low is the time to do annotation, and the time with both sides are high is the time to sell models.
But then you need an investor who can ride out the market.
Isn't it more like "nearly always"? It's pretty hard to find examples of things that haven't been tried multiple times before in some variant. You can argue if it was timing or execution that worked "this time" of course, but almost nothing happens in isolation.
Maybe? Hard to see past survivorship bias. My intuition says some ideas will never see their time. Hard to quantify how often that is the case as a percentage of all ideas.
I imagine black box data from plane crashes, or in general data that comes out of a tragic event that no one can or would want to replicate, but is otherwise extremely valuable.
Of course, if you can sell it to one person, they can just pass it off to others, so this will quickly turn into a DRM business profiting off tragedy. Probably not a good idea.
I had a friend tell me recently about a client using commercial real estate data for lead gen. He mentioned https://compstak.com/
Basically, identifying companies that are doing well / expanding by how big the space is they leased. This sort of data is apparently very hard to get, but gives users a competitive advantage.
Real estate data, and companies like compstak are exactly the kind of niche markets I'm talking about. Agents are willing to spend large sums to get access to this data and it can be resold multiple times. Unfortunately it is also a market full of existing competition with some established players.
What other markets for data are similar? In general, data that leads to prospect generation is desirable because sales agents are willing to spend money to make money. Are there any other markets like that?
So it sounds like the salient aspect here isn't necessarily the type of data, but the manner in which that data is collected. Looks like compstak's success is a result of creating a platform that facilitates crowdsourced data points that are difficult to acquire using traditional data collection approaches...that scarcity is what makes the data valuable, especially since that data can be used for leverage in a negotiation. Also, they appear to prop up the overall scarcity by only granting access of existing data to users who provide new data.[1]
I'm curious how they figure out how much to charge companies for this data? And also how they stop real estate insiders from gaining access without sharing new data?
The medical markets, but I don't want to go any further because that's what I'm doing right now :p
SMART on FHIR is a newish standard for medical applications that is getting a HUGE push from large companies like Cerner, Epic, along with all the tech giants. Hospitals are itching for more FHIR apps that can integrate directly into their Electronic Health Record system (and web apps be delivered directly on a doctor's web portal within the hospital's IT system).
So that might be a good place to start poking around...
A really good example of this is labeled medical imaging data.
Some key contributing factors: multiple stakeholders & consent/approval issues, legal & technical constraints on access, depending on the application labeling may only be possible using very expensive experts. Lot's of human interaction.
I agree it is expensive to gather but is it something that can be sold at high cost to 1000s of customers? It seems the market for purchasers of that data might be limited to a small number of companies, probably hoping to build ML models.
The question I responded to was what made it hard and/or expensive to obtain, which I think I answered.
Commercial viability of doing so for profit is a different issue, but I see that's the other part of your original comment. It's not an obvious answer, partially because there are a lot of different scenarios within that blanket "medical imaging", and what the putative customer might want to do with it.
Yes, I should have followed the thread better. My mind is focused on a particular kind of commercial viability which is niche markets (in the 1000s to 10,000s) willing to pay for access to data.
One example is that it could require a significant amount of human legwork - e.g. Google street view. Another example is it might require significant dev effort to clean and combine several raw data sets into a refined output data set.
My first thought is specific business industry analysis data. I've often been an hour into an online deep-dive only to hit a paywall related to this. However, I'd think it would be hard to acquire the valuable aspects of this data without some kind of insider access (compared to web scraping, creative api mining, etc).
New data source needs seem to popup out of nowhere - what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?
> what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?
One of the problems with this is how custom each data set for each client would be. My mind has been on this topic since the a16z article on "The new business of AI ..." [1] which was posted to HN in the last couple of days. The key idea is the question of how to decouple the process of collecting, cleaning and analyzing data from the process of acquiring customers for that data, and not how to solve the problem of collecting, cleaning and analyzing data. Developers want to solve the technical challenge (how to build the processes) but not the business challenge (how to find customers willing to buy the resultant data).
I do believe there is a market for start-ups to partner with exiting companies to help them wrangle their data. It just isn't the market I'm thinking of.
Sounds like you're thinking of creating a gold-mine business model for data....do the hard searching/digging up/processing of rare and valuable data, then sell it at a premium.
A few questions: What types of businesses would be "customers of that data"? Brainstorming all potential customers separated by industry would be a good start. Are there any "data purchasing" trends you've seen lately?
> What types of businesses would be "customers of that data"?
That is the exact question! That is the wall we hit. If I could consistently answer that question then there is a business to be had.
Who is willing to pay for that kind of data? I actually considered getting together a larger group of friends to do that exact brainstorm. But even then I'm not sure it is such an easy question to answer.
Also a bit funny you called it a gold-mine business. I called data that meets that criteria Goldilocks data.
Not OP but I think he means market research data. I've thought about this as well. You pay some researcher some money to write a report on growth of a particular segment of an industry. Trade groups often do this and you hear stats like "Mobile usage expected to grow by X% in developing countries over the next Z years". But it is a multi-page report, probably including graphs, on some particular topic.
It matches roughly the kind of data I was talking about. It is expensive to generate since you have to pay a researcher some amount of money to write the report. The resultant report generally can be re-sold multiple times.
My problem with this kind of data is that you will be competing against AIs pretty soon which will drive the cost to generate such reports down. And the price you can charge per report will be tied to how good a report you are capable of generating. It is also a saturated market already so the real play is driving the cost of generation down, not what I want.
Instead of selling reports, would it make more sense to create dashboards that let users slice + dice data and view the insights?
In other words, instead of being a Gartner, focus on being a Crunchbase. That way, you can sell to both the end users of these insights (the companies in these industries) as well as the market research companies, themselves.
Yea this idea has been tried and never found a market at least a dozen times over. It's not impossible for the market or execution to change (e.g. Dropbox) but given the recent software is eating the world boom, it's hard to imagine that the market has changed dramatically so recently or their execution is going to be so much better than other teams who have tried and failed.
I can almost guarantee you that they are building this to address some specific NSA problem and their entire business strategy hinges on getting a massive DoD contract
I'm a bit unclear on the goals for this startup. When you say "Github for Data" I'm thinking of a repository of datasets used for ML training or for more traditional research. But this:
> This so-called “synthetic data” is essentially artificial data that looks and works just like regular sensitive user data.
So its like Lorem Ipsum generator for data? Whats the use case here besides building apps with sample data? Notwithstanding potential privacy concerns, How am I confident that this is realistic if you literally say its generated?
General data repository for research with some mechanism to ensure cleanliness or integrity sounds much more useful to me.
As you have noticed, "Github for Data" is not what they're building. At all.
Our company is actually building Git for Data and Github for Data. We have an open source database called Dolt which combines the commit graph of git with the relational tables and SQL of MySQL:
Dolt lets you version, branch, and merge your dataset so that you can collaborate on it with others. Dolthub lets you share your dataset with the world, submit PRs, fork other people's repos, and lots of other analogous features to Github.
> A mental retard who is clueless not only about current events, but also has the IQ level of a rock. "Dolt" may be the most sophisticated insult in the English language. Dolts commonly populate such stereotypes as jocks, nerds, fruits, bookworms, and dorks.
True, but the name `git` came from the open source SCM tool, written by Linus for the Linux source code - and now everyone is sort of stuck with that name. This is a commercial product, deciding on a name for themselves.
True, but the name `git` came from the open source SCM tool, written by Linus for the Linux source code - and now everyone is sort of stuck with that name. This is a commercial product, deciding on a name for themselves.
> How am I confident that this is realistic if you literally say its generated?
This is a particularly good question since it's recently been shown that even neural nets trained on real data often pick up substantial, predictable dataset biases.
Practically every single-dataset-trained CNN seems to pick up stylistic quirks in the photos or labels it's trained on. The most visible result is that the CNNs perform better on same-dataset test examples than they do in the wild, sometimes vastly better. More startlingly, it's possible to work backwards from this: the training source of a "finished" CNN can be discerned by looking for certain types of error, and adversarial examples can be predictably constructed based on training source.
Tagged imagesets undoubtedly have stronger and harder-to-remove 'fingerprints' than text data like addresses, but I'd be shocked if the problem was nonexistent for text. My first reaction to "synthetic sensitive user data" for ML is to worry about winding up with systematic errors coming from the generation scheme.
The concept is that you're carrying over the same general topology of the real data but in a way that is effectively non-sense. This allows you to build ML models that are representative of the parameters of the true data which you can then use for inference systems in production.
It's a technique we've used in the DoD a long time and it works ok when everything is perfect. There are a lot of boundary problems like being able to troubleshoot bad data if you are doing your initial analysis with the transformed data, having DS's actually grok the problem-set since it's abstracted etc...
Edit: It's worth noting that this is a technical solution to a policy/legal roadblock. As organizations mature into better data governance, they are pushing more fundamental changes to governance that gets at these problems where solutions like this will no longer be necessary. For example, hiring data scientists into the groups that have access to the raw data (in our case hiring data scientists and giving them security clearances).
Except that if their models have already inferred sufficient structure from your data to do this in a way useful for training, then their models have already solved the problem your models are trying to learn.
In health care AI there is some tendency to use generated data for training. Idea might be that org A has real patient data but for privacy reason cannot share data, but if they create a sufficiently strong generator they can share that and org B can train their classifier without ever accessing sensitive data. Alternatively, it can be also used if you simply have too little data and need augmentation.
This just seems like something that will catastrophically fail. If you can build a good enough generator you can just build the ML model internally. And if you can't the statistics of what you provide are going to be off enough that any strong model is going to be wrong in strange ways.
This is an incredibly important point: in order for your synthetic data to be useful your simulator must have already solved the problem at hand. In theory there is no need to even fool around with generating the synthetic data and going through the charade of training a model on it; simply exact the outcome model from your simulator directly as that's implicitly what you are doing. For example, if you have a generative model that provides densities, you can simply compute P(Y | X) = P(X, Y) / P(X).
But this is not how generators work. They generally produce samples in the from
G: Q -> (X,Y)
where Q is some prior from which you are sampling. If they are not invertible then you straight up cannot get P(X,Y) out of the generator. Even if it is invertible getting P(X) requires integrating out the Y which might be infeasible (since the model is not integrable and is sufficiently fast changing that you need very, very many samples).
Very true. If you've solved the labeling/extraction problem using a means other than ML, you can use that means to generate synthetic data. The situation at my company is exactly this.
Say you use regular expressions to extract sensitive data from standardized, but numerously varied, form documents. The pieces of information extracted are very common classes of data: first name, last name, dates, physical locations.
During the extraction process you can save the complement of the extraction (the "leftovers") and insert generated data at the extraction points. Also, because you've extracted the actual sensitive data, you can exclude that from the set of values used for generation, if it's practical.
Sometimes people get caught up in the math and theory that they fail to see the practical solutions.
I agree that this is very tricky. I think the most interesting synthetic healthcare data generation I saw was using causal inference (where SMEs can bake in a bunch of expert knowledge during skeleton construction) and then generated data by getting the weights on the edges from a smaller dataset. At the same time, it is very hard to ensure that you synthetic dataset actually reflects real world. On one hand SME knowledge might give extra oomph to synthetic data generation (as this knowledge is equivalent to some highly abstracted training) but also if the "expert knowledge" is wrong then it's a recipe for disaster.
> In health care AI there is some tendency to use generated data for training.
Which is part of the reason for the high failure rate.
Good governance and data access for health data is a very hard problem. Good labeling is also hard/expensive in this space.
So there is an incentive for people wanting to do ML/AI without solving above to try any kind of shortcut they can think of. This incentive doesn't help solve any real problems.
The classic solution to "too little data" is use a simpler and/or less discriminating model. It's still the only one with a good track record.
Transfer learning is nothing like a silver bullet. True, it has become an important work around but it's no panacea and the track record is at best mixed.
People already use it quite a lot. More importantly they misuse it a lot. I'd be less concerned with increasing the usage, and more concerned that the people using it understand the implications and trade offs.
Gartner estimate that by 2022, 40% of AI/ML models will be trained on synthetic data.
Just because it may have some drop in utility today from real data, there are all sorts of scenarios where that’s outweighed by speed and ease of working with fake data rather that tied up in red tape real data.
The "mays" in this statement are doing a lot of work.
There are a few areas of current practice that have this feature: a) the arguments & evidence for it being worse are pretty simple and b) the arguments & evidence for potential benefits are either weak or very convoluted. This is never a good sign.
I think this happens mostly because the reasons these things are being done are for the most part not technical, but the technically oriented people involved don't like to think about that way, and would rather talk about technical solutions - but that is operating at the wrong data.
The business & cost cases behind not doing this "right" in some abstract sense are pretty clear too, though. I wish more people would just be clear about this, and spend less effort obfuscating and more in clearly quantifying the cost of these workarounds.
Any time you hear someone starting off by saying things like "we don't really need good labels", "this synthetic data will be better, actually", "we'll use transfer from X because it's already done most of the work", etc., well what follows is quite likely to be good fertilizer.
Note, I'm not saying these approaches don't have value, just that there is an awful lot of magical thinking going on around it, and a lot of failures due to that.
There’s only one “may” there but yes, it masks “potentially losing crucial information”. The post I linked to is pretty clear on that.
I totally agree that the business need is the driver and that people miss the imperatives if they look at it purely from a technical or mathematical lens.
In a sense, synthetic data is the least bad actually viable solution. The democracy of privacy / data agility :p
I was referring to both "it may have some drop in utility today... " and "in future it may well perform better than ..."
My issue is not that people just miss the imperatives, but that they also misapply effort because of it. Accept there is a cost and try and quantify the impact. Make intelligent risk management decisions based on that. Sometimes that decision is "this is unlikely to work, what else can we do".
I wonder if "synthetic data" could be used to proof ML? Ie, rather than start from a huge data set that you _think_ is clean enough to know what a Car is (or w/e), you could use a data set that is perfectly clean and the ML set could be tweaked to give the expected result.
Of course, that would still be hairy as you'd have to ensure your ML still performed on real data sets. All the clean data would do is allow you to write unit tests of sorts for your ML with more confidence than the all-too-common unclean real data.
No idea, just making stuff up. Interesting thought.
It seems to be targetting pre-production release startups/companies who need access to real data, structured and appearing exactly like real APIs they intend to roll out in production, from which to build their products. Then when it goes live it switches to the real data feeds which this company's API directly mimics.
That's what I'm guessing. It could be used for AI stuff but also other useful datasets that are closed off or require special access.
I'm guessing a lot of defence contractors and other heavily regulated industries (ie, healthcare, insurance, pharma, etc) have similar problems in the dev process of not having access to real data. This was the leading pitch:
>> Data is valuable for helping developers and engineers to build new features and better innovate.
I would be really interested in a "github for data", just a place you can upload structured data of any kind with a bit of meta tagging for public consumption.
SDR captures
PCAPs
Microphone captures
Public tax records
Road data
Flight plans
etc...
Just anything and everything on one service with the limitation that it's all open data. No license agreements or legal restrictions.
It would probably never fly, but it would be amazing.
I don't want a github for data, I want a git for data. The one reason that POS Perforce is used in game development is that it can handle binary files, especially large ones. LFS is..OK.
And this company isn't really even aping GitHub given that they are generating the content too.
Although having said then when you say data to me, I do imagine some level of immutability. Logging scientific results using git would be pretty good if required, in the sense that there is a habit to just use folders and text files which is fine at the time but is really really hard to take over sometimes (Like code written by scientists with their single-letter variable names and hatred of functions, in my experience of Fortran - yuck)
We’ve released our branched versioning system for data at http://dvid.io. It’s entirely open source and uses a science HTTP API with pluggable data types and a versioned key-value backend. I’m currently developing a new backend for it: DAGStore, a lower-level ordered key-value store that will have an explicit immutable store and is tailored for distributed versioning of big data.
Recording outputs of analytical runs or deep learning models; the ability to say ‘this is the output of program X v <git hash> run on the output of program Y <git hash> whose source data was Z <data-git hash>” for all the reasons of repeatability and audit ability needed for actual science.
I’ve met several ex-NSAers that used their time there to build their brand successfully.
At the end of the day, the dragnet surveillance decisions came from the highest levels of the Bush and Obama administrations, not the boots on the ground.
"Historically, the plea of superior orders has been used both before and after the Nuremberg Trials, with a notable lack of consistency in various rulings."
Most people who were members of socially bad organizations throughout history weren't bad people or doing bad things, but willingly joining and serving a support role for such an organization is itself morally dubious. It's actively helping enable their activities.
For example, if someone got a job on a shark finning boat maintaining the knives, they're still complicit in shark finning even if they're never touched or seen a shark in their life.
And even if their actual role was spying on and helping detain Americans for arbitrary reasons, nobody's going to slap that on their resume. They'd say they were a data aggregating administrative officer.
Google, Facebook, ... every time we talk about these things the US position seems to be Germans are privacy nuts. It shouldn't be surprising that we think people in the US don't see privacy invasions as a serious problem / accept them in the name of a "higher purpose".
They should have lesser, though still real, baggage attached to them. There are definitely a few types of people in computer science who refuse to touch FAANG employees, but it's not as widespread as I wish it were.
There's so much they're responsible for that it seems like it'd be excessive to post all of them, and HN has a character limit, anyway. Even the most well-known ones would fill a couple of comments.
And 125,000,000,000 globally, along with almost a hundred billion internet requests per month in a time when the Internet was much smaller, years ago, under a program that hasn't stopped:
This definitely damages everyone, and I think more fundamentally than adtech companies. You can escape adtech companies: use different websites, install an adblocker, block trackers, whatever. There's no obvious way outside of strong encryption and hermitry to avoid NSA getting its paws on every single public or private communication you ever make, given their record. When privacy doesn't and cannot exist, human behavior changes deeply. I think it's the greater of two evils.
Adtech companies definitely aren't good, by any stretch, but I think it's the difference between Ghengis Khan and a common killer: sure, they both want to kill you, but the scale and the level of cruelty is much different.
Sure, but if one group is operating in an unacceptable fashion along the lines of "We're stealing candy from babies!" and the other is "We're not only a risk to your business, your customers, and you, but also to the entire world, and also we have actually pretty legitimate ties to murder and also coups and also maybe war crimes a little," the first is child's play in comparison.
If I ever am, NSA in the history is a solid red flag. Be nice or go home. Willingness to participate in all that makes me wonder about your character.
Facebook almost as much, depending on when and for how long. The more recent, or longer sustained stay, again. Be nice or go home.
Google a little bit more reasonable, especially if only ever on the app/project side vs main service.
Adtech experience, I'd have to ask if they'd washed their minds out and won't even consider any of those tactics ever again... on second thought, almost as bad a flag as NSA crap.
Even if it's only a small drop of resistance in an ocean of "yeah, ok, whatever", I feel like we should signal that certain behaviors _against_ your fellow humans are poorly regarded.
Speaking generally, I view anyone who works for a very problematic company or agency in a worse light for it, regardless of whether they're at the bottom or top of the ladder.
Obviously, the higher up the ladder a person, the stronger this effect is. However, that someone is willing to work in any capacity at an organization that behaves badly (in my view) says something about that person.
That's not what I said, but perhaps you have to define what you mean by "hold them responsible". As I interpret that phrase, no, I wouldn't hold them responsible as such, but that they are willing to work there in any capacity says something about them.
The market opportunity is very large, but also insanely difficult to tap.
For one, you have an uphill battle on trust. Customers have to trust your data is secure, and btw it’ll never be 100% secure by many standards.
On the other hand, people have to trust the synthetic data is good enough to use Practically.
So you have to both convince management and convince engineers. Arguably management is easier to convince, but.. best of luck on the endeavor!
That’s not even discussing the technical challenges - I’ve implemented this all technically and have had it deployed to production systems. Building a robust system that is secure and produces valid synthetic data is a challenge.
I see 'Github for data' but I'm reading services to hide PII. I think a distinction needs to be made here. Is the primary goal to enable Change Cata Capture on data(Github does CDC on code) or is the primary goal to manage PII?
Hey thanks for the great question. So the "Github for data" is referring to the ability to collaborate on data. By streaming in data, you can view discoveries we made on the data (entity recognition, etc) then essentially make a new version of that data with automatic transforms, anonymizations, etc. So you're absolutely right, managing PII is part of it, but really its about enabling entity A to share data with entity B with a high level of confidence the sensitive data is stripped out.
We'll be releasing some of the packages to do the analysis and transformations as time goes on, so stay tuned for those so you can take them for a test drive yourself.
The problems with a 'Github for data' are the 7 'V's
Volume: too much data to have usable diffs and merges
Veracity: How do you know which branch of data to commit?
Velocity: The data coming in is a stream - batch processing does not cut it
Most companies end up creating a federation, not a true Data Mart.
A lot of the discussion here seems to be about the methodology and value of the "synthetic data" concept, and squaring that approach with the analogy to github.
It feels like we can tease those two things apart:
1. Is a github-style website/service of forkable data sets useful?
2. Are anonymized, synthetic versions of those data sets, created via ML, useful?
Feels like the answer to both is "yes"?
(Also makes me wonder if there's a "rebase" equivalent for data in this sort of world...)
Great question. Frankly you shouldn't, and your developers should do it for you. While we'll be offering a service to enable this type of obfuscation via REST APIs (where we don't store or write any of your data to disk) we'll also be releasing some of these techniques as open source packages so your devs can kick the tires themselves without shipping data into someoneelse's service.
the same reason you trust external companies to do all sorts of other stuff for you: specialization. in the average case, they can probably do it better than your developers because it's their primary business.
Sure, but we are talking about a twofold trust problem here. Trust in an entity (external company vs internal developer) and trust in the work itself.
While I do see your point about trusting an external company which is specialized in the problem I’m trying to solve more than my own developers, I still have to transfer my highly sensitive data to them for which I have to trust them even more.
"How do the CIA, NSA, DHS, and FBI all share critical data without breaking the law with regard to specific details about certain types of collection targets (like citizens?)"...
FEDRAMP it. Now run that as a service inside a secure cloud environment. You need enough runway for the FEDRAMP / engineering / sales process plus a contract win and then I'd imagine the income gets pretty steady.
Commercially? Unsure of the use case. I'd imagine as those sharing data are typically in competition. Not so in the government / intel community / finance space and I'd imagine you have to write a metric ton of policy and sign a bunch of MOUs to do this kind of stuff properly. People in government do care a great deal about these policies, believe it or not.
This is also a huge problem to solve in the "Know Your Customer" / Anti-Money Laundering space for financial institutions, where sharing data between companies or government and companies is often prohibited and/or really difficult. See the recent FCA TechSprint for more on this: https://www.fca.org.uk/events/techsprints/aml-financial-crim...
Lots of talk of "Homomorphic Encrpytion" and "Encrypted Cloud Runtimes" as options but if you don't really need to share all of the data to get to an outcome (but rather synthetic data - though synthetic identity seems to be the hard bit here...) that could be interesting!
Hey all- we just released code on GitHub and a research post on Medium demonstrating how to generate synthetic datasets from models trained using differential privacy guarantees, based on rideshare datasets. Please take a look and let us know what you think!
A little off-topic, since we’re not talking about true synthetic data here (vs. wiping PII), but the future of synthetic data is no-data due to differentiable programming. Instead of a program outputting vast amounts of synthetic data, it is written with a library or language that is differentiable and whose gradients can be integrated straight into the training of a model. A few PyTorch libraries dealing with 3D modeling have been released lately that accomplish that, and a good deal of work in Julia is making promising advances. I’m curious to see how overfitting will be addressed, but there may come a time when large datasets become a thing of the past as low-level generation of data becomes just another component of a model’s architecture.
Synthetic data is a tool in a system of privacy-preserving analytics. I don't think a lot of companies would be comfortable with shipping their sensitive data into a model sitting not on their premises. The model, needs to generate good enough results that would make sense in a training/analysis scenario (doubtful) and preserve the privacy of that data (difficult).
People in the Privacy-preserving industry try to find the silver bullet in one technology, but the real solution is a combination of different technologies. Trusted delegation of computation + privacy-preserving techniques together solve this issue, but separately provide marginal value.
Honest question: does the NSA have good engineers? Like, FAANG good? It seems like a funny "hype line" for a startup, because I think of government engineers as ok, but not like, rockstars.
Most of the top engineers would probably be contractors. The government pay scale pretty much forces you to become a manager if you want to move past GS-12. Including locality pay, GS-12 salary tops at a little over 100k a year[1].
I just looked at their website, and they don't mention anything about NSA, Amazon, or Google on the landing page, so I think it's just techcrunch adding it in as clickbait, even though the founders aren't going out of their way to advertise it.
The majority of the best engineers don't even work for a FAANG. FAANGs don't offer remote for the most part. How many of the best decided they wanted more. The world is a big place.
I think we're pretty good. Feel free to review the code and contribute as we release our open source packages over time. We all get better with time and collaboration.
Not entirely sure what this is yet but I was looking for a resource of scientific data in nice formats and I was very surprised I couldn't really find any. If this could do that in a way that doesn't upset any licensing agreements (Don't really know what I'm talking about in that regard) that would be pretty cool.
I needed a Resitivity-Temperature dataset for a Tungsten alloy and I ended up having to manually type up a series out of the book, not fun!
I think this is a great idea. Perhaps the "Github for data" copy could use some work, but the concept of obfuscating real data for use in building systems without the overhead or concern of operating on real customer data is valuable. Of course, the degree to which the obfuscated data represents real data is important to ensuring the systems built like this are robust, but this seems possible.
I've been using Snowflake for a while now and the Data Sharing feature they have is weirdly close to what I'm reading here. Anonymization, masking a column etc. Anyone with similar experience as mine? Or anyone who's knowledgeable enough to compare this product with Snowflake for me?
I see "Laszlo Bock" is one of the founding members. He was industry veteran an exec at GM and Google (as a head of HR, as far as I know). He then started an HR company Humu (https://humu.com).
Wouldn't a data repository considered more of a wiki than vcs? I don't understand the desire to associate the concept with Github instead of Wikipedia, which I would propose is more appropriate.
I like this idea but the article doesn't do a great job of spitting out why it's valuable.
Consider a SAAS with 25,000 active customers and tons of structured (database) data.
You have a bunch of people that need to work on the dev system and the closer dev looks to prod the better you are.
- Contractors in another country
- A team working on basic compliance with GDPR and CCPA
- Sysadmin team trying to manage backups/restores
When the contractors pull a version of the DB it needs to not have any customer data (emails, addresses, etc.) so there's a process that wipes all those out and fills them in with fake data.
When the GDPR team gets a data deletion request ("Please delete all my data my email is X@y.com") on Monday and the Sysadmin team restores from a backup from Sunday what happens?
Right now both of these actions are one-off things done with a mish-mash of scripts, organizational knowledge, and half-remembered processes.
So wouldn't this be better with a service that could talk to your DB and you could fork out versions that "knew" the current DB structure, that you could mark as purged of sensitive data, that you could apply (and re-apply) transforms to for structure to data addition/removal.
It's not Github for data, but if it were, then one of the main issues is handling GDPR requests. Git is by construction averse to deletions in its history, so incompatible with hosting sensitive data. On the other hand it could be great to have deduplication in storage and versioning for datasets.
Well, Git itself is not a good tool for handling large dataset files. In most cases, you're not interested in deltas between commits. The size of your repo can also grow or of control pretty quickly.
As a dirty workaround, you have Git-LFS to do that for you. People tend to use it in repos with a lot of multimedia assets. This works well in many cases, but it has its own pitfalls as well.
@freeone3000 GitHub is amazing for building and collaborating on source code, we're building services to enable safe and private collaboration on streaming data (whether text, images, video, your own format)
GitHub itself works fine for science. I see no particularly compelling reason to use an science-oriented service given that people are more likely to be familiar with GitHub.
There's code in that repository too. The code merges a variety of different data sources and performs some analyses. Nothing particularly fancy, and the code is probably not much better than average as far as academic code goes (which is not good), but I'm slowly adding tests and improving the code otherwise.
“We’re building right now software that enables developers to automatically check out an anonymized version of the data set,” said Watson. This so-called “synthetic data” is essentially artificial data that looks and works just like regular sensitive user data. Gretel uses machine learning to categorize the data — like names, addresses and other customer identifiers — and classify as many labels to the data as possible. Once that data is labeled, it can be applied access policies. Then, the platform applies differential privacy — a technique used to anonymize vast amounts of data — so that it’s no longer tied to customer information.
Anyone taking bets on how long it's going to be before these idiots end up leaking the SSN of every US citizen because their categorizer failed?
You don't have to be a machine learning expert to understand that no classifier is going to be correct 100% of the time. The laws against divulging PII don't contain exceptions for classifiers goofing.
Why would you need "to anonymize vast amounts of data — so that it’s no longer tied to customer information" or "appl[y] access policies" if the data contain no PII? Presumably the ML is anonymizing the data and the access policies are necessary because the data contain PII.
Yup, people tend to confuse concepts and refer to synthetic data as anonymised data. They are very different things.
Anonymised data or redacted data are transformations of a data set that _hopes_ not to leak too much PII / sensitive data. People don’t use ML to anonymise but they do use ML to classify as a first step before splatting or generalising.
In that case, its absolutely right that the ML classifier not being 100% results in PII leaking.
This is a key reason why anonymisation and redaction are widely seen as problematic and are being replaced by synthetic data and, maybe in future, homomorphic encryption.
Homomorphic encryption and any encryption in-use technology is no guarantee of privacy on its own. Synthetic data has the same dillema of utility vs anonymity as any other anonymization tech.
Synthetic data uses the distribution(s) of the underlying dataset(s) to generate a totally new dataset that in theory has the same statistical properties of the original dataset. The rub is in making sure that you're actually synthesizing a valid statistical representation of the original dataset, including joint distributions. Otherwise, you wind up with models that won't generalize back to the original data.
I said, in good faith, that I would consider his product if I could inspect the running system and the code.
He said several things: 1) the NSA never did anything illegal 2) the software was too large to audit 3) it was an insult to his employment in the NSA that I was even asking these questions.
Then he hung up.