A group of ex-NSA and Amazon engineers are building a ‘GitHub for data’

dekhn · on Feb 20, 2020

I once took a call with an ex-NSA guy who was a CEO selling email security software- your MX would point to them, they'd scan your incoming email for exploits, then deliver it on to you. When I spoke to the guy, I expressed my concern, having working for a large multinational corporate whose fiber optic lines were tapped by UK intelligence to avoid NSA laws against spying on americans, that if I couldn't inspect his software, I couldn't feel confident of the security and integrity of the scanning system.

I said, in good faith, that I would consider his product if I could inspect the running system and the code.

He said several things: 1) the NSA never did anything illegal 2) the software was too large to audit 3) it was an insult to his employment in the NSA that I was even asking these questions.

Then he hung up.

m463 · on Feb 21, 2020

Obviously he fired you as a customer.

This is how the misspellings in Nigerian Prince emails market to the "right" customers.

sonotathrowaway · on Feb 21, 2020

The NSA would never do anything illegal; if you have a problem with highly misleading and unethical actions being undertaken with flimsy pretenses established by classified memos citing dubious legal justification- then that’s obviously your problem, not theirs. Also, didn’t the director of the NSA perjure himself when he lied under oath during his sworn testimony before SSCI? No, sorry, he actually gave the “least untruthful answer” and then changed his answer when contradictory facts became public. I would have called that illegal but obviously the NSA has a legal theory and justification as to when they need to provide “untruthful” “facts” to the institutions exercising oversight.

godelski · on Feb 21, 2020

> 1) the NSA never did anything illegal

Okay... but even if it was legal that doesn't mean what they did is "right". There's a big difference.

dekhn · on Feb 21, 2020

The difference is that I knew he wasn't on the level when he said it, because the NSA has been sued and it's come out they've done things that were illegal, as found by a court of law, in public. That's just one example that we know of.

solipsism · on Feb 21, 2020

Yeah, one is objective, the other is so subjective it depends not only who you ask but when and how you ask it.

thaumasiotes · on Feb 21, 2020

Neither is objective. There's no pretense of objectivity in the legal system. This is why you see people say things like "we'll find out if this was legal when they rule on the case".

xgbi · on Feb 21, 2020

They never got an official, judicial, slap, which certainly means it must be legal, isn't it?

dekhn · on Feb 21, 2020

Found to be illegal at the judicial level: https://www.nytimes.com/2010/04/01/us/01nsa.html then overturned by the 9th court. Then Congress stepped in and changed laws to make the situation "clearer".

voidfunc · on Feb 21, 2020

Please define "right"

almost_usual · on Feb 21, 2020

the precursor to law

miles · on Feb 21, 2020

Or perhaps the precursor to “just” law. Unjust laws abound.

JoeSmithson · on Feb 20, 2020

> whose fiber optic lines were tapped by UK intelligence to avoid NSA laws against spying on americans

Who did you work for? This appears to be a 180deg misunderstanding of the 5 Eyes agreement. GCHQ is not allowed to spy on Americans.

dekhn · on Feb 21, 2020

https://www.nytimes.com/2013/10/31/technology/nsa-is-mining-...

"In partnership with the British agency known as Government Communications Headquarters, or GCHQ, the N.S.A. has apparently taken advantage of the vast amounts of data stored in and traveling among global data centers, which run all modern online computing, according to a report Wednesday by The Washington Post. N.S.A. collection activities abroad face fewer legal restrictions and less oversight than its actions in the United States."

Note there's a fair amount of speculation on the specific details of how and what data is collected and shared.

JoeSmithson · on Feb 21, 2020

This source doesn't remotely support what you said, in fact it explicitly contradicts it;

> "pushed back against the notion that it was collecting abroad to “get around” legal limits imposed by domestic surveillance laws"

hobs · on Feb 21, 2020

There's been public claims about it, and honestly thinking that they wouldnt do it once they have the power to seems naive?

https://www.zdnet.com/article/thatcher-ordered-echelon-surve... > Ex-spy Mike Frost told the CBS 60 Minutes programme that Thatcher had ordered surveillance on two cabinet colleagues according to excerpts released on Thursday. The allegation comes in the same week that a European Parliament report said Echelon, a surveillance system run by the United States, Canada, Britain, Australia and New Zealand, was used for industrial espionage.

jjeaff · on Feb 21, 2020

"pushed back"? Like how the director of the NSA "pushed back" on congressional questions of whether the NSA was broadly collecting any data from American citizens?

samstave · on Feb 21, 2020

Don't forget that Snowden leaks revealed that there is a huge open sharing pipe directly with Israel.

I can't recall the details - Ill have to go look it up again, but apparently Israel gets more data than any of the others, IIRC.

whearyou · on Feb 21, 2020

Source?

When someone says something about mysterious unseen evil activities and Israel it raises my conspiracy theory detector to yellow alert

StanislavPetrov · on Feb 21, 2020

https://www.latimes.com/world/la-xpm-2013-sep-11-la-fg-wn-ns...

samstave · on Feb 21, 2020

[flagged]

azernik · on Feb 21, 2020

https://news.ycombinator.com/newsguidelines.html

"Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email us and we'll look at the data."

samstave · on Feb 21, 2020

[flagged]

dependenttypes · on Feb 21, 2020

Is asking for a link to the source shilling now?

samstave · on Feb 21, 2020

No, the asking for. Link was fine, as i stated in my op, i was going to find a source...

It was the “when israel gets called out” bullshit - and then looking at their post history.

Thats a shill account 100%

whearyou · on Feb 22, 2020

Do you have a link that speaks to the original comment or not?

454545_UUUUUU · on Feb 21, 2020

[flagged]

JohnFen · on Feb 21, 2020

> Conspiracy denotes that something is hidden

No it doesn't (at least not to me or the dictionary). Conspiracy denotes two or more people planning with each other to commit illegal, wrongful, or subversive acts. Whether or not it's hidden doesn't enter into it.

lowdose · on Feb 21, 2020

A lot of people are talking about a state of apartheid in Israel, lets not throw every theory in the corner of anti-Semitism. And to be crystal clear I'm saying this from a POV coming from Jewish heritage.

azernik · on Feb 21, 2020

Not "more data than any of the others" - just a specific kind of access that usually comes with much closer intelligence relationships (e.g. Five Eyes), rather than the more wary relationship the Israeli and US intelligence communities generally have with each other.

JohnFen · on Feb 21, 2020

> GCHQ is not allowed to spy on Americans.

Since when do spy agencies avoid doing things just because they're "not allowed" to do them? An activity being disallowed only means that they'll avoid telling people that they're doing it.

HashThis · on Feb 21, 2020

All of your emails and text messages have been saved on NSA server harddisks, without a warrent. You are asking the right questions.

paulddraper · on Feb 20, 2020

> the software was too large to audit

That part's probably true.

Try to audit something like OpenSSL.

justanotherc · on Feb 21, 2020

That's up to the auditor to decide, not the auditee.

dekhn · on Feb 21, 2020

AUditing is a negotiation between the auditor and the auditee. The auditor rarely gets to dictate absolute terms (and in my experience, will often listen to well-reasoned and prepared arguments and plans from auditees).

rileymat2 · on Feb 21, 2020

Open ssl has be audited and fips validated for some versions

Godel_unicode · on Feb 21, 2020

You might be interested in this evidence of the incompleteness of FIPS validation as a talisman of security.

Edit: more importantly, the FIPS version of OpenSSL was vulnerable to HeartBleed.

https://www.engadget.com/2019/06/13/yubico-recalls-governmen...

dependenttypes · on Feb 21, 2020

I am pretty sure that the fips validated versions could not even compile.

friendlybus · on Feb 21, 2020

Is there no more subtle way of turning down a customer, as an ex-spook? He seemed to imply more than he needed to for someone trained.

JohnFen · on Feb 21, 2020

> 1) the NSA never did anything illegal

This is simply not true.

bognition · on Feb 20, 2020

I bet I can guess who this is

samstave · on Feb 21, 2020

Why are we supposed to respect secrecy in this moment?

Who is it?

arj111 · on Feb 21, 2020

Sounds like Area_1

sdinsn · on Feb 21, 2020

It is just plain rude to apply his previous employer's actions, who employ over 30k people, solely on him.

dekhn · on Feb 21, 2020

Since I was effectively the CTO for a startup that cared about the security of its messaging, I think I made a reasoned judgement about the nature of the security of their product, offered a way that he could help increase my confidence that he wasn't just sending a copy of my unecrypted email (the email has to be unencrypted for their scanner to work) off wherever.

I don't really find that rude. A cloud customer certainly can go to a cloud provider, say "you know, it's possible you have rogue internal actors, I've read articles that said you've fired SREs before who snooped on user data, can I see your audits that show you deal with insider risk properly?"

sdinsn · on Feb 21, 2020

I encourage you to go to cloud providers and ask to see their code. They will laugh in your face.

mhluongo · on Feb 21, 2020

He's the one asking for your data, trust, and money. A little scrutiny is warranted.

sdinsn · on Feb 21, 2020

That applies to every B2B company pretty much. Did he also ask other companies for access to their code?

CWuestefeld · on Feb 21, 2020

If the NSA guy is selling himself based on his past experience, then he himself is the one dragging those issues into the conversation.

dekhn · on Feb 21, 2020

Yes, he sells himself on that experience (I had already done due diligence on a previous company he founded, sqrrl, which was organized around open source software, but touted the NSA creds):

Oren Falkowitz CEO and Co-founder Oren Falkowitz co-founded Area 1 Security to discover and eliminate targeted phishing attacks before they cause damage to organizations. Previously, he held senior positions at the National Security Agency (NSA) and United States Cyber Command (USCYBERCOM), where he focused on Computer Network Operations & Big Data. That’s where he realized the immense need for preemptive cybersecurity.

sdinsn · on Feb 21, 2020

> those issues

What issues?

audiometry · on Feb 21, 2020

I think it's acceptable to wonder if the ex-employee's attitudes and beliefs are similar to the organization he spent the bulk of his time working at.

sdinsn · on Feb 21, 2020

Then perhaps ask him about his attitudes and beliefs- not for access to his code.

tripzilch · on Feb 22, 2020

They told him about their attitudes and beliefs: "The NSA did nothing wrong".

Honestly, why are you defending someone who believes that??

thundergolfer · on Feb 21, 2020

Who said anything about “solely”? It’s entirely reasonable to have the concerns OP had.

tripzilch · on Feb 22, 2020

Yeah you should be able to work for whatever unethical employer you like, and defend their actions without consequence.

/s

fulldecent2 · on Feb 21, 2020

-5 insightful.

This post could turn into +5 informative if names and contact details were added for this unidentified "ex-NSA guy" and the "security software."

As-is, this is post is becoming popular because people are replying with random experiences and hate they have for NSA and insecure systems. But instead it could be helpful by damaging the reputation of a specific person that is peddling insecure systems.

This is why I browse with -5 insightful.

lalaland1125 · on Feb 20, 2020

When you hear a company talk about the promises of synthetic data, you should run far far away. The fundamental problem is that in order for synthetic data to be useful for model training the generative synthetic model must have already solved the problem at hand. Training on the synthetic data is just a charade: you would be better off simply extracting the model you need from the source generative model. For generative models with densities this is as simple as P(Y | X) = P(X, Y) \ P(X).

So, what do you get out of it compared to simply training models as a service for people? Almost nothing useful. All you get are:

- Much worse privacy guarantees. (People selling synthetic data love to talk about improved privacy, but it's actually the reverse. The privacy guarantees for synthetic data are much, much worse than selling direct models to people.)

- Much worse model performance. See the previous notes about how a synthetic data generating procedure must have already solved the problem at hand.

- A much more complicated setup with much more expensive model training. Training generative models is hard and requires a lot of data and compute due to the difficulty in learning such a complex outcome space. This can easily cost 100-1000x as much as simply training a straightforward xgboost model.

lettergram · on Feb 20, 2020

I’d argue you’re incorrect on most accounts. I work in this space, and have implemented what’s being described related to synthetic data.

You can see some of my work here:

https://medium.com/capital-one-tech/why-you-dont-necessarily...

The reality is, you can guarantee certain values are not passed. That’s fairly easy, what you can’t do is block general trends easily in synthetic data - else the model can’t learn. So you have to be willing to accept that “leakage” when using synthetic data.

If you accept that leakage, then you can actually improve model performance in some domains. See ubers architecture search blog, there’s a lot of material (also from Uber) showing this.

Regarding cost of training, yes model training costs could increase. Although, I’d suggest much much less than 10x more like 50% or something. Is that worth it for privacy?

missosoup · on Feb 21, 2020

> The reality is, you can guarantee certain values are not passed. That’s fairly easy

Nonsense. There's a fundamental requirement that for a model to have utility, it must have access to relevant data. If you truly find a leak-proof way to block certain values (which is borderline impossible in itself), then you make the model significantly weaker or outright useless if those values aren't uniformly distributed across the input domain.

And if you do have leakage (and you almost certainly do), then direct re-identification becomes a trivial problem and we're back to square one.

Synthetic data generated from statistical analysis of real data is either worthless or leaky. One of those two conditions is formally guaranteed to be true.

GP is correct that a viable synthetic data generator basically has to have already solved your problem for you, and in that sense it just becomes your model which is trained on real data. Training an additional model on top of that model doesn't add any privacy or mitigate reidentification.

lalaland1125 · on Feb 20, 2020

What specific privacy guarantees are you hoping to provide and how do you know you are satisfying them? That is the key to basically all of it and most work on this subject falls flat (except for differential privacy, but current evidence indicates that it doesn't work very well for generative models).

Guaranteeing that certain values are not passed is both useless and trivial. I can easily satisfy that by simply adding 0.1 and a space character to all values in my dataset even though that doesn't remove any of the sensitive data.

jmakov · on Feb 20, 2020

Have read your post. Interesting! Is the code perhaps already available?

thruflo · on Feb 20, 2020

The idea that the synthetic generator model must have solved the actual modelling problem is a attractive idea that doesn’t correspond to what people want data for: they want to eyeball it, see what’s in it, test some algorithms, figure out how they might approach the problem. You can do that very well on realistic synthetic data (much better than any other privacy tech) even if the synthetic data has lost some utility through its statistical approximations.

The idea that training on synthetic data is a “charade” misunderstands the usefulness of having realistic, “drop in compatible” data that works with your existing code or models.

The ideas of training models as a service and also of working directly with synthetic data generators to “extract“ from them are great but incompatible with (a) the complexity of real world DS workflows in regulated industries and (b) data scientists current code / workflows / techniques.

If I’m a bank, I’m not going to give you my fraud rules and you’re not going to solve my problems with xgboost. Access to models is just as locked down as access to data.

This is why it’s useful having an intermediary. Like ... a generative model that you can train where the data is and then copy over to where the modelling is happening.

ska · on Feb 21, 2020

I think there are a number of different use cases that people want data for, so there aren't any blanket answers to "is this a good approach".

As an intermediary like you describe is far different than "I don't have enough real data for what I want to do", for example.

adrianmonk · on Feb 20, 2020

> must have already solved the problem at hand

Assuming the problem at hand is analyzing the data.

I don't know enough about what they're building to understand what use cases they have in mind. (Their blog says "developing new features or exploring insights", which is a bit vague. See https://medium.com/gretel-ai/gretel-readme-fd0c4eff8a09 .)

But if this allows people to take their production database, run a command, and generate fake data they can run integration tests against, then that's not analyzing the data. It's making it easier to run tests in an environment that is as close as possible to production, because data is part of an environment.

Aside from that, I'd argue that some kinds of analysis may even still be possible. Let's say you have an e-commerce site, with one database table for user accounts and another table for shipping addresses. And let's guess that Gretel anonymizes this data by creating a user table with different usernames and an address table with made-up shipping addresses, but the overall structure is isomorphic: for every row in real data, there is a corresponding row in fake data, just with different values in it. And the (synthetic) keys aren't private (they data you generated, not collected from the user), so they can be preserved. Then you can, on the fake data, run a query and find out what percentage of users have given a shipping address.

Of course it's just a guess that that's what they're aiming to do since so little detail is available. It's entirely possible that what I described isn't what they're planning to do. Maybe it doesn't sufficiently protect privacy.

But the point is, until we know what they're building, I don't know if we can conclude that people should run away.

SeriousM · on Feb 21, 2020

I got your point and it sounds good, but let's take your example of the shipping addresses. Let's say you want to develop a model that helps finding the best route for an ups driver. The real data may contain addresses all based in New York because your customers business is located there. The fake data contains addresses based all around New York. I guess the model you train with the fake data won't be any helpful because instead the driver can do 5 locations in a hour, he has to drive 5 hours for the same amount. Gretle needs to know the context of the data and the importance of relationship between the data rows. A pure anonymisation won't be enough.

vladf · on Feb 20, 2020

While generally I agree with your conclusion (synthetic data doesn’t have better privacy guarantees, probably will hurt training if you use it naively), I wouldn’t be so pessimistic.

At the risk of digressing from TFA, Candes’ knockoffs, for instance, are an example of a (theoretically) successful use of synthetic data for model robustness. Still need original data, of course.

Basically, the broader point is that you don’t need to solve the full problem of joint likelihood estimation to use generative models effectively, e.g., GANs are another example.

typon · on Feb 20, 2020

If you take the MNIST dataset and generate synthetic data that has variations on the original images, e.g. rotations, added noise, different background images, etc. you will get a higher model accuracy than the baseline model in exchange for a longer training time.

MauranKilom · on Feb 21, 2020

Maybe I misunderstand something, but if x -> f(x) is easy to compute but you want to learn f(x) -> x, then isn't synthetic data exactly what you want to be using?

Example: Training an image upscaling algorithm by feeding it downscaled images. In this case you don't even need to train a generative model (the algorithm is known), but it should illustrate that the generative task can be extremely easy compared to the target task. You can't just handwave that away with "just divide by P(X)".

dicroce · on Feb 20, 2020

What about synthetic data that is correct? Example: Using the output of a physics simulation to train... The output is synthetic (as in, its not from the real world) but it is governed by the equations of motion...

Nition · on Feb 20, 2020

Going from the parent comment's assertion, the answer would be that if you want to train something to predict physics, the original physics simulation that's generating the data is either:

1. Already doing that, in which case it's already the superior model.

or

2. Not doing that correctly, in which case the synthetic data output is equally poor.

slavik81 · on Feb 21, 2020

Consider computer graphics and computer vision. If you had a photorealistic renderer, it could be useful for generating training data for a vision algorithm. However, you can't just run your rendering pipeline backwards to create a scene out of an image.

dataflow · on Feb 20, 2020

I mean, if (1) is true but you get faster performance, then it's useful, right?

aghillo · on Feb 20, 2020

Yes. A typical example would be the use of statistical emulators, emulating process (physical) based models that are computationally complex in the environmental sciences.

dbish · on Feb 20, 2020

This is the one that seems to be useful today. Running simulations with physics being modeled across objects seems to be useful for reinforcement learning algorithms that interact in the real world. You see some of this with autonomous vehicle training, robot manipulation training, etc. I think there are also emergent interactions that can occur that you wouldn't have modeled directly unless you let the simulation run.

evjrob · on Feb 20, 2020

The point lalaland1125 is making is that the equations of motion are the existing superior model in this scenario.

derefr · on Feb 20, 2020

Right, but presumably we don't want the model to learn those, since they're an impractically low-level model. If you're training an AI for flying a plane, you want it to learn "the simplest model that works"—presumably something like aerodynamics—not a far more computationally-intensive (and data-intensive!) model, like e.g. quantum chromodynamics.

fathead_glacier · on Feb 20, 2020

The example is fundamentally different as physics simulations have closed form solutions or can be numerically approximated for various initial and boundary conditions. Therefore, by definition the generation is not synthetic.

Other examples I could think of fall in the category the top comment is referring to.

killjoywashere · on Feb 21, 2020

> When you hear a company talk about the promises of synthetic data, you should run far far away. The fundamental problem is that in order for synthetic data to be useful for model training the generative synthetic model must have already solved the problem at hand.

I've heard this group is pretty good, and they seem to be proud enough of their synthetic data to publish papers on it:

https://ai.googleblog.com/2020/02/generating-diverse-synthet...

Illuminate me, why are they wrong?

smallnamespace · on Feb 21, 2020

That post is illustrating parent’s point: to generate realistic fake data, you have to first create an accurate model of the real data.

ivalm · on Feb 21, 2020

I answered this in another post as well but this is straight up misinforming.

Generative models do not necessarily know P(X,Y), often they simply give samples of (X,Y) using some map

G: Q -> (X,Y); Q~N(0,1)

Actually getting P(X,Y) would mean inverting G, which is not always possible (although there are some invertible GANs).

Computing P(X) is often computationally intractable as tracing out variables can be arbitrarily expensive.

w1 · on Feb 20, 2020

A recent(ly featured on the front page) example where synthetic data is helping solve an unsolved problem: https://eng.uber.com/generative-teaching-networks/

I'm being pedantic by posting this though, because their approach involved learning the synthetic data generator simultaneously with the classifier/whatever that you're training. That is not relevant for a static synthetic data source.

sjg007 · on Feb 20, 2020

Why are you citing Bayes Law here? P(X) might be some gnarly computation.

alexwatson405 · on Feb 20, 2020

@lalaland1125 you're right that privacy is a hard problem, we're really excited about making techniques like synthetic data, data labeling, and analyzing data sets for potential privacy (re-identification risk/etc) available to all developers. We'll be open sourcing some of our work in the next few weeks- feel free to jump in the code and we'd love to hear your thoughts!

salty_biscuits · on Feb 20, 2020

I don't know, if your generative model is something like getting the knn of each data point, then fuzzing each point by the local statistics of each point (from the knn's), I don't think you will miss the important trends and you get a bit of differential privacy.

bogomipz · on Feb 21, 2020

Might you or someone else say what this specific area of ML generative models/synthetic data is called? Are there any introductory references you could share. Thanks.

freepor · on Feb 20, 2020

You’re coming at this from a theoretical approach, which isn’t appropriate for deep learning because nobody has any idea how or why these neural networks actually work. Generating a bunch of rotated and stretched versions of your labeled images DOES work, and very effectively. Why? The neural network is an inscrutable God, we just have no idea.

peteforde · on Feb 20, 2020

I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a similar amount of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.

I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.

https://www.youtube.com/watch?v=EWMjQhhxhQ4

reggieband · on Feb 20, 2020

Sometimes an idea has a time. It may seem obvious now but it isn't like Instagram and Snapchat were the first image sharing applications developed. Slack was nowhere near the first chat app.

I happened to have a long discussion on the topic of data businesses last night with a friend. We brainstormed datasets that would be a combination of hard/expensive to obtain while also having resell ability to thousands of customers who would be willing to pay a high value for them. I don't want to get involved in datasets that are easy to obtain (too many competitors, no bar to entry) or datasets specific to a particular company (too much dependence on a small set of customers, cost of acquiring new customers also includes cost of acquiring data, no economy of scale).

It's easy to start with the tech problem: how to collect, clean and analyze data. But reasoning backwards from the business side is much more difficult. Expensive data I can sell once feels easy. Cheap data I can sell frequently feels like a race to the bottom. Expensive data I can re-sell 1000s of times to a niche audience feels like a perfect middle ground ... I just can't think of any examples.

killjoywashere · on Feb 21, 2020

I think the machine learning problem is not data. It's not models. It's not compute.

It's annotation. That is a workforce problem. You want to automate contracts? You need attorneys. You want to automate radiology? You need radiologists. You want to automate driving? You need drivers.

This makes ML less like a SaaS business, more like a mining business. There's tons, literally tons, of data/ore for any interesting problem. That's why it's an interesting problem. There are buyers of iron, gold, and marble. There are buyers of driverless cars, physician decision support systems, and contract automation solutions. But recovering the data from the mine (digitization) and enriching it (annotation) cost money. So much that the market variation may make it lucrative at some times and not others. If you are near peak employment, the value of a model goes up, but the cost of annotators is high also.

I'm not sure how finance guys capture that problem: how do you make a profit when there's high demand on both sides at one time, and low demand on both sides at other times? I submit that when both sides are low is the time to do annotation, and the time with both sides are high is the time to sell models.

But then you need an investor who can ride out the market.

ska · on Feb 20, 2020

> Sometimes an idea has a time.

Isn't it more like "nearly always"? It's pretty hard to find examples of things that haven't been tried multiple times before in some variant. You can argue if it was timing or execution that worked "this time" of course, but almost nothing happens in isolation.

reggieband · on Feb 20, 2020

> Isn't it more like "nearly always"?

Maybe? Hard to see past survivorship bias. My intuition says some ideas will never see their time. Hard to quantify how often that is the case as a percentage of all ideas.

ska · on Feb 20, 2020

Oh fair enough - I wasn't thinking of that direction.

Lots of ideas are just bad. The ones that work out though, are very rarely original I think (in this context, at least).

peteforde · on Feb 20, 2020

See: flying cars

juuular · on Feb 20, 2020

I imagine black box data from plane crashes, or in general data that comes out of a tragic event that no one can or would want to replicate, but is otherwise extremely valuable.

Of course, if you can sell it to one person, they can just pass it off to others, so this will quickly turn into a DRM business profiting off tragedy. Probably not a good idea.

AznHisoka · on Feb 20, 2020

What makes a dataset hard/expensive to obtain?

mfrye0 · on Feb 20, 2020

I had a friend tell me recently about a client using commercial real estate data for lead gen. He mentioned https://compstak.com/

Basically, identifying companies that are doing well / expanding by how big the space is they leased. This sort of data is apparently very hard to get, but gives users a competitive advantage.

reggieband · on Feb 20, 2020

Real estate data, and companies like compstak are exactly the kind of niche markets I'm talking about. Agents are willing to spend large sums to get access to this data and it can be resold multiple times. Unfortunately it is also a market full of existing competition with some established players.

What other markets for data are similar? In general, data that leads to prospect generation is desirable because sales agents are willing to spend money to make money. Are there any other markets like that?

jb775 · on Feb 20, 2020

So it sounds like the salient aspect here isn't necessarily the type of data, but the manner in which that data is collected. Looks like compstak's success is a result of creating a platform that facilitates crowdsourced data points that are difficult to acquire using traditional data collection approaches...that scarcity is what makes the data valuable, especially since that data can be used for leverage in a negotiation. Also, they appear to prop up the overall scarcity by only granting access of existing data to users who provide new data.[1]

I'm curious how they figure out how much to charge companies for this data? And also how they stop real estate insiders from gaining access without sharing new data?

[1] https://techcrunch.com/2012/10/18/compstak/

juuular · on Feb 20, 2020

The medical markets, but I don't want to go any further because that's what I'm doing right now :p

SMART on FHIR is a newish standard for medical applications that is getting a HUGE push from large companies like Cerner, Epic, along with all the tech giants. Hospitals are itching for more FHIR apps that can integrate directly into their Electronic Health Record system (and web apps be delivered directly on a doctor's web portal within the hospital's IT system).

So that might be a good place to start poking around...

Here's a good brief overview: https://healthtechmagazine.net/article/2018/10/everything-yo...

ska · on Feb 20, 2020

A really good example of this is labeled medical imaging data.

Some key contributing factors: multiple stakeholders & consent/approval issues, legal & technical constraints on access, depending on the application labeling may only be possible using very expensive experts. Lot's of human interaction.

reggieband · on Feb 20, 2020

I agree it is expensive to gather but is it something that can be sold at high cost to 1000s of customers? It seems the market for purchasers of that data might be limited to a small number of companies, probably hoping to build ML models.

ska · on Feb 20, 2020

The question I responded to was what made it hard and/or expensive to obtain, which I think I answered.

Commercial viability of doing so for profit is a different issue, but I see that's the other part of your original comment. It's not an obvious answer, partially because there are a lot of different scenarios within that blanket "medical imaging", and what the putative customer might want to do with it.

reggieband · on Feb 20, 2020

Yes, I should have followed the thread better. My mind is focused on a particular kind of commercial viability which is niche markets (in the 1000s to 10,000s) willing to pay for access to data.

jb775 · on Feb 20, 2020

One aspect I know of is internal private-business analytics. For example:

- How many truckloads of widgets did the widget company ship out of Warehouse A compared to Warehouse B in 2019 vs 2018.

- What is the purchase ratio of titanium to steel for Company X over the past 5 years?

This type of data is valuable to seek out emerging trends, risk minimization, stock analysis, etc. Very hard to find legitimate data on your own.

reggieband · on Feb 20, 2020

One example is that it could require a significant amount of human legwork - e.g. Google street view. Another example is it might require significant dev effort to clean and combine several raw data sets into a refined output data set.

killjoywashere · on Feb 21, 2020

The costs of digitization and annotation.

jb775 · on Feb 20, 2020

> I just can't think of any examples

My first thought is specific business industry analysis data. I've often been an hour into an online deep-dive only to hit a paywall related to this. However, I'd think it would be hard to acquire the valuable aspects of this data without some kind of insider access (compared to web scraping, creative api mining, etc).

New data source needs seem to popup out of nowhere - what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?

reggieband · on Feb 20, 2020

> what about building a platform that would facilitate the "collect, clean and analyze data" aspect of this for non-technical business owners?

One of the problems with this is how custom each data set for each client would be. My mind has been on this topic since the a16z article on "The new business of AI ..." [1] which was posted to HN in the last couple of days. The key idea is the question of how to decouple the process of collecting, cleaning and analyzing data from the process of acquiring customers for that data, and not how to solve the problem of collecting, cleaning and analyzing data. Developers want to solve the technical challenge (how to build the processes) but not the business challenge (how to find customers willing to buy the resultant data).

I do believe there is a market for start-ups to partner with exiting companies to help them wrangle their data. It just isn't the market I'm thinking of.

1. https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-i...

jb775 · on Feb 20, 2020

Sounds like you're thinking of creating a gold-mine business model for data....do the hard searching/digging up/processing of rare and valuable data, then sell it at a premium.

A few questions: What types of businesses would be "customers of that data"? Brainstorming all potential customers separated by industry would be a good start. Are there any "data purchasing" trends you've seen lately?

reggieband · on Feb 20, 2020

> What types of businesses would be "customers of that data"?

That is the exact question! That is the wall we hit. If I could consistently answer that question then there is a business to be had.

Who is willing to pay for that kind of data? I actually considered getting together a larger group of friends to do that exact brainstorm. But even then I'm not sure it is such an easy question to answer.

Also a bit funny you called it a gold-mine business. I called data that meets that criteria Goldilocks data.

AznHisoka · on Feb 20, 2020

Can you be more specific as to what you mean by industry analysis data?

reggieband · on Feb 20, 2020

Not OP but I think he means market research data. I've thought about this as well. You pay some researcher some money to write a report on growth of a particular segment of an industry. Trade groups often do this and you hear stats like "Mobile usage expected to grow by X% in developing countries over the next Z years". But it is a multi-page report, probably including graphs, on some particular topic.

It matches roughly the kind of data I was talking about. It is expensive to generate since you have to pay a researcher some amount of money to write the report. The resultant report generally can be re-sold multiple times.

My problem with this kind of data is that you will be competing against AIs pretty soon which will drive the cost to generate such reports down. And the price you can charge per report will be tied to how good a report you are capable of generating. It is also a saturated market already so the real play is driving the cost of generation down, not what I want.

AznHisoka · on Feb 20, 2020

Instead of selling reports, would it make more sense to create dashboards that let users slice + dice data and view the insights?

In other words, instead of being a Gartner, focus on being a Crunchbase. That way, you can sell to both the end users of these insights (the companies in these industries) as well as the market research companies, themselves.

ritchiea · on Feb 20, 2020

Yea this idea has been tried and never found a market at least a dozen times over. It's not impossible for the market or execution to change (e.g. Dropbox) but given the recent software is eating the world boom, it's hard to imagine that the market has changed dramatically so recently or their execution is going to be so much better than other teams who have tried and failed.

0x00000000 · on Feb 20, 2020

I can almost guarantee you that they are building this to address some specific NSA problem and their entire business strategy hinges on getting a massive DoD contract

peteforde · on Feb 20, 2020

Let's at least pretend to give them the benefit of doubt.

samstave · on Feb 21, 2020

Gosh I share this lament, and hope...

I have started and failed at several endeavors that were too early for the market, investors, hype...

And when recognized as good ideas, I was crushed by those with more resources to execute than I...

I feel a bit numb over my career arc at this point.

headcanon · on Feb 20, 2020

I'm a bit unclear on the goals for this startup. When you say "Github for Data" I'm thinking of a repository of datasets used for ML training or for more traditional research. But this:

> This so-called “synthetic data” is essentially artificial data that looks and works just like regular sensitive user data.

So its like Lorem Ipsum generator for data? Whats the use case here besides building apps with sample data? Notwithstanding potential privacy concerns, How am I confident that this is realistic if you literally say its generated?

General data repository for research with some mechanism to ensure cleanliness or integrity sounds much more useful to me.

zachmu · on Feb 20, 2020

As you have noticed, "Github for Data" is not what they're building. At all.

Our company is actually building Git for Data and Github for Data. We have an open source database called Dolt which combines the commit graph of git with the relational tables and SQL of MySQL:

https://github.com/liquidata-inc/dolt/

Then we have a DoltHub, which is Github for data:

https://www.dolthub.com/

Dolt lets you version, branch, and merge your dataset so that you can collaborate on it with others. Dolthub lets you share your dataset with the world, submit PRs, fork other people's repos, and lots of other analogous features to Github.

dflock · on Feb 20, 2020

You might want to re-think that name: https://www.urbandictionary.com/define.php?term=dolt

> A mental retard who is clueless not only about current events, but also has the IQ level of a rock. "Dolt" may be the most sophisticated insult in the English language. Dolts commonly populate such stereotypes as jocks, nerds, fruits, bookworms, and dorks.

vharuck · on Feb 20, 2020

It's probably intentional, because "git" is also an insult to imply somebody's dumb.

dflock · on Feb 20, 2020

True, but the name `git` came from the open source SCM tool, written by Linus for the Linux source code - and now everyone is sort of stuck with that name. This is a commercial product, deciding on a name for themselves.

dflock · on Feb 21, 2020

Also, at least in British English, calling someone a git doesn't mean "dumb" it means "mean/nasty/malicious".

zachmu · on Feb 20, 2020

It's paying homage to how git was named:

https://en.wikipedia.org/wiki/Git#Naming

dflock · on Feb 21, 2020

True, but the name `git` came from the open source SCM tool, written by Linus for the Linux source code - and now everyone is sort of stuck with that name. This is a commercial product, deciding on a name for themselves.

Someone · on Feb 20, 2020

I was thinking of https://www.folklore.org/StoryView.py?story=Do_It.txt (nice story about Tesler at Apple, by the way) and hence, reading it that way :-)

PopeRigby · on Feb 20, 2020

Is DoltHub also open source?

arxpoetica · on Feb 20, 2020

It appears they've got an open source licence on Dolt, which is a CLI tool to be used in tandem with DoltHub?

zachmu · on Feb 20, 2020

No. Like Github, Dolthub is closed source.

PopeRigby · on Feb 21, 2020

That's unfortunate. I hope you a open source version pops up, or DoltHub becomes open source.

Bartweiss · on Feb 20, 2020

> How am I confident that this is realistic if you literally say its generated?

This is a particularly good question since it's recently been shown that even neural nets trained on real data often pick up substantial, predictable dataset biases.

Practically every single-dataset-trained CNN seems to pick up stylistic quirks in the photos or labels it's trained on. The most visible result is that the CNNs perform better on same-dataset test examples than they do in the wild, sometimes vastly better. More startlingly, it's possible to work backwards from this: the training source of a "finished" CNN can be discerned by looking for certain types of error, and adversarial examples can be predictably constructed based on training source.

Tagged imagesets undoubtedly have stronger and harder-to-remove 'fingerprints' than text data like addresses, but I'd be shocked if the problem was nonexistent for text. My first reaction to "synthetic sensitive user data" for ML is to worry about winding up with systematic errors coming from the generation scheme.

LiveTheDream · on Feb 20, 2020

cf "radioactive data" to tag datasets and see which downstream models used those datasets for training: https://ai.facebook.com/blog/using-radioactive-data-to-detec...

AndrewKemendo · on Feb 20, 2020

The concept is that you're carrying over the same general topology of the real data but in a way that is effectively non-sense. This allows you to build ML models that are representative of the parameters of the true data which you can then use for inference systems in production.

It's a technique we've used in the DoD a long time and it works ok when everything is perfect. There are a lot of boundary problems like being able to troubleshoot bad data if you are doing your initial analysis with the transformed data, having DS's actually grok the problem-set since it's abstracted etc...

Edit: It's worth noting that this is a technical solution to a policy/legal roadblock. As organizations mature into better data governance, they are pushing more fundamental changes to governance that gets at these problems where solutions like this will no longer be necessary. For example, hiring data scientists into the groups that have access to the raw data (in our case hiring data scientists and giving them security clearances).

darawk · on Feb 20, 2020

Except that if their models have already inferred sufficient structure from your data to do this in a way useful for training, then their models have already solved the problem your models are trying to learn.

AndrewKemendo · on Feb 20, 2020

As a point of clarification it's not the model that is being abstracted, it's the raw data.

The use cases that this addresses are ones where you cannot train models in the same system/space/network as where the data exists.

ivalm · on Feb 20, 2020

In health care AI there is some tendency to use generated data for training. Idea might be that org A has real patient data but for privacy reason cannot share data, but if they create a sufficiently strong generator they can share that and org B can train their classifier without ever accessing sensitive data. Alternatively, it can be also used if you simply have too little data and need augmentation.

ibarea277 · on Feb 20, 2020

This just seems like something that will catastrophically fail. If you can build a good enough generator you can just build the ML model internally. And if you can't the statistics of what you provide are going to be off enough that any strong model is going to be wrong in strange ways.

lalaland1125 · on Feb 20, 2020

This is an incredibly important point: in order for your synthetic data to be useful your simulator must have already solved the problem at hand. In theory there is no need to even fool around with generating the synthetic data and going through the charade of training a model on it; simply exact the outcome model from your simulator directly as that's implicitly what you are doing. For example, if you have a generative model that provides densities, you can simply compute P(Y | X) = P(X, Y) / P(X).

ivalm · on Feb 20, 2020

But this is not how generators work. They generally produce samples in the from

G: Q -> (X,Y)

where Q is some prior from which you are sampling. If they are not invertible then you straight up cannot get P(X,Y) out of the generator. Even if it is invertible getting P(X) requires integrating out the Y which might be infeasible (since the model is not integrable and is sufficiently fast changing that you need very, very many samples).

thruflo · on Feb 20, 2020

Mathematically valid but misses the business problem :)

kradroy · on Feb 20, 2020

Very true. If you've solved the labeling/extraction problem using a means other than ML, you can use that means to generate synthetic data. The situation at my company is exactly this.

Say you use regular expressions to extract sensitive data from standardized, but numerously varied, form documents. The pieces of information extracted are very common classes of data: first name, last name, dates, physical locations.

During the extraction process you can save the complement of the extraction (the "leftovers") and insert generated data at the extraction points. Also, because you've extracted the actual sensitive data, you can exclude that from the set of values used for generation, if it's practical.

Sometimes people get caught up in the math and theory that they fail to see the practical solutions.

ivalm · on Feb 20, 2020

I agree that this is very tricky. I think the most interesting synthetic healthcare data generation I saw was using causal inference (where SMEs can bake in a bunch of expert knowledge during skeleton construction) and then generated data by getting the weights on the edges from a smaller dataset. At the same time, it is very hard to ensure that you synthetic dataset actually reflects real world. On one hand SME knowledge might give extra oomph to synthetic data generation (as this knowledge is equivalent to some highly abstracted training) but also if the "expert knowledge" is wrong then it's a recipe for disaster.

tomrod · on Feb 20, 2020

Fraud modeling, regulatory requirements, economic data share (e.g. internal firm data) all represent potential use cases.

visarga · on Feb 20, 2020

You can still pretrain on synthetic data and finetune on a smaller dataset of real data.

ska · on Feb 20, 2020

> In health care AI there is some tendency to use generated data for training.

Which is part of the reason for the high failure rate.

Good governance and data access for health data is a very hard problem. Good labeling is also hard/expensive in this space.

So there is an incentive for people wanting to do ML/AI without solving above to try any kind of shortcut they can think of. This incentive doesn't help solve any real problems.

The classic solution to "too little data" is use a simpler and/or less discriminating model. It's still the only one with a good track record.

ivalm · on Feb 20, 2020

> The classic solution to "too little data" is use a simpler and/or less discriminating model. It's still the only one with a good track record.

Kind of, transfer learning is an important tool that people should use more.

ska · on Feb 20, 2020

Transfer learning is nothing like a silver bullet. True, it has become an important work around but it's no panacea and the track record is at best mixed.

People already use it quite a lot. More importantly they misuse it a lot. I'd be less concerned with increasing the usage, and more concerned that the people using it understand the implications and trade offs.

thruflo · on Feb 20, 2020

Gartner estimate that by 2022, 40% of AI/ML models will be trained on synthetic data.

Just because it may have some drop in utility today from real data, there are all sorts of scenarios where that’s outweighed by speed and ease of working with fake data rather that tied up in red tape real data.

See for example: https://hazy.com/blog/2019/12/09/data-science-on-test-data

Also, synthetic data can be augmented, rebalanced etc. Which is why in future it may well perform better than real data for data science work.

For example, think about this person does not exist and then apply that to business data.

Disclosure: Hazy cofounder. We’ve been doing smart synthetic data for a few years now.

ska · on Feb 20, 2020

The "mays" in this statement are doing a lot of work.

There are a few areas of current practice that have this feature: a) the arguments & evidence for it being worse are pretty simple and b) the arguments & evidence for potential benefits are either weak or very convoluted. This is never a good sign.

I think this happens mostly because the reasons these things are being done are for the most part not technical, but the technically oriented people involved don't like to think about that way, and would rather talk about technical solutions - but that is operating at the wrong data.

The business & cost cases behind not doing this "right" in some abstract sense are pretty clear too, though. I wish more people would just be clear about this, and spend less effort obfuscating and more in clearly quantifying the cost of these workarounds.

Any time you hear someone starting off by saying things like "we don't really need good labels", "this synthetic data will be better, actually", "we'll use transfer from X because it's already done most of the work", etc., well what follows is quite likely to be good fertilizer.

Note, I'm not saying these approaches don't have value, just that there is an awful lot of magical thinking going on around it, and a lot of failures due to that.

thruflo · on Feb 20, 2020

There’s only one “may” there but yes, it masks “potentially losing crucial information”. The post I linked to is pretty clear on that.

I totally agree that the business need is the driver and that people miss the imperatives if they look at it purely from a technical or mathematical lens.

In a sense, synthetic data is the least bad actually viable solution. The democracy of privacy / data agility :p

ska · on Feb 20, 2020

I was referring to both "it may have some drop in utility today... " and "in future it may well perform better than ..."

My issue is not that people just miss the imperatives, but that they also misapply effort because of it. Accept there is a cost and try and quantify the impact. Make intelligent risk management decisions based on that. Sometimes that decision is "this is unlikely to work, what else can we do".

qmmmur · on Feb 20, 2020

https://forensic-architecture.org/investigation/cv-in-triple...

Also this for a practical use of synthetic data.

asdkhadsj · on Feb 20, 2020

I wonder if "synthetic data" could be used to proof ML? Ie, rather than start from a huge data set that you _think_ is clean enough to know what a Car is (or w/e), you could use a data set that is perfectly clean and the ML set could be tweaked to give the expected result.

Of course, that would still be hairy as you'd have to ensure your ML still performed on real data sets. All the clean data would do is allow you to write unit tests of sorts for your ML with more confidence than the all-too-common unclean real data.

No idea, just making stuff up. Interesting thought.

dmix · on Feb 20, 2020

It seems to be targetting pre-production release startups/companies who need access to real data, structured and appearing exactly like real APIs they intend to roll out in production, from which to build their products. Then when it goes live it switches to the real data feeds which this company's API directly mimics.

That's what I'm guessing. It could be used for AI stuff but also other useful datasets that are closed off or require special access.

I'm guessing a lot of defence contractors and other heavily regulated industries (ie, healthcare, insurance, pharma, etc) have similar problems in the dev process of not having access to real data. This was the leading pitch:

>> Data is valuable for helping developers and engineers to build new features and better innovate.

That's just my guess though.

jandrese · on Feb 20, 2020

I would be really interested in a "github for data", just a place you can upload structured data of any kind with a bit of meta tagging for public consumption.

SDR captures PCAPs Microphone captures Public tax records Road data Flight plans etc...

Just anything and everything on one service with the limitation that it's all open data. No license agreements or legal restrictions.

It would probably never fly, but it would be amazing.

QuesnayJr · on Feb 20, 2020

Maybe they use something unusual, like homomorphic encryption?

ibarea277 · on Feb 20, 2020

It's a cool idea, but you can't yet apply homomorphic encryption and then ML and get anything useful. You can do more limited data obfuscation.

jhoechtl · on Feb 20, 2020

Would be too cool!

gumby · on Feb 20, 2020

I don't want a github for data, I want a git for data. The one reason that POS Perforce is used in game development is that it can handle binary files, especially large ones. LFS is..OK.

And this company isn't really even aping GitHub given that they are generating the content too.

zachmu · on Feb 20, 2020

Have you heard of Dolt? It's git for data.

https://github.com/liquidata-inc/dolt

arxpoetica · on Feb 20, 2020

What would you use a "git for data" for?

mhh__ · on Feb 20, 2020

Data?

Although having said then when you say data to me, I do imagine some level of immutability. Logging scientific results using git would be pretty good if required, in the sense that there is a habit to just use folders and text files which is fine at the time but is really really hard to take over sometimes (Like code written by scientists with their single-letter variable names and hatred of functions, in my experience of Fortran - yuck)

DocSavage · on Feb 21, 2020

We’ve released our branched versioning system for data at http://dvid.io. It’s entirely open source and uses a science HTTP API with pluggable data types and a versioned key-value backend. I’m currently developing a new backend for it: DAGStore, a lower-level ordered key-value store that will have an explicit immutable store and is tailored for distributed versioning of big data.

arxpoetica · on Feb 20, 2020

I think I meant more in terms of use cases. So it sounds like one of your use cases is logging scientific results...?

gumby · on Feb 21, 2020

Recording outputs of analytical runs or deep learning models; the ability to say ‘this is the output of program X v <git hash> run on the output of program Y <git hash> whose source data was Z <data-git hash>” for all the reasons of repeatability and audit ability needed for actual science.

HashThis · on Feb 20, 2020

ex-NSA engineers. That has a negative brand to it. Especially and specifically when it comes to respect for data privacy.

save_ferris · on Feb 20, 2020

I’ve met several ex-NSAers that used their time there to build their brand successfully.

At the end of the day, the dragnet surveillance decisions came from the highest levels of the Bush and Obama administrations, not the boots on the ground.

whatshisface · on Feb 20, 2020

"I was just following orders," is not an excuse with a precedent of being accepted.

gruez · on Feb 20, 2020

It works as long as the current regime isn't overthrown or your country gets occupied.

austhrow743 · on Feb 21, 2020

https://en.wikipedia.org/wiki/Superior_orders

"Historically, the plea of superior orders has been used both before and after the Nuremberg Trials, with a notable lack of consistency in various rulings."

sdinsn · on Feb 21, 2020

No, the point is that the average NSA employee was not involved in dragnet at any capacity.

fiblye · on Feb 21, 2020

Most people who were members of socially bad organizations throughout history weren't bad people or doing bad things, but willingly joining and serving a support role for such an organization is itself morally dubious. It's actively helping enable their activities.

For example, if someone got a job on a shark finning boat maintaining the knives, they're still complicit in shark finning even if they're never touched or seen a shark in their life.

And even if their actual role was spying on and helping detain Americans for arbitrary reasons, nobody's going to slap that on their resume. They'd say they were a data aggregating administrative officer.

philosopher1234 · on Feb 20, 2020

Surveiling the population is not equivalent to genocide

Dahoon · on Feb 21, 2020

No but it makes it easier if it comes to that.

zionic · on Feb 20, 2020

If I see NSA on a resume, it's going in the trash. Just being honest.

kalberg6429 · on Feb 20, 2020

As an outsider to the US, I expected a job for the NSA to be seen as respectable. Quite surprised that it seems to be quite the opposite.

catalogia · on Feb 21, 2020

> "As an outsider to the US, I expected a job for the NSA to be seen as respectable."

Why should we respect those that spy on us? Would you expect Germans respect the Stasi?

sgift · on Feb 21, 2020

Google, Facebook, ... every time we talk about these things the US position seems to be Germans are privacy nuts. It shouldn't be surprising that we think people in the US don't see privacy invasions as a serious problem / accept them in the name of a "higher purpose".

Dahoon · on Feb 21, 2020

>As an outsider to the US, I expected a job for the NSA to be seen as respectable

I think that has a lot to do about which social circles you move in. If I saw NSA on a CV I would throw it in the trash.

sdinsn · on Feb 21, 2020

It is perfectly respectable to most people, not the loud few commenters you find in this thread.

HashThis · on Feb 20, 2020

"German Stasi is on my CV, because I worked there. I was just following orders."

JohnFen · on Feb 20, 2020

> not the boots on the ground

True, but the "boots on the ground" are OK with it enough to go along with it. Their hands may not be quite as dirty, but they certainly aren't clean.

bluntfang · on Feb 20, 2020

Do FAANG engineers have the same baggage? What about any adtech company ever?

kick · on Feb 20, 2020

They should have lesser, though still real, baggage attached to them. There are definitely a few types of people in computer science who refuse to touch FAANG employees, but it's not as widespread as I wish it were.

justinclift · on Feb 20, 2020

Why should they have lesser baggage attached to them?

kick · on Feb 20, 2020

NSA is unambiguously doing worse things than any FAANG or adtech company could do in their wildest wet dreams.

"NSA staff used spy tools on spouses, ex-lovers"

https://www.reuters.com/article/us-usa-surveillance-watchdog...

Optic Nerve ("GCHQ" in title is misleading because the NSA played a very enthusiastic helping role in it):

https://en.wikipedia.org/wiki/Optic_Nerve_(GCHQ)

ECHELON:

https://en.wikipedia.org/wiki/ECHELON

There's so much they're responsible for that it seems like it'd be excessive to post all of them, and HN has a character limit, anyway. Even the most well-known ones would fill a couple of comments.

justinclift · on Feb 20, 2020

Agreed, corrupt(ed) NSA people can do very bad stuff to targeted individuals.

However, adtech companies damage everyone, and fairly significantly. eg concentration problems, addictive behaviours, etc.

That's literally "human race" level damage.

Doesn't seem like a lesser thing to me. :/

kick · on Feb 20, 2020

Targeted individuals?

They, for years until it was leaked, "targeted" every phone call in the United States:

https://gizmodo.com/this-government-phone-tapping-thing-just...

And 125,000,000,000 globally, along with almost a hundred billion internet requests per month in a time when the Internet was much smaller, years ago, under a program that hasn't stopped:

https://en.wikipedia.org/wiki/Boundless_Informant

This definitely damages everyone, and I think more fundamentally than adtech companies. You can escape adtech companies: use different websites, install an adblocker, block trackers, whatever. There's no obvious way outside of strong encryption and hermitry to avoid NSA getting its paws on every single public or private communication you ever make, given their record. When privacy doesn't and cannot exist, human behavior changes deeply. I think it's the greater of two evils.

Adtech companies definitely aren't good, by any stretch, but I think it's the difference between Ghengis Khan and a common killer: sure, they both want to kill you, but the scale and the level of cruelty is much different.

JohnFen · on Feb 20, 2020

If different groups all operate in a way that is unacceptable, then it doesn't really matter (in practical terms) which group is worse than the other.

kick · on Feb 20, 2020

Sure, but if one group is operating in an unacceptable fashion along the lines of "We're stealing candy from babies!" and the other is "We're not only a risk to your business, your customers, and you, but also to the entire world, and also we have actually pretty legitimate ties to murder and also coups and also maybe war crimes a little," the first is child's play in comparison.

zentiggr · on Feb 20, 2020

Haven't been in a position to sort resumes yet.

If I ever am, NSA in the history is a solid red flag. Be nice or go home. Willingness to participate in all that makes me wonder about your character.

Facebook almost as much, depending on when and for how long. The more recent, or longer sustained stay, again. Be nice or go home.

Google a little bit more reasonable, especially if only ever on the app/project side vs main service.

Adtech experience, I'd have to ask if they'd washed their minds out and won't even consider any of those tactics ever again... on second thought, almost as bad a flag as NSA crap.

Even if it's only a small drop of resistance in an ocean of "yeah, ok, whatever", I feel like we should signal that certain behaviors _against_ your fellow humans are poorly regarded.

Edit: figured this would get downvotes, ah well.

sdinsn · on Feb 21, 2020

> Willingness to participate in all that

Participate in what? The NSA does many different things and has many different positions. Would you hold NSA's janitors responsible as well?

JohnFen · on Feb 21, 2020

Speaking generally, I view anyone who works for a very problematic company or agency in a worse light for it, regardless of whether they're at the bottom or top of the ladder.

Obviously, the higher up the ladder a person, the stronger this effect is. However, that someone is willing to work in any capacity at an organization that behaves badly (in my view) says something about that person.

sdinsn · on Feb 21, 2020

So your answer is, yes, you'd hold NSA's janitors responsible... I hope you never become an employer!

JohnFen · on Feb 21, 2020

That's not what I said, but perhaps you have to define what you mean by "hold them responsible". As I interpret that phrase, no, I wouldn't hold them responsible as such, but that they are willing to work there in any capacity says something about them.

zentiggr · on Feb 21, 2020

Strawman. Obvious from context I'm referring to anyone involved in the SIGINT against CONUS and the Internet, and anything related.

sdinsn · on Feb 21, 2020

Not a strawman- you stated "NSA in the history". Believe it or not employees work on all sorts of projects.

justinclift · on Feb 20, 2020

zitterbewegung · on Feb 20, 2020

I think a Gitlab for data is a much better proposition . See https://quiltdata.com/

It is also open source and written in python. https://github.com/quiltdata/quilt