Hacker News new | past | comments | ask | show | jobs | submit login
CCPA Will Hit Dev Teams Harder Than GDPR (tonic.ai)
199 points by icoe 45 days ago | hide | past | web | favorite | 167 comments

Counting an IP address as PII is kind of crappy, you need a court order to turn an IP alone into PII.

Operators should be free to log traffic at the network level, PII should only come into play once you're asking someone to provide personal information.

Lots of comments here about how IPs aren't PII b/c they can change, etc. I'm not arguing that, but consider that there is an entire _industry_ around using IPs to specifically target people, companies and households that is effective enough for businesses to write large checks to them.

Household IP Targeting - https://www.vicimediainc.com/ip-targeting-direct-mail-intern...

Or even just your ISP (who for sure know your IP addr and your address) - https://arstechnica.com/information-technology/2017/03/how-i...

The larger issue that we (HN tech people) treat IPs as fallible because we're thinking of it like an absolute. The advertising side of the Internet looks at them like a goldmine b/c even a 75% correlation to "truth" can still make their ads reach the people they're trying to reach in a much cheaper way.

The amount of information that can be found using your ip address https://clearbit.com/attributes (refer only the reveal api)

I just signed up for the trial and Reveal turns up nothing on my home IP address.

Yeah it is odd. You decided to hit my server, I should be able to record the occurance. How am I suppposed to deflect DoS attacts if I can't maintain a list of nefarious IPs. I know that's a fairly low tech attack, but they still happen constantly. Is Fail2Ban no longer compliant?

I wouldn't be surprised if some policies pertaining to record keeping in some sectors contradict that requirement as well.

Not sure about this law but that sounds completely fine under GDPR. You need to keep your log files secure and not longer than necessary for what youre doing though.


You absolutely can still maintain that list under CCPA. What you can't do is sell your list of nefarious IP addresses. You could sell (or buy) the service of checking various IP addresses against a proprietary list of nefarious IP addresses.

To deflect a DoS attack you should not need the records for an extended amount of time. There is no reason why you cannot specify you are keeping records for security purposes and getting rid of them when no longer pertinent.

You can do all those things under GDPR as they are required for the running of the service

CCPA will probably be amended at least once more before it goes into effect. If you feel that it shouldn't apply to non-membership website operators who merely log IP address and requested URL... consider writing to your California State Assemblymember and California State Senator, and possibly to the California Attorney General who will be publishing guidance regarding CCPA.

Amusingly enough, California consumers will not have privacy rights regarding any written comments sent to the California Attorney General.

Could you salt and perform a one-way hash on the IP address and store that? It would alleviate a large amount of leakage issues while still giving you uniqueness counts.

I built an Nginx plugin to do something like this https://github.com/masonicboom/ipscrub

IPv4 addresses are only 32 bits, which makes building rainbow tables almost trivial.

Yeah, though a salt would at least mean you'd have to rebuild the table for each site/database/whatever. However I'm having a hard time seeing how to really protect against this.

The IP is a an identifier, so unlike password salt (where the user is the identifier) you need a way to know what the salt is to hash the IP, and it needs to be consistent.

You can do a lookup table of IP-to-salt, but this either gives away your list of addresses (if only containing IPs you've seen) or is huge (entire ipv4 range), and either way doesn't prevent rainbow tables.

You can have a static salt for the entire site, but again this is not really helping much against rainbow tables (beyond requiring recalculating the table, once).

Is there a mitigation I'm not thinking of?

You could encrypt instead of hash, and then have some policy (e.g. the decryption library/service/piece will only allow decrypting ciphertext newer than 30 days).

If you need the ability to group ciphertexts without decrypting them, you could create a scheme which will make cryptographers cringe, but could be justified in this specific case.

For large sites, you also have the risk that you might be able to say something statistically useful about the plaintext.

For instance, you can probably assert things like which IP blocks are likely to comprise most of the entries in the table or which IP blocks or addresses cannot be in the table.

That just makes me all sorts of uncomfortable.

I thought salt was supposed to be unique per hashed value. Rainbow tables don't work in that case.

No matter how complex your scheme is, if IP address is the only input, it's a (mathematical) function of f: IP → hash. Since IP(v4) space is 32bit (in practice, slightly less), if you know the function f, you can trivially enumerate all inputs.

From security point of view, if you use a fixed (unrelated to input) salt, the attacker will have a harder time to discover the function f (unless you store the salt next to your IP hashes). But from privacy point of view, in relationship between me (user) and you (service provider), you are the attacker. And you know your function f. Hashing IPv4 addresses, salt or not, gives me no privacy protection, since you can trivially reverse the hash - just due to small domain size. With IPv6, this problem will resolve itself somewhat; till then, I'd prefer if you encrypted those IPs with keys that have finite and short life time, in a way that a third party could audit if need be.

that only work if you had two pieces of information. username and password works because you can find the salt value associated with that username and then use that for the password hash. an ip would still require an unhashed thing to lookup to get the hash if you did it per ip address. for this you might be able to get away with using a sole salt value for all ip addresses but even then if you get hacked it would be trivial to write a script to compute the rainbow table when you steal the salt value.

For passwords, yes, this is generally best practice. Also, the salt is normally stored with the hashed password, as it’s not regarded as a secret.

Modern GPUs can manage several thousand million SHA256 hashes/sec, so even with a salt per hash it’s not going to take long to get a given entry, given the 32bit address space of IPv4

You can use bcrypt or argon2 to make it much slower than that.

but why?

If I am got a DoS attack or Spam, I need the IP to find out to whom I should file abuse complain.

Do we need to sanitize SMTP header too? How about shuting down DNSBL?

It's not possible to one-way hash a 32-bit IP address. A hash of a 32-bit value can always be reversed because the search space is so small.

Store only the first 16 bits of the hash maybe?

Google Analytics is supposedly GDPR compliant when they store only the first 3 octets, un-hashed.

However I'm not sure myself it makes sense. Some people will be identified by just a partial IP or even a partial hash.

Who cares if it’s trivially hackable; we’re talking about a legal checkbox that you have to tick.

A reversable hash "could reasonably be linked" with the plaintext. You can't get around the law on technicalities. Judges are not computers.

> You can't get around the law on technicalities.

Simply out of curiosity, what do you mean by that?

All my life experience and knowledge tells me it's exactly how you get around the law, unless court has its own agenda or strong bias.

I do. People treating privacy protections as "legal checkbox that you have to tick" are the reason regulations like this show up in the first place.

Aren't IP addresses used as PII by companies to track users that have profiles but aren't logged in?

I’d hope not. From the company’s perspective, there’s never any guarantee at all that an IP is going to be 1:1 to a real identity. IPs will be dynamically reassigned to new consumers constantly, and there are many situations where you’ll have many (some times very many) users sitting behind the same IP. The only situation I’ve come across where some level of PII has been retrieved from an IP are services that will be able to link an IP to a particular company’s office. I’ve seen that used in Account Based Marketing funnels where you can get information that ‘somebody at ACME Corp viewed these pages on your website’.

Trackers don't care if they're wrong some of the time. The prediction problems they're using the data to build models for are pretty noisy anyway. If using inexact identifiers improves their model, they'll get used. Many technically dynamic IPs change only rarely... I think my home Comcast IP has changed once in the last 2.5 year. So the correlation between a Comcast IP and a perfect household identifier is going to be pretty good. If you have a dataset that's got search history timestamped and labeled with IP, it's probably pretty easy to figure out the physical address that goes with the IP from map searches. Cross-reference an address to name database and now you've got a dataset with each household's (labeled by name and address, with some error) search history.

From a company’s perspective a person uses only a handful of IPs most of the time: home and work.

Combine that with cross-site tracking and phone companies selling your info...

From a companies perspective, almost all global mobile users are behind cgnat, a huge portion of homes are too, and offices have hundreds or thousands of people exiting from a single or a few public IPs.

You can probably mask off a few low order bits and still get most of the value for network management applications.

Another note... Per 1798.140(c)(1)(B), CCPA applies to a business that receives PII of =>50k consumers for the business’ commercial purposes. Which might not apply to access logs kept purely for diagnostic purposes.

A commercial purpose of ours is keeping the web site up.

Right now it maybe isn't but that could quickly change if newer protocol versions get more common.

If IP addresses were as anonymous as claimed, there would be little incentive to save them in any long time storage.

Especially considering many home connections don't even have static ips any more. Websites can't tell whether or not the IP is static or dynamic; it would be pretty silly for them to use it too.

There's been a lot of FUD surrounding the logging of IP addresses for network diagnostic and abuse purposes as a violation of GDPR (and now CCPA), but I'm not aware of any cases where that alone was sufficient to cripple a business.

Until I hear otherwise, I'm going to gamble that for now that's not the kind of reckless mishandling of personal information that regulators are trying to crack down on.

> Until I hear otherwise, I'm going to gamble that for now that's not the kind of reckless mishandling of personal information that regulators are trying to crack down on.

And you're probably right until they do otherwise.

The problem with badly-drafted laws is that they can be used to attack people who are annoying but who haven't done anything wrong... except for technically violating a law which is "supposed to" mean something else but which can be read to penalize some harmless activity the gadfly happened to engage in.

So, maybe you'll be patient when I'm not comforted by people telling me to not worry about it.

GDPR gives regulators a lot of leeway on how to crack down on things.

And that’s problematic for someone trying to understand if their business operations are legal.

Courts are not run by robots, judges are generally smart people. I agree - I think most people overthink the whole IP == PII nonsense. I think it’s more likely that IP + other factors, and your USE (or misuse) is where things become more gray.

I think the whole point of the rule of law (versus rule of authority) is to remove some of the massive ambiguity about enforcement and make the courts a bit more “robotic” and regular. You don’t want a situation where it’s luck of the draw on a judge, or where the ambiguity allows selective enforcement against people one judge or prosecutor particularly dislikes.

I agree with you ideologically. I'm not defending this law, or bad laws, or laws applied unevenly. I've been a vocal opponent of all those.

But we also have to have a certain pragmatism when deciding how to behave in a society with an impossible legal system. How much effort should I, as a developer or as a consultant to business owners or as a systems administrator, spend on purging IP addresses versus all the other things that need attention?

For that we look to how the law is applied in practice.

I was active on Slashdot back when the DMCA was first proposed and then fought its way into becoming law. There is no topic about which HN is as rancorous as Slashdot was about the DMCA. What does the situation look like now, twenty years later? Yes, there are and have been and continue to be abuses of the DMCA, but not at the internet-destroying scale that Slashdot predicted.

So I'm not going to tell you to ignore IP addresses in your log files. That's up to your judgement. But I'm going to ignore them in mine, until I see a reason to do otherwise, and when it's a topic of discussion with others, I'll tell them that according to a strict reading of the law, logged IP addresses may be a liability, but that there have been exactly 0 cases to date which have been only about some business having IP addresses in its logs for abuse and diagnostic purposes.

Agreed. On the other hand, you can't make courts fully robotic. The absurdly large size of existing laws are the consequence of trying to make them more like computer code, and having to patch countless vulnerabilities and corner cases in the process. In general, writing good laws as computer code is an AI-complete problem. That's why all laws leave some space for human judgment.

The advice we were given, and my general understanding is that you absolutely have the right to use IP addresses for network diagnostic and [anti-] abuse purposes. What you can’t do is leave those IP addresses lying around unsecured, share them with anyone who doesn’t have a legitimate requirement for access, or otherwise use them for random purposes. Also, you probably need a lifecycle policy so you don’t hang onto that data indefinitely.

The GDPR means you need a lawful basis for processing the data. Not that you can't process it at all.

There's lots of talk about consent as a basis for processing. For lots of purposes "Legitimate Interests" is likely a better basis. You'll have to perform a legitimate interests assessment and be able to justify that the potential negative impact of your processing is outweighed by the benefits.

The ICO has a interactive tool for selecting a basis for processing https://ico.org.uk/for-organisations/resources-and-support/l... with links to more information.

> if a data breach occurs, the law permits consumers to recover up to $750 per incident

This is great!

Simple, you just add this to clickwrap agreement:

The Parties mutually agree that any and all disputes arising from or relating to this Agreement, including the interpretation or application of this Agreement will be submitted exclusively to final and binding arbitration pursuant to the Federal Arbitration Act. The arbitration will be conducted the state of Delaware or such other location as the Parties may agree, by a single arbitrator in accordance with the substantive laws of the State of Delaware.

Boom. No more pesky California law.

Setting aside potential flaws in your thesis, theoretically the Federal Arbitration Act (FAA) can be circumvented by making the state the real party in interest but permitting a victim to sue and recover on behalf of the state. Because the state wouldn't be a party to any contract (and also because it's a state), the FAA wouldn't apply.

California does this for labor violations through it's Private Attorneys General Act (PAGA): https://www.dir.ca.gov/Private-Attorneys-General-Act/Private...

Glancing at the Wikipedia page for CCPA, it's possible that the CCPA is structured similarly--"Companies ... can be ordered in civil class action lawsuits ... subject to an option of the California Attorney General's Office to prosecute the company instead of allowing civil suits to be brought against it."

That said, I don't think California's PAGA has ever been tested vis-a-vis the FAA in the Supreme Court because it was only recently that they decided to strictly apply the FAA to employment contracts.

Section 1798.192 covers that:

    Any provision of a contract or agreement of any kind that purports to waive or limit in any way a consumer’s rights under this title, including, but not limited to, any right to a remedy or means of enforcement, shall be deemed contrary to public policy and shall be void and unenforceable
Edit: I suppose that I should say that I'm not a lawyer. This isn't legal advice. And it's completely possible that I have misunderstood this section of the law.

I dislike how the minute someone mentions a legal hack, the responses are "oh, are you a lawyer?"

Why not consider this reply on its merits?

Because it is super-risky to consider these things on their own merits if you are not the kind of person who regularly interacts with judges and juries. Laws are something that are applied within a particular kind of, ah, culture. You have to be familiar with the body of work of that culture and how they will likely interpret the law. Trying to interpret laws in ignorance of that culture is likely to lead to interpretations contrary to those with the power to enforce the laws, and land you in a lot of trouble.

In other words, laws aren't code or mathematics. They're not pure exercises of abstract thought to be considered in isolation. Trying to treat them that way is going to lead to trouble.

Does everyone downvote all medical speculation in the numerous health threads on this site?


It's fine for people to speculate about medical ideas, legal ideas, etc. Especially on a forum like this where there is no pretense that people are offering genuine legal advice.

Or maybe we should express less confidence in our assertions about medicine?

After all, most of the time, people are writing about things they don't know all that much about.

To be honest, one or the other should be the case.

Either wild speculation on medicine and law should be fine (this is my position).

Or, people should fear medical speculation as much as they do legal speculation (I think this is the more pathetic option).


Personal attacks are not ok on HN. I appreciate your concern for the quality of the site, but doing this is one of the worst ways to destroy itt. So if you'd please review https://news.ycombinator.com/newsguidelines.html and follow the rules when posting here, we'd be grateful.

carbocation is right about the underlying point, btw. This is an internet forum, the purpose is good conversation, and speculation is a normal part of conversation. It can of course be dumb and low-information, but it needn't be.

EDIT: Taking the L on this one.

You might take a second to click on 'carbocation's name and see what his background is. I think you've missed some subtext.

EDIT: Taking the L on this one.

"Legal hacks" are rarely, if ever, as clever as their proponents think. Scepticism is natural and warranted.

Judges aren't complete morons and will take a dim view of "hacks". There could be loopholes somewhere but you'd need a lawyer to spot them.

One of the most famous "legal hacks", Richard Stallman's copyleft, had to be rewritten by a lawyer. rms wrote GPLv1 by himself and you should never use it. GPLv2 is the version that was actually vetted by a lawyer.

A similar thing happened with Perl's Artistic License. Its version 2 is basically also a lawyer-approved rewrite.

In other words, hackers, don't try this at home. There are professionals who can do this for you.

I find it somewhat sad that law is basically a guild where arcane language is used to gatekeep what should be a much more straightforward exercise.

It's not. It's the equivalent of saying "I can do this better" and producing unreliably, buggy code. Sure you can, but a more experienced professional can point out all the corner cases you missed.

Then when it fails, you blame the programming language rather than your experience in programming.

I find it somewhat sad that programming is basically a guild where arcane language is used to gatekeep what should be a much more straightforward exercise.

Hahaha, actually at some point in the future, I suspect 25 years or so, our programming guild will likely have taken over and replace the legal guild.

This would make an interesting entry on http://longbets.org/

I mean, if you pay attention to the names of the kernel API functions, you'd probably end up with the same conclusion.

Very relevant xkcd: https://xkcd.com/1494/.

Probably because anyone who isn't a lawyer has no hope of considering this reply "on its merits".

My gut feeling is this "legal hack" wouldn't work, because if it did someone would have used it by now against some other law that provides for damages, and someone else would have figured out how to neuter the hack. Which is to say, there's probably an existing law that prevents this hack from working. But you'd need a lawyer to be able to say whether that's true or not.

I'd love to see them try this in the EU.

Oh are you a lawyer?

OP needs to be not just a lawyer, but your lawyer. I.e. someone who is accountable to you if their advice is wrong.

There shouldn't be a cap to liability, this reeks of tort reform-esque legislation.

If my identity gets stolen, there is much more than $750 at stake on my end.

It's up to $750 or actual damages if greater.

good time to make a bot that signs up for things

Only if you don't value your PII. If you don't use PII in your bot then you can't claim.

"up to"

Are there any guidelines for determining actual compensation?

The full sentence is this fwiw:

>Additionally, if a data breach occurs, the law permits consumers to recover up to $750 per incident (or actual damages, if greater).

So that might just be $750 as part of a punitive fee.

It sounds more like a statutory damages thing, although note I have not read the law.

The idea with statutory damages is that determining the actual damages can be difficult and uncertain, so some laws allow plaintiffs to elect to ask for damages from a standard range, and the court will decide where damages should fall in that range based. It's basically saying "just give me about what is typical for cases like this one".

Presumably it's a scale from

"Leaked (e-mail) adresses"


"Leaked nude photographs".

I don't mind my nude photographs. I mind if somebody takes loan in my name and dumb bank would send it to collections.

Someone else might mind your nudes. E.g. your employer, the school your kids go to, the parents of your kids' friends, etc.

At least in the US, rest of the world isn't that shocked of our natural form.

Presumambly the upper end of the scale would be closer to identity theft with total asset loss and fraudulent lines of debt, which is likely would occur if eg google got hacked.

It would be decided in a civil court, most likely.

"It defines de-identified as “information that cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer.”"

I'd love to know what they mean by reasonable... I've seen some demos of tech that can do some pretty amazing things at de-de-identifying.

So, huge caveat (I'm NOT a lawyer), but right now most interpretations seem to suggest that masking and synthesizing would constitute appropriate deidentification even if a motivated adversary could reverse engineer given appropriate time and resources. Again, this is something that will likely be clarified over time.

Great article, until the end.

Who uses PII in test data derived from real customers? That's just an absurd practice to begin with, and no one who takes security seriously would even consider doing this.

Hi, this is Adam. I'm a founder at Tonic.

As others have said, we've found a lot of smaller companies will test with production data because of their need/desire to move quickly. But we've also seen much much larger companies use production data in their dev/staging environments. Sometimes there will be production-like safeguards and security measures in place but not always. People shy away from practices that slow down development and testing.

We think synthetic data is the right solution for a few reasons. Most importantly, we believe it provides the right level security, while still allowing your team to be productive, i.e., your business logic and test cases still work. It also allows you to scale really easily since you effectively have a ruleset for generating data of any size. Finally, it’s a great way to share data throughout your organization and can help facilitate sales and partnerships. If you’re curious about scaling, check this post out: https://www.tonic.ai/blog/condenser-a-database-subsetting-to...

Funny you should ask. This just popped up the other day over here in Norway:


In summary, when doctors were testing a new electronic patient journal system, they used real social security numbers (our version of them). And just for kicks they tested in production, so the persons used got all kinds of prescriptions for stuff they didn't need etc.

I have never seen a “dev” instance of a DB that wasn’t just a snapshot of the prod DB from earlier. I admit haven’t seen many - but I have seen zero of any other kind (e.g. anonymized or synthetic)

Just going to throw out there that I’ve never seen a dev database that was anything other than fake data, or internal dogfood data. Have worked at major public tech companies and late-stage startups.

I think one reason might be that this was never sensitive personal data. Phone numbers, emails and addresses mostly corporate. But real passwords (hashes) from real users, on 50+ laptops with unencrypted drives was pretty normals.

I think culturally there may be a difference since I'm in a place where some data (addresses, phone numbers, ...) is public info, i.e. given your name I can get your address and phone number from a public DB anyway.

I've done this plenty of times, but you can anonymize the data fairly easily and get the benefits of both.

The flip side of the coin is dev databases not being representative of production, this can cause performance issues. "It works with 10 rows on localhost, why doesn't it work with a million in production?".

Many small companies take this approach. Usually it’s lower-risk PII.

Some small companies will refuse to use generated data if it takes even a minute more to generate it vs import it from production. In the consulting world I’ve seen multiple examples of companies complaining bitterly about other security minded consultants efforts to improve security and privacy through even small amounts of additional development time.

I have seen it done in a small company to check if a query will run too slow in production. Take a copy of the biggest database. Run query, see what happens, delete copy.

It's probably more often that the query is just run against production in the first place.

Making a copy is probably more effort than most developers out in the wild are going to make.

Not true. If you were to throw up a slow locking query in production, you could take down the site. Restoring a backup should be fast.

It's very easy for a small company to get in the habit of using data cloned from prod for testing. This practice, easy as it is to start, gets progressively more difficult to move away from as the application and service grow in complexity.

As a result, you get shockingly mature companies that do exactly this obviously absurd thing because it's a ton of work to stop. Work with no obvious reason to do this instead of feature work.

We used a similar trick, not for testing, but the ability to download the prod database and debug things locally. We hit scaling issues before PII, so coworker built a system to generate real-ish DBs with only one customer's data. And then in a future version, sensitive fields were filtered or replaced with mock data. Not sure if there are better, less engineering effort ways of doing this, but it was a great tool when debugging.

Makes a lot of sense. I actually think leveraging your prod data to create a test environment is one of the best approaches, as long as you're mindful of privacy. Full disclosure: I'm a founder of tonic.ai and we make tools to make it easier to create synthetic staging instances from production environments.

Small companies that are just starting out may use real data in test environments since it's a bit easier than using mocked data... Honestly this really only holds for companies that also avoid unit/integration tests (which will generally require that data to support the tests be explicitly mocked in some manner)

Since this involves computers nothing above is a hard rule, but it goes along with my experience.

The $25 million revenue limit would be a pretty good guide from 'small'. Typically there are a lot of changes around that mark, one of which should be to stop using Customer data insecurely.

Except that revenue limit is just one term of an OR clause. If you hit any of those three listed points, CCPA comes down on you. No revenue at all but 50k unique visitors, and it applies.

Yea but the $25mil portion of the clause is the only one I see an excuse for, if you're saying that -all- businesses generating X revenue or higher need to comply with a regulation then it's good to make sure X is high enough that businesses in unrelated fields will be able to afford the cost of compliance without going bankrupt.

The other two categories specifically target companies that really should comply with this law - I assume the $25mil clause is there to make sure large companies can't loop hole themselves out of this somehow (offload PII responsibility onto a subsidiary or a "third party" that is incorporated in Bermuda by the owner of the company)

People who have been hurt by the fact that synthesized data often doesn't exactly match real data.

TSA. My buddy was on a contract to modernize the system behind the supposedly secret no-fly list in the United States. The sample data was a sample of the production data.

Throwaway for obvious reasons.

When the system involve some not well understood edge cases (e.g. pre-unicode, non-latin person names. Historical dates and time during calender and timezone switch..) and other underdocumented business rules.

Sadly it's much more common than you might imagine.

A huge part of the problem is that many companies don't take security seriously.

It would surprise you.

I've seen that at an insurer...

Going into effect in a year? Seems like a business opportunity. Someone let me pay them $X and review my systems every so often and give me a seal saying I'm compliant with all these laws, and include some insurance up to $Y. Especially given the selective enforcement, there's money to be made from the chill alone. Compliance audit companies can probably just roll this into their package.

Also, I'm a bit annoyed at laws only affecting companies of a certain size. At some point right at crossing the line, there's a negative effect to having 50,001 users. (really I'm annoyed at how these data protection laws are implemented in general and I wish the discussion would be about that instead of being idealistic and only looking at the supposed intent)

> how these data protection laws are implemented in general and I wish the discussion would be about that instead

Let’s do that, shall we?

Before GDPR there were laws in each European country protecting private data (GDPR is basically Sweden’s data protection law in that regard).

Not a single “poor company that will need comply” gave a damn.

Then GDPR was introduced, discussed, amended. Quite publicly. Not one of the “poor devs that would be hit by it” gave a damn.

GDPR was passed and companies were given two years to adjust their software/systems/business practices to comply. Hardly any of the “let’s have a discussion shall we” devs gave a damn until the last few months of the transition period.

And only when they realized that they had to actually do something, something they should have done literally years ago, we had (and still have) this fake outcry of “boohoo these laws make us work hard and do right things and we don’t wanna”.

Cry me a river.

As a top engineer of a EU headquartered company, I can be one instance of saying this was not true of us. We started our preparations almost a year and a half in advance of the March 2018 deadline. Once we engineers and our GC were done interpreting the extent of what we believed we needed to do and the resources to do it, we were basically ordered by the CEO to do as little as possible as late as possible, automate as little as possible, and just wait to see if anything came of it. I left the company a few months after GDPR-day so cannot say how it worked out, but it was the CEO’s company and his choice to do it in a way that it then became my responsibility to implement.

Compliance/legal is a company risk and as I indicated in the challenger article here a few days ago, as an engineer I can advise on hat the risks are and the potential consequences of bad outcomes, as well as the costs to reduce them. The business decides what level of risk to take. I personally would have preferred a robust response to GDPR and thorough internal procedures, but it was not my call to make.

Of course, I personally believe that we humans should own our data and digital footprints, so I agree with a lot of the concepts behind GDPR and CCPA even if I do not agree with all and as an engineer may think some are ... silly/overzealous/misguided or what have you. Case in point: the IP tracking discussion above. If I hit your network, thats on me (barring externalities or bad actors, etc.). Retention periods and use definitions are fine, but a requirement to treat it as PII or other super sensitive data seems a bit much to the engineer in me.

Yes, true. In the end it comes to business decision. My focus on devs is mostly because it's devs who comment and complain on HN, so my comments are mostly geared towards them.

It's true, businesses (or people who run them) will in the end judge the direction where the company will go, and their judgment is often worse that that of developers.

So yes, I would replace "devs" etc. with just "companies" in my comment.

I wouldn't say "worse" judgement in general. Just "different" in general. I have had both "worse" and "better" cases.

However, the better integrated and communicated the company's goals and rationales are, the more aligned the judgements become.

>Going into effect in a year? Seems like a business opportunity.

GDPR was law two years before it came in to effect and everyone left it until the last ~month.

When using personal data is outlawed, only the outlaws will use personal data.

What about all of the state actors (and 'hackers') who are cracking corporations for data and building a massive database on everyone?

> When using personal data is outlawed, only the outlaws will use personal data.

This argument only works if you feel the thing being outlawed is good (it is most commonly used in the context of privacy). To your statement I would respond the same way as I would respond to "When shooting people is outlawed, only the outlaws will shoot people": sounds good to me!

I meant, in jest, that if laws get tougher on the private sector, I hope that the government also throws a lot more money at data crime, too.

So, devil's advocate here: why not just require your ToS to state that if the user is from the state of California, that they are to not use the service and find a local alternative?

It is a state law, they can't hassle you if you're not Californian and do not service their target market. Most of America doesn't live there, and California seemingly doesn't want you to do business there.

Because there are a huge number of users in California, and it’s also the fifth largest economy in the world. Ignoring California is probably throwing away a big market.

You could say that about Europe to wrt GDPR but you should note that almost everyone is becoming GDPR compliant too because it’s a big market.

>So, devil's advocate here: why not just require your ToS to state that if the user is from the state of California, that they are to not use the service and find a local alternative?

Silently redirect them to a similar-enough site run by a partner company that's based in another state/country.

That would be considered an anticompetitive behaviour.

Does this mean another swarm of privacy popups everywhere again?

If we're lucky.

ot: What's wrong with this website? It loads super slow and behaves very weirdly on my iPhone.

Apologies. We're using wix right now. We'll be moving off shortly.

Ah. yeah, that's normal for wix :| It works for making sites, but it never works well.

I'm on a PC and the site was also behaving strangely for me. The site doesn't display a scroll bar, so I could not scroll down and read the article. It worked using a different browser.

This seems like a good discussion to ask.. Have any of you used a tool like pg_Anonymizer [0] to mask your data when building test/dev databases? I see several tools on there, but have no idea where to be begin on them..

[0] https://pgxn.org/dist/postgresql_anonymizer/0.0.3/

Already starting to deal with this where I work. It’s gonna be interesting...

If a company distributes a program or app that processes your address, personal data, geolocation, etc. _on your device only_ and the sensitive data never leaves your device, are they subject to the CCPA?

So are we going to have websites blacklisting CA IPs and answering back with some vague "this content is not available in your region?"

Hopefully not, unless we speak about some hyper-local businesses.

It's now GDPR + CCPA, so you are cutting off EU and California. Probably, more to come.

For example, seems like LA Times does not block EU anymore.

The way this came into existence is what scares me.

The "if they knew what we know" part or the you can (kind of) buy policy part? If the latter is shocking to you, I have some very bad news for you. (The process would have been more difficult and more expensive if the law wasn't genuinely benefiting the people, but still possible.)

Also, Mr. Mactaggart, what a guy!

The part that bothers me is how hastily the "compromise" was drafted, without any public debate. I don't like the idea of an individual holding the legislative process hostage.

There are limits on campaign contributions, perhaps there should be limits on individual contributions for these signature drives, which are essentially just large marketing efforts.

And just because this guy got x number of signatures, I don't see why he should now have the power to make compromise deals with the government.

This isn't my area of expertise so I may be missing something here.

IMO the biggest difference between CCPA and GDPR is that GDPR does not distinguish between large and small companies. Everyone needs to comply. At least with CCPA you can bootstrap a company and not have this be another thing you need to worry about, on day 0

No, just once you reach 50k visitors to your site.

Can we stop talking about how privacy laws are hitting devs, and start talking how they will benefit people?

Boohoo, poor devs need to finally pay attention to people’s private data.

You need to talk about both costs and benefits when discussing public policy. Otherwise, you end up with a ton of terrible policy that looks good due to an obvious tangible benefit, but nets out to more harm than good.

For example, minimum bedroom sizes for rental units. Seems nice to have enough space to live comfortably, right? End result though is the $20M apartment complex has 35 units instead of 40, and is only built later when rents have gone up to make the project make sense financially, exacerbating a housing shortage.

Let’s look at the cost, shall we?

Invasive and pervasive surveillance. Private and sensitive data sold wholesale not even to the highest bidder, but to anyone.

Hell, when news about NSA surveillance broke, it was a huge scandal that was the focus of attention of all media for more than a year. Now Facebook alone is reported to have the same level of maliciousness and willfull ignorance on a monthly basis, and it’s business as usual.

So yes, I don’t give a rat’s ass about the “poor developers” who couldn’t get their shit together and provide privacy and security to the common people. And who now pretend they are being unfairly punished by governments.

And yes, I’m a developer myself.

"these costs fall on people who I feel deserve it" isn't a good reason to completely ignore the size of the costs being imposed. Especially since these costs are sublinear with respect to organization size, causing the tech behemoths you complain about to get a free competitive advantage against upstarts threatening their business model.


I'm rather confused as to why you're harping on about whether or not developers "deserve sympathy". I'd be making the same points about pretty much any business regulation - that they impose costs, and that we need to be cognizant of them in order to make sure it's a net positive. If the costs outweigh the benefits, then the regulation is a good thing. If they don't, but you advocate for it anyways because don't care about hurting a specific class of people that don't "deserve sympathy", that makes you quite a mean-spirited person.

>If anything, startups benefit: they have less data and systems.

It's not about the absolute costs of regulatory compliance, which are relatively small. It's about the relative costs of compliance compared to the economic value of regulated activity. Google has roughly a million times more revenue than a ten-person start-up will. Privacy compliance is not a million times more expensive for Google than it is for the start-up. If it costs a startup a day of engineering effort to comply, and it costs Google ten million dollars, this is a relative business advantage for Google.

This is a pretty general pattern; established businesses get a competitive advantage from regulation, since it prevents competition from arising. If it costs $400 to get your setup inspected before you can sell lemonade that you make, this helps Nestle sell more bottled lemonade at the cost of your kids' lemonade stand.

> If they don't, but you advocate for it anyways because don't care about hurting a specific class of people that don't "deserve sympathy", that makes you quite a mean-spirited person.

Let's not gloss over the fact that the specific class of people who are "hurt" are the ones causing the hurt. If they only collected data they needed and secured the data they did collect, the regulation wouldn't be needed in the first place.

It's not mean-spirited to expect people who have widely profited from collecting bulk data to foot the bill for securing that data.

Once again, I've yet to see a compelling study addressing any of the emotional points about economic burden you make.

I see one clear net benefit: without regulation companies had zero care for private data. Well, time to suffer for it. I won't shed a tear.

I like the general idea and I like that it specifically applies only to organizations with more than $25 million in revenue. Give small startups a break.

Does the GDPR also have a lower limit like this? It should.

The criteria is a "one or more of the following" not a combination of them all

So if you make more than $25 million, OR your have more than 50k users or devices, OR you make more than 50% of your money selling data

Seems like the second one is the real problem. "50K users or devices" is less than 0.02% market share, even if you have only US customers, and for businesses with margins in the $1/user/year range it doesn't even cover one full time employee.

You can end up with that many users on a side project all of a sudden if it gets posted to the front page of a site like this one.

And it doesn't even have to be users in the signed-up sense if you simply have access logging turned on for your web server; 50k unique IPs would be enough.

50K California customers.

So less than 0.13% market share then.

Assuming you have any way to reliably identify which state your users are in -- which means we're back to "privacy regulations" encouraging companies to collect more data on their users.

“He started worrying about data privacy after talking with a Google engineer and spent nearly $3.5 million in 2017/2018 to place an initiative on California's November ballot.”

That Google employee must be somewhat nervous these days...

>Process personal information of >50k consumers, households or devices

>Derive >50% of revenue from selling PII

So if I forward all of the data to another company outside of CA, does my company count as processing data?

What if the code that forwards that data is written by another company and I'm just hosting it on my site? Everything goes through their code and I'm paid to just setup a website to host their code.

Maybe I do collect info in CA but I sell the data for $1, but the company also buys some consulting services for the actual price of that data that I'm selling them?

> So if I forward all of the data to another company outside of CA, does my company count as processing data?

You are still processing that data. Part of processing that data involves you shipping it off...

> What if the code that forwards that data is written by another company and I'm just hosting it on my site? Everything goes through their code and I'm paid to just setup a website to host their code.

You are as responsible, if not more, in making sure that compliance is met. You are the one hosting the code. The data is moving through your servers.

> Maybe I do collect info in CA but I sell the data for $1, but the company also buys some consulting services for the actual price of that data that I'm selling them?

That's just being a jerk. But better hope you don't pass the 50k mark...

>You are still processing that data. Part of processing that data involves you shipping it off...

>The data is moving through your servers.

So if a random company gets breached, everyone involved from cloud providers to ISPs are also responsible because they facilitated moving and storing the data and they are just hosting code?

This is problematic. Cloud providers give you permission to publish code. I could position myself to allow another company to publish code on my popular website to collect data and my role is basically no different than a cloud provider. We don't have to agree that is what it's specifically for, I just need to give them access to upload their own code for whatever expensive fee.

>So if a random company gets breached, everyone involved from cloud providers to ISPs are also responsible because they facilitated moving and storing the data and they are just hosting code?

ISP's aren't (supposed to be) "storing" that data. They are transferring bits between computers. You on the other hand are hosting a website with some sort of form that people input PII into. You are accepting that PII, whether or not it gets forwarded or not is irrelevant. You are processing it. So do your due diligence, contact your users and let them know what is going on, and speak with a lawyer for more information.

>You on the other hand are hosting a website with some sort of form that people input PII into.

That's what cloud providers do! If there's a spirit-of-the-law that is supposed to protect them, this would be a good time to write that in!

Do they specifically mention rental cars in the code of law, when they say that the driver can't drive over the speed limit?

"Process PII" is incredibly vague. You could define that in a hilarious amount of ways with the amount of complexity we introduce to our software products, especially with code we don't even write ourselves that widens your security surface.

This is especially true if you use a service that allows others to inject code into your code base. If NPM has a security failure that leads to a breach at a company, who is at fault? Both? Or only the company that chose to use the code? An NPM package might be processing PII after all. Does that mean NPM can never be held responsible for security breaches?

Secondly, your example would be backed up by historical cases and this law is brand new, so it is not clear. I'm not even sure how you guys can confidently argue that the new law ISN'T outright vague.

>> You could define that in a hilarious amount of ways with the amount of complexity we introduce to our software products, especially with code we don't even write ourselves that widens your security surface.

You could define in a hilarious amount of ways in which your chef can pee in the broth you ordered in a local diner. But it generally doesn't happen, does it?

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact