Hacker News new | past | comments | ask | show | jobs | submit login
Operation Rosehub – patching thousands of open-source projects (googleblog.com)
727 points by fhoffa on Mar 1, 2017 | hide | past | web | favorite | 117 comments

So many questions... What does this say about Google's hiring, about its employee's values, about values across the tech community? I can remember a time when managements would have shut this down, when employees would have said, "not my problem", when entire industries would have buried their heads in the sand.

Is it the lack of liability and regulation that clears the way for this kind of corporate citizenship? Is it cultural?

When your company is the web and the web is your company

When you openly celebrate really nerdy technical achievements

When you incentivize and promote side projects and open source

There grows a camaraderie with the projects and tools that make the web what it is, and the volunteerism that open-source projects require

Really inspiring.

It may be, in part, Google's giant cash machine. It makes it possible to be altruistic. In a world of fierce/commoditized competition it is much harder to expend resources on 'side' projects.

Google also thrives in a healthy internet.

"They were happy to see employees spontaneously self-organizing to put their 20% time to good use."

Reminded me that I heard that Google's 20% time had generally like 10% participation and has had a number of conditions including manager approval for the last 4 years. Is this article breathing new life into the myth of 20% time or does this reflect the revival of the process?

Either way, incredible accomplishment on the patch army!

20% time has always meant different things depending on who you talk to. It's likely that you're just talking to different people.

I took copious amounts of 20% time in my time at Google (2009-2014). Usually I'd start a 20% project with neither my manager's knowledge nor his approval; if it looked like it had legs, I'd let him know about it and ask what he thought. I never had a manager outright forbid me from working on a 20% project; responses ranged from "You should consider this your main project now; it's critically important that we understand this area" (along with a spot bonus for delivering on it) to "Well, you can work on it, but you are unlikely to get credit for it come promo time." In general, as long as I got my work done, my managers didn't care what else I was working on.

Was your workload such that you could put in a reasonable number of hours on your 'work' and still have 20% of your time for other projects? Or was 20% time just putting in an extra 20% on top of what other developers (who didn't do 20% projects) would do?

Over my career there, yes, I could put a reasonable number of hours on my work and still do 20% projects. There were short periods of time when I was asked to "bank" the time and focus entirely on my main project, eg. I might work straight for 5 weeks on my main project and then take a week straight of 20% time once it launched.

This suited me fine - usually the way I took 20% time was to fit it into time periods when I didn't really have much else to do or I was bored with my main project. And splitting it up like this let me focus more intently on both of them, which helped in delivering.

Cool, sounds like it worked out for you about the way I would expect. I suppose it may depend some on your team, I have heard some people say it is just 20% extra work.

Googler here. I recently had a 20% project that sometimes had bursts of 100% busyness (but still ~20% on average). It also was not relevant to my team and was in the benefit of "everyone". No explicit manager approval was required.

It certainly depends on the particular manager or even org, but in general 20% projects are still a thing for those who want them (not many).

> in general 20% projects are still a thing for those who want them (not many).

Why is that? Any more insight for that?

It's not uncommon for people to put additional effort (that you can spend on a 20%) to fix stuff in the primary project. Like, voluntarily adding themselves to a build cop rotation or doing code / readability reviews as a service, and so on. And sometimes, the primary project is tight on deadlines and people want to stay focused.

And even when a person is enjoying the 20% opportunity, most of the time there is no good project to take on. During my stays at Google, I've done many 20% projects (>10), but only had a 20% project may be a third of the time.

People enjoy their work in different ways. Some like being 100% on their main project, some like involving themselves in many smaller efforts (both across their main project and across 20%). Also, the default state is to not have a 20% project, the only people that will work 20% on something else are those that made a specific effort to do so.

Googler here, 20% projects are very real. It's a flexible tool between employee and manager that gets used in a lot of ways. The only recent guidance was that 20% projects should benefit the company. And, mileage will vary depending on the relationship between employee and manager. Some employees and managers use the program well, others don't do as good of a job, but it's still very alive and well.

IMHO, this is a good example of 20%.

20% time never went away. There's especially a lot of code health 20% projects, and like the article says, this is applying that spirit to open source.

As other Googlers have said (and yes, I'm also one) 20% time is a bit misunderstood and always has been. Suffice it to say that it never died, and different groups have different levels of participation - some times, some teams are focused on a launch so the percentage of time spent goes up or down. But generally speaking, yes, this is exactly the sort of healthy thing that we encourage.

Not sure I follow your first point but the latter one I think hits the nail on the head.

Google benefits greatly from healthy code ecosystem. Any exploit could easily come back and bite them in the ass. As big as they are and as much code as they pump out, they still rely on third party code and less exploits in the wild is a big win for them.

Google is primarily a capitalist corporation and by definition not altruistic. It's accountable to its shareholders; if the CEO couldn't make a case for e.g. how employees spending part of their time on side projects somehow benefits the bottom line, this liberty wouldn't be allowed.

Google employees own >50% of voting rights of the company stock. They aren't subject to external interests in the same way most other public companies are.

On a tangent, why shouldn't we celebrate projects that benefit both private and public interests? IMO, society would be quite a bit better if the two were more often aligned (like you suggest they may be in this case).

And employees want to see the stock grow. Right?

I wasn't implying that it's bad when private and public interests are aligned. What I'm saying is that this is not the same as altruism, and if private and public interests are at odds, the private ones will likely win as they have before at Google. This is just the nature of a capitalist company. Let's not have illusions about that.

While obviously, yes, we are a public company and aligned with shareholder value, you are vastly underestimating the core of the altruism in the Googlers doing this work (and working on Elections, and G4NP and so on)...and who are also a large amount of the shareholders involved.

Google is also a big part of the reason of the poor health of the web.

You're going to have to explain that one a bit. Or perhaps you think Tim Berners-Lee is responsible for popunder ads, and Jon Postel for unsolicited Viagra spam?

Their browser market dominance (which they got in part through shady marketing strategies) is a big part of it.

How do you think we got here: https://news.ycombinator.com/item?id=13763759

> Their browser market dominance (which they got in part through shady marketing strategies) is a big part of it.

I have to point out that Microsoft are still the biggest supplier of desktop Web browsers.

Of course, we all know that Tim makes money by selling ads. And Google, on the other hand, has nothing to do with it /s

Hey ho.

So I'm on Google's open source team.

Rosehub was driven entirely by rank-and-file Googlers who saw the MUNI hack and thought "This sucks and it's so simple to fix and it shouldn't happen." I remember Justine sending out some messages to various mailing lists with the identified vulnerable repos in a spreadsheet, and then people volunteered to send pull requests to repos they claimed off the spreadsheet.

It's the sort of thing that happens because Googlers are actually passionate about software in the large; they just happen to work at Google. The only things I think that Google as a corporate entity does to foster that is hire the same sort of passionate people, and let them decide how best to use their work hours. This is the sort of effort that naturally arises from that.

It's the same sort of thing that causes Googlers to want to open source their code (and get the director approval to do so). Googlers do it because they want to give back to software in the large, not for any perceived benefit of getting a community to work on the code (we have thousands of repos on https://github.com/google that never get any external contributions).

To clarify the timeline a little bit here... Rosehub wasn't actually motivated by the MUNI hack, as it predated it by 8 months (Rosehub started in March 2016, MUNI hack was announced in November). As noted in the blog post, it was instigated by Justine seeing that open source packages weren't updating their dependencies to protect themselves, then doing some digging and realizing just how widespread the problem was.

However, the MUNI hack certainly did motivate being more public about the project and writing this blog post, since it really helped underscore the severity of this vulnerability in very real, concrete terms.

Well one thing is Google engineers are the highest paid in the Software industry. This allows anyone to move away from the "grow my career by preparing for interviews in my free time" mindset to coding for pure fun. This might not be the only reason but is definitely one of them. Hence why you won't find this attitude at say Microsoft (which has the least pay amongst the Big 4/5)

> "grow my career by preparing for interviews in my free time" mindset

That must be sad life, is this common? I work as game programmer and me and almost every other game programmer I know code for fun, and we are not paid very well.

This attitude is why Game Development is such a terrible career. You get pennies for doing a difficult job under the guise of "yay fun" as though it is not a multi billion dollar business. You will be hard pressed to find a bigger mug than a mid level game developer. I remember working on the physics of Resistance at Sony for the PS3, and I was picking up 34 bags a year. Pretty much every other programming role I have had since is not only magnitudes easier, but pays way way more. It makes no sense to me at all.

Not that I would not accept higher salary, but I don't think it's pennies. It's way above average salary in general, and probably average in software industry. Not many game companies earn billions (except some outliers like GTA, COD, Minecraft, ...), average game company is on the verge of bankrupcy all the time. It's well worth it compared to a higher paid programmer whose job does not bring him happiness.

Well, it's a filter bubble. You can find enough people in both mindsets. It's mostly not related to pay -- related to job satisfaction, career growth options etc.

It is a showcase to BigQuery, also a big ad for any software engineer or security engineer who actually cares to join Google and at the same time doing something good. I guess it is a win, win, win situation. :)

It's great advertising.

I mean yeah it kinda is, but you really think google needs advertising for its brand??

I think Google is doing some of the most impactful security work as a company I have ever seen. Between GOOG and MICROSOFT (within past few years) they have been instrumental in moving the bar and the industry. They both catch flack for the privacy issues (which is a leg of the InfoSec triad) but I think most people should be grateful for the work they do.

I always find it funny that people think like this. Do you think only struggling companies should strive to grow their brand? Market leaders stay market leaders with PR/branding and of course, their products.

Products like Coke are internationally recognized but recognition != likeability, which is why they continue to pour in big $$$ in advertising.

> I mean yeah it kinda is, but you really think google needs advertising for its brand??

This is advertising BigQuery[0] which is a Google Product people may be less familiar with. This nicely shows off what can be done using a large dataset.

I'm grateful for the work they've done here, and I have no doubt individual employees donated either their free time or 20% time (or w/e it is now). But that doesn't negate the fact that this is an advert for BigQuery.

[0] https://cloud.google.com/bigquery/

I think you're wrong. Folks used the most convenient tool at their disposal that is based on the internal tool.

Sure, but so is anyone making big donations to charity or doing anything big really. Does that mean it's a bad thing? I'd much rather companies do things like this for advertising than something that only purely benefits themselves.

Widespread community trust in the internet is what keeps Google financially viable.

This is one of the most impactful projects I've seen built using the GitHub source on BigQuery dataset (since we published it).

If you want to see other use cases - I've collected plenty of other stories from multiple parties at:

- https://medium.com/google-cloud/github-on-bigquery-analyze-a...

Disclosure: I'm Felipe Hoffa and I work for Google Cloud (https://twitter.com/felipehoffa)

<meta> Title change by mods --

I submitted this post as "Googlers used BigQuery and GitHub to patch thousands of vulnerable projects". After it got to #1 on the front page, mods silently changed the title to "Operation Rosehub – patching thousands of open-source projects"

I wish HN had a more transparent way to show that the mods changed a title and why. Since HN does not, the least I can do is add this info for transparency.

(related https://news.ycombinator.com/item?id=6572466 https://news.ycombinator.com/item?id=4102013)


IMHO the title coming from mods sounds better. Your title seemed like you're explicitly advertising for BigQuery.

I got a bit confused by your post. I gather that you are simply recording the fact that the title was changed in accordance with the posting guidelines, not that you are complaining about it. Just in case other people were similarly confused...

I think he wishes that here was an "edited" annotation next to changed titles.

I've built something like this for Python projects.

You add your repo and a bot is constantly checking for insecure and/or outdated packages and sends you a pull request if you need to update.

It's free for open source projects at https://pyup.io

We've been using pyup for a few months now and it's awesome. Thank you.

I was initially concerned that constant PR's for dependencies would be too noisy for our team but it turned out that it's configurable enough and gracefully handles us ignoring or closing PR's that we evaluate and decide to wait on.

It's a great service that all python developers should be using. (I say that because I want as many pyup users as possible so it never goes away)

Bookmarking this. I recently was alerted to a CRLF injection vulnerability on the service I manage because an outdated dependency was vulnerable. This could have nipped that in the bud.

Is there something like this for all languages / os / etc? Seems like an amazing tool to poll against.

Just signed up and discovered a vulnerability. Thanks!

A quick suggestion, consider adding a full sample configuration along with your config docs: https://pyup.io/docs/configuration/

It's a lot easier to see how it all sits together with a sample.

CircleCI has a great example: https://circleci.com/docs/1.0/config-sample/

Similar to Greenkeeper.io for npm?

Why does your service ask for write access to my repos?

For the pull request. The bot creates a branch, commits the changes and sends you a pull request on your repo.

This it how it looks like: https://github.com/pydanny/cookiecutter-django/pull/1065

Thanks for the reply. I guess forking each repo wouldn't scale well?

I like the "bank teller" analogy used in the article.

> it would be like hiring a bank teller who was trained to hand over all the money in the vault if asked to do so politely, and then entrusting that teller with the key. The only thing that would keep a bank safe in such a circumstance is that most people wouldn’t consider asking such a question.

This does not only work for deserialization issues.

It is a great analogy for a huge class of IT security issues!

Maybe we should use that one when communicating with the media. This this works much better than the usual burglary analogy. I like how it points out that this is about stupid and/or malicious behaviour (code), where the attacker (hacker) just needs curiosity, and may find this out even by accident. The attacker did not have to break something, and did not damage anything, to get into something. In particular, this makes clear that this is caused by irresponsibile behaviour of the organization and/or other entities to whom they delegate trust.

Even for more complicated scenarios, I like the bank teller analogy more than the classic burglary analogy. In that case, the attacker observes multuple bank tellers, and notices e.g. that if you ask the first teller for form A and put in certain words, another bank teller will accept it and give you a stamped form B, which you can show to a third teller in another branch office who will look a bit confused, but finally accept it and hand over all money to you.

We need to get over blaming the messengers[1], buying zerodays and declaring cyberwar. What we really need to do is to finally make our[2] computer systems secure and trustworthy, at least up to a certain minimum-level of sanity: no exec, no injection (i.e. typing/tagging), no overflows (i.e. static analysis), input validation, testing, fuzzing, you name it.

And this cannot work by just adding more and more complex security measures outside, but more importantly simplifying and cleaning up inside. Although rewriting software from scratch is very risky, radical refactoring is not! And every good software engineering course tells you how to do it correctly.

[1] security researchers, but also "amateur" hackers, or just someone running into it by accident because the security issue became so large it finally had to be noticed by someone.

[2] in the sense of: everyones!

I know it is not correct to ad "me too" comments here, but I went here for the same quote as you. It is the best quote I have never seen about security because it does not depict a evil hacker that breaks a not secure enough wall. It depicts a clever client that goes through a stupid security hole. That's a better analogy for 99% of security hacks.

Is https://libraries.io not a more comprehensive and community-focused response to the same problem?

libraries.io did make it to the front page a few months ago, but I think its underlying vision might not have been driven home from just glancing at its home page. It supports 33 package managers (not just Java, though I'm sure Rosehub doesn't just do that either) and Github/Gitlab/Bitbucket, not just Github. And it provides both email notifications and auto PRs.

But that's just the overlap with Rosehub. On top of that it offers the means to discover libraries based on a Dependency Rank (think Page Rank but using dependencies instead of hyperlinks). Which in turn allows it to surface projects with a high "Bus Factor" -- projects maintained by few committers, but depended on by many (so they'd be more affected by said committers getting run over by a bus). AND it mines the licenses for a project, notifying if any of the dependent licenses are incompatible with the parent license. What's more it's a non-profit organisation receiving enough funding to employ 2 full time devs.

I think libraries.io is Rosehub and more, to quote the about page;

  Our goal is to raise the quality of all software,
  by raising the quality and frequency of contributions
  to free and open source software; the services,
  frameworks, plugins and tools we collectively refer
  to as libraries.
To take the liberty of extrapolating from the libraries.io vision: open source security isn't just about fixing patches, but about supporting the environment, people, conditions and tools that contribute to open source software.

I see nothing on the libraries.io website that explains how it would be used to solve the problem described in the OP.

OP here, yeah I agree, like I said the home page could be more explicit.

Here's some links:




I am extremely sad that this turns into an argument for making certain that all source code in the world is at least indirectly accessible specifically via GitHub (at which point people will find it there and expect the developers to respond and generally track everything going on there, even projects which are much happier using more open tools); like: it isn't sufficient that your code is "open", it actively has to be part of the unified GitHub empire.

Your gitweb [0] has always worked perfectly fine for me! I agree, I don't see the need for Github.

[0]: http://gitweb.saurik.com

In their query they do:

   FROM (SELECT id,content
      FROM (SELECT id,content
         FROM [bigquery-public-data:github_repos.contents]
         WHERE NOT binary)
     WHERE content CONTAINS 'commons-collections<')
Why the subquery? Why not WHERE NOT binary AND content CONTAINS...? is this a bigquery thing?

This appears to be "legacy SQL" in BigQuery, which did not have query optimization - entirely rule-based query planning. The query is a little inefficient indeed.

BigQuery has since released ANSI 2011 "standard SQL", which would does have an optimizer and would push predicates down).

(work on GCP and worked on BQ until recently)

Good catch. It seems like an artifact of working and editing a query until they got what they wanted - certainly not a BigQuery restriction.

(like this https://www.youtube.com/watch?v=cO1a1Ek-HD0)

Note that the published query scans 2.25 TB of data. While impressive, for a better workflow and cost management I would split it into a 2 step process:

- First extract all the files I'm interested in to a separate table (all pom.xmls?).

- Then run whatever analysis you want over those files.

It should work fine either way. Using standard SQL in BigQuery, though, you can do:

    SELECT pop, repo_name, path
    FROM (
      SELECT id, repo_name, path
      FROM `bigquery-public-data.github_repos.files` AS files
      WHERE path LIKE '%pom.xml' AND
        EXISTS (
          SELECT 1
          FROM `bigquery-public-data.github_repos.contents`
          WHERE NOT binary AND
            content LIKE '%commons-collections<%' AND
            content LIKE '%>3.2.1<%' AND
            id = files.id
    JOIN (
        difference.new_sha1 AS id,
        ARRAY_LENGTH(repo_name) AS pop
      FROM `bigquery-public-data.github_repos.commits`
      CROSS JOIN UNNEST(difference) AS difference
    USING (id)
    ORDER BY pop DESC;
Better yet, it runs faster than the legacy SQL query :)

As a disclosure, I work on the project to support standard SQL in BigQuery.

Maybe they want to make sure Binary data is filtered out first? Not exactly sure how rigid the evaluation order is in SQL or BigQuery.

Wow. I wonder how much a query that searches the content of all of Github costs (if you're not Google). This page says the dataset is 3TB+ https://cloud.google.com/bigquery/public-data/github and presumably most of that is content.

The contents table [0] which they ran the query on is 1.8TB. Assuming you only need to do a single pass (seems reasonable given that it's a simple regex), the price should be about $9 [1]. Free quota covers 1TB so the remainder would be $4.

[0]: https://bigquery.cloud.google.com/table/bigquery-public-data...

[1]: https://cloud.google.com/bigquery/pricing

The data is stored in BigQuery and updated from GitHub weekly. I imagine they do a date range filter when they updated from GitHub

You can do it yourself for free, I believe. https://github.com/blog/2298-github-data-ready-for-you-to-ex...

Nice! That's some good citizenry.

Interesting fact: Justine was the founder of occupywallst.org, which was the highest-trafficked publisher/web hub for the Occupy Wall Street movement before she worked for Google.

"Patches were sent to many projects, avoiding threats to public security for years to come."

Are these pull requests that the project would still need to approve/merge or were they just pushed in?

They were PRs that required approval and merge from the Github project maintainers. Here is a search to see some of their work:


It's actually incredibly interesting to read how the developers individually responded to each of these PRs.

It would have been great to see a count of how many PRs have been accepted.

edutechnion's link says 1108 open PRs and 999 closed.

Interesting that 2100 of the PRs are "Upgrade Apache Commons Collections to v3.2.2" and just 7 were "Upgrade Apache Commons Collections to v4.1".

Probably v3.2.2 was lower hanging fruit for most projects. Instead of having to make code changes.

For me, if a project had a bunch of open PRs for security issues, it would discourage me from using it for new work. It would also help break a tie in my head between keeping an old library and replacing it with something that has legs.

So even if they don't get merged, they still serve a purpose, even if it's just for a few people who behave like I do.

I'm sure they're requests.. They'd need to be deployed to production too.

Still pretty awesome.

As scary as Google's massive size and power is, it's pretty awesome that they're incentivized to do things like this to help the internet because they are the internet.

I read so many of these kinds of articles out of curiosity and rarely understand them.

Thank you for adding in the part about the bank teller.

For reference: "it would be like hiring a bank teller who was trained to hand over all the money in the vault if asked to do so politely, and then entrusting that teller with the key."

> But unlike big businesses, open source projects don’t have people on staff

To read that from Google is frankly disappointing. While this is true of many open-source projects, it doesn't have to be that way. Red Hat (and Google!) are brilliant proofs of this.

More generally, if a company uses software X (open-source or not), they need to:

a) make a contract with a company that takes responsibility for X, or

b) hire somebody who takes responsibility for X, or

c) take responsibility for X on your own

It doesn't help to "buy" closed-source software X from another company if you can't count on them in case of emergency, i.e. if they vanish, go bankrupt or put their lawyers onto you.

Then, better take open-source software where you can take responsibility on your own, for which it may help to hire one or more of the lead developers.

Really cool, kudos to people helping with this. I wonder if this could have been done in a way that non-Googlers could have pitched in too, given that this is for a public good -- but it's tricky with security issues.

How does it work for transitive depandancies ? If you use a package that use a vulnerable Apache common? Does a pr is sent to update the package when it is updated?

If I understand it right, this bug involves code pulling in old buggy libraries, sometimes indirectly via other libraries. It seems that there is a reference to a specific bad version, not the actual inclusion of cut-and-paste code.

Eh, why not just get rid of the bad version? Alternately, release a bug-fixed copy with the same version number.

Any breakage is a case of "oh well, you're safe now". Leaving the security hole is probably worse breakage.

In theory someone could be relying on the bug, and not be vulnerable - e.g. if they don't allow any external access to the system.

If you just re-publish the old version, it's difficult to know whether you've taken the change.

If you are going to reissue the same version number - why bother having version numbers at all?

The bug is not including this library. This library is 100% secure [i mean if this is a vuln in this library then a large proportion of libraries are insecure because they could be leaked to untrusted code and used to break the JVM trust model]. It just so happens this library used to contain a really-really useful gadget for exploiting another security problem. However, removing this gadget doesn't mean the security problem is fixed. There are other fun libraries. In fact classes similar to the Mad Gadget have been used in the JDK to escape the sandbox in the past. Yes, stuff like this exists or has existed in the JDK [https://github.com/jenkinsci/jenkins/blob/96a9fba82b85026750...].

And this work is very useful in so far as I'm sure the benefits it provides is going to massively outweigh the cost. However, if you have a naked ObjectInputStream#readObject in your code then you probably still have an exploitable security issue. Have a look at how well Jenkins strategy was to fixing this issue which was basically the same strategy as Operation Roshub. ie: removing the ability to access classes that were known to be used in gadget chains. Surprise, surprise it didn't last very long and people just found new gadgets.

And if you read this blog post then you might be mistaken into thinking that removing commons-collections from your classpath or upgrading commons-collection to the 'safe' version would make object deserialization safe but this is not the case. if you have a naked ObjectInputStream#read in your code then you are vulnerable to remote code execution.

Author here. There are individuals in the infosec industry who agree with you. They've stated on many occasions that the problem isn't gadgets, but rather that programming practices in general need to change. This might have something to do with the fact that no one told Apache about this weakness until nearly a year after it was presented at an infosec conference.

While gadgets may not be the root weakness, the gadgets certainly help. We may never be able to have perfect security. Hopefully the systemic paradigm shift infosec professionals are advocating will come some day. But until that day arrives, we can make people so much safer, with minimal effort, by simply disabling these gadgets.

Almost no one uses them. Out of all the projects I found, I was only able to identify one or two that were legitimately using the gadgets in question.

Thanks for your reply. This is Thursday in the UK so I'm going to pre-emptively apologise for this rant. But we in the infosec community informed the wider community in 2008 of the problems of Java Serialiazation. That is 8 years ago. Sami Koivu, peace be upon him in December 2008 showed that arbitrary deserialisation in Java was a security risk (http://slightlyrandombrokenthoughts.blogspot.co.uk/2008/12/c...). Not to mention that SERIAL-5 (http://www.oracle.com/technetwork/java/seccodeguide-139067.h...) of the Java Security Guidelines for Java SE has this to say:

Guideline 8-5 / SERIAL-5: Understand the security permissions given to serialization and deserialization Permissions appropriate for deserialization should be carefully checked. Additionally, deserialization of untrusted data should generally be avoided whenever possible.

And do you want to have a guess as to how many times Serialization was used to bypass the Java Sandbox between when Sami Kouvi made his blogpost and someone made a con talk on about Apache. Hint: it is greater than 1. [https://tyranidslair.blogspot.co.uk/2013/02/fun-with-java-se...]

We have also demonstrated numerous times to the programming community that deserialzation of user data is dangerous. For example Stefan Esser has shown numerous times that PHP deserialization is dangerous both because PHP deserialization is a source of bugs in itself and because it is a source of bugs because it interacts with application code in unexpected ways. We have seen the same thing in both python with unpickle and ruby with YAML.

I'm going to let you in to a secret within the infosec community. You can find bugs by just applying existing research in new and novel ways because developers do not follow security research.

I feel like I'm falling into some rationalism fallacy by ranting at you because you are doing something useful to improve security. But you could be doing much much more. You have a voice and people will actually read your blog as compared to Sami :( You could have mentioned that we people should stop doing ObjectStream#readObject() or you could have pushed for updating the JavaDoc to say: THIS IS A BAD THING DO NOT DO IT.

EDIT: apologies to anyone that realized that java serialization was bad before the Sami post. I wouldn't be surprised if this was part of the Java secure code guidelines before then or if someone had exploited the issue before then. It just so happens that Sami's post was my introduction to Java Serialization vulnerabilities.

As one of the people who did the talk at Appsec Cali that was building on all this work outlined by benmurphy... our goal was to reach security minded developers and talk about a repeated anti-pattern that put software at risk that impacts things written in many different languages. Both Chris and I have a development background, and have seen the same issue show up in ruby, python, php, basically anything that has an object serialization capability. We hoped to change the focus from a specific library or gadget to the idea that deserialization is inherently dangerous.

The core problem really stems from the idea that OO models encapsulate data and behaviors. Behaviors mean code execution - so, anything that will deserialize objects is giving the person who serialized them the ability to control the execution flow. If this is a listener on the network, than things are really bad :-)

So, it's great that a set of gadgets have been removed, it's neat to see the application of resources to make that happen. I have to agree with Ben, that any system that relies on object serialization from untrusted sources (in any language) is still vulnerable, it just might require a more specific gadget chain. Too many vendors have fixed their products by just updating the library and not removing the dependency on dangerous object deserailization.

Why did no one tell Apache?


"So replacing your installations with a hardened version of Apache Commons Collections will not make your application resist this vulnerability."

Ok well you can tell your cohort that this narrative isn't going to fly anymore.

What evs

I agree. Developers should consider alternatives to object deserialization, which is a risky technology.

Google has already shown leadership in this regard, by making Protocol Buffers open source. Protobuf is a library that has served our company well. We use it at all layers of the stack. BigTable stores them. gRPC transmits them. Business logic operates on them. Closure Templates render them. Many developers outside Google have chosen to embrace this technology. We hope that it has helped them keep their users secure, just as it helps keep Google users secure.

We considered mentioning this in the blog post. We decided against it. The goal of Operation Rosehub was to take simple steps that will keep people safer in an imperfect world. Suggesting that people change, or that they should adopt our way of doing things, seemed orthogonal to our mission.

However there are developer evangelists working in the company who try their best to communicate the benefits of using Google development technologies. We support their efforts too.

I wish you could do the same thing with mental illness.. massively send pull request to correct everyone's bad brain code.. <sorry>

This is a genuinely cool idea. Seriously.

It raises a lot of questions about what sort of transformative spectrum (excuse pun) would be applied here though. It's is incredibly abstract as presented.

But even at the abstract level, the one thing I know would absolutely happen for sure is that the fixes that made the biggest difference would be hand-waved out of existence by infecting them with viruses, creating scare-campaigns, etc.

Source: I've learned a lot about Big Pharma over the past 10 years as I've quietly found real solutions to my own mental health issues. I'm sadly too scared to share what I've found and I keep seeing products disappear off the market or suddenly attract customs/overseas shipping issues. Suffice it to say that the medical industry is opposed to anything they can't patent - and that, as an industry, it must ensure its own survival. Interpret that any way you see fit.

I was thinking about doing something similar with bigquery and github data to search for uses of strncpy in C code. But I am not that good with the query language and also bigquery didn’t support multiple users properly (this adds friction).

I still think it’s a good idea. It would be even better to search for a few C pitfalls more, but strncpy is probably the easiest to search for.

It's interesting that this type of initiative, which is admirable, will spike up some java "popularity" metrics on GitHub.

I think this is one good concrete example of why the npm style of private dependencies for each lib is not the greatest thing ever, while the non-recursive style in python (or C) is overall more manageable (if you are actually managing your dependencies instead of ignoring them).

we have been doing thus for a while now : https://www.sourceclear.com/blog/millions-of-program-builds-...

Got a 404 on the above, looks like it should be: https://www.sourceclear.com/blog/millions-of-program-builds-...

But I don't see where it discusses sending PRs to affected repos, only detecting them.

Ah yeah I had the old link, thanks for fixing. Actually we privately disclosure the problem to the developers and get it fixed following responsible disclosure and not post PRs directly.

So, Google does ... something and is showered with praise.

Thousands of volunteers work in the saltmines and get nothing.

Business as usual. Myths like "Google sponsored Python!!!" propagate when they do nothing at all.


That's outstanding news. Hats off to the volunteers doing the work on this.

Mad thank-yous to Google for this!

Awesome! Contgrats to the team!

What is in it for google?

Disclaimer: I work for Google and Justine (this blog post's author) was an immediate coworker at the time of Operation Rosehub.

I think you're asking the wrong question. This wasn't some top-down directive from some VP trying to come up with ways to make Google look good in the open source community. This was a bottom-up effort, that happened simply because Justine wanted to do it, and the easiest way to get it done was to recruit other like-minded engineers to help her rather than having to do it all by herself. She would have done it at any other company that allowed her (though, knowing her, she would've done it regardless).

As engineers we have agency. The decisions that I make in my day-to-day work are mine, not my employer's. I can directly impact and affect lots of things, and the only motive you need inquire about to explain it is mine.

Employee satisfaction.

And a healthier internet

Operation Rosebud

Wouldn't a graph database be a more suitable tool for that kind of task?

Why would it be a more suitable tool? What can a graph database tool do that BigQuery lacks?

If you've already got the data conveniently preloaded into a SQL database for you, and all you need is a very simple SELECT statement with two WHERE clauses ... why would you use anything else? Spinning up an entire graph database unnecessarily seems like over-engineering.

Author here. There's some truth to what he's saying. One thing I've been meaning to do is get my hands on all the Maven pom.xml files that exist, so I can load them into a Guava Multimap (my graph database of choice) and figure out every single artifact that will transitively inherit vulnerable collections on the class path.

Try Neo4j?

Why use Neo4j for a program you're going to run once and not store the data? Multimap<T, T> works great.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact