
Operation Rosehub – patching thousands of open-source projects - fhoffa
https://opensource.googleblog.com/2017/03/operation-rosehub.html
======
fhoffa
This is one of the most impactful projects I've seen built using the GitHub
source on BigQuery dataset (since we published it).

If you want to see other use cases - I've collected plenty of other stories
from multiple parties at:

\- [https://medium.com/google-cloud/github-on-bigquery-
analyze-a...](https://medium.com/google-cloud/github-on-bigquery-analyze-all-
the-code-b3576fd2b150)

Disclosure: I'm Felipe Hoffa and I work for Google Cloud
([https://twitter.com/felipehoffa](https://twitter.com/felipehoffa))

~~~
fhoffa
<meta> Title change by mods \--

I submitted this post as "Googlers used BigQuery and GitHub to patch thousands
of vulnerable projects". After it got to #1 on the front page, mods silently
changed the title to "Operation Rosehub – patching thousands of open-source
projects"

I wish HN had a more transparent way to show that the mods changed a title and
why. Since HN does not, the least I can do is add this info for transparency.

(related
[https://news.ycombinator.com/item?id=6572466](https://news.ycombinator.com/item?id=6572466)
[https://news.ycombinator.com/item?id=4102013](https://news.ycombinator.com/item?id=4102013))

</meta>

~~~
mikekchar
I got a bit confused by your post. I gather that you are simply recording the
fact that the title was changed in accordance with the posting guidelines, not
that you are complaining about it. Just in case other people were similarly
confused...

~~~
ec109685
I think he wishes that here was an "edited" annotation next to changed titles.

------
jayfk
I've built something like this for Python projects.

You add your repo and a bot is constantly checking for insecure and/or
outdated packages and sends you a pull request if you need to update.

It's free for open source projects at [https://pyup.io](https://pyup.io)

~~~
nkuttler
Why does your service ask for write access to my repos?

~~~
jayfk
For the pull request. The bot creates a branch, commits the changes and sends
you a pull request on your repo.

This it how it looks like: [https://github.com/pydanny/cookiecutter-
django/pull/1065](https://github.com/pydanny/cookiecutter-django/pull/1065)

~~~
nkuttler
Thanks for the reply. I guess forking each repo wouldn't scale well?

------
rrggrr
So many questions... What does this say about Google's hiring, about its
employee's values, about values across the tech community? I can remember a
time when managements would have shut this down, when employees would have
said, "not my problem", when entire industries would have buried their heads
in the sand.

Is it the lack of liability and regulation that clears the way for this kind
of corporate citizenship? Is it cultural?

~~~
ISL
It may be, in part, Google's giant cash machine. It makes it possible to be
altruistic. In a world of fierce/commoditized competition it is much harder to
expend resources on 'side' projects.

Google also thrives in a healthy internet.

~~~
lloydde
"They were happy to see employees spontaneously self-organizing to put their
20% time to good use."

Reminded me that I heard that Google's 20% time had generally like 10%
participation and has had a number of conditions including manager approval
for the last 4 years. Is this article breathing new life into the myth of 20%
time or does this reflect the revival of the process?

Either way, incredible accomplishment on the patch army!

~~~
nostrademons
20% time has always meant different things depending on who you talk to. It's
likely that you're just talking to different people.

I took copious amounts of 20% time in my time at Google (2009-2014). Usually
I'd start a 20% project with neither my manager's knowledge nor his approval;
if it looked like it had legs, I'd let him know about it and ask what he
thought. I never had a manager outright forbid me from working on a 20%
project; responses ranged from "You should consider this your main project
now; it's critically important that we understand this area" (along with a
spot bonus for delivering on it) to "Well, you can work on it, but you are
unlikely to get credit for it come promo time." In general, as long as I got
my work done, my managers didn't care what else I was working on.

~~~
ensignavenger
Was your workload such that you could put in a reasonable number of hours on
your 'work' and still have 20% of your time for other projects? Or was 20%
time just putting in an extra 20% on top of what other developers (who didn't
do 20% projects) would do?

~~~
nostrademons
Over my career there, yes, I could put a reasonable number of hours on my work
and still do 20% projects. There were short periods of time when I was asked
to "bank" the time and focus entirely on my main project, eg. I might work
straight for 5 weeks on my main project and then take a week straight of 20%
time once it launched.

This suited me fine - usually the way I took 20% time was to fit it into time
periods when I didn't really have much else to do or I was bored with my main
project. And splitting it up like this let me focus more intently on both of
them, which helped in delivering.

~~~
ensignavenger
Cool, sounds like it worked out for you about the way I would expect. I
suppose it may depend some on your team, I have heard some people say it is
just 20% extra work.

------
vog
I like the "bank teller" analogy used in the article.

 _> it would be like hiring a bank teller who was trained to hand over all the
money in the vault if asked to do so politely, and then entrusting that teller
with the key. The only thing that would keep a bank safe in such a
circumstance is that most people wouldn’t consider asking such a question._

This does not only work for deserialization issues.

It is a great analogy for a huge class of IT security issues!

Maybe we should use that one when communicating with the media. This this
works much better than the usual burglary analogy. I like how it points out
that this is about stupid and/or malicious behaviour (code), where the
attacker (hacker) just needs curiosity, and may find this out even by
accident. The attacker did not have to break something, and did not damage
anything, to get into something. In particular, this makes clear that this is
caused by irresponsibile behaviour of the organization and/or other entities
to whom they delegate trust.

Even for more complicated scenarios, I like the bank teller analogy more than
the classic burglary analogy. In that case, the attacker observes multuple
bank tellers, and notices e.g. that if you ask the first teller for form A and
put in certain words, another bank teller will accept it and give you a
stamped form B, which you can show to a third teller in another branch office
who will look a bit confused, but finally accept it and hand over all money to
you.

We need to get over blaming the messengers[1], buying zerodays and declaring
cyberwar. What we really need to do is to finally make our[2] computer systems
secure and trustworthy, at least up to a certain minimum-level of sanity: no
exec, no injection (i.e. typing/tagging), no overflows (i.e. static analysis),
input validation, testing, fuzzing, you name it.

And this cannot work by just adding more and more complex security measures
outside, but more importantly simplifying and cleaning up inside. Although
rewriting software from scratch is very risky, radical refactoring is not! And
every good software engineering course tells you how to do it correctly.

[1] security researchers, but also "amateur" hackers, or just someone running
into it by accident because the security issue became so large it finally
_had_ to be noticed by someone.

[2] in the sense of: everyones!

~~~
reacweb
I know it is not correct to ad "me too" comments here, but I went here for the
same quote as you. It is the best quote I have never seen about security
because it does not depict a evil hacker that breaks a not secure enough wall.
It depicts a clever client that goes through a stupid security hole. That's a
better analogy for 99% of security hacks.

------
tombh
Is [https://libraries.io](https://libraries.io) not a more comprehensive and
community-focused response to the same problem?

libraries.io did make it to the front page a few months ago, but I think its
underlying vision might not have been driven home from just glancing at its
home page. It supports 33 package managers (not just Java, though I'm sure
Rosehub doesn't just do that either) and Github/Gitlab/Bitbucket, not just
Github. And it provides both email notifications and auto PRs.

But that's just the overlap with Rosehub. On top of that it offers the means
to discover libraries based on a Dependency Rank (think Page Rank but using
dependencies instead of hyperlinks). Which in turn allows it to surface
projects with a high "Bus Factor" \-- projects maintained by few committers,
but depended on by many (so they'd be more affected by said committers getting
run over by a bus). AND it mines the licenses for a project, notifying if any
of the dependent licenses are incompatible with the parent license. What's
more it's a non-profit organisation receiving enough funding to employ 2 full
time devs.

I think libraries.io is Rosehub and more, to quote the about page;

    
    
      Our goal is to raise the quality of all software,
      by raising the quality and frequency of contributions
      to free and open source software; the services,
      frameworks, plugins and tools we collectively refer
      to as libraries.
    

To take the liberty of extrapolating from the libraries.io vision: open source
security isn't just about fixing patches, but about supporting the
environment, people, conditions and tools that contribute to open source
software.

~~~
gpawl
I see nothing on the libraries.io website that explains how it would be used
to solve the problem described in the OP.

~~~
tombh
OP here, yeah I agree, like I said the home page could be more explicit.

Here's some links:

[https://libraries.io/about](https://libraries.io/about)

[https://libraries.io/bus-factor](https://libraries.io/bus-factor)

[https://github.com/librariesio/lib2issues](https://github.com/librariesio/lib2issues)

------
saurik
I am extremely sad that this turns into an argument _for_ making certain that
all source code in the world is at least indirectly accessible specifically
via GitHub (at which point people will find it there and expect the developers
to respond and generally track everything going on there, even projects which
are much happier using more open tools); like: it isn't sufficient that your
code is "open", it actively has to be part of the unified GitHub empire.

~~~
jevinskie
Your gitweb [0] has always worked perfectly fine for me! I agree, I don't see
the need for Github.

[0]: [http://gitweb.saurik.com](http://gitweb.saurik.com)

------
orf
In their query they do:

    
    
       FROM (SELECT id,content
          FROM (SELECT id,content
             FROM [bigquery-public-data:github_repos.contents]
             WHERE NOT binary)
         WHERE content CONTAINS 'commons-collections<')
    

Why the subquery? Why not WHERE NOT binary AND content CONTAINS...? is this a
bigquery thing?

~~~
vgt
This appears to be "legacy SQL" in BigQuery, which did not have query
optimization - entirely rule-based query planning. The query is a little
inefficient indeed.

BigQuery has since released ANSI 2011 "standard SQL", which would does have an
optimizer and would push predicates down).

(work on GCP and worked on BQ until recently)

------
tlrobinson
Wow. I wonder how much a query that searches the content of all of Github
costs (if you're not Google). This page says the dataset is 3TB+
[https://cloud.google.com/bigquery/public-
data/github](https://cloud.google.com/bigquery/public-data/github) and
presumably most of that is content.

~~~
Veratyr
The contents table [0] which they ran the query on is 1.8TB. Assuming you only
need to do a single pass (seems reasonable given that it's a simple regex),
the price should be about $9 [1]. Free quota covers 1TB so the remainder would
be $4.

[0]: [https://bigquery.cloud.google.com/table/bigquery-public-
data...](https://bigquery.cloud.google.com/table/bigquery-public-
data:github_repos.contents?pli=1&tab=details)

[1]:
[https://cloud.google.com/bigquery/pricing](https://cloud.google.com/bigquery/pricing)

------
cypherpunks01
Nice! That's some good citizenry.

Interesting fact: Justine was the founder of occupywallst.org, which was the
highest-trafficked publisher/web hub for the Occupy Wall Street movement
before she worked for Google.

------
markcerqueira
"Patches were sent to many projects, avoiding threats to public security for
years to come."

Are these pull requests that the project would still need to approve/merge or
were they just pushed in?

~~~
edutechnion
They were PRs that required approval and merge from the Github project
maintainers. Here is a search to see some of their work:

[https://github.com/search?q=%22Upgrade+Apache+Commons+Collec...](https://github.com/search?q=%22Upgrade+Apache+Commons+Collections%22&type=Issues&utf8=%E2%9C%93)

~~~
fudged71
It's actually incredibly interesting to read how the developers individually
responded to each of these PRs.

It would have been great to see a count of how many PRs have been accepted.

~~~
cpeterso
edutechnion's link says 1108 open PRs and 999 closed.

Interesting that 2100 of the PRs are "Upgrade Apache Commons Collections to
v3.2.2" and just 7 were "Upgrade Apache Commons Collections to v4.1".

~~~
therealdrag0
Probably v3.2.2 was lower hanging fruit for most projects. Instead of having
to make code changes.

------
luhn
As scary as Google's massive size and power is, it's pretty awesome that
they're incentivized to do things like this to help the internet because they
_are_ the internet.

------
mrgrowth
I read so many of these kinds of articles out of curiosity and rarely
understand them.

Thank you for adding in the part about the bank teller.

For reference: "it would be like hiring a bank teller who was trained to hand
over all the money in the vault if asked to do so politely, and then
entrusting that teller with the key."

------
joelthelion
> But unlike big businesses, open source projects don’t have people on staff

To read that from Google is frankly disappointing. While this is true of many
open-source projects, it doesn't have to be that way. Red Hat (and Google!)
are brilliant proofs of this.

~~~
vog
More generally, if a company uses software X (open-source or not), they need
to:

a) make a contract with a company that takes responsibility for X, or

b) hire somebody who takes responsibility for X, or

c) take responsibility for X on your own

It doesn't help to "buy" closed-source software X from another company if you
can't count on them in case of emergency, i.e. if they vanish, go bankrupt or
put their lawyers onto you.

Then, better take open-source software where you can take responsibility on
your own, for which it may help to hire one or more of the lead developers.

------
bla2
Really cool, kudos to people helping with this. I wonder if this could have
been done in a way that non-Googlers could have pitched in too, given that
this is for a public good -- but it's tricky with security issues.

------
hokkos
How does it work for transitive depandancies ? If you use a package that use a
vulnerable Apache common? Does a pr is sent to update the package when it is
updated?

------
tropo
If I understand it right, this bug involves code pulling in old buggy
libraries, sometimes indirectly via other libraries. It seems that there is a
reference to a specific bad version, not the actual inclusion of cut-and-paste
code.

Eh, why not just get rid of the bad version? Alternately, release a bug-fixed
copy with the same version number.

Any breakage is a case of "oh well, you're safe now". Leaving the security
hole is probably worse breakage.

~~~
benmmurphy
The bug is not including this library. This library is 100% secure [i mean if
this is a vuln in this library then a large proportion of libraries are
insecure because they could be leaked to untrusted code and used to break the
JVM trust model]. It just so happens this library used to contain a really-
really useful gadget for exploiting another security problem. However,
removing this gadget doesn't mean the security problem is fixed. There are
other _fun_ libraries. In fact classes similar to the Mad Gadget have been
used in the JDK to escape the sandbox in the past. Yes, stuff like this exists
or has existed in the JDK
[[https://github.com/jenkinsci/jenkins/blob/96a9fba82b85026750...](https://github.com/jenkinsci/jenkins/blob/96a9fba82b850267506e50e11f56f05359fa5594/test/src/test/java/jenkins/security/security218/ysoserial/payloads/Jdk7u21.java)].

And this work is very useful in so far as I'm sure the benefits it provides is
going to massively outweigh the cost. However, if you have a naked
ObjectInputStream#readObject in your code then you probably still have an
exploitable security issue. Have a look at how well Jenkins strategy was to
fixing this issue which was basically the same strategy as Operation Roshub.
ie: removing the ability to access classes that were known to be used in
gadget chains. Surprise, surprise it didn't last very long and people just
found new gadgets.

And if you read this blog post then you might be mistaken into thinking that
removing commons-collections from your classpath or upgrading commons-
collection to the 'safe' version would make object deserialization safe but
this is not the case. if you have a naked ObjectInputStream#read in your code
then you are vulnerable to remote code execution.

~~~
jart
Author here. There are individuals in the infosec industry who agree with you.
They've stated on many occasions that the problem isn't gadgets, but rather
that programming practices in general need to change. This might have
something to do with the fact that no one told Apache about this weakness
until nearly a year after it was presented at an infosec conference.

While gadgets may not be the root weakness, the gadgets certainly help. We may
never be able to have perfect security. Hopefully the systemic paradigm shift
infosec professionals are advocating will come some day. But until that day
arrives, we can make people so much safer, with minimal effort, by simply
disabling these gadgets.

Almost no one uses them. Out of all the projects I found, I was only able to
identify one or two that were legitimately using the gadgets in question.

~~~
benmmurphy
Thanks for your reply. This is Thursday in the UK so I'm going to pre-
emptively apologise for this rant. But we in the infosec community informed
the wider community in 2008 of the problems of Java Serialiazation. That is 8
years ago. Sami Koivu, peace be upon him in December 2008 showed that
arbitrary deserialisation in Java was a security risk
([http://slightlyrandombrokenthoughts.blogspot.co.uk/2008/12/c...](http://slightlyrandombrokenthoughts.blogspot.co.uk/2008/12/calendar-
bug.html)). Not to mention that SERIAL-5
([http://www.oracle.com/technetwork/java/seccodeguide-139067.h...](http://www.oracle.com/technetwork/java/seccodeguide-139067.html))
of the Java Security Guidelines for Java SE has this to say:

 _Guideline 8-5 / SERIAL-5: Understand the security permissions given to
serialization and deserialization Permissions appropriate for deserialization
should be carefully checked. Additionally, deserialization of untrusted data
should generally be avoided whenever possible._

And do you want to have a guess as to how many times Serialization was used to
bypass the Java Sandbox between when Sami Kouvi made his blogpost and someone
made a con talk on about Apache. Hint: it is greater than 1.
[[https://tyranidslair.blogspot.co.uk/2013/02/fun-with-java-
se...](https://tyranidslair.blogspot.co.uk/2013/02/fun-with-java-
serialization-and.html)]

We have also demonstrated numerous times to the programming community that
deserialzation of user data is dangerous. For example Stefan Esser has shown
numerous times that PHP deserialization is dangerous both because PHP
deserialization is a source of bugs in itself and because it is a source of
bugs because it interacts with application code in unexpected ways. We have
seen the same thing in both python with unpickle and ruby with YAML.

I'm going to let you in to a secret within the infosec community. You can find
bugs by just applying existing research in new and novel ways because
developers do not follow security research.

I feel like I'm falling into some rationalism fallacy by ranting at you
because you are doing something useful to improve security. But you could be
doing much much more. You have a voice and people will actually read your blog
as compared to Sami :( You could have mentioned that we people should stop
doing ObjectStream#readObject() or you could have pushed for updating the
JavaDoc to say: THIS IS A BAD THING DO NOT DO IT.

EDIT: apologies to anyone that realized that java serialization was bad before
the Sami post. I wouldn't be surprised if this was part of the Java secure
code guidelines before then or if someone had exploited the issue before then.
It just so happens that Sami's post was my introduction to Java Serialization
vulnerabilities.

~~~
gebl
As one of the people who did the talk at Appsec Cali that was building on all
this work outlined by benmurphy... our goal was to reach security minded
developers and talk about a repeated anti-pattern that put software at risk
that impacts things written in many different languages. Both Chris and I have
a development background, and have seen the same issue show up in ruby,
python, php, basically anything that has an object serialization capability.
We hoped to change the focus from a specific library or gadget to the idea
that deserialization is inherently dangerous.

The core problem really stems from the idea that OO models encapsulate data
and behaviors. Behaviors mean code execution - so, anything that will
deserialize objects is giving the person who serialized them the ability to
control the execution flow. If this is a listener on the network, than things
are really bad :-)

So, it's great that a set of gadgets have been removed, it's neat to see the
application of resources to make that happen. I have to agree with Ben, that
any system that relies on object serialization from untrusted sources (in any
language) is still vulnerable, it just might require a more specific gadget
chain. Too many vendors have fixed their products by just updating the library
and not removing the dependency on dangerous object deserailization.

~~~
jart
Why did no one tell Apache?

~~~
gebl
[https://blogs.apache.org/foundation/entry/apache_commons_sta...](https://blogs.apache.org/foundation/entry/apache_commons_statement_to_widespread)

"So replacing your installations with a hardened version of Apache Commons
Collections will not make your application resist this vulnerability."

~~~
jart
Ok well you can tell your cohort that this narrative isn't going to fly
anymore.

~~~
gebl
What evs

------
make3
I wish you could do the same thing with mental illness.. massively send pull
request to correct everyone's bad brain code.. <sorry>

~~~
i336_
This is a genuinely cool idea. Seriously.

It raises a lot of questions about what sort of transformative spectrum
(excuse pun) would be applied here though. It's is incredibly abstract as
presented.

But even at the abstract level, the one thing I know would absolutely happen
for sure is that the fixes that made the biggest difference would be hand-
waved out of existence by infecting them with viruses, creating scare-
campaigns, etc.

Source: I've learned a lot about Big Pharma over the past 10 years as I've
quietly found real solutions to my own mental health issues. I'm sadly too
scared to share what I've found and I keep seeing products disappear off the
market or suddenly attract customs/overseas shipping issues. Suffice it to say
that the medical industry is opposed to anything they can't patent - and that,
as an industry, it must ensure its own survival. Interpret that any way you
see fit.

------
mirekrusin
It's interesting that this type of initiative, which is admirable, will spike
up some java "popularity" metrics on GitHub.

------
hawski
I was thinking about doing something similar with bigquery and github data to
search for uses of strncpy in C code. But I am not that good with the query
language and also bigquery didn’t support multiple users properly (this adds
friction).

I still think it’s a good idea. It would be even better to search for a few C
pitfalls more, but strncpy is probably the easiest to search for.

------
ploxiln
I think this is one good concrete example of why the npm style of private
dependencies for each lib is not the greatest thing ever, while the non-
recursive style in python (or C) is overall more manageable (if you are
actually managing your dependencies instead of ignoring them).

------
codelion
we have been doing thus for a while now :
[https://www.sourceclear.com/blog/millions-of-program-
builds-...](https://www.sourceclear.com/blog/millions-of-program-builds-
vulnerable-to-man-in-the-middle-attacks/)

~~~
hcs
Got a 404 on the above, looks like it should be:
[https://www.sourceclear.com/blog/millions-of-program-
builds-...](https://www.sourceclear.com/blog/millions-of-program-builds-
vulnerable/)

But I don't see where it discusses sending PRs to affected repos, only
detecting them.

~~~
codelion
Ah yeah I had the old link, thanks for fixing. Actually we privately
disclosure the problem to the developers and get it fixed following
responsible disclosure and not post PRs directly.

------
11928311
So, Google does ... something and is showered with praise.

Thousands of volunteers work in the saltmines and get nothing.

Business as usual. Myths like "Google sponsored Python!!!" propagate when they
do nothing at all.

Disgusting.

------
Dem0stheneS
That's outstanding news. Hats off to the volunteers doing the work on this.

------
rburhum
Mad thank-yous to Google for this!

------
lvlds
Awesome! Contgrats to the team!

------
snambi
What is in it for google?

~~~
idlewords
Employee satisfaction.

~~~
theDoug
And a healthier internet

------
muzster
Operation Rosebud

------
lolive
Wouldn't a graph database be a more suitable tool for that kind of task?

~~~
CydeWeys
Why would it be a more suitable tool? What can a graph database tool do that
BigQuery lacks?

If you've already got the data conveniently preloaded into a SQL database for
you, and all you need is a very simple SELECT statement with two WHERE clauses
... why would you use anything else? Spinning up an entire graph database
unnecessarily seems like over-engineering.

~~~
jart
Author here. There's some truth to what he's saying. One thing I've been
meaning to do is get my hands on all the Maven pom.xml files that exist, so I
can load them into a Guava Multimap (my graph database of choice) and figure
out every single artifact that will transitively inherit vulnerable
collections on the class path.

~~~
CydeWeys
Try Neo4j?

~~~
jart
Why use Neo4j for a program you're going to run once and not store the data?
Multimap<T, T> works great.

