
Building bots to mend badges, or how to get your GitHub account suspended - movermeyer
http://movermeyer.com/2018-03-08-building-bots-to-mend-badges/
======
guessmyname
I did this 3 years ago.

I created a bot that would scan for private SSH keys to connect to AWS and
other services, it also warned about leaked software licenses for SublimeText
and other popular programs at the time. While many people appreciated the
initiative, it was not taken the best way by others. Ultimately, GitHub
suspended my account and I had to explain what was all about.

One year later, through my employer, I created another bot to scan for
security vulnerabilities in projects written in Ruby, Python, PHP and Node.js;
this time I already knew that I would need to contact GitHub beforehand to
make sure what were the limits of the _" automation"_. They simply stated that
— at the time — no automation was allowed, which was quite surprising because
CI is automation. Travis and other services are allowed to do things there so
I didn't understand why my bot was different.

I reported to my employer that we would need to shutdown that project and move
on to something different. One year later, I find that GitHub implemented a
(semi) vulnerability scanner for a selected group of programming languages,
warning the repository owners about problems with their software dependencies.
I cannot be mad about this, it's their service, but it still made me a bit
angry.

~~~
michaelmior
Assuming you're talking about a bot similar to the OP's that scans random
projects you don't own, that's very different from CI which is explicitly
configured by projects.

~~~
d0lph
But if they are maintaining "no automation" as a rule, CI should not really
allowed.

~~~
michaelmior
This really depends on how you define automation. CI on GitHub depends on
webhooks which is an officially supported part of the API. So there isn't
anything unsolicited happening.

I just don't think there's any meaningful claim that CI explicitly configured
by maintainers using officially supported channels should be lumped in the
same category as automated scraping of repositories for creating PRs.

Sure, there are many definitions of automation that would include both of
these things, but I think it's obvious what GitHub intends in practice.

~~~
d0lph
The automation wouldn't be doing anything unsolicited either, since it's only
doing things that have been defined in the UI.

Perhaps they didn't mean automation per se, but CI is certainly automation,
automatic merging, as in not manually merged by a human.

Perhaps a better rule would have been, no automation on repos not controlled
by yourself.

~~~
movermeyer
The actual rule is no "excessive automated bulk activity".

[https://help.github.com/articles/github-terms-of-
service/](https://help.github.com/articles/github-terms-of-service/)

~~~
d0lph
That being the case, seems like OP's bot probably should have been allowed,
especially since it seemed like they were only automating their repos.

~~~
movermeyer
This bot was exclusively applied to repos that were not mine.

------
vinceguidry
The answer to this is to do human rate limiting.

Build your pipeline so that a human has to approve each automated action being
taken. The difference between a bot making 5000 network requests in a day and
a human making 100-200 semi-automated requests isn't a whole lot in terms of
throughput, but makes an enormous difference in terms of quality control and
not stepping on toes.

I really wish more companies would do this. Fully automated business
procedures that make demands on human attention are just plain awful. Humans
should be interacting with humans, machines should interact with machines.
Machines can help the human, and the human can help the machines, but
interaction points should only be between two of the same types of entity.

Every time I've ever suggested human rate limiting though, I get looked at
like I'm a moron. Even when I build the entire workflow myself and tune it so
it only takes a relatively tiny amount of time to clean massive amounts of
data, people just don't want to do it. It's beneath them. Even when it creates
a massive difference in the quality of the product / service you're offering.

~~~
icebraining
That's an interesting point. I'm not sure I would trust myself to do a better
job than automated tests after a (short) while. It's not really a matter of
being beneath me; for example, I don't mind doing repetitive manual labor
(once in a while).

Just out of curiosity, what's the biggest job of this type you've personally
handled?

~~~
vinceguidry
I made a brief, aborted attempt at a restaurant recommendation service. We
wanted to hydrate our data with existing pictures of dishes from the
restaurants sourced from Yelp and/or Google Image Search. After looking at
that data, I realized that a human touch to picking the right images would
make a huge difference in the service.

We're talking thousands of restaurants that we wanted pictures of food from,
each of the restaurants had dozens of images we could pull. So tens of
thousands of images needed to be sifted through, I figured with the right
tooling, myself and my cofounder could put together something really nice that
would only need an hour or so of maintenance a day to keep up.

So I built a pipeline that used very basic and easy to build and maintain
'dumb' Rails asset pipeline pages to present data for sifting. Go to the
endpoint, it shows you the name of the restaurant and a bunch of images, you
select one, type in a name for the dish, and it saves it to the database and
puts up another page of images.

It took me bitching up a storm to get him to even look at it. He complained
about how long he thought it would take, while I just got to work. Took maybe
three weeks to prototype our app. One thing I learned in the process is that
if you're looking at a bunch of Southern food, for some reason the picture of
shrimp and grits always looks the most appetizing.

I was well on my way to classifying and figuring out novel ways to present the
data when I had to make the determination that there wasn't good cofounder
fit. So now I work with CNN.

But now all my side projects revolve around ways to get human attention to
improve automated tasks. I suppose one of these days I'll get the right idea
and/or the right cofounder and I'll give it another go.

There's a wealth of usable information out there on the web that one can build
businesses on top of if one only wants to apply a little elbow grease to clean
it and turn it into data. It's far easier to scrape data with a regular web
browser with a custom browser extension than to try to build out headless
infrastructure. But no one wants to do it.

~~~
MrLeap
Your story reminds me of a tool I wrote for helping lawyers classify a feed
texts as a test set for a project.

Our main initiative was creating a heuristic based classifier (think lots of
regex). At my own initiative, I trained ML classifiers while we worked on it.
As development went on, the ML classifiers were rapidly catching up with the
heuristic based one. Unfortunately it was kind of a one off data processing
task, and when time ran out the regex machine was still in the lead.

I was modestly proud of the legalese DSL generator I wrote up. The lawyers
didn't even know they were writing coffeescript as they typed out what
documents were, what key dates were, etc. :D

That coffeescript formed the basis of our accuracy testing suite. It was as
fundamental as it was huge. That team ended up creating a couple thousand
tests in less than a month.

~~~
ruairidhwm
I'd be interested in hearing more about this. Fancy dropping me an email
(address is in profile)?

------
nathantotten
I actually really like this idea. There are so many random things on github
that get broken over time. The implementation though clearly is problematic
and github has no choice but to block this behaviour.

I could image though a system where there was some sort of community managed
github bot. Developers could submit pull request to the community service to
fix common issues. Github would then run the service nightly themselves.
Developers could opt-out of the service if they wanted. Something like this
could be very handy for many things - security issues, typos, broken links,
etc.

~~~
masklinn
Github Applications exist for that. It's basically a bot you specifically opt
into.

The main issue is one of discovery.

Though I imagine you could build an application to notify users of new fixer
applications. Maintainers would opt into that for their accounts/repositories,
it would then match repositories & applications submitted to it and ping
submitters when an application looks… applicable.

> Github would then run the service nightly themselves. Developers could opt-
> out of the service if they wanted.

It would be just as bad as TFA's.

------
adrianN
Maybe Github should provide a checkbox somewhere in my account setting where I
can generally consent to bots making PRs in my repositories.

------
joejev
Great write up. I am a maintainer of a project you opened a PR on, but the
diff is large and seems to include unrelated changes. Maybe you based of a
fork and the opened the PR to upstream? If the bot had worked as intended I
would have thought it was cool, but given that it opened a large PR with an
incorrect description I found it harmful. I guess the problem with these bots
is that it is easy for you to make a mistake which takes a lot longer for
people to deal with than it took for you to make, so github needs to ban these
to prevent this.

~~~
movermeyer
Thanks for reporting this. Firstly, I am sorry for the inconvenience of
receiving a broken pull request.

I have looked into the pull request and discovered that this is a variant of
"Bug #4" from the blog post. It happens when the third-party renames their
forked repo. At this point, the names don't line up and my bot doesn't realize
that the two repos fork to the same location.

I have manually fixed my merge request for your repo and will be writing a
script to look for others that might have had a similar experience.

Sorry once again.

------
edf13
I can see why this would end in suspension - you even mention yourself that
when launching the bot (Even on a limited number of repos) you had bugs which
messed up README's....

This sounds like a terrible idea! I wouldn't want an automated bot trying to
auto-correct my work in this manner

~~~
Matt3o12_
I would actually really appreciate a bot that does that. There are so many
bandges which are broken which is just annoying and many maintainers do not
care about the readme once they have written it. They do accept pull requests
but don’t update it manually.

And since the bot is only creating pull requests, I don’t see any harm: worst
case for my repo, it would brake the readme but I double check it just like
any other pull request, realize that it messed up, and fix it myself (but I
would be thankful the bot noticed the broken link and I have a motivation to
fix it).

What is a bit problematic about this bot would be that, due to a bug, it
starts spamming (creating 1000 of pull requests) flagging false positives,
etc).

Furthermore, it is also important where to draw the line. A bot that notices
something is broken and offers me a fix is ok. A bot that notices I use a
working service X and offers a pull requests to use service Y could be
problematic because it might be useful but might also be annoying (because it
is advertising, and service X might be good enough for me.

~~~
daveFNbuck
The issue with bots is never a particular bot. GitHub is just trying to avoid
a situation where a large percentage of pull requests are coming from bots.
This would drive people off of the platform even if every individual bot were
reasonable and justified.

------
ada1981
If you still are interested in solving the badge problem, you could also
register the domain pypip.in again and redirect the URLs to the other service
you found.(looks like it's available)

Maybe the other service would do it if you emailed them?

~~~
movermeyer
That's a neat idea. Looks like someone beat me to it though:
[https://www.namecheap.com/domains/whois/results.aspx?domain=...](https://www.namecheap.com/domains/whois/results.aspx?domain=pypip.in)

I've reached out to them to try and see whether they want help.

[UPDATE: they're benevolent and I'll be working with them on this. Cheers]

~~~
sitkack
If they are benevolent, and not going to serve spam or malware over those
badges.

~~~
amenod
Well, if they are not benevolent, at least there will be a strong motivation
to finally solve those broken badges then. ;)

I must say this made my day... It's one of those occurences when one uses many
many many hours for a project that in the end could be solvable by a small
amount of money ($10) and a few e-mails. Must admit I didn't think of it
either when I was reading the blog post. :)

------
benatkin
Removing already-merged pull requests from display seems adjacent to data
integrity issues. GitHub should probably be doing more QA in this area.
They've done a lot to improve their security and protect from DDOS so I'm sure
they can. I hope someone there will notice it (through this comment or through
the blog post) and this will be a wake up call.

------
troymc
We let Greenkeeper update our JavaScript repos. We still have to accept and
merge its pull requests. So there are ways to do this where GitHub won't shut
you down (if you are Greenkeeper)...

Maybe a solution would be for someone to create an app like Greenkeeper, but
which promised to start by doing five things, but to add more over time,
informing you of each new thing and letting you opt out at any time: a list of
checkboxes.

------
alpb
This is exactly why GitHub will never be the monorepo (as described in
Google’s paper [https://cacm.acm.org/magazines/2016/7/204032-why-google-
stor...](https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-
billions-of-lines-of-code-in-a-single-repository/fulltext)). The world needs
an ability to do code repairs and refactoring at GitHub-scale and yet probably
some people just blocked this person’s bot for spam.

GitHub doesn't seem to have an ambition as for being the world’s monorepo
either, the features they’ve been building is not usually in line with that. I
think GitHub should consider creating a team of people that think about the
next 10-20 years of open source development and how THEY can carry the flag in
terms of innovation in being a world-scale code repository.

------
nukeop
The guy running the show at the Appimage organization routinely runs a bot
that automatically creates thousands of issues asking maintainers of random
repositories to create Appimage builds, or if there already is an Appimage
build it asks them to comply with good Appimage practices, or to fix paths or
icons, and so on. Why can they do that with no problems while this less known
but more useful bot can't? According to the description here, the bot was
useful because it was fixing minor issues in an automated, easy to integrate
way with no reasonable downsides to maintainers (other than having to eyeball
the pull request and merge it).

~~~
jamiedbennett
Is there any information on this?

Whilst fixing problems is fine, spamming developers is not. It would be
interesting to find out more about this 'bot'.

~~~
nukeop
[https://github.com/probonopd](https://github.com/probonopd)

Check out the few times he has over 200 contributions in a single day, most of
those are issues opened with a bot and so on.

------
andrew_
Honest and noble intentions to be sure. But the author should have foreseen
the consequences, or at least investigated similar services to see how they
behave (and what Github and users typically consider acceptable).
Greenkeeper.io for example, provides a very similar service for Node.js
package dependencies - but it's opt-in, as Github support was quoted as
mentioning in the article.

All one needs to do is take a step back and take stock of how many real-life
situations we find unsolicited anything acceptable, and the real potential for
pitfall would've been clear.

------
olivierlacan
This is a great opportunity to remind everyone that
[http://shields.io/](http://shields.io/) has been providing high quality SVG
badges for all your repo metadata needs for 5 years now.

Even if Shields doesn't support the specific third-party service integration
you're looking for you can generate a badge using the incredibly simple image
API:

\- [https://img.shields.io/badge/hacker-news-
orange.svg](https://img.shields.io/badge/hacker-news-orange.svg)

\- [https://img.shields.io/badge/rate-limited-
red.svg](https://img.shields.io/badge/rate-limited-red.svg)

The API is open source:
[https://github.com/badges/shields](https://github.com/badges/shields)

I started the project although it's maintained by Thaddée Tyl and Paul
Melnikow these days. Here's a bit of backstory:
[http://olivierlacan.com/posts/an-open-source-rage-
diamond/](http://olivierlacan.com/posts/an-open-source-rage-diamond/)

------
shusson
Did you investigate deploying the bot as a Github App?

------
Eun
That's why you should never ever use your personal account for botting...

------
wffurr
At Google they use a system called Rosie to do something similar:
[https://m.cacm.acm.org/magazines/2016/7/204032-why-google-
st...](https://m.cacm.acm.org/magazines/2016/7/204032-why-google-stores-
billions-of-lines-of-code-in-a-single-repository/fulltext)

------
Bedon292
Would this also violate the terms of services to create an issue
automatically, to explain what is broken. And include a link in the issue text
to automatically create a PR via the bot? Then it would be an opt-in
situation. Or is it just frowned upon to even do that?

~~~
detaro
IMHO it wouldn't make sense to allow this if PRs are off-limits, since it
creates at least the same amount of noise.

------
adambowles
The link to your personal site is broken, it points to
[http://movermeyer.com/2018-03-08-building-bots-to-mend-
badge...](http://movermeyer.com/2018-03-08-building-bots-to-mend-
badges/movermeyer.com)

~~~
movermeyer
Thanks. I've fixed it.

------
GogoAkiraThe2nd
Hey Michael, your blog post isn't finished, what is the "much more complicated
(and useful) bot that I was working on" supposed to be, if it's not a secret
will you let me know, it's like watching a movie with no ending.

------
Alir3z4
I wrote my own alternative in Python when noticed pypi.in:
[https://github.com/SavandBros/badge](https://github.com/SavandBros/badge)

------
OtterCoder
You absolutely deserve to be banned, regardless of ToS. You're wasting
thousands of other people's man-hours to demand, however politely, that they
fix a miniscule error that likely only bothers you. Even if you provide the
fix with the request, it will still take a colossal amount of time to read and
review the PRs. This is mechanised pedantry on a despicable scale.

~~~
movermeyer
The total amount of time to review one of these pull requests can be estimated
at < 30 seconds per repo. 30s * 1000 repos = 8h20m. Hardly thousands of hours.

But yes, this was a particularly trivial problem to tackle. It was meant to be
a stepping stone to a truly useful bot that I was working on. However, that
has been put on hold.

FWIW, the maintainers themselves gave lots of positive feedback on the
project.

------
BugsJustFindMe
I think that the author is taking away the wrong lesson, because I think they
started with some wrong ideas about communication. If you read the GitHub ToS,
the relevant policy statement says "excessive automated bulk activity" not
just "automated bulk activity". I bet someone complained about them, and I
think it happened because they thought it would be bad to make the bot act
like a person.

If your bot has output, always make your bot act like a person. That means
messaging, and that means timing. Even in the best case, if your bot uses few
resources and always perfectly does the right thing, people don't like bots.

> _There are four very important things that any automated message needs to do
> in order to help avoid aggravating people: Be Accurate and Useful, Be Honest
> /Open about being a bot, Have a mechanism for feedback, Be Friendly_

No. God no. There are two important things that you need in a PR: Don't act
entitled (op did a great job there) and don't waste my time (op failed hard at
this). Everything else is bad. Telling someone that your pull request is
coming from a bot only hurts your goal. In the absolute best case, they treat
your PR like any other. In many cases, though, knowing that a message was
automated will get you instantly reported for spamming regardless of how
helpful you were.

> _Automated messages should describe themselves as such._

This is off topic and therefore violates the "don't waste my time" principle.
It also has a tendency to engage the gag reflex.

> _It should be the opening line._

Having multiple lines for something so small violates the "don't waste my
time" principle. And definitely don't start your message with something that
is off topic.

> _announcing it as automated helps explain why they are receiving the pull
> request_

They are receiving the pull request because something is broken and you are
fixing it for them.

> _Have a mechanism for feedback_

They can put feedback on the PR. This violates the "don't waste my time"
principle.

> _I ended up settling on the following message for the pull requests:..._

Holy crapballs that's verbose. This definitely violates the "don't waste my
time" principle. "Fix broken badge by pointing to working URL foo [see:
bug_report_link]". Boom. Done. It's easy to read, easy to understand, and easy
to approve.

> _Note that the last paragraph is only included in the message if the README
> includes the “download count” badges. I debated working out a system to
> delete these badges automatically_

You should have either skipped them or maybe filed an issue instead. "The
download count badge in README is broken because the foo API no longer
exists". Not a whole paragraph.

> _Do not make automatic unsolicited pull requests._

Most pull requests are unsolicited, and GitHub has an automation API for pull
requests, and their ToS doesn't prohibit unsolicited automation, just
"excessive" such, so this is probably the wrong takeaway.

