
Open source collaboration across agencies to improve HTTPS deployment - konklone
https://18f.gsa.gov/2017/01/06/open-source-collaboration-across-agencies-to-improve-https-deployment/
======
bertil
This is a very small detail in that post but it captures quite well what
officialdom is to me, what separates GSA and 18F from other digital efforts:
the inclusion of the “tribal” scale in the list of levels of authority. 18F
makes things so that many people can use the Internet including, explicitly,
the administration of First Nations.

I’ve complained a lot about how US-based company do not thing about non-US
users enough (that common rant is obviously not applicable to GSA, although
American abroad, immigrants and foreign visitors probably quality) but in that
rant, I have forgotten the original Americans. Shame on me. I have never heard
of any start-up asking “What about First Nations? Do we support Cherokee
alphabet? Is there a Sioux exception for the law that we are enforcing in that
form?”

------
garrettr_
pshtt (the HTTPS scanning tool) also powers the results for Freedom of the
Press Foundation's recently launched Secure The News project:
[https://securethe.news](https://securethe.news). (Full disclosure: I work for
FPF, and worked on Secure the News).

It's a promising project, and could use more contributors if anyone here is
interested: [https://github.com/dhs-
ncats/pshtt/issues](https://github.com/dhs-ncats/pshtt/issues) for ideas!

------
discreditable
I was happy to notice not long ago that apod.nasa.gov is now served over HTTPS
with a Let's Encrypt certificate. Even OP link is!

~~~
hmft
Yeah! Lots of NASA sites use of Let's Encrypt certs. Some examples here
[[https://crt.sh/?Identity=%25nasa.gov&iCAID=16418](https://crt.sh/?Identity=%25nasa.gov&iCAID=16418)].

------
alpb
One thing I noticed going through the list linked in the page is, many of
these .gov pages host _both_ www and no-www versions, making them essentially
two different websites with the same content. Example:
[http://abilityone.gov/](http://abilityone.gov/) and
[http://www.abilityone.gov/](http://www.abilityone.gov/) It looks like the
clear guidelines around this is something missing. I know of certain countries
whose .gov domains are almost 99% www and they don’t serve no-www at all.

~~~
konklone
You're right, this is (unfortunately) very common. I wish there were clearer
guidelines about this.

The White House Office of Management and Budget publishes IT policies, and
they ask for specific URLs with www in front:
[https://www.whitehouse.gov/sites/default/files/omb/memoranda...](https://www.whitehouse.gov/sites/default/files/omb/memoranda/2017/m-17-06.pdf)

But I don't think they or anyone would care if the www redirected to the root,
or vice versa, as long as it eventually got you there.

------
randomdrake
Thanks for the work that you're doing on this and answering questions. I had
never seen many of the neat things mentioned in the blog post.

While the article did a good job explaining how pshtt works and how it
generates data for the reporting, it didn't dive too much into the scanning
itself. Since this is posted on Hacker News, I'd love to hear more about the
nitty gritty of the data collection itself.

Can you talk about what sort of setup you run, and what sort of technical and
interdepartmental challenges you run into scanning, storing, and obtaining
data for 1,143 government websites?

~~~
hmft
Hi there. First, you've got to begin with the understanding that no one is
maintaining a list of federal .gov websites holistically (or at one I can get
hold of). So, before scanning, we source several public datasets to gather
potential .gov hostnames. This was recently described in depth by 18F
[[https://18f.gsa.gov/2017/01/04/tracking-the-us-
governments-p...](https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-
progress-on-moving-https/)]. In addition to Censys, GSA's DAP, and the End of
Term Web Archive data, our team performs authorized scans of federal agency
networks
[[https://www.whitehouse.gov/sites/default/files/omb/memoranda...](https://www.whitehouse.gov/sites/default/files/omb/memoranda/2015/m-15-01.pdf)]
and so we mine that data too. This currently nets ~90k hostnames, only which
about a third are responsive.

For both hostname gathering and HTTPS scanning, we use 18F's domain-scan
[[https://github.com/18F/domain-scan](https://github.com/18F/domain-scan)],
which orchestrates the scan and provides parallelization. We use the pshtt
scanner to ping each hostname at the root and www for both http and https--
this typically takes 36-48 hours to burn through. Once the scanning is
finished, we throw the data from the CSV into mongodb, then generate the
report via LaTeX. The trickiest part is probably report delivery, which is a
mostly manual process for Very Government reasons.

Most of the bureaucratic challenge is overcome because we've already been
doing scans against these executive branch agencies for the past several
years, so we're a known quantity, though we do modify our user-agent to
clearly point back to us. On the whole, agencies have been very supportive--
the data on Pulse bears that out. Agencies really do want to do the right
thing for citizens.

~~~
randomdrake
I appreciate you taking the time for an insightful and detailed response. The
link you provided, "Tracking the U.S. government's progress on moving to
HTTPS[1]" gave a lot of the details I was looking for.

You might consider mentioning it in this blog post as it does offer
interesting background information and technical details.

As a specific example, the actual Python scripts used to generate the data[2]
and the data itself[3], give a great deal of insight into the question I had.

[1] - [https://18f.gsa.gov/2017/01/04/tracking-the-us-
governments-p...](https://18f.gsa.gov/2017/01/04/tracking-the-us-governments-
progress-on-moving-https/)

[2] -
[https://github.com/GSA/https/tree/master/compliance](https://github.com/GSA/https/tree/master/compliance)

[3] -
[https://github.com/GSA/https/tree/master/compliance/data](https://github.com/GSA/https/tree/master/compliance/data)

------
ycmbntrthrwaway
I like it how [https://pulse.cio.gov/](https://pulse.cio.gov/) does not work
because its certificate is issued for cloudfront.net

~~~
ycmbntrthrwaway
Looks like it was fixed just now or there is some round-robin balancing behind
it.

~~~
konklone
As it happened, we were migrating production infrastructure to a new service
tonight, and had a few minutes of time where the cert was invalid. Sorry about
that.

------
hmft
Heyo, ^ blogger here. Happy to chat.

~~~
konklone
And 18F/GSA employee and open source collaborator here. =) Can definitely help
answer any questions folks have.

~~~
newman314
Are the HTTP report generation/assembly tools available/open-sourced too?

I'd love to be able to use this as a starting point. Thanks.

~~~
hmft
No, the code for report generation hasn't been opened up yet, mostly because
it won't work without dependancies that aren't yet public. I think that will
change in the next few months; open-sourcing is definitely an intention. It
will live at [https://github.com/dhs-ncats](https://github.com/dhs-ncats) when
released.

------
DyslexicAtheist
this combines some really important checks. I might be able to remove my
.bashrc hack ...

    
    
      function certchain() {
          # Usage: certchain
          # Display PKI chain-of-trust for a given domain
          # GistID: https://gist.github.com/joshenders/cda916797665de69ebcd
          if [[ "$#" -ne 1 ]]; then
              echo "Usage: ${FUNCNAME} <ip|domain[:port]>"
              return 1
          fi
    
          local host_port="$1"
    
          if [[ "$1" != *:* ]]; then
              local host_port="${1}:443"
          fi
    
          openssl s_client -connect "${host_port}" </dev/null 2>/dev/null | grep -E '\ (s|i):'
      }

------
eeZah7Ux
How mature is pshtt?

~~~
konklone
We (18F/GSA) have been using DHS's tool in production for a few months now,
and have fixed various bugs as they've come up.

Before that, pshtt's methodology was replicated in a Ruby tool (site-
inspector) that we grafted HTTPS/HSTS detection logic onto, and had that
running in production for a year or so.

So in terms of business logic, I think it's pretty mature. If you mean things
like having it formally audited or having a dedicated development team, it
hasn't gotten there yet. But the more people that use it, the more mature it
will get.

