
How a 20-year-old kernel feature helped USDS improve VA’s network - Matt_Cutts
https://medium.com/the-u-s-digital-service/how-a-20-year-old-kernel-feature-helped-usds-improve-vas-network-33109cbcb2e6
======
js2
I probably would've started at the TCP layer only because I've been bitten at
that layer many times and it always has these sorts of strange symptoms. Some
examples:

1) Connections hanging over a frame relay network that one day started
dropping packets over a certain size. Work-around was adjusting the MTU until
I was able to convince the frame relay network operator that something was
broken in their network. Initially it was confusing because an interactive
telnet session over the network would work fine till you did something like
"ls -l" or tried to read a man page which generated enough text to send a full
size packet, then the connection would hang.

2) Unable to reach a Verizon e-mail paging gateway but only when connecting
from a Linux box. An OS X box on the same network as the Linux box could reach
the gateway fine. Turned out Verizon had a firewall rejecting connections
where the ECN bit was set. Linux was setting ECN, OS X was not.

3) Solaris box A could initiate a connection to box B, but not the other way.
After A talked to B, B could then talk to A, but only for a short period.
Someone had deleted A's own MAC from A's ARP table, so A wasn't replying to
ARP requests for itself. But if A connected to B, B would keep A's MAC in its
own table till it timed out after which B couldn't initiate connection to A
any more.

4) All manner of misconfigurations over the years where you learn to recognize
the symptoms: misconfigured netmask size; misconfigured duplex; duplicate IP
address on same network. You rarely see these any more.

5) The infamous 500-mile e-mail. :-)

6) And my favorite - [https://www.pagerduty.com/blog/the-discovery-of-apache-
zooke...](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-
poison-packet/)

~~~
brazzledazzle
I agree. I'd probably look at it from TCP layer shortly after initial failures
to diagnose if not from he start. Especially when dealing with communication
between a cloud provider and on-prem gear and infrastructure. However, it's
tempting to exhaust all other avenues depending on how likely the on-prem ops
folks are to punt the issue.

~~~
askldjd
I actually did look at the TCP layer early on. However, I didn't pay close
attention to the TS Val. From the packet dumps, it just appeared that the TCP
window had stopped sliding. I couldn't conclude that NSOC's router was at
fault.

Getting NSOC on-board is a big deal. After all, they deal with the entire VA
network with 100,000+ employees. If you think about it from their perspective,
why is USDS' TCP connections so special?

~~~
kevin_nisbet
Network level troubleshooting is incredibly difficult, especially for
individuals who don't have a networking background. Even showing someone how
to read wireshark often isn't enough.

I just wanted to politely point out though, in this case, I think there should
have been an indications of a network failure in this analysis early on, from
the standpoint that TCP frames were sent to the server which were not
acknowledged. This would depend on the point where you capture the traffic
naturally, but the lack of acknowledgement would be a strong indicator that
traffic is not reaching the server, or that replies are not reaching your
capture point.

So while the TS Val may be the cause of the drops, I think the packet drops
should have stood out when seeing the traffic being black holed, and likely
the same segments getting re-transmitted continuously.

And for anyone out their who thinks this is easy to catch, I'd say this is
very easy to miss, because you need to have a good understanding of how TCP
works in the first place, to know what not working looks like.

~~~
dboreham
True, but Wireshark will highlight dodgy TCP frames (retransmits, dups, etc)
which should give a small clue to look further. I agree that it is necessary
to understand how TCP works (or have access to someone who does) in order to
run Internet services.

------
ceworthington
Some people might not have realized USDS is still around since it was best
known for the Healthcare.gov rescue under Obama. But it's still here, and
still hiring people to work on problems like this www.usds.gov/join

~~~
F00Fbug
Hmm.... I apply every 6 months or so and get the thumbs down. Not sure what
they're looking for. I've got 30 years of every kind of experience (dev, DBA,
network, security, product mgmt, analytics/data science, business mgmt, and
more) with good credentials and they never bite. I wish I knew more what the
ideal profile was; I'd love to help out!

~~~
steven777400
I wonder if the environment of experience is significant? USDS positions
itself like a startup (even their page has a section on "dress code" which
mentions being like "any other startup"). Someone whose experience is
primarily enterprise or BigCo might be less appealing. It would be interesting
to see a roster of current USDS FTEs and their backgrounds (I didn't see a
"Who's Who" on their page, but didn't look extensively).

~~~
noir_lord
I think that startup mentality might bite them in the arse.

I saw "React on Ruby" and winced.

There is nothing wrong with that platform as a "We are in a market where
things will change radically in two years" but for the VA? Where things might
change once a decade, that's a recipe for pain.

Look at where the Web was 5 years ago (hell React didn't exist) never mind 10.

Angular is 7 years old, KnockoutJS is 7, jQuery is the grandaddy at 11, React
is 4.

Not a criticism (they are clearly doing important impactful work) more a
concern.

If someone said to me "You will have to support this for at least 10 years"
the choices I made would be extremely conservative.

~~~
commandar
To me, this looks like the bigger potential problem:

>U.S. Digital Service members join us for what we call a tour of duty. We are
seeking candidates interested in joining the U.S. Digital Service fulltime,
ideally for at least 12 months. In some cases, we can accommodate candidates
who can only commit to a shorter amount of time. Three months is the minimum
time commitment we can accommodate. All members of the U.S. Digital Service
hold "term-limited" positions, which means that at the end of a prescribed
term, the candidate's employment with that agency must end.

You have to move to DC -- without relocation assistance -- knowing that you're
only going to work for USDS with an expiration date? Seems like that'd really
shrink the net of candidates to me. I know it kind of kills my interest,
personally.

~~~
kelnos
What also concerns me about that is maintenance. You're constantly bringing in
new people to build new things who have no knowledge of what people in
previous "tours" built. The overheard of all the handoffs and knowledge
transfers that needs to happen seems unfortunately high.

~~~
dkhenry
While the USDS does build things, the model is to have them partner with
career civil servants and contractors and get them to implement industry best
practices. So there is still overhead when handing off between tours, but the
bigger problem is finding capable contractors and vested partners at the
agency's who can champion the new way of doing things.

------
UnoriginalGuy
I really hope USDS can help introduce interdepartmental digital transfers
within the federal government.

To give an example of how frustrating it can be... I went through the visa ->
green card -> citizenship process. No two departments talk to one another, and
when they do they seemingly transmit information on paper which is then
transcribed by hand introducing errors/typos.

For example USCIS does not talk to the SSA digitally at all. I filled in a
single form which was used for both my Visa/Green Card and to apply for a
Social Security Card on my behalf, my name was spelt correctly on the visa,
but got typo-ed during entry into the SSA's system (then the emphasis placed
on me to prove their error, even though other US government departments don't
have the typo, including any official ID I hold or naturalisation
certificate).

Additionally when you earn citizenship the USCIS won't tell anyone. You have
to get your piece of paper and physically go tell each department one by one
about the change, otherwise nothing will happen.

Why doesn't the federal government just have a big database? Or failing that,
why does one department not electronically transfer records to another
department? Why are people still hand re-entering information already held
digitally?

~~~
vogelke
> Why doesn't the federal government just have a big database?

Because that tends to end very badly. I've been in Fed MIS systems since I
joined the USAF in 1981, and it's the same just about everywhere except for a
few of the research labs.

Remember the OPM data breach, where the background check info for over 20
million people went walkabout? I was one of the lucky winners, and it's a
result of the management-by-spreadsheet mentality that's everywhere in the
government. On paper, they were fine. In actual fact, not so much.

If the security checklist says "you must have an audit system in place", and
you have the system installed and running, you pass. Nothing is said about
ever looking through the logs for atypical behavior that might indicate a
breach.

You want to buy new software? If it's not Oracle or Microsoft, good luck. If
there's not a contractual vehicle in place to use for the purchase, forget it.
Whether or not the software is fit for purpose has no bearing on the matter.

------
kevin_nisbet
While I found this article very interesting, I feel like something is missing
here.

So linking this issues to a Cisco bug is very interesting, that dropping
connections would cause the application to lock up / crash, while all the
connections to the database were dead.

My question is why would the application lock up and the servers would crash?

I don't see it very often, but when striving for high availability and strong
resiliency (which isn't reasonable for everyone), issues need to be looked at
in great detail. So I would be trying to look at the second side of the story,
which is why was there crashes encountered under these circumstances, and are
there other plausible triggers that could cause a similar set of
circumstances.

Disabling timestamps does avoid the Cisco bug, but a similar set of triggers
could be encountered anytime the VPN connection dropped, or if the firewall
failed over without the state tables in sync, or any number of other network
conditions.

And don't take me wrong, I don't know if the OP did this, but based on the
article, I would lean towards disabling timestamps as a workaround, and this
might still be an indicator that something in the app isn't behaving correctly
when the database is unavailable.

~~~
askldjd
You are dead on. We do have a bug where we are not recovering the Oracle
connectivity correctly. It is on our radar to address the issue.
[https://github.com/department-of-veterans-
affairs/caseflow-m...](https://github.com/department-of-veterans-
affairs/caseflow-monitor/issues/15)

However, There is actually another 50% of the story that I never posted.
VACOLS is a really old Oracle DB (from the 80s) that is out of our control.
Somehow, it has a "feature" where you can only make one TCP connection to it
every 2-3 second. So if we lose connection to the database, it will take many
seconds to recover. At that point, our ELB health-check would've fired and
restarted our EC2 instances. This is why recoverability of the database
connection is not an immediate priority.

Here's how we preallocate the VACOLS connection pool to workaround this
throttling feature. [https://github.com/department-of-veterans-
affairs/caseflow/b...](https://github.com/department-of-veterans-
affairs/caseflow/blob/master/config/initializers/warmup_vacols.rb)

The infrastructure we operate in are very challenging (and interesting)
because of legacy systems. That's why common sense engineering often may not
apply in USDS.

~~~
kevin_nisbet
I also bet, those challenging legacy systems in many case are way better built
than what "modern" systems would provide. Sure there will be whacky things to
work around, but I've seen my share of whacky engineering in brand new systems
too. Common sense engineering seems to be few and far between these day's.

Kudos on having something interesting to work on.

------
dkhenry
Its amazing to see how "solving" the problems can often not solve the problem.
Immediately when faced with a error that happened after five minutes I might
just put a sleep(301) in the startup script, but that totally would have
masked the issue for others. Also amazing foresight by the kernel team to
think ahead and make this wrap explicit.

~~~
askldjd
Author here. Completely agreed. My jaws dropped when I saw the
INITIAL_JIFFIES. The kernel developers really saved our butt.

I could not imagine debugging this problem if INITIAL_JIFFIES was randomized.
It may takes days/weeks/months for this bug to appear.

~~~
lfowles
Similarly, Unreal Engine 4 offsets platform time (a double) by some large
value so if it's stored in a float, accuracy errors will be exposed almost
immediately. Looking it up, the offset starts out large enough that the
epsilon is two seconds.

~~~
vageli
Do you have a link with more info? I'd love to read more about this.

~~~
lfowles
Sorry, no. It's not something documented other than a cryptic comment in the
source code ( FPlatformTime::Seconds() ) assuming some knowledge of floating
point number gotchas.

Edit: Here's a more detailed post o made about the specific gotcha if you're
interested: [https://community.gamedev.tv/t/why-is-fplatformtime-
seconds-...](https://community.gamedev.tv/t/why-is-fplatformtime-seconds-
already-past-6-months/9701)

------
jacquesm
I love bugs like these. Make it crash is often the hardest part of solving any
bug and without this you'd have never known. There is one nasty bit to this
story though: the NSOC was running outdated firmware on their Cisco's and
wouldn't have known about it if an outside party had not alerted them to this
fact. That's pretty sloppy on their end.

~~~
askldjd
Thanks. I think the fact Cisco routers fail to route TCP packets bothers me
even more.

/you had one job

~~~
fartbagxp
But it did route those TCP packets over.

For exactly 5 minutes.

It just means you need to route everything faster, and then kill your
connection, and restart it. :)

------
heywire
I just wanted to say that I love these type of postmortem stories. Thanks for
sharing!

------
ChicagoBoy11
It's like a 2017 version of the 500 mile email bug

[https://www.ibiblio.org/harris/500milemail.html](https://www.ibiblio.org/harris/500milemail.html)

------
firebones
The Jiffies root cause leads to an interesting idea: an Glossary of Magic
Constants where all kinds of important constants, limits, and overflows are
tracked to aid in debugging. You could imagine a search engine where "tcp
connection drops after 5 minutes" lists every piece of software and firmware
with 5 minute and 300 second constants.

~~~
gwern
So OEIS for programming? Largely seems covered by Google: punch in an oddly
specific number and someone will probably have discussed it on Stack Exchange.

------
rlucas
Reminds me of trying to debug long-lived SSH tunnels which would fail every
2:11:15 hours. Right down to the hardcoded value in net.ipv4.tcp _

------
sydney6
Would this (disabling TCP Timestamps) affect TCP Performance with other OSes
in regard of their respective TCP Window Auto Scaling Implementations? I
believe Linux uses DRS (1) and doesn't necessarily depend on TCP Option TS for
TCP Window Auto Scaling and FreeBSD has got this (2) commit ~ 2 Months ago.

(1)
[http://public.lanl.gov/radiant/pubs.html#DRS](http://public.lanl.gov/radiant/pubs.html#DRS)
(2)
[https://svnweb.freebsd.org/base?view=revision&revision=31667...](https://svnweb.freebsd.org/base?view=revision&revision=316676)

~~~
askldjd
Yup, it would. Disabling the TS option was just a stopgap measure to make our
deployments stable for the time being.

~~~
sydney6
Of course, setting priorities.. I was just wondering how different OSes would
behave under these circumstances. For instance AWS S3 also doesn't support TCP
Timestamps and this had a rather big impact on e.g. FreeBSDs TCP Performance
until recently.

------
equalunique
A team at Veterans Affairs is my customer. They have been tasked to integrate
with some AWS hosted intranet system. Other than points of integration, we
know very little about it. This article seems to be a big clue.

------
jyz
I had the pleasure of meeting and working with many amazing USDS engineers.
Lots of talents, many are truly dedicated to the higher purpose and truly
believe in the mission of serving our country. It's a shame that because of
the current administration, people are less and less interested in the
government.

~~~
askldjd
The government's current IT infrastructure crisis is not caused by any one
administrations. The root cause goes back decades. Things like "Improving
Veterans' lives so they don't have to wait 5-10 years for an Appeals decision"
shouldn't be political.

I can honestly say that the projects I've been involved with in USDS are the
most impactful and meaningful projects I've worked on in my entire life.

