
Three bugs in the Go MySQL driver - farslan
https://github.blog/2020-05-20-three-bugs-in-the-go-mysql-driver/
======
bgrainger
While the second and third problems sound specific to Go's `database/sql` API,
the first problem (server-closed connections) is an issue any MySQL client
library (that implements connection pooling) has to deal with.

My .NET MySQL connector ([https://github.com/mysql-
net/MySqlConnector](https://github.com/mysql-net/MySqlConnector)) "solves" it
at the MySQL level by sending a PING packet, but as the post points out, this
adds extra latency.

The TCP-based approach of performing a non-blocking read sounds like a much
better approach; I'm glad the author shared this, and I want to see if that
technique can also be implemented in MySqlConnector:
[https://github.com/mysql-
net/MySqlConnector/issues/821](https://github.com/mysql-
net/MySqlConnector/issues/821).

~~~
wahern
The non-blocking read is a bad hack to an ill-defined problem.

For starters, if a read would fail, so would a write. Your write path has to
handle errors anyhow, so the non-blocking read is a contrivance. I understand
the reason for the contrivance--it's because the typical SQL driver API
requires first dequeueing a connection from a pool, passing ownership to a
caller, and the caller sending a query. The proper and better API is to
combine the "dequeue connection" and "send query" steps, returning a handle
back for the response, which may include (as an optimization) assigning the
connection to the caller for any additional requests. Additionally, any hack
that prevents the majority of write failures, but not all, leaves time bombs
around by not stressing the write failure paths of the caller.

Secondly, IME a far more common reason in the wild for a connection
disappearing is either a) the server crashing or b) NAT association timeout.
In either case it's typical for the remote end to simply disappear, and unless
TCP keepalive is enabled, a read won't return any error until you first
attempt a write and wait for the TCP stack to figure out that the remote end
is gone. In the best case that's a R/T, in the worst case some eager firewall
rules in the middle or on the remote peer silently drop packets for
unassociated connections and/or the ICMP response and you're stuck waiting for
TCP timeout.

Thirdly, and related to #2, absent TCP keepalive or an application ping (not
ICMP ping), connections are far more likely to get dropped on the floor while
sitting in the pool. And an application ping is more reliable because some
firewalls and application and NAT gateways drop or ignore TCP keepalive.

I've written client libraries with connection pooling for HTTP, MySQL, SMTP,
RTSP and generic TCP. I learned all this the hard way.

~~~
dub
The article says the non-blocking read resolved the problem they were having
while only introducing five microseconds of latency per borrow from the
connection pool.

For a consumer-facing, latency-sensitive application, that sounds like much
easier-to-stomach hack than introducing a network round-trip at the start of
every connection borrow.

~~~
user5994461
It's funny you say it resolves the problem because if you read the bug report
from the author opened since 2017 [https://github.com/go-sql-
driver/mysql/issues/657](https://github.com/go-sql-driver/mysql/issues/657)

It's repeatedly pointed out by one maintainer that his server and application
are misconfigured and should be fixed. Yet the author doesn't consider it and
goes on a project to redo the connection pool (the library can certainly
benefit from better error detection but he would be better off fixing the root
cause of connections dropping).

Ergo the issue is not fixed. If you look at the chart provided, it's still
erroring every few minutes, instead of every minute. It's an improvement but a
far cry from zero. [https://user-
images.githubusercontent.com/42793/57318792-6e3...](https://user-
images.githubusercontent.com/42793/57318792-6e3b6800-70fb-11e9-87f7-f9d69e8a3c7b.png)

~~~
yencabulator
There's a fundamental unfixable race there: a connection can be picked from
the pool while a FIN is already on its way to the client.

This is "caused" (as in, not mitigated) by the intersection of a naive
protocol, as pretty much all SQL protocols are, and the nature of distributed
programming.

~~~
user5994461
You realize the database server was misconfigured to drop connections
transparently after 30 seconds? If it were not for that, TCP race conditions
are extremely rare.

------
JulienSchmidt
Bug #3 (The race) was silently introduced by a _semantic change_ in Go's
database/sql in January 2018:
[https://github.com/golang/go/commit/651ddbdb5056ded455f47f9c...](https://github.com/golang/go/commit/651ddbdb5056ded455f47f9c494c67b389622a47)

It took until December of the same year until we got the first bug report and
figured out what was going on. While the semantic change might look subtle, it
was certainly not from our (driver maintainer's) perspective. We were quite
disappointed that such a change was made 1) without informing the driver
maintainers 2) making sure the changes were in place before this change made
it into a Go release.

We regularly test against Go's master (now using a Travis CI cron job), but
that only helps if the existing tests fail. We don't have the time to
constantly monitor all changes in the Go repo.

If there is a need to make such changes (not just in database/sql and not just
in Go), PLEASE actively communicate early with the community / the direct
users.

~~~
nemetroid
A comment on the original issue[1] promises to ”[open] some tickets for
various drivers”. Did this not happen?

1:
[https://github.com/golang/go/issues/23519#issuecomment-36057...](https://github.com/golang/go/issues/23519#issuecomment-360572590)

~~~
JulienSchmidt
I really appreciate the work that is done to improve the database/sql package.
It is still rather young and not as mature as for example JDBC. I would rather
thank the people who work on it for the work they do, than to point fingers
for anything that is not done. The original design of database/sql was
sometimes a bit too simplistic and will probably require some more ugly
changes or workarounds in the future.

This specific change was a bit unfortunate. It tried to fix another bug caused
by the introduction of Context cancellation support, but it unfortunately
caused some other major bugs due to the changed semantics. Like the very same
comment already indicates, other drivers also required changes. In retro-
perspective I think it would have been better to try to handle the original
issue entirely in database/sql, instead of changing the "contract" with the
drivers.

Go MySQL Driver was, and I believe still is, developed mostly by some random
programmers in their free time (I originally started it as a side-project
during high school and until now was never paid for any work on it). It
probably makes anyone who worked on it proud that now some major companies,
not just GitHub, employ it. But like many such projects, it is not a perfect
or finished product. If you use community-driven open source software at work,
try to convince your management to set some work hours aside for actually
contributing back to those, like GitHub did.

~~~
networkimprov
Go is still undergoing intensive R&D, especially in the runtime. Its releases
are quietly tested against internal Google projects, but not formally tested
with widely-used open source projects.

Any org that deploys Go apps should designate someone to watch the Go issue
tracker, and subscribe to all issues that might affect them.

[https://github.com/golang/go/issues/](https://github.com/golang/go/issues/)

------
kardianos
FYI: Go 1.15 will have idle timeout:
[https://tip.golang.org/pkg/database/sql/#DB.SetConnMaxIdleTi...](https://tip.golang.org/pkg/database/sql/#DB.SetConnMaxIdleTime)

The driver interface also separates out the session resetter and the
connection validator concepts.
[https://tip.golang.org/pkg/database/sql/driver/](https://tip.golang.org/pkg/database/sql/driver/)

Please do have drivers implement the Connector.

------
EmielMols
The work-around for the first issue might work well in practice, but due to
asymmetry of tcp streams, is still open for a race where the server will idle-
close a connection while a new (non-idempotent!) query is just in-flight.

The correcter solution would let the client manage the idle-timeout and
disconnect once reached. Depending a bit on how controlled the client is, this
might be a good or bad idea.

Note that this problem is very generalizable to http (1) upstream servers. If
you need to support non-idempotent requests and want persistent connections to
your upstream, it's not a good idea to have the upstream manage the idle
timeout (and disconnect if reached).

In practice, I would have the client manage an idle-timeout of 30s, and server
of 40s as an extra protection against misbehaving clients.

~~~
bgrainger
Some MySQL clients offer this; for example MySqlConnector has a
`ConnectionIdleTimeout` setting that controls how long a connection can be
alive in the pool: [https://mysqlconnector.net/connection-
options/](https://mysqlconnector.net/connection-options/)

This is generally useful to "prune" the number of open connections in the
connection pool after a burst of activity has happened, so I'm surprised more
clients don't offer it; I couldn't find an equivalent setting on go-sql-
driver/mysql or MySQL Connector/J.

------
user5994461
>>> Keep in mind, we run our production MySQL clusters with a pretty
aggressive idle timeout (30s)

Number one source of connectivity issue with MySQL. That setting alone is the
source to every bug they encountered.

It's marvelous the amount of applications that have a low timeout and wonder
why the (idle) connections are dropped all the time.

The timeout works fantastic when software prepare connection to the database,
spend maybe 30 seconds initializing and doing other stuff, then send the
query. The carefully prepared connection is guaranteed to be broken
consistently. But the most impressive use case is with connection pools, whose
purpose is to maintain idle connections, that gets butchered every 30 seconds.

MySQL should remove this fatal setting and enable TCP keepalive by default.

At least have mercy on the developers to set a sane default and cap the
minimum value to 5 minutes.

~~~
throwdbaaway
It looks like they don't need the aggressive idle timeout anymore, now that
they have ProxySQL for connection pooling:

> ... ProxySQL, which is responsible for connection pooling ...

[https://github.blog/2020-03-26-february-service-
disruptions-...](https://github.blog/2020-03-26-february-service-disruptions-
post-incident-analysis/)

More worryingly, it looks like they just rewrote the authz functionality from
ruby to golang, and have both the monolith and the microservice accessing the
same database?

------
kccqzy
Cancellation is always such a big issue regardless of language or environment,
and whether it's due to a timeout or some other action.

In C, when using threads and thread cancellation there are a lot of subtleties
involved: [https://ewontfix.com/2/](https://ewontfix.com/2/)

Even in Haskell where there's generally a focus on well designed abstractions,
asynchronous cancellation can fail due to badly written code that wants to
catch all exceptions.

------
sargun
These bugs are not specific to MySQL. The bug around context closing the
connection resulting in a txn not having an explicitly successful abort /
rollback is a problem as well because conn.Close gets called in a goroutine
when context expires.

We've experienced this in pq as well.

~~~
todotask
This as well, a recently fix for this scenario

[https://github.com/golang/go/issues/34775](https://github.com/golang/go/issues/34775)

------
kevinherron
Bug #1 doesn't seem solved. They just narrowed the window for the race
condition.

I don't see how you can fix this without the client first calling db.Ping() if
within some proximity to its estimation of the connection's expiration.

------
nitwit005
> A quick Code Search for rows.Close() on GitHub shows that pretty much nobody
> is explicitly checking the return value of rows.Close() in their Go code.

Seems to be a pattern in all languages. Plenty of C code ignoring the result
of the close system calls, or Java code that just wraps exceptions from close
calls in try/catch blocks that do nothing with the exception.

~~~
PeterisP
Well, what _should_ the code do there?

At that point, the status is uncertain, there's potentially a problem but IMHO
there's nothing the code can do to reliably recover from it.

~~~
sudhirj
The code should throw away the data it read from the connection just before it
closed it, because there's a high chance of it being corrupt.

~~~
lstodd
That's just crazy. How do you define "just before" ?

~~~
sudhirj
The last iteration before the current check. But I'd assume the foolproof way
would be to return an error if the Close() returns an error, erroring out the
entire read. I don't know if a partial read is ever a good thing.

------
sudhirj
Why aren’t these problems applicable to Postgres? Is the connection model
different or do the libraries already do all this?

~~~
hilbertseries
Why is Postgres relevant to this post at all?

~~~
donatj
Because Go uses the same _provided_ abstraction layer for all SQL databases.
If the errors are in the abstraction layer and not the driver it would be odd
for them to be MySQL specific.

~~~
wolco
Not really that odd. The abstraction layer provides functionality common to
all. Perhaps something common didn't work for mysql.

~~~
csharptwdec19
Entirely possible.

If you've ever looked at an ORM or even complex MicroORM that supports
multiple database you get an appreciation for how many 'little' differences
everyone has in the SQL standard itself.

And if that's the "industry standard" (Yeah, I know it's not, but it's at
least a spec/guideline) with that much deviation, I can only imagine how
dissimilar each DB's nonstandard protocol might be.

------
nerdbaggy
I’m sure there are a lot of reasons to hate it but I like how Go does the
different database types and the base type

~~~
nkozyra
A lot of languages do this under the hood, because sql abstraction layers need
it for network, query discrepancies.

------
jijji
I encountered this same issue a few years ago when using the Go mysql driver,
and the simple fix was calling db.Ping() before doing a query, which checks
the connection and allows the query to proceed.

~~~
RhodesianHunter
So you have to make a round trip to the database before making a query? Yikes.

~~~
kevingadd
You don't need a round-trip, you can push a ping request into the pipe
immediately followed by your actual query. You don't care about the actual
result of the ping unless your query fails - "if the ping also failed, this
was a dead connection so I can retry the query, but if the ping was OK and the
query was not, the query failed"

------
yuribro
I really don't understand the discussion around the first bug. Either
something is over simplified, or is it just an issue of a bad abstraction?

This issue (other side closes connection) is so fundamental to all networking
code, I don't see how the use specific use case (MySQL idle timeout) is
special. The connection pool also doesn't sound relevant - the same would
happen if the caller is keeping the same connection (socket) and using it.

For this class of issues - this is the easiest case to deal with - the other
side sent a FIN and we got it! Our kernel knows that this socket is closed
(doesn't matter if fully closed or only half for this case).

So if you would do a select (or equivalent) call for the socket it will not
return as writable (and will probably return in the exception case), and
you'll reconnect.

If you don't use select - the write will fail immediately, and I would guess
that at the worst case, you can see that 0 bytes were written?

So the problem here was just that the wrong error was returned through the
multiple layers of abstractions? Why not propagate the error correctly? Or
handle the re connection in a lower level? Trying to "Ping" on the application
level is such an overkill and misuse of resources. Do we really want to have
an extra round trip for every action we do on a DB?

Of course, as someone else pointed out, there are much harder issues to
consider - silent disconnects, proxied connections where the proxy didn't
propagate the error, failures on the DB level, and so on.

EDIT: I see now that another comment mentions that there is a specific error
code in MySQL clients for lost server.

------
zkirill
>> Instead, make sure that every SQL operation in your app is using the
QueryContext / ExecContext interfaces

I've been wondering if this is indeed a best practice or something that should
be used only when necessary. Should every database query really abort if the
connection to the client fails?

In addition, if you follow this advice you now need to check if the returned
error is context.Cancelled which warrants an HTTP 400 response.

~~~
dullgiulio
Of course it is up to the application to decide case-by-case.

Just keep in mind that if your query is started by a user request, but does
not need to terminate in the context of the user request that triggered it,
you can move the query to a separate goroutine and decouple it completely from
the triggering user request.

Also, aborting a query is safe: the transaction gets rolled back, removing
unwanted side effects. You cannot do harm by cancelling, bu you can do harm if
you don't cancel.

------
markdog12
Dumb question, but you can't listen for a close socket event in Go?

~~~
tptacek
A TCP half-close, with a FIN? I think that's a limitation of the socket system
call API itself.

~~~
markdog12
Yeah, I wasn't sure. I was looking at a Dart MySQL connection pool
implementation, and that's what's done.

    
    
      socket.listen(onData, onDone: onSocketDone);
    
      onSocketDone() {
        _closed = true;
      }
    
      write(Buffer buffer) {
        if (_closed)
          throw StateError('Cannot write to socket, it is closed');
    
        ...write
      }
    
      onData(RawSocketEvent event) {
        if (event == RawSocketEvent.READ_CLOSED) {
          _closed = true;
        }
    
        // Other events handled as well, READ, CLOSED, WRITE
      }
    

But I don't know what's going on under the hood there.

------
Traubenfuchs
One consequence of reinventing the wheel (go) is going through the same
problems others stumbled over years ago.

~~~
zkirill
As someone who is currently transitioning from Go to Java/Kotlin I agree with
you to an extent. However, I was surprised that I wasn't able to find anything
in the Java/Kotlin universe that is comparable to Go's web request context
cancellation.

~~~
apta
Loom will have superior context cancellation and deadline handling than
golang.
[https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1....](https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part1.html)

~~~
Traubenfuchs
I didn't read all of it, but I was wondering: In a virtual loom thread, am I
allowed to call currentThread().sleep(...); or will it block the system
thread? Or is there some magic like currentThread().sleepAndThen(..., ()->{});
?

------
frobisher
Just curious - are any of the these bugs possible in Rust?

~~~
The_rationalist
Rust SQL drivers (when they exists) are mostly not production ready, they just
have not the human resources. For example the leading driver (diesel) do not
support elementary SQL such as GROUP BY

~~~
couchand
I thought Diesel was just an ORM, but investigating your comment I discovered
that they've implemented their own drivers in-tree. That surprises me, and
also seems like a mistake. Steven Fackler's PostgreSQL driver, for instance,
is widely considered to be quite solid.

------
The_rationalist
The justification for going to go given here is very weak. Firstly if they
really cared about performance they would go to jruby or truffleruby, secondly
most of the performance of a language are in its library ecosystem beyond
being the language itself. Ruby has had decades of optimisation for the server
that the nascent go has to reinvent

------
PunksATawnyFill
Go: Yet another Google language.

What's it going to be next week?

~~~
nkozyra
They have Dart, which is alive and well after 9 years and Go, which is alive
and well after 11 years.

What point were you trying to make?

------
erikrothoff
As someone running MySQL with Go in production and seeing random MySQL
connection loss exceptions, is there a summary of how they fixed this? It’s an
insanely long article and I appreciate the detail and effort, but if someone
has read it all and have a gist that would be insanely helpful .

~~~
sethammons
You have production issues related to the article but you dont have the time
to read the article?

~~~
erikrothoff
Sorry if I came across as lazy. It’s a sporadic issue that’s more easily fixed
with a service restart. Did not mean to come off as entitled. My intent was
just to ask for some help from the community.

