

Upgrading to Ruby 2.1 (and why complex regexes will inevitably hurt you) - lylo
http://engineering.freeagent.com/2014/05/23/upgrading-to-ruby-2.1/

======
quarterto
Whoever wrote the title: the article doesn't mention regexes even once.

~~~
Yver
That's what I thought so I did a second take. This is the relevant excerpt:

> This tree of strings is then converted to YAML tokens using the class
> Psych::ScalarScanner that internally uses regular expressions to identify
> YAML types. The regular expression used to identity time objects changed in
> the new version of Psych, which meant our nanosecond precision time was now
> correctly converted to a YAML Time type rather than being left as a string.

Apparently, the "hurt" mentionned in the title refers to a bug being _fixed_.

~~~
makomk
It may technically be a bug fix, but the "fixed" behaviour is kind of gnarly,
especially in the way that it interacts with older Ruby versions. Basically,
they're serializing _a string containing a timestamp_ as YAML, but because
that string matches a regex deep in Ruby 2.1's YAML code it actually gets
serialised as a YAML Time object. Ruby 1.9 then deserialises the YAML Time
object to a Ruby time object rather than a string, causing round-tripping
issues.

~~~
jrochkind1
Yeah, this is actually pretty crazy:

In ruby 1.9, a String containing a timestamp gets serialized as a String --
and parsed as a String in 1.9 or 2.1.

In ruby 2.1, a String containing a timestamp gets serialized as a Time (I'm
not sure I'd consider this correct behavior), -- and parsed in again as a
String anyway in ruby 2.1 (due to new YAML-safety behavior), but as a Time in
ruby 1.9.

The ruby 2.1 behavior of serializing out as a Time but then parsing back in to
a String anyway -- seems particularly sketchy, and ripe for roundtrip bugs
where the string it gets turned into on parse isn't _quite_ the same one that
was original before serialization. (Say, winds up with a different timezone or
something).

I'm not sure what the lesson here is, but I don't think it's about regexen. It
may be about how too much cleverness/magic will get you -- just keep it
simple. Shared gem/library code should be predictable and understandable with
simple mental models, and gems used as core dependencies should be very stable
from version to version (yaml and yaml serialization rules are neither).

It may be, yet again, about how using YAML the way we use YAML is a big
mistake -- either use ruby marshal if you actually want exact roundtrip ruby
objects, or serialize only to JSON-compatible datatypes (whether using JSON or
maybe that's what YAML should have done all along). This middle ground of YAML
where it's hard to predict what is round trippable and what isn't, what is
safe and what isn't, and it can change from version-to-version -- is just
asking for trouble. But it's hard to decide to stop doing it, as so many of
your gem dependencies might be.

~~~
tenderlove
> The ruby 2.1 behavior of serializing out as a Time but then parsing back in
> to a String anyway -- seems particularly sketchy, and ripe for roundtrip
> bugs where the string it gets turned into on parse isn't _quite_ the same
> one that was original before serialization. (Say, winds up with a different
> timezone or something).

It's not serializing it out as a time, but serializing it as an "implicit
string" (a string with no quotes). When Psych goes to dump the string out, it
checks to see if the string could be interpreted as something else. Since the
string being dumped doesn't match a YAML timestamp, it doesn't add explicit
quoting on the string.

Here's an example. From the blog post, they are doing this:

    
    
        Psych.dump Time.now.utc.strftime("%Y-%m-%d %H:%M:%S.%6N %Z")
    

Output:

    
    
        "--- 2014-05-23 15:42:29.882127 UTC\n...\n"
    

The `strftime` adds the string "UTC" to the timestamp. But "UTC" isn't part of
the [yaml timestamp
spec]([http://yaml.org/type/timestamp.html](http://yaml.org/type/timestamp.html)).
Since it doesn't match a YAML timestamp, Psych dumps it out as an implicit
string.

If we change the `strftime` to produce a string that _does_ look like a YAML
timestamp like this:

    
    
        Psych.dump Time.now.utc.strftime("%Y-%m-%d %H:%M:%S.%6N %z")
    

Output:

    
    
        "--- '2014-05-23 15:46:31.197338 +0000'\n"
    

Then the output is quoted so that when we load it isn't ambiguous (quoted
strings are _always_ considered strings, and are never candidates for
coercion). If we were to manually modify that YAML and remove the quotes, it
is indeed converted to a Time object:

    
    
        irb(main):012:0> Psych.load "--- '2014-05-23 15:46:31.197338 +0000'\n"
        => "2014-05-23 15:46:31.197338 +0000"
        irb(main):013:0> Psych.load "--- 2014-05-23 15:46:31.197338 +0000\n"
        => 2014-05-23 08:46:31 -0700
    

The article says:

>The same would happen in Psych 2.0.5 if it were not for a new feature called
Safe Load which was introduced after the recent Rails YAML deserialisation
security vulnerabilities.

I have no idea where they got this. You have to explicitly call `safe_load`,
and it doesn't sound like they're doing that. If your format the timestamp
correctly in your YAML, it will happily deserialize as a Time object. :-/

~~~
jrochkind1
So, the fact that there's such thing as an 'implicit string' in the YAML data
model, and that I didn't know this, and that I still don't completely
understand how it works even after reading your comment (thanks though!)...

...is leading me back to suggesting YAMLs data model is overly complex, and
not understandable with a simple developer mental model, and that's the root
of a lot of problems.

I previously thought leaving quotes off string literals in YAML was simply a
convenience for hand-writing YAML and producing YAML with less noise for human
readability; I didn't realize it resulted in an 'implicit string' which would
be de-serialized by YAML parsers differently than quoted strings.

(Are these patterns by which 'implicit strings' are recognized to be
coerceable to certain types part of the YAML standard, or up to the particular
YAML parsing implementation as to how/whether to do it? If the latter... woah,
just asking for trouble, no wonder we got it.)

 _update_ Oh wait, I just realized I totally misunderstood what you were
saying. Strings are Strings, but on serializing, the serializer is supposed to
ensure quotes are used if the string, were it unquoted would be misinterpreted
as something other than a string. Okay, seems reasonable -- but OP is
reporting this is failing somehow, right? And in some cases something meant to
be a string is winding up unquoted, and being interpreted as something else on
load?

Or I'm probably still not quite getting it. Okay, yeah, tldr, we can stick
with: YAML's data/processing model ends up being way too complex making it's
behavior hard to predict and hard to maintain consistent between
parsers/versions.

~~~
tenderlove
> And in some cases something meant to be a string is winding up unquoted, and
> being interpreted as something else on load?

In this particular case, a timestamp with "UTC" in it _should not_ be
interpreted as a timestamp, and older versions of psych _would_ interpret it
as a timestamp even though the spec says it's not valid.

> Okay, yeah, tldr, we can stick with: YAML's data/processing model ends up
> being way too complex making it's behavior hard to predict and hard to
> maintain consistent between parsers/versions.

Absolutely. If we eliminated implicit strings, it would make life lots easier.
OTOH, you could just use JSON. :-D

~~~
jrochkind1
Me, I'd keep implicit Strings (meaning optional to use quotes, for
convenience), but stick to JSON-compatible data types. For most cases where
YAML is used, what's really wnated is something that is pretty much exactly as
expressive as json (no more), but easier and neater to write/read by hand.

Of course, I realize the devil is in the details, and that may be what those
developing YAML thought they were doing, but always adding just one more tiny
thing that would be useful too.

Complexity vs flexibility, one of the fundamental tensions of software design.
I see how we got here through good intentions, but in retrospect YAML made
some decisions on this balance that most of us would do differently. Or, did
those designing YAML realize what a fundamental part of the ruby ecosystem it
was to become, and what the implications of that were? Maybe part of the
lesson is that some things really do need to be carefully designed, not just
iteratively designed "you can always fix it later" by an ever-changing
community of volunteers; is that part of what happened with YAML spec and
implementations and uses?

Also, thanks for explaining the bug-fix as part of the story, that makes a lot
more sense. I understand that bug fixes will inevitably sometimes be backward-
incompat. I think it has maybe been consistently under-estimated how
important/useful it would to allow upgrades of ruby itself while keeping the
exact same YAML processor and version. The psych/syck stuff ended up being one
of the biggest pain points for many people way back in 1.8.7->1.9.3 too. Most
gems ended up compatible with multiple ruby versions, so you could upgrade
ruby versions while keeping your gem dependencies identical. YAML
serializing/parsing, so fundamental, where minor differences in implementation
can have disastrous consequences, was a painful exception.

------
josephlord
Isn't the real risk running different versions simultaneously (both of Ruby
and the libraries) in production and having them sharing state via a
serialisation format.

~~~
jrochkind1
Except if that serialization format was Json, we'd both be shocked if there
were any problems, wouldn't we? It's possible, but pretty darn unlikely.

And if that serialization format were ruby marshal... there's actually a
'marshal version' embedded with marshaled serializations, which changes
independently of ruby version, can change to indicate backwards incompat, but
I don't believe has changed in quite some time, not between 1.9.3 and 2.1.x.
So again, if the marshal version is the same, we'd be shocked (and it would
unambiguously be a bug) if there were anything handled differently between
versions. (Okay, that's not entirely true, I can think of some counter-
examples too).

I think it really is about YAML, and the mistaken design decisions in the way
much of rubydom has historically used it, which continue to pile up.

Note that I think ActiveRecord `serialized` uses YAML in the db, so that's
another place you might find yourself "sharing state" via YAML between diff
ruby versions -- if you upgrade to ruby 2.1 and have it reading a database
created with an earlier ruby version. Oops.

Maybe the lesson is really that serialization is way harder to get right than
you think, in terms of both security and forwards compatibility (future-
proofness).

~~~
josephlord
By serialisation I meant something that dumps objects like Python's pickle.
Maybe marshal is a better equivalent. If there was version checking as you say
there is in marshal it would still have failed in production but the errors
would be better.

Are you sure Activerecord uses YAML in the DB? Everything I've used it for is
stored as the native database types. What is the need to serialize the object
into the DB rather than create a new one based on the record in the DB?

Nowhere did I say serialisation was easy. My comment was that running multiple
versions of platform and libraries at once sharing state was brave. To do that
I would want to be completely explicit about the state format and have tests
to ensure it is correct even though it is an internal API.

Better not to share anything between the servers and just have them all
accessing the same service that owns the state format (probably the DB but
possibly also caching layers[risky needs managing] and other services).

~~~
byroot
> Are you sure Activerecord uses YAML in the DB

ActiveRecord use YAML (by default) to serialize non scalar values like hashes
and arrays.

~~~
josephlord
OK, I don't actually use them, my data model is pretty pure relational (edit:
and normalised) although I might start using the Postgres array and
hstore/json column types at some point I guess that boundary would prevent
serialization issues troubling me.

------
viraptor
Oh, this gets even better when passing yaml between versions.

In ruby 2.1:

    
    
        irb(main):018:0> "2014-99-99 99:99:99.999999 UTC".to_yaml
        => "--- 2014-99-99 99:99:99.999999 UTC\n...\n"
    

Send it to ruby 1.9:

    
    
        irb(main):006:0> YAML.load "--- 2014-99-99 99:99:99.999999 UTC\n...\n"
        ArgumentError: argument out of range
        	from /usr/lib/ruby/1.9.1/psych/scalar_scanner.rb:111:in `utc'
    

Why did they even try to guess at stuff that's in a string, I can't
understand. At least the new version does not explode. (returns a string)

PS: If you think at this point "who would send an incorrect timestamp?", think
about some definition of a field validator for example. "format: 9999-99-99
99:99:99.999999 UTC"

~~~
tenderlove
> Why did they even try to guess at stuff that's in a string, I can't
> understand.

Because that's what the YAML spec says to do. I will try to write better code.
:-(

------
matthewmacleod
Good stuff. It's a testament to a good process that you managed to flip it
live without more major problems. Obviously there's not as big a difference as
1.8.x -> 1.9.x, but I've still seen some weird errors in production…

~~~
lylo
It was a much more painful story going from 1.8.7 to 1.9.3 in 2011. It didn't
help that we decided to go from Rails 2.3 to Rails 3.0 at the same time :-/

------
lectrick
> (and why complex regexes will inevitably hurt you)

This is a needless bash on Regex (which is not even mentioned in the article
at all, except as a YAML footnote!). I have a tip, though- If you DO need to
write a complex regex (hint: it will run much faster than doing the same thing
in code), document it using things like the /x switch! Here is an example I
wrote of a complex password validator- I don't think most developers would
have a problem understanding what's going on:

[https://gist.github.com/pmarreck/4c5f1076498da1a86062](https://gist.github.com/pmarreck/4c5f1076498da1a86062)

~~~
lylo
I agree in hindsight it's an overly provocative title. If I could change it I
would (I have changed the actual blog post title FWIW). Sorry if it offended
you or regular expressions in general.

~~~
lectrick
Hey, I get that regexes can be a bit "dense" to understand. I've actually had
to scale back my use of them at a prior job for the sake of the team (I was
the most expert regex guy by far, but you can't have code that only 1 person
is able to modify...)

I love them for some reason because 1) they're like puzzles, 2) they're
extremely fast if you know what you're doing, vs. doing the same things in
interpreted code.

------
airblade
Good to see real-world data on the difference upgrading Ruby makes.

