While this might have been caused by mistake - these types of bugs can be (and are) abused by hackers.
The post also links to this video:
A shitty Belkin KVM in certain configurations can allow this to happen.
There's a bug which keeps generating chr(32) characters when you activate the keyboard shortcut (scroll lock twice), and try to switch to another machine.
It will keep pumping out those spaces on whatever fields was selected at the time, so if you take your time before you switch back, you are going to be in for a lot of fun.
Haven't read the link yet, but wanted to share this sooner than later.
I've got of course checks in place that warn about trailing spaces and my Git hooks outright refuse a commit with trailing spaces. But it would be nice to catch that at the insert level.
You know, it may be cat who's typing.
autocmd BufWritePre * :%s/\s\+$//e
Then copy the entire post and paste it a couple dozen times.
See the edit: http://stackoverflow.com/revisions/38484433/2
> The malformed post contained roughly 20,000 consecutive
> characters of whitespace on a comment line that started
> with -- play happy sound for player to enjoy. For us,
> the sound was not happy.
Mine does that on the Control key all the time.
 eBay insists on something being entered and when it was a routine transaction with a vendor I seldom have anything useful to say.
But I do know that tools often make it easy for people do incredibly stupid things by accident. Like the 'sudo remove file' command followed by '-rf' in the wrong place.
I rimraf a lot; it's very useful. It's a miracle I haven't wiped out any computers doing so...
This is very timely, because minutes ago, I made a link to Russ Cox's articles in my Kernighan awk repo:
If you are not familiar with this issue, basically Perl popularized bad computer science... "regexes" are not regular languages.
They say that this particular case triggered quadratic behavior, not exponential, but the point is that there is a linear time algorithm to do this.
The file b.c in the awk repo implements the linear time algorithm:
(and rsc's site has some nice sample code too, as well as caveats with regard to capturing and so forth)
"Regular expressions are one of computer science's shining examples of how using good theory leads to good programs ..."
"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."
The alarming thing is that regex are supposed to be compiled before use, and the class of expressions that need quadratic or exponential time is distinct from the class that needs linear time. Why don't these popular, widely-used regex implementations perform any optimizations?
At least they keep you warm. Sometimes.
How can it not be a feature that you can heat more space the more spaces you feed it?
You just made my day. Thank you.
"This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions."
What better excuse is there for a poor implementation than standards compliance? In many ways, using regex with backtracking by default is like programming in Lisp without tail-call optimizations. If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences, just like the Scheme specification requires implementations to support tail-call optimization.
The comparison is valid because they both can create a quadratic runtime from a linear time algorithm.
At least in some regexp engines, this is possible. There is a concept of possessive (as opposed to reluctant, or as it's often called, "non-greedy") quantifiers. Given the string "abb", consider the following regexes:
/[ab]+b/ -- matches "abb"
/[ab]+?b/ -- matches "ab"
/[ab]++b/ -- DOES NOT MATCH!! (Because there is no back-tracking.)
More info: http://www.regular-expressions.info/possessive.html
One regexp engine which implements this (see section 4): https://raw.githubusercontent.com/k-takata/Onigmo/master/doc... -- this is the engine used by modern Ruby versions.
Sorry I don't have the exact reference handy. Apparently the author of PCRE tried to implement an NFA/DFA engine for some regexes as well. And I think RE2 might also have different strategies for different regexes as well, but I forget the details (it might be related to capturing).
The problem is that the friedl O'Reilly book uses the terms NFA and DFA in a completely wrong and boneheaded way. I was confused by that book for awhile.
Perhaps we ought to call them irregular expressions - this kind of behavior is the literal antithesis of what regular means in automata/parsing theory.
Python adopted back references as well as forward and back assertions, but not the other stuff.
Perl really turned regular expressions into something else entirely...
/* lasciate ogne speranza, voi ch'intrate. */
("Abandon all hope, you who enter here.")
In particular there don't seem to be crazy optimizations and non-portable stuff like you see in other production regex implementations and in interpreters. It's pure ANSI C.
The functions are all short -- no 1000 line monsters like I've seen in a lot of old C code lately.
The rsc example code is actually a great example of using pointers in C. People rightly avoid lots of raw pointers for application code, but for this use case, they yield compact and elegant code.
Note a lot of developers use this trick anymore but it's quite powerful. Should get more consideration. Not to mention the real solution of choosing tools or algorithms with predictable, reliable behavior. Those definitely exist for the task they're performing as other commenters noted.
EDIT to add modern example from high-assurance field that shows how awesome it can be (esp see Github link):
The program was written in python, and as it turns out the GIL prevents a thread from being able to interrupt another thread that's doing a re.match() call - the entire regex engine is in C so the GIL is held for the entire call.
My solution was to use SIGALRM to set a timer before calling, and have the signal handler raise an exception. This causes the regex call to raise that exception, which I then catch and display an appropriate error.
That's not even the issue here, though. The issue is tgat the input is processed via an algorithm thst is easy to hang and/or format that's hard to parse. On top of that, no isolation or monitor to catch the fault. Each of these can be done differently... are in many programs... to avoid or better mitigate such risks. Developers often dont.
The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).
Maybe regex are just enough and few bother to have something faster?
I have found some start code at https://t0yv0.blogspot.com/2011/02/home-made-regular-express.... The thing is that I don't know how much else is necessary to have a well made library...
But yes it looks like it uses a struct re_dfa_t for matching, and some special flags for backreferences? That seems to indicate that they are using linear time maybe with a fallback? Hm.
I think the general turn of events was that libc supported BRE at first, which was implemented using the DFA algorithm. Then Perl implemented a backtracking engine and Perl syntax for REs, and Python/Ruby/etc. wanted to be compatible with Perl, so they also implemented backtracking as well. Because Perl has other features like forward and back assertions that require backtracking.
And then perhaps libc got EREs with backreferences which they bolted onto the existing infrastructure?
Anyway, thanks for the pointer... I may look into these more in the future.
musl libc also uses the TRE regex engine, which apparently uses the DFA algorithm.
This should be the title of a book on software engineering.
My current stance is, avoid regex if at all possible. Turns out, many of the things we use regex for is possible without. Often times, .Substring, .IndexOf, and using LINQ over strings is sufficient.
Whoever came up with them has surely been given the same honorary place in hell with Jon Postel, who invented the utterly disastrous "be liberal in what you accept", that has plagued all web developers for the last 20 years.
Regular expressions are just a notation for expressing a finite state automata that recognizes a regular language: if that's the problem that you have, regular expressions are definitely the tool you want to be using. For instance, I recently made myself a small tool to compute the min/max/expected value of a dice roll; the notation for such a thing forms a regular language that can be expressed as follows:
non_zero_digit ::= '1' | ... | '9'
digit ::= '0' | non_zero_digit
integer ::= non_zero_digit digit*
modifier ::= '+' integer | '-' integer
roll ::= integer 'd' integer modifier?
I agree that regular expressions are often used in contexts where they should not, especially when people start using back-references to recognize non-regular languages, but don't throw out a perfectly good tool because sometimes it is ill-suited.
It's quite an amazing problem to have honestly because it really shouldn't be a problem to create a proper implementation. You can learn the theory behind regular expression in a few hours and know pretty much everything there is to know.
Without fuzzing, it's going to be pretty difficult to come up with enough test cases to thoroughly test a regex.
As proof, I submit some email header parsing code which I rewrote as a well-commented Regex which was something like 300x faster than using the Mail gem: https://github.com/pmarreck/ruby-snippets/blob/master/header...
Once you know what to look for re: catastrophic backtracking, you know how to avoid it. This is called programmer skill.
... is a part of so-called unix philosophy, the CAUSE of the internet. The fact that many web developers forgot to become a programmer is just a fun fact changing nothing.
It dates back to August 12, 1997!
(Perhaps someone should encourage jwz to have a 20th birthday party for that at DNA?)
This regular expression has been replaced with a substring function.
> While I can't speak for the original motivation from many moons ago, .Trim() still doesn't trim \u200c. It's useful in most cases, but not the complete strip we need here.
This would have probably been my train of thought (assuming that I consider regex to be a valid solution):
Trim() would have been the correct solution, were it not for that behavior. Substring is therefore the correct solution. Problem is, IndexOf only accepts a char array (not a set of some form, i.e. HashSet). You'd need to write the <Last>IndexOfNonWhitespace methods yourself. Use a regex and make sure that it doesn't backtrace, because it's expressive and regex "is designed to solve this type of problem." The real problem/solution here isn't substring, it's finding where to substring.
I consider regex too dangerous to use in any circumstance, but I can certainly see why someone would find it attractive at first.
The regexp for this is simply \s+
As to why they're not using String.Trim (https://msdn.microsoft.com/en-us/library/t97s7bs3(v=vs.110)....), maybe it's because String.Trim doesn't seem to know about the 200c whitespace character.
Stackoverflow is a programmers site, you must expect that a programmmer might go, "Hmm, they're trimming whitespace, wonder what happens if I put 20,000 unicode whitespace characters in there instead of normal whitespace"?
Comparing this to checking the home page, what is the best way to setup a health check for your load balancers?
The danger is that the endpoint becomes public knowledge and comes under a DDOS attack. Putting an IP address filter on that endpoint is usually enough to stop that.
I'm aways up for hearing about other ways people solve health checks.
Emphatically agree, but it's important in the first place to design and deploy your infrastructure such that basic increases of scale are accounted for - prevention is the most important piece of the puzzle. Get that right and an overloaded server is symptomatic of something else, in which case taking down access to the unruly resource is first priority.
IMO, the big takeaway here is that they were load balancing simply by only hitting the top level - selectivity is somewhat tedious to build but worth it in the long run.
Why is there even direct external IP connectivity to the realserver, sidestepping the loadbalancer?
1) If you do have a back-end failure, this setup can cloud the root cause during recovery because your downstream web servers are down as well.
2) Transitory back-end failures can cascade, and your health checks can make this worse as things flap around. You need to be VERY careful with your timers and retry logic, and tune them appropriately. eg, how does `/status` handle a single TCP reset from the DB? When does it decide to timeout if the DB doesn't respond? It's hard to get this right - network errors vs protocol errors vs timeouts are often handled differently at different layers of the controller, and the load balancer has its own retries and timeouts.
3) If this health check is being used for an API, recovery can be more difficult because of thundering herds. eg, 5/5 API servers go down because of a temporary DB failure. Queues and requests backup. 1/5 API servers is restored to service a bit earlier than the others, and immediately goes down again because it's not degrading gracefully under heavy load. Wash/rinse/repeat until you figure out how to load shed.
4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken? Ideally, the app will degrade gracefully if certain request can be served from cache, or don't require the failed backend.
5) The failure case where this type of thing might be useful (network partition affecting DB connectivity from a subset of your webservers) isn't very common in the environments I've been involved with.
A far more productive approach if you absolutely need this sort of thing is to implement circuit breakers at the app level, and explicitly make certain tripped breakers a hard failure for the /status endpoint. This has the advantage of not depending solely on synthetic queries, and having very explicit restoration logic.
For load balancer health checks, I prefer ones that hit the controller and expect a 200, and nothing more. (The DB and overall health of the service cluster are monitored through more organic instrumentation.)
I was laughing to myself when I read the Stack Overflow post mortem because I have a healthy pile of healthcheck-too-expensive war stories, and their experience resonated.
I've also had the opposite problem where the health check would answer 200 OK, but the real app itself had a corrupted internal state because of a poor design/threading bug. If the health check had been the home page the node would have been pulled from the LB. While a more in-depth health check would have helped here, I think its better to alert/monitor/kill any node with a higher than normal error rate and leave the health check simple.
> 4) Are you absolutely positive that your app is worthless without a DB/redis/backend-baz, and every endpoint is broken?
Yes, if that's the case you should not respond "unable to serve". However, you should probably fail setup check and not start serving traffic if you're a newly provisioned node.
I wouldn't call this cascading failure more than error propagation. If a critical dependency is down making the node essential worthless, it should be removed. This can be caused by things like network partitions and whatnot - as you say by 5. You say "it's not likely" which is sadly a common statement, but when it does happen it can be nasty - and the bigger environment the often it happens.
Cascading failure usually (in my experience anyway) refers to having the actual failover cause more outages. In the case of failing /status I can only see this happening if the architecture of the system is broken anyway - or the load is genuinely too high for the failure that just occurred.
You say: "The DB and overall health of the service cluster are monitored through more organic instrumentation". What is acting on that? What happens when you do get that rare network partition and half of your web nodes cannot reach the database?
A very sound strategy, IMO. I subscribe to this KISS practice. If response/status != 200, issue an alert and/or call a deeper health check routine.
You live and learn.
I actually got to talk to Raymond Hettinger (Python core team) about why re uses a potentially exponential-time algorithm for regexes when there is a famous linear-time algorithm, and (I suspect) most programmers would assume linear complexity. As it turns out, there was an attempt to re-write re to fix this, but the re-write never managed to present exactly the same (extremely large) API as the existing module. He advised me that "the standard library is where code goes to die."
Edit: it's "regex" - https://pypi.python.org/pypi/regex. I have no idea if it behaves better with backtracking, a cursory glance at the documentation doesn't indicate any changes in that area.
$ python -mtimeit -s "s=' '*20000; import regex as re" "re.search(r'\s+$', s)" # ends with spaces is fast
10000 loops, best of 3: 201 usec per loop
$ python -mtimeit -s "s=' '*20000+'non-space'; import regex as re" "re.search(r'\s+$', s)" # ends with a non-space is slow
10 loops, best of 3: 3.39 sec per loop
I try to write my regexes such that they anchor at the front of the strong or the back, or they describe the whole string; never an either-or anchoring type situation like this example.
Spaces at beginning of string (100,000 iterations):
Rate onestep twostep
onestep 62500/s -- -2%
twostep 63694/s 2% --
Rate twostep onestep
twostep 55249/s -- -9%
onestep 60976/s 10% --
Rate onestep twostep
onestep 7.11/s -- -100%
twostep 16667/s 234333% --
It's likely that Perl has an optimization for `\s+$` but not `^\s+|\s+$` (the former regex only ever matches at the end of the string, which is amenable to optimizations).
I haven't done much Python but the documentation for re.search() and re.match() is very clear: use search to find an expression anywhere in a string, use match to find an expression at the beginning of a string. It appears to ignore anchors in both cases? Left undetermined then is how to anchor an expression at the end of a string. You say re.match() works, but this is pretty confusingly described, and it's easy to see how the ambiguity can lead to problems for even the most experienced programmers.
But yes, this is a very subtle semantic and different regex engines handle it differently. My favorite approach is to always use `.STAR?theregex.STAR?` and don't provide any additional methods. Instead, if the user wants to match from the beginning to the end, they just need to use the well known anchors. (Go's regexp package does this, and Rust borrowed that API choice as well.)
a) what part of the regex am I currently trying to match
b) what point in the string am I currently starting at
c) how much of the string has this piece of the regex consumed so far
Then the state machine basically has three transitions:
* if (b+c) terminally matches (a), increment both (a) and (b) and reset (c)
* if (b+c) matches (a), but more could be consumed, increment (c) and check again
* if (b+c) doesn't match (a), increment (b) and reset (c)
So yeah, you could put in short cuts for things like "well this didn't match because the next piece of the regex is matching on a line ending", but keeping it simple is straightforward and generally smiled upon.
(Rust's regex engine does this.)
You are very likely right. Though it looks like they permanently deleted those characters because they don't show up on the edit history.
If only she was there to witness the effect of a 34 minute downtime on an open space full of mobile/back/front developers.
(:greedy-repetition 1 nil :whitespace-char-class))
(:greedy-repetition 0 1 :non-whitespace-char-class)))))
(let ((length 40000))
(let ((string (make-string length :initial-element #\space)))
(setf (char string (1- (length string))) #\+)
(defparameter *match* (make-string length :initial-element #\space)))
(defun end-white-match (string)
(ppcre:do-scans (ms me rs re *scanner* string)
(when (and ms
(= (aref re 1) (aref rs 1)))
(return (values ms me)))))
(time (end-white-match *match*))
;; Evaluation took:
;; 0.000 seconds of real time
;; 0.000000 seconds of total run time (0.000000 user, 0.000000 system)
;; 100.00% CPU
;; 25,139,832 processor cycles
;; 0 bytes consed
(time (end-white-match *no-match*))
;; Evaluation took:
;; 0.000 seconds of real time
;; 0.000000 seconds of total run time (0.000000 user, 0.000000 system)
;; 100.00% CPU
;; 11,105,364 processor cycles
;; 0 bytes consed
Most of what you want to do with a regex can be done with an NFA or DFA based engine. That which can't be done with an NFA or DFA based engine is generally better handled with a parser than a regex.
There are plenty of good DFA based regex matchers out there; RE2, the Rust regex crate, GNU grep, etc. At a glance, it even looks like glibc uses a DFA, though it supports POSIX REs which support backreferences so it must use backtracking at least for REs that contain backreferences.
Predictable hash collisions were a big sources of DOS attacks in web scripting languages which use tables a lot, until they started rolling out randomized hashing algorithms to prevent easily predictable hash collisions. It seems like it would be best for languages and libraries to move to DFA based regexps, at least for anything that doesn't contain backreferences, to mitigate these kinds of issues from being easy to exploit.
I'm impressed they were able to do this so quickly.
Something like that.
If time pressure is not a factor and the production server has the VStudio or WinDbg installed, attaching the debugger to the process can see the data related to the post. But the symbol file and source files might be needed; it's just more hassle to set up. For stressful situation, simple steps and simple tools are more useful. Whatever works.
If it was malicious they would have made a bunch of them, not just one. I personally have seen many files with unreal amounts of whitespace at the end.
Probably went like this: 'web server crashed. so did another. what page did they crash on? ok, let's take a look at the post on that page. what in the....'
> 10 minutes to roll out the fix
That seems very slow to me. 30% of their down time was because their deploy process is slow.
FWIW here's a write up on their process
Also I imagine that 10 minutes included dev and testing, not just the deployment part of "rolling out"
Maybe "deploy" means the "Deploy" section of this article: http://highscalability.com/blog/2014/7/21/stackoverflow-upda...
Seems to target only being able to "deploy 5 times a day". I guess maybe the build time is the limiting factor.
Its just interesting to me the implications of what folks optimize for and that this is considered fast. We have very minimal deploy testing and optimize to be able to revert quickly when there are problems because performance issues like this are very hard to predict. Probably means we create many smaller short hiccups though (that generally are not a full site crash).
- login to server
- make dump of all threads stack traces
- see that something like Regexp.match present in all stacks
- find function that called this regexp.
"At one stage, we decided to try to avoid having to be woken for some types of failure by using Heartbeat, a high availability solution for Linux, on our frontend servers. The thing is, our servers are actually really reliable, and we found that heartbeat failed more often than our systems - so the end result was reduced reliability! It's counter-intuitive, but automated high-availability often isn't."
One of these days we'll finish our new system and I'll blog about that, which is that the automated systems are allowed to take ONE corrective action without paging, at which point they flag that the system is in compromised state. Any further test failures trigger an immediate wake of the on-call.
Simple regex (as in formal language theory) are matched in O(n) time by finite automaton.
Extended regex like PCRE are more powerful, but most of the time are implemented by backtracking engines, where really bad regex pattern might go exponential, but even simple pattern as in postmortem can go O(n^2).
Do implementations optimize simple regex patterns to O(n) matching? Even I wrote x86 JIT regex compiler for fun some time ago. Compilation time was really bad, but matching was O(n).
* regex pattern is controlled by site, while regex input is external
* regex pattern is compiled once, while it is being run for every input
> You can do that with a pair of substitutions:
> You can also write that as a single substitution,
> although it turns out the combined statement is
> slower than the separate ones. That might not
> matter to you, though:
The trick would be to not allow only expression that can be translated easily to state automate.
Good regex: "Phone number [0-9]* "
Bad regex: ";Name=.;" as . can also match ";" and it can lead to bad backtracking. You should rewrite this regex to ";Name=[^;];"
RE2 is probably best implementation so far, but because it's tries so hard to preserve backward compatibility with all regular expression it is not that fast in average case:
Never mind that it looks like all web servers where taken out of rotation, even one server down could cause a cascading effect (more traffic directed to the healthy ones that end up dying, in a traffic-based failure). One action for example after n servers have gone down, (besides getting up other m servers) is to put (at least some) servers in a more basic mode (read only/static, some features disabled), not guaranteed but that could have prevented this and other type of down times.
Sure, it makes sense to cut malfunctioning servers out of the production pool when discovered. But at the point where you've dumped out every production server, you've just created the worst possible situation - you're absolutely guaranteed not to be able to serve any traffic. At the point you get to zero functioning servers, a better response would be to restore all servers to the pool and just hope. That would be bad, but less so.
What we've done on our internal systems is to set things up so that when the last server is dropped, it fails over to a passive pool that is really just the normal production pool with no health checks. But I'm not aware of any LB that supports this directly, we had to glue together some ugly configurations to get this to work.
Pasting that content into hipchat will probably lock up your browser and webview based clients. Beware.
Lesson learned: don't parse user input with a regex.
Anybody know if any regex engines attempt this?
Obviously you can still shoot yourself in the foot, but it's somewhat more difficult to do so in a situation like this where the regex in question "looks" cheap.
Obviously. If the "regex" includes a backreference, it requires backtracking. If it includes only regular operations (I can't call any other nonregular operations that people might expect to mind), it doesn't. This is information that can be trivially surfaced by a parser.
Here's a state machine that recognizes /^.*\s+$/:
1: initial state (non-matching)
2: matching state
all states transition to 2 on \s, and to 1 on \S. Done.
Taking this a little further, we can observe that since the transition table is identical for every state, there's no real need to have states at all -- you get the same effect by just checking the last character.
In my current project, it found two of those in Google Zxing library.
1. Rendering a page fails/does not terminate if some non essential subtask (rendering a single code block) fails/does not terminate.
2. They do not try to detect bad data (the way they certainly try to detect bad code)
3. Load balancing based on the rendering time of a single page
Code bugs triggered by bad data will happen again, with or without regular expressions.
It was actually the sanitization function regex that became unresponsive. So they killed the page by trying to stop bad data coming in.
Agreed on your other points though.
Also, the "homepage" of StackOverflow does not show any 'comments' - it is just the top questions? Why was the page loading any comments in the first place?
To clarify the "comment" was not a Stack Overflow comment, but rather a comment in a code block inside a question.
It would be interesting to find out if something like this goes into their automated regression tests. :)
The postmortem here should probably be "why are you reimplementing trim".
% perl -v
This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
% ls -l testfile
125380092 Jul 20 16:41 testfile
% time ./regextest testfile
./regextest testfile 0.36s user 0.03s system 99% cpu 0.398 total
(Remove the a at the end)
% time ./regextest testfile
./regextest testfile 0.13s user 0.03s system 99% cpu 0.156 total
There are many, many modules that use /\s+$/, including at least one form validator.
It's true you need them to implement backreferences. But I've never used such a thing. If I were creating a runtime for some new language, I would simply ignore that part of the POSIX standard.
Maybe I'm less-experienced (1.5 yrs), but it would have taken me significantly more than the 15 mins(was it?) that it took them to diagnose the problem.
Props to the team, and Thank All 0.33G Gods I was not on call!
> It took 10 minutes to identify the cause,
> cause was a malformed post that caused one of our
> regular expressions to consume high CPU ... called on
> each home page view ... Since the home page is what our
> load balancer uses for the health check, the entire site
> became unavailable since the load balancer took the
> servers out of rotation.
The trick is that there's no guarantee, in general, that a match failing at character N is due to character N, so the regex engine backtracks.
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
> 20,000+19,999+19,998+…+3+2+1 = 199,990,000
Wouldn't regular users, trying to access the homepage have yielded the same effect?