I liked this article a lot. A programmer found a bug in something he uses every day, and learned enough to fix it.
I think articles like this are very useful to beginner/intermediate developers. Everyone in software says "write open source code", "make pull requests to code you use" etc, but there's a very big gap between knowing how to program and knowing how to track down, fix, and submit a PR for a bug in a program (and language!) you've never worked on before.
This is a good little tutorial on ways to attack a bug in a program. I especially like that he starts with two print statements - there's no wizardry here, just a programmer digging into a bug.
Absolutely. I think articles like this are the best way to demystify some of the stuff that seems a bit magical in working software. Things that look complex are rarely the product of genius, but rather of a long string of practical and sometimes painstaking, but fundamental logical problem-solving.
I agree. I was recently working with an npm package that wasn't behaving as expected. After some debugging on my end, I was convinced I was using the library as intended. So I started digging through the source code, wrote a few console.log()s, and found the issue. The check for a property in the code didn't match the property name in the README. A simple issue to fix but one that seems much more attainable if you know how someone worked their way to that point. A few years ago and I doubt I would have taken the time to dig.
Maybe I should write up a blog post like this one..
First of all, this is why people should stop adding stupid features to regular expressions. A sane regular expression implementation has no pathological cases. DFA generation can be done in O(n^2) from memory (in the absolute worst case O(n) is average), and matching can't be worse than O(m) or similar (n is the size of the regex and m the size of the string). When you add features like back references and recursive matches, you bring the worst case complexity to O(a^n), which is exponential.
And why? So you could write some dirty hack rather than writing a simple recursive descent parser to create an AST of the code (which takes O(n)). What did you gain by ruining your regular expression implementation and writing shitty code?
EDIT: Matching might be O(m), but I'm wondering about maximally linked graphs with many epsilon edges. The conversion from DFA to NFA would require O(n^2) in that case. Implementing it as an NFA evaluation might be faster than full conversion, but then you run the risk of having O(n^2) dominating in the matching.
EDIT: The author is somewhat correct on what "catastrophic back referencing" is. While you could argue that it is due to bad implementations of the greedy matching, it's a more endemic problem of how you have to implement a regular expression engine that supports back references. If you have to support back references, then you will always have a class of pathological cases which cause exponential time complexity. I had a useful graph about this on my toy regular expression engine (which performs much better than Python's implementation even though it's written in Python and not C): https://github.com/cyphar/redone.
I don't always want to spend insane amounts of time writing full AST parsers. Sometimes it is a lot easier to write a simple, hacky, throwaway regex. It's a bit much to claim that backtracking regular expressions, a very useful tool sometimes, should never ever be used and everybody should waste loads of time writing careful code even in situations where it isn't required.
The problem is not the tool. The problem is using the tool for the wrong job - although who knows, maybe if they didn't just chuck out a quick hacky implementation, then atom simply wouldn't have the feature at all - in which case it is a trade-off between having the feature at all and an evidently tiny corner case bug that was easily fixed by someone who isn't even a regular developer of the project.
So I don't see the reason to be outraged or judgemental here.
"I don't always want to spend insane amounts of time writing full AST parsers."
This is an effect of the language, not the problem space. Backtracking REs make it very easy to write bad code, and then most languages make writing the parsing code hard.
If you're in a language that makes parsing code easier, like Haskell, then the tradeoff isn't anywhere near so bad. I don't mention Haskell just because it's the trendy thingy, I mention it because parser combinators really do make for some very easy parsers, and while parser combinators can exist in other languages they tend to be very syntactically heavyweight. Perl 6 is supposed to make parsing perhaps even easier, but I haven't played with it to know.
Parsing isn't really that hard, it just plays very poorly with Algol-inspired syntax.
To be clear, since this is the Internet and the presumption of disagreement is strong, this post is not an explanation of why you're wrong... it's an explanation of why you're right. It is absolutely a true statement that in most languages you're way better off bashing out a dangerous RE than writing a proper parser even for something as simple as this.
Interesting. Can you point to a piece of Haskell code that does some parsing of roughly the same level of complexity of that in the OP, so that we can compare how long it takes to write such a parser compared to a regex?
Haskell is, in fact, a really good way to write parser code. That's the main thing I use Haskell for.
But how are you suggesting that Haskell could solve the problem of letting CoffeeScript plugins express which text they should apply to in a CoffeeScript text editor?
He's suggesting that the correct solution (recursive descent parser) is not inherently hard, but CoffeeScript makes it hard enough that people reach for less good approaches.
> Sometimes it is a lot easier to write a simple, hacky, throwaway regex.
I mean, yes, but I'd argue that "I am writing software which will spend a significant amount of its time parsing code" is not one of the times when that is an appropriate instinct.
> I don't always want to spend insane amounts of time writing full AST parsers.
Then don't use languages whose grammars require insane amounts of effort to parse.
One can write a decent S-expression parser in an hour or two, and a full parser in a weekend. S-expressions can represent everything anyone ever needs to work with, so why use anything else?
I strongly doubt the AST would be needed. You could probably do this in a trivial loop state machine. It can't even be called laziness: that regex took time to work out - far longer than a switch-based state machine would have taken to write in the first place.
Aside: out of curiosity it would have been interesting to see how a Thompson NFA[1] would have dealt with this.
I miscommunicated the question: "how well would have a Thompson NFA have dealt with this?" I.e. How many magnitudes of order faster would it have been? If it was faster?
Fun with automata: Constructing the minimal DFA for a given NFA is PSPACE-hard. There are families of NFAs with n states so that the powerset automaton has 2^n states, but the minimal DFA has 1 state. Example: Let Σ={a,b} be the alphabet, Q={q1, ..., qn} be the states. q1 has a self-loop with both a and b and a transition to q2 with a. q2 to q(n-1) have transitions with a and b to the next state (q2 -> q3 etc). qn has no transitions. q1 and qn are the only accepting states. This is basically the "n-th-to-last letter was an a" automaton with the modification that the starting state is accepting.
> Constructing the minimal DFA for a given NFA is PSPACE-hard.
This isn't really a problem if you incur that cost only once by having the regex compiled when the script is parsed. However, idiomatic JS usually includes regex literals in the closure where it is used - decreasing performance, code reuse and clarity. Why? Probably for the same reasons that regex is being used in the first place.
Well, PSPACE-hard is pretty damn hard, so this really depends on the size and complexity of your NFA ;) Of course most of the time you don't hit the pathological cases like the one I described above. My comment was not meant to provide a guideline on how to fix this, consider it a fun observation on automata theory.
Indeed. What I failed to communicate was that there is no reason to incur that cost multiple times per process/application. For some reason idiomatic JS encourages incurring that cost multiple times. Your comment shows that this idiomatic form is even more absurd than I originally thought.
Actually, the O(n^2) thing is correct for an NFA simulation that tracks states not paths. While there might be O(2^n) paths, there are n states you can be in at any one point.
Because regular expressions such as that one can be applied line by line which becomes important for thousands of lines file and because you can adopt them more easily between languages: they are more loose. This is the reason why all text editors use them (IDEs can choose to use an AST).
You don't need to throw out your favorite regex features—backreferences aren't stupid, they're useful and usually harmless! Rather, a good regex engine should start with recursive backtracking (ideally JIT'd) and fall back to the Thompson NFA if the regex supports it and the recursive backtracking approach is taking too long. Think of it like the adaptive sorting algorithms that switch between insertion sort and quicksort based on the size of the collection.
This approach gives you the best of both worlds: speedy operation in the common cases in which recursive backtracking beats the Thompson NFA, all the regex features you know and love, and worst case polynomial running time for regexes that the Thompson NFA supports.
I don't think it would have prevented this bug, though, because I don't think the Thompson NFA supports the recursive backreference thing that this regex uses.
I don't know of any that can actually switch during regex execution, but TRE at least will choose Thompson NFA if it can and fall back to recursive backtracking if backreferences are used.
The support of fancy features like backtracking does not cause any degradation of performance for simple regular expressions. You are not obliged to use them, but they do not hurt you when you do not use them. Some other people may love them despite their poor complexity.
Actually, this is false. A lot of regex implementations use a backtracking approach in all cases, causing pathological behavior even on regexes which match a regular language (and so should never take any significant amount of time to process).
You are right. The support of fancy features like backtracking should not cause any degradation of performance for simple regular expressions. You are not obliged to use them, but they should not hurt you when you do not use them. I am surprised this issue is not fixed in perl.
If I understand the article correctly, the Perl regex evaluator, if given a string consisting of only the minimum required number of 'a's, first matches them to the optional 'a's, and then has to backtrack to match the required 'a's.
I misspoke a little. The point is that that expression doesn't use backreferences, and therefore doesn't need to use backtracking at all. The fact that it does use backtracking anyway is the focus of the complaint.
> The support of fancy features like backtracking does not cause any degradation of performance for simple regular expressions.
Well, backtracking is how you implement your engine. So "simple" regular expressions that don't use any of the features backtracking provides (like backreferences) will cause exponential complexity, like:
I got to the bottom and couldn't believe that the solution chosen was to modify the regexp instead of use a for loop. He did such a great job explaining that the code is really just trying to count the number of parentheses or braces to see if they're imbalanced, that it felt the next step was "so I wrote a really simple set of loops that is fast enough on short strings and way less crazy otherwise".
Same here. Also, I don't use Atom, but looking at the expression I'm pretty sure it fails to account for strings with parentheses. The way the matching is done, it looks like it will happily count:
Sometimes I feel like the difference between a junior, intermediate, and senior developers is that the junior hasn't yet figured out how to use regexps, the intermediate dev has, and the senior dev has figured out not to.
I kid, but...really. The number of times I've seen people burnt by non-trivial regexps in production code is absurd.
"Oh, I need to mangle this CSV that's in the wrong format"? Sure, write a one-off regexp.
"Hm, there might be a security vulnerability in this URL parameter, I need some way to filter out bad inputs..." Oh hell no.
Literally two weeks ago I tracked down a crazy bug to a regexp someone had written to try and correct mistyped email addresses in a signup form (?!) that deleted "invalid" characters. Like a "+". So many facepalms for one short line of code...
The rule I've developed for myself is never clean up data for security purposes. There are some cases where it is valid to examine the data for validity; for instance, there's many places where you'd like to filter null characters or "all of the ASCII control characters except the whitespace", and it's valid to look for those things. But if you find them, reject the input. In theory you'd think you can clean it up, but in practice it turns out to be just too dangerous. Too many concrete instances of real-world failure, and reality always trumps theory.
The famous "Gruber URL regexp", before an update in 2014, would run in exponential time given some URLs that contained parens.
I noticed this when a web app I worked on froze on certain pages. Runaway regexp matching is one of the easiest ways to really lock a JavaScript thread.
I completely agree. Honestly regular expressions are hard to get a handle on anyway. I've done software development for about 12 years now and I probably run into a regular expression pretty regularly and I still find them hard to read and understand. Almost every single time they do things that can easily be accomplished with a loop and a tiny bit of string manipulation and, honestly, even if the regex was a little faster it's unlikely to be so much faster than you need to make the micro-optimization of using regex instead of some looping.
I try to avoid regex where possible and in the event that I do need to use it I try to document it the best I can and to make sure it's as simple as possible to avoid weird issues like this.
Well, it's clear to see why people are lead astray. In this case, RFC 3986, which defines what a URI actually is, itself actually proposes a regex[0]. It even claims, and I quote, 'the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions'. Of course, it doesn't actually validate anything... it just performs rudimentary field extraction. Trying to do anything beyond that with regex is madness.
The ABNF grammar is fairly simple though and, I think, free of ambiguities. I've had luck converting it straight in to PEG form. It's not a trivial or wholly useful endeavour though, and, if you do so, remember to check the errata. For HTTP you'll also have to add in the changes from RFC 7230[1].
Oh, and of course, none of this validates DNS names, their labels, etc. for length, the "LDH rule", or the public suffix list[2], or IP addresses to check whether they have publicly routable prefixes.
Bottom line is, if you want to validate a URL, the best thing to do, much like e-mail, is to just try and GET it.
The general rule is that any tricky security code should have as many eyeballs on it as possible, so I'd probably start by seeing if my framework had solid support for this, or failing that, see what popular libraries for my platform exist.
In this particular case, the Google Caja project[1] is a good starting place for most HTML/JS/CSS sanitization needs (although the project has a much larger scope than just that); and I think the 'sanitizer' package on npm is a fairly popular wrapper/port of it's basic sanitizing code, and I believe ruby/php/python have their own but I couldn't name them offhand. But it would depend on the exact attack vector you're trying to stop, eg, XSS, remote shell via filename params, etc.
If I had to write it myself, I'd probably go for something as braindead as possible; probably a bunch of nested loops backed by some thorough tests. Regexps are great for magic one liners, but magic one liners are antithetical to good security.
I've never used a language mode that worked properly in all cases. That's why I've stopped using them, aside from very clever and extraordinarily useful ones like paredit.
It terrifies me to think about how these language modes contain regexps for matching regexp syntax. Of course those are wrong.
> I've never used a language mode that worked properly in all cases.
You need a full parser for that to happen, most mode authors don't bother[0]. JS2-mode (https://github.com/mooz/js2-mode/) does that (or attempts to).
Most syntactic colorisers are souped-up tokenisers, not actual parsers. It suffices in 99% of cases, but breaks badly elsewhere.
[0] it would help if more languages made fast parsing machinery externally available as a convenient API. ALAS, that's rarely the case, even less so for fast parsing machinery which doesn't throw the information you need for syntactic coloration out the window.
I'm fairly sure that Emacs's go-mode, which doesn't use a full-blown parser, won't make the mistake of treating a paren inside a string as if it were outside either. Or if it did, that would be easy to fix.
Emacs has a generic lightweight parsing facility that can tell you if you're inside a string, or a comment, or how to skip over a string, or a pair of matching parens/braces/brackets, and so on.
The goal just shifts from having a semi-accurate regexp tokenizer, to having the language mode's complex parser match the complex parser of the language it targets—which will pretty much never happen, unless the language is exquisitely simple, like maybe restricted dialects of Scheme... or Brainfuck.
js2-mode supports ES6 fairly accurately. But of course, that's not enough, people keep asking for new things: JavaScript mixed with HTML, Facebook Flow support, JSX, new ES7 (still unstable) syntax extensions, and so on.
It would be easier with a more stable language. I.e. not JavaScript, which is still going through growing pains.
Even so, "25 Open, 204 Closed" seems pretty good to me.
fundamental-mode hasn't failed me even once, is what I mean. :) I'm not criticizing js2-mode, it's just that I have personally given up hope for accurate language modes in general, and realized that they are a lot of effort and IMO relatively little use.
With anything less than a full-featured lexer and parser of the latest edition of the language in question, I'd say. Including understanding of whatever metaprogramming features or preprocessors that are in play.
It's even more than full-featured for a code editor, really, you really want a language parser that can more gracefully handle partial statements than even the core language parser itself can handle. von Neumann help you if your language was complicated to parse even before you added that requirement.
Visual Studio doesn't even do a full-blown AST - even though it has a full-blown internal parser. It frequently messes up auto-indenting when using macros involving blocks.
Lexing and producing tokens with a loop is probably enough given that auto-indentation works on a subset of states of the parser, if any. As my comment alluded to, using regex is harder to both write and read than a simple loop; it's also most likely slower.
This could be solved by using regexes to tokenise instead, then detecting brackets. Tokenising is something regular expressions can actually do well. Directly parsing source code is not.
There's a comment by 'dave' - presumably, an atom dev - explaining why that's not as quick a fix as you'd hope. But, yes, as soon as I got to the part in the article that read something like '... using backtracking to handle an unknown level of nesting ...' I reached pretty much the same conclusion: stop using regexps for this!
Edit: I think it might actually be the same 'dave' who authored the article, which demonstrates how much he's learnt just by fixing this one bug.
I think the reason is that all actual JS code is in Atom core and not in this Golang plugin, which inherited the whole indentation and syntax highlighting stuff from textmate and sublime. You can't change the code there without making all language plugins (which can only use regexes) break.
"There are horrible people who, instead of solving a problem, tangle it up and make it harder to solve for anyone who wants to deal with it. Whoever does not know how to hit the nail on the head should be asked not to hit it at all."
You must not have finished reading. He originated the version that says "regular expressions", but not the "two problems" part or the sentiment about textmatching:
>As cute as the “now you have two problems” quote is, it seems that Jamie wasn't the first to come up with the idea. The same quote (but with AWK rather than regular expressions as the punch line) shows up in the sig of John Myers post from 1988, where he credits a “D. Tilbrook” for it:
“Whenever faced with a problem, some people say `Lets use AWK.'
Now, they have two problems.” -- D. Tilbrook
The worst part is that as soon as he mentioned regular expressions I knew exactly what the problem was.
Regexes are powerful and useful but also dangerous. People who don't thoroughly understand them and try to get fancy often run into problems like this. In general you shouldn't be using them to parse a computer language anyway, it is something you should be using a tokenizer/parser for.
Hmm, if I remember right, on our compiler course we wre taught to build tokenizer with regexpes. As far as I unrestand, it is quite valid tool for that.
Yeah I was disappointed when he got to the crazy regex. Realised it was an unreadable mess, but then didn't realise that the real fix is not to use regex.
Completely unrelated to the issue at hand... I had no idea Atom was written in CoffeeScript. I thought it was written in Javascript, which made me positive at the thought of hacking into it.
But this code? Definitely giving me a headache, and my interest went down to zero. There seems to be a meme and unchallenged claim in hacker circles that CoffeeScript is somehow more "readable" and easier to understand.
Allow me to disagree, and I suspect there's quite a few like me. This code is harder to read. I think I understand what the code does, but I can no longer be certain about the language semantics. That introduces needless uncertainty.
And it just broke all my pre-configured and pre-setup JS-tooling. Can't use any of that for this codebase. Just great.
So what's up with people writing applications and projects in NodeJS, a prime JS-environment which supports "all" modern Ecmascript-features, classes included, and then decide to go use a non-standard language for their app?
And for what gains? How did Coffeescript make this code more readable or easier to debug? It didn't. And now you need to debug code compiled from the actual code you wrote. How does that do anything except make everything harder?
TLDR: CoffeeScript seems like a bad choice for just about everything and I can't see why any big project with a desire for contributors would even consider using it.
First, good news: The Atom teams is (slowly) moving away from Coffeescript towards modern JS, and I think they'd agree (unofficially) that the choice proved to be a mistake.
Beyond that...
1. The JS world moves stupidly fast, and Coffeescript is a relic of a now-vanished age. It was born, it evolved, and it died. Back in those long ago days of, um, 5 year years ago, there was no ES6, and Coffeescript looked a lot more attractive. So much so, in fact, that ES6 stole a bunch of Coffeescripts better features.
2. If you're familiar with Coffeescript, it's terse and expressive and very readable. If you're not, it looks like gibberish. But that's true of any language. The proper critique of Coffeescript should be "hey, not a lot of potential collaborators know it, so it'll see unreadable to them", not "hey, Coffeescript is generally unreadable". JS is pretty confusing and unreadable if you don't know it too.
Mind you, Atom was released 2 years ago, when Coffeescript was already starting to look dated. And it was an open source project looking for contributors so...yeah. Bad, bad choice. :)
In fairness, I built a fairly large app using CoffeeScript and it still took a lot of brainpower to read it compared to JS. All the typical complaints about ambiguous-looking syntax match my experiences exactly.
I ended up with a general workflow of having my Coffeescript source and the generated JavaScript sitting side-by-side so I could check that the output was what I was expecting. I've only had the urge to do that with ES6 and Babel a couple of times.
I built a pretty large app using Coffeescript, and I was about to say that I disagreed, and never felt like I needed at the generated output, but...
...now that you say that, I did end up going to the Coffeescript REPL on their website to test a bit of syntax a fair number times. I feel like a got used to it pretty fast, but I agree, a big chunk of Coffeescript's learning curve is getting past it's ambiguity, and there were some features I expressly avoided just because they were too confusing. Eg, I could never remember the difference between 'in' and 'of' in CS, so I just used underscore's equivalents. And then there were the comprehensions, and the weird scoping rules, and...yeah.
I'll be honest, switching from CS to a modern Babel/ES6+ configuration, I thought I'd really miss CS, but I really haven't. I maintain that if you know CS, good CS code is easy to understand, but good CS code takes way too much work to write. :)
Interesting. Comprehensions are probably the main thing I miss. I liked the automatic returns and the @ shorthand, but whenever I find myself using Python for whatever odd reason, having comprehensions again is like encountering a long lost friend.
I wrote a couple of (simple) apps in LiveScript, which is similar to CoffeeScript, but is still way ahead of ES6 in terms of features[1]. I had to look at the generated source only once: when I was debugging and ultimately fixing a bug in the compiler. That's expected: had my C compiler had a bug, I'd have to look at asm output, too.
Normally, pre-source maps, I'd also look at generated JS when debugging in the browser, but this is no longer the case. Other than the mentioned compiler bug I didn't have to look at the generated JS even once in my 2-3 years of using LiveScript.
I think a good question to ask would be why did you have to look at the generated source. What did that give you, and what could replace it? Is looking at the generated source the most efficient way to achieve your goals?
[1] The "killer feature" of LiveScript, for me, is its support for functional programming on par with support for OO. It reminds me of Scala or F# in that regard.
> If you're not, it looks like gibberish. But that's true of any language.
That's not true though. I'm not familiar with C++, for example, but it doesn't look like gibberish when I look at it. Nor does Go, as another example, even though I've never written a line of it.
CoffeeScript, on the other hand, still is hard for me to parse—even as someone who has written thousands of lines for it.
There are very real differences in readability between languages.
CoffeeScript is supposed to be this nice Ruby/Python-esque sugar for JS. It does resemble those languages, but it's not nice, it's very... stabby. Indentation is unintuitive and makes code do completely different things (whereas in Python it just delimits blocks, and if you get it wrong the compiler will moan at you). Functions implicitly return the last value they produce. You can omit brackets in function calls - but only if there's arguments, otherwise you've accidentally made a NOP.
It's horrible. It's a nice idea, but the implementation leaves so much to be desired. It's bitten me before and so I avoid it now.
Coffeescript is not perfect, however, nowadays people went crazy for ES6/ES2015, and suddenly nobody complains readability of babel output.
I've worked with both, ES6 is definitely a big progress on Coffeescript, but it takes like 80% of the things from coffee IMO, except the indentation syntax and list comprehension, added an `import` keyword.
My point is: don't complain about coffeescript while cheer at ES6, they are mostly the same.
Throughout its life, CoffeeScript has been by disparaged by those who just seem to really hate it's existence for some reason. At first it was "you can't debug generated code!", now that everybody is using generated code it's "coffeescript is the past, you don't want to live in the past do you?"
Except... CoffeeScript does all sorts of nice things that Babel does not. List comprehensions. Lack of == operator. Correct modulo operator. Block regexes. Triple-quoted strings. The @ syntax.
There is a strong cult-like vibe to the JS community, and I really find it off-putting. They have taken a technical limitation "Browsers only support JS" and turned it into a socially enforced rule "You may only use JS". Fuck 'em.
> My point is: don't complain about coffeescript while cheer at ES6, they are mostly the same.
No, they are really not.
ES6 does not change the semantics of the language. It adds a couple of new useful features, but it still maintains the basic readability of JS.
I've written production apps in both. With Coffeescript, I regularly encounter problems where the generated output is not what I expected given the input and the only way I discovered that was examining the output JS.
With ES6/Babel, I have literally never inspected the generated output. It's never produced code contrary to what I expect.
The semantics are the same (with a couple of minor exceptions) as Javascript so it's just sugar really. I found it very easy to reason about, it often comes down to whether or not you are comfortable with reading significant whitespace.
I liked CoffeeScript but haven't used it in years. IMO CoffeeScript is probably a bad choice for a modern dev environment.
>TLDR: CoffeeScript seems like a bad choice for just about everything and I can't see why any big project with a desire for contributors would even consider using it.
you're not familiar with coffeescript and used to the many javascript warts, so your perspective is biased.
But Coffeescript makes many things easier and has good tooling (just another loader in webpack). While not perfect, It is a reasonnably efficient tool for front-end dev.
If you can't read CoffeeScript then just compile it to JS. Result is more readable than most people produce in JS by hand.
For me CoffeeScript is just shorthand syntax for well structured, fast JS although I agree that ES6 is good CoffeeScript replacement (destructuring FTW) if you can suffer through braces, colons and `function` keyword.
Isn't arbitrarily-long nesting or matching any sort of palindrome, where you count up and down, the classic case of something you should never do with a regex, because they're finite automata?
People don't build parsers because of masochism, but because regular expressions are provably insufficient to capture things like nesting. You need to go at least one level up on the Chomsky hierarchy[1] to pushdown automata for that.
More importantly, shouldn't SOMEBODY working on a text editor know and recognise this sort of thing? I'm all for the hacker mentality of shipping, reducing developer time instead of machine time, using what you know, but this is the sort of hacky fix that just pushes the problem further down the line.
> Isn't arbitrarily-long nesting or matching any sort of palindrome, where you count up and down, the classic case of something you should never do with a regex, because they're finite automata?
Yes, but features like backtracking and backreferences (or recursive referencing) make regex implementations non-regular regular expressions. That comes with a whole heap of issues (pathologically exponential complexities in any circumstance where backtracking is inolved).
I guess NFA is the classic regex, but most implementations are vastly more powerful through the introduction of backtracking and back references.
In any case, a regex is hilariously unsuitable for the purpose here. It's obtuse, it has terrible corner case performance (as noticed here), it probably took vastly longer to write than a simple loop and apparently it isn't even correct.
I am still unsure why anyone uses Atom when it consumes so many resources. The emacs instance I have had open for some time now is using 92Mb. Atom instances have been reported in the hundreds of megabytes [1]. Of course, this is because it is actually a web application running in an entire browser, renderer and helper processes included.
I understand the need for people to be able to "easily" hack on it (easily in quotes here because really that just means "people who know web languages"), but that goal could still be accomplished in a native application that embedded a JS runtime for plugins.
but that goal could still be accomplished in a native application that embedded a JS runtime for plugins.
The easy-to-hack bit is not about the specific scripting language but about the UI being completely hackable.
No other editor (except emacs) lets you hack the interface as well as Atom.
And any attempt at exposing enough hooks to allow a native text editor UI to be as hackable would, in a kind of Greespun's 10th rule, probably result in a bug ridden, poor implementation of an HTML engine.
Because it has a friendly interface, while Vim and Emacs have all these "weird" keybinds and whatnot. These are people switching from e.g. SublimeText to Atom.
I am not trying to harp on Vim or Emacs, I myself am a Vim user and use it for C++ code.
> Because it has a friendly interface, while Vim and Emacs have all these "weird" keybinds and whatnot.
vim and emacs are friendly and welcoming to people who already know them; new users in 2016 have some weird expectations due to growing up using insufficiently-powerful UIs, which means that they have quite a learning curve when picking up a powerful UI.
> vim and emacs are friendly and welcoming to people who already know them; new users in 2016 have some weird expectations due to growing up using insufficiently-powerful UIs, which means that they have quite a learning curve when picking up a powerful UI.
Having first used both vi and emacs (and other text-mode - and even line-mode -- editors), though only casually then, I disagree; vi and emacs are, like almost anything, friendly and welcoming to people who have become deeply familiar with them, but they simply aren't as accessible as tools that benefit from the advances in the intervening years in making UIs easy to use for people that haven't invested huge amounts of time in familiarity.
This is really a different issue than UI power (though if vim and emacs didn't have both powerful UIs and substantially bodies of users with long investments, they wouldn't stick around in the face of their disadvantages in terms of onramp.)
Older users just didn't, when they were new, have alternatives with a simpler learning curve. So, there really was no trade off for the power vi and emacs offered (less powerful alternatives of the time were still just as inaccessible), where now there is.
> So, there really was no trade off for the power vi and emacs offered (less powerful alternatives of the time were still just as inaccessible), where now there is.
There's simply nothing out there as good as emacs. Nothing. Eclipse, IntelliJ, Atom, SublimeText, all those pale in comparison. vim has its positive points (it's an excellent way for a human to edit line-oriented text), but ultimately it too falls down in the general case. emacs is simply the best way for a human being to use a computer to edit data.
There still isn't an alternative to vim and emacs. I kinda wish there were, but there ain't. Someday I'd like to use a modern emacs-like editor, implemented in Common Lisp (or a successor language), but right now emacs is the pinnacle of editor evolution.
> vim has its positive points (it's an excellent way for a human to edit line-oriented text), but ultimately it too falls down in the general case
I think the only thing that Vim doesn't do that it should is actually use a proper language for scripting. Emacs has elisp, but Vim has VimScript which is a horribly stunted langauge.
Aside from that, I much prefer vim to emacs. Just because everything is mode-based, and the keybindings don't require 10 hands to do anything.
> I am still unsure why anyone uses Atom when it consumes so many resources. The emacs instance I have had open for some time now is using 92Mb. Atom instances have been reported in the hundreds of megabytes [1].
It works well on the size files I mostly want to use it with, has a strong community producing support for the languages I want to use it with, doesn't want to impose its own structure and management artifacts on projects the way IDEs tend to, and imposes less cognitive overhead than Emacs [0]. It might use more memory, but that's never been a real practical problem, and even on my cheap 6Gb RAM laptop system, the difference between a programming editor taking up ~100Mb and "hundreds of Mb" just isn't an issue in most cases.
[0] I recognize that this is largely due to familiarity, in that Atom follows closer to currently-popular UX paradigms and Emacs is its own unique beast that predates most of the things that Atom is following, but I've been using Emacs intermittently for many years, and Atom was still instantly more accessible. But even so, there are some languages where the available Emacs packages are so good that I prefer Emacs. Just as there are some things for which I prefer Visual Studio or an IntelliJ-based IDE. Its not like one tool is ideal for every person for every task.
The syntax will be broken 99% of the time and making a parser that recovers gracefully under any circumstance is very hard.
You could make an editor that only allows valid programs but that opens up a lot of UI problems, it was tried many times and it never took off in practice.
Yet Eclipse and IntelliJ IDEA (and most likely a lot of other IDE's) successfully use AST's for their code editor. The syntax will only be broken locally, so it's not that hard to recover.
I tried Eclipse many times over last decade. Always used for few weeks before code highlighting and completion drove me insane due to multiple crashes, hangups and slowdowns.
Trouble with elaborate 'correct' solutions starts when you need another thing. I don't think support in Eclipse for .cjsx or PureScript or whatever is coming soon. I won't wait 10 years till they get it 'right'. I'll use Atom to get the 95% right in few months.
It doesn't need a full AST. It needs a tokenizer and a nesting level or potentially a stack of different classes of token nesting.
This parser would not be difficult to write; the only "recovery" is in choosing how to react to mismatched token pairs. The simplified problem means heuristics could potentially be used.
Not just broken, but broken and changing. Efficiently updating an AST as the user types, especially when their changes could trigger dramatic alterations to the structure of the tree (e.g., typing /*) is a very hard problem. A responsive editor is pretty much always going to have to take some shortcuts.
Here's another one that needs to get fixed with Atom. Try writing this in the editor with syntax set to Go:
expected
func someFunc() {
aSlice := []string{}{
}
}
actual
func someFunc() {
aSlice := []string{}{
}
}
The end bracket on the slice's initializer never indents correctly when you type it and hit <enter>. It always defaults to the first character of the next line. It seems insertNewLine somehow is not able to grok the idea of more than one set of matching brackets.
Thanks for the tip -- yea, I have go-plus installed. It is helpful, but still isn't ideal to have to save the file in the middle of trying to initialize a slice.
Would switching to standard JS regular expressions substantially improve Atom's performance, even though it would mean losing some advanced regex features? I remember hearing that V8 JIT-compiles regular expressions to native code.
I wonder if this is why Atom feels so snappy now. It's my main editor as of a few weeks ago because the speed is now good. It used to choke on a 1000 line rails controller, but now it edits that file just perfectly.
Congratulations to the author, great commit! Small payload, enormous benefits.
It seems there's a need of a regex linter (eslint plugin?) that could warn on pathologically complex regexes. Is building this kind of plugin feasible?
Having said that, probably few people would explicitly opt in to using such a plugin unless it's bundled by default to some linter.
Well, regex101 [1], linked in the article, detects it, so it's definitely possible. Whether that code is open, or it's easy to replicate, is another matter.
technically yes.
but i wouldnt be surprised if deciding on "best practices" is hard given that regex are very often used to get something "just to work".
Dude, you don't need to be so semantic. The headline had "atom" in the title, it had to do with Atom the editor, what more do you want? Do you think every headline should be excruciatingly specific and exact? Might as well just put the whole article in the headline.
I fixed my car the other day by changing a tire - well, technically, I fixed a problem in the car/going-on-the-highway package. But no one on earth would talk or write like that.
Yes it does, and that was my biggest annoyance with this article.
The chrome debugger with breakpoints, profilers, and a whole slew of other goodies is great for this sort of debugging.
Yeah, sometimes i need to avoid it because turning it on can actually slow the code down significantly, but the profiler would have been perfect to see where the time was spent here with just a few clicks.
Perhaps a core contributor or someone who regularly works on the code-base would have approached it with the full suite of debugging tools available as you mention.
However, I get the impression that the author of this article is not one of those and simply wanted to crack open his editor to fix this one problem he saw (and happened to learn more about Atom, Coffeescript and regexps along the way).
Being upset over a great open-source contribution because they used "non-ideal" methods to arrive at a solution is just silly.
I probably could have worded it differently, but I'm not upset about it, it just makes me sad that people might not know that these tools exist!
I often use the same kind of thing for debugging (console.log('got here')) because it's sometimes too much work to leave the code to just get an understanding of if something is hit or not.
When i started reading the article i was really hoping he would use the profiling tools. I just feel sad that they are being overlooked a lot of the time, and this is a textbook perfect use case for it!
Yet more proof that anyone who uses regular expressions to attempt to parse a context-free grammar is in a state of sin.
Balanced parentheses form a context-free grammar, and thus cannot be parsed by regular expressions. There are extensions to regexps which make them irregular — and hence capable of parsing CFGs — but they often lead to poor performance, as here.
I'm not familiar with the way syntax highlighting and code indentation logic in text editors is implemented, but using regular expressions to try and parse a CFG like the Go programming language seems like a phenomenally bad idea.
Maybe everyone does it and everybody is happy with it working only in 90% of cases, but it just seems like the wrong approach to me.
The lesson is: don't parse with regexes. At all. Even for a first draft of a plugin for editor support for some obscure language should regexes be used for this kind of job.
Surprised to see that mistake in code that otherwise looks pretty high quality.
I really really want to love this editor, it's beautiful, but it burns me every time I use it and I end up going back to sublime text. Slow, buggy, hangs, 100% cpu usage, huge amounts of memory usage...
For those who want a quick look at the state of Atom today & it's packages I made a video covering the top frontend and backend packages for Atom.
https://www.youtube.com/watch?v=cFAzqvYoHJs
Grrr, that's not a Regular expression at all. I think you'd be better off looping over the string and counting the number of ( and )s in a loop, guaranteed linear time.
I wish people would stop abusing regular expressions.
Performance wise I never understood why ATOM is even used. it is lackluster compared to a notepad++ and there is a delay in every action: loading the software, clicking on a tab, on a menu, on an option. it seems to me like a very wrong idea to push the web into softwares
It's definitely not the fastest opening editor, and it does have performance issues, but they tend to be more rare than common.
The 2 that bite me are:
* the update/install screens in the settings tend to be a bit slow
* opening "large" files that have 1000+ characters per line will either hang or crash the browser depending on the size. If it's a "normal" looking source file, it runs like a dream with files up to 5gb+ (the largest i've used it for), but if it's something like a minified javascript file, a few kb is enough to hang the editor.
Outside of those 2 issues, i don't see any lag or stuttering in my normal use, and i've got about 70 plugins on top of the defaults.
But to answer your question, it's the customizability and the massive number of plugins that draws me. Plus it's fucking beautiful, and i know this isn't the most popular opinion, but if i'm going to stare at this thing for 6+ hours a day, i want it to look good!
It's by far the slowest compared to Sublime Text 3 and Visual Studio Code. VSCode and Atom obviously suffer from being Electron based compared to ST3, so the comparison there isn't exactly fair, but VSCode is noticeably faster to load and snappier to work with even loaded up with third party plugins. I have a similar amount of plugins for each of the three (10-15) too, so no real difference there.
And honestly I don't see a marked difference in visuals between those three. After installing my zenburn colour scheme they're virtually identical with similar if not identical UI features.
I don't expect the Electron based editors to match ST3 for performance (at least not yet), but honestly it's kind of embarrassing how slow Atom is compared to VSCode, particularly when it comes to things like checking for package updates and just starting up.
This is all based on using all three editors within the last three months (I've been swapping around trying to find what I like).
Totally agree. The performance is atrocious. Most plugins don't work. It can't handle files bigger than a few hundred lines. Update fails often requiring a reinstall. It's such garbage, I don't even have it installed anymore. I thought it'd compete with Sublime. Was I ever wrong! It doesn't compete with TextPad or Notepad, let alone Sublime. I couldn't find a single redeeming feature.
notepad++'s search feature can not be matched by any other text editor I've come across. I love the intuitive search result window, the direct link to the source document....
"How I fixed a bug in Atom that affected almost no one, and then spent quite a lot of time writing an article about it, then wrote a title that attempted to give me more credit than I deserved"
.. which is his main pass time if you see his other posts, rather than spending his effort on fixing bugs that actually affect a lot of people. I am sorry but I don't really appreciate it and have trouble getting over the misrepresentation in the title of the blog posts.
Sure the title is a bit sensationalist, but there was a legitimate issue to fix and he did fix it.
Plus, it is a good write up that explains the problem, solution, and the steps taken well.
It would be a shame to see others put off from fixing bugs, and posting commentary on the solution, in open source software for fear of being put down.
Others should be put off from taking more credit than due in the title of their posts and from taking on bugs that aren't important just for that purpose.
No, nobody should be put off taking on any bug. I hugely appreciate the fact that different people are up for different challenges and, therefore, with enough people, bugs should be fixable. Your ethos appears to be 'some bugs are just boring and not important enough; never bother fixing them'.
Wow, talk about "haters gonna hate". This dude went into great detail about debugging the issue, which is very interesting. Your negativity isn't really welcome here.
I've never written any Go and I don't use Atom. I also found the article quite interesting. Do we really need a comment on every HN post asking why this is written or created or upvoted?
Whether or not the core of your criticism is legitimate, I have a certain distaste for this kind of negativity and I don't feel that it adds to the quality of discussion on this site.
If you had restrained your point to the misrepresentative title then you might have built consensus. Instead you lead with a creditworthiness argument, which was your weaker and more sensational point.
Start debugger, cause 'freeze' to happen, pause debugger, look at call stack. I touch JavaScript as little as possible, but that's what I would do in any decent language.
I think articles like this are very useful to beginner/intermediate developers. Everyone in software says "write open source code", "make pull requests to code you use" etc, but there's a very big gap between knowing how to program and knowing how to track down, fix, and submit a PR for a bug in a program (and language!) you've never worked on before.
This is a good little tutorial on ways to attack a bug in a program. I especially like that he starts with two print statements - there's no wizardry here, just a programmer digging into a bug.