Most languages don't have ASTs that work well in "degenerate" cases like unfinished/work in progress/sample code. These are things you often want to have in source control. There are also very few AST standards between languages making for a lot more language-dependent work for an SCM.
(For what it is worth, I toyed with what I think to be a useful compromise to the idea, which is to use syntax highlighting tokenizers, which perform well and interoperate well with character-based diffs: https://github.com/WorldMaker/tokdiff)
I agree that degenerate cases are one of the major problems with trying to move to higher-level editing abstractions (and it's one of the big issues I faced with my own application), but I don't see how it's a problem here.
If it doesn't have a valid AST, then it's not a valid program, either. If you're using a formatting program (like prettier or gofmt) and pass it an invalid program, you're either going to get an error, or undefined behavior. And anyone else opening it in a different editor is going to have their language-mode interpret the invalid program differently than yours, too. Source code tools like structured search won't work predictably, either.
This sounds to me like saying "An XML structure editor? But what if I want to put an invalid XML structure in a file called foo.xml and commit it to the repository?" Or my first boss complaining that visual text editors didn't let you see every byte. Or Mel needing to know drum addresses so he could use them for constants.
As time goes by, we move to higher level abstractions, and (thanks to tools like gofmt) we're already at the point where you probably shouldn't be pushing source code which doesn't even have a valid AST, and expect all your tools to work perfectly. There's plenty of ways to write and commit WIP code without needing an invalid AST on disk.
I thought similarly for many years. What's the use of code in source control that doesn't compile anyway?
Then I started paying attention to all the reasons why we naturally might want to check in "invalid" code.
All the cases where I'm never going to finish an entire refactor in a single commit, and the whole refactor makes far more sense in documented steps where many of the intermediate steps will never compile.
All the cases where sending broken code to a source control server at the end of the day is both the best way to back it up and the easiest way to get fresh eyeballs on it in the morning. Even with CI systems in place, sometimes the errors that the CI bot can tell you are as useful as the ones your own machine's build environment can tell you. (Especially in those weird cases where it turns out to be that maybe its the build environment on your own machine that's the problem and you've been beating yourself up over a bad install of something that should be unrelated, and the build errors on the remote machine lead you to the real problem.)
All the cases where I might build tests first, and the compiler is a test, especially in a static typed language, before working backwards to make the tests pass/run/compile.
A lot of people like to think of programming code as some purely logical construct, but good code is poetry, even in its mistakes. Sometimes we need drafts to tell our stories right. Sometimes we need bad poetry in source control as a warning to others to help them realize what good poetry can be.
It's interesting to want code editors to keep us from ever writing bad poetry to disk, but it's also somewhat inhumane. People write bad poetry all the time, it's a natural skill. Saving bad poetry for posterity is sometimes the only way we get better poetry.
I once saw a phd thesis that argued that the best way to go was to use AST diffing at the top level (classes, method definitions, etc) but line-based diffing for the inner parts (method bodies, etc).
Small textual changes can lead to large changes in the AST, which results in confusing diffs. Merging is also non-trivial.
On the other hand, for top level structures following the AST keeps things saner.
It's also similar to the thought path that lead me to trying a Tokenizer for diffs. The tool as is essentially gives you character-based diffing everywhere which keeps the inner parts sane/mergeable.
One of the things that I never got around to doing was exploring the higher-level opportunities, but they are there. Just as the tokenizer is the first pass in AST building, tokenized diffs should contain enough information that if you wanted to try to do higher level diff analysis you could do some interesting things.
Storing the AST works for a class of problems, namely whitespace issues, but a refactoring as simple as renaming a token still makes merges a mess. (And if you store a whitespace / comment decorated AST, you can still have a messy merge.)
That said, you could address that and other issues, you're just working with something several steps removed from an AST. Or you could call an AST the 90% solution.
And how do you decide, given two versions of a source file, if the token was renamed or it was a new one?
This looks "trivial" for trivial cases, but I can imagine lots of difficult corner cases there. Depending on how you define what constitutes a "rename", this may even be undecidable.
Actually, even basic AST-aware diffing algorithm would be a nice thing. I write a lot of Lisp code - Lisp source is, quite literally, AST serialized to text - and regular diffs in SCMs are really annoying. Like, I remove some AST node in the middle, and git diff will think most of the function changed, just because indentation shifted a bit.
Only for a tiny, tiny portion of what constitutes a coding style (indentation depth) and only if the original code used tabs for indentation and spaces everywhere else consistently otherwise everything looks like crap when you change the tab width. In my experience most tab-indented projects end up mandating a tab width in their coding style and using anything else leads to chaos. See the Linux kernel for example. Besides if you really value this flexible tab width you can no longer impose a maximum line length since it's tab-dependent.
Beyond that it's quite weird to make tab width configurable but nothing else. What if I prefer to have opening { on the same line as the function prototype? What if I prefer my identifiers to be in CamelCase? What if I prefer not having spaces around arithmetic operators? All that could be done relatively easily if code was stored as an abstract AST and formatted on demand.
Tab indentation is theoretically superior to spaces if used correctly but in practice it's probably not worth the trouble. At least as far as I'm concerned I gave up on tabs a long time ago.
It's less ridiculous than fighting over this stuff in code review.
It's usually teams or projects agreeing on a coding standard, and those standards dictate everything other trivial detail. As a developer, you are an author, and you're writing with a "voice." On a team, it's a courtesy to the reader to write with the team's voice rather than your own.
Doesn't work as well in languages where indentation is significant. But even in languages where indentation is insignificant you'll run into issues doing things like line splitting.
Doesn't work as well in languages where indentation is significant.
The point is that the code on disk needs to work when it's compiled. That's all. You could write an editor that displays the code in a circle so long as it writes properly indented code when you save. Nothing about the IDE is actually important.
FWIW Black ( https://github.com/ambv/black ) is a disciplined Python formatter that's reasonably successful, so I'm not convinced that the "significant indentation" thing actually matters too much; it probably just means that formatters need to be more careful to not introduce bugs, but the same principle of code -> AST -> code seems just as valid.
What line splitting issues are you referring to? I've been pretty happy with the approach used by Prettier (and I believe Black) of fitting it on one line if possible, and doing one per line otherwise. I'd say that's my favorite part of Prettier: I don't need to think about wrapping anymore, it figures it out, so it's sad to see that gofmt doesn't do that (from my testing just now).
I think you're only considering a one-way transform, what's being proposed above is two-way. i.e. Store the output of Black as the 'canonical' version, but translate it back into your preferred style when looking at the code in your editor.
> but the same principle of code -> AST -> code seems just as valid
The question is really the reversibility of the transform in a reasonable way. If you are translating all your source to an AST there is technically no reason multiple programmers even need to be working in the same language. However in practice, even within a single language, there are many possible transforms from an AST back into code. Some will make sense, some will be nearly indecipherable. In practice you either limit the allowed transforms such that there is little benefit over a simple linting tool or the problem explodes in complexity.
> Doesn't work as well in languages where indentation is significant.
Store the bytecode, indents and dedents are present as <indent> and <dedent> token which are no different than e.g. <brace, open> and <brace, closed>.
Well technically those are in the tokeniser, the AST or bytecode would reify the structures they delimit, and if you stored the AST or bytecode the separator would be reified from there.
But the point is that for the machine indentation is no different than any other token (in fact Haskell is defined using braces and semi-colons, the common indentation-based version is an augmentation of the base language which semantically just inserts braces and semicolons)
The typesetting is used to convey information though. I use indentation to convey information about connected concepts. The indentation of my in code comments conveys how important they are and whether we're at the start of a new paragraph.
Im more explicit than most but everyone does this to some degree. The most basic example is indenting code blocks.
Generally I find it harder to debug auto formatted code. I lose information of who the author(s) were and it makes it harder to see where breaks in ideas can happen.
Back when I was in Excel every section had its own style and unwritten rules - inevitable with 30 year software. Once the code is auto formatted all information about those unwritten rules is lost.
Good point; some meaningful typesetting may be lost. Then again, if such a system were used, programmers would avoid writing code that made significant use of non-syntactically-relevant aspects of code. Furthermore, it might influence language designers to make documentation part of the language syntax.
its not really documentation per se. It's flaws programmers are prone to. Everyone has their own set. Knowing the author in an abstract sense speeds up the process.
isn't it currently the case? colors and style are not saved in the code, but are applied by the syntax highlighting when displayed.. the display of elastic tabs could probably fit in there somewhere.
Not if you make documentation part of the syntax (which is often sort-of the case today to make it easier to generate doc from code). It would force you to fix the documentation format though, but that wouldn't necessarily be a bad thing.
There are tools to take a well defined format and generate pretty documentation from code. What happens in real life is that people write the bare minimum that describes the interface for the function, but not how it works or how to use it beyond a few words. And then you end up with very pretty autogenerated docs that don't help you at all.
Okay but that's kind of a different problem isn't it? If people can't be bothered to comment their code properly I doubt they're going to write high quality LaTeX documentation on the side.
I find that Rust deals with that mostly well, the doc format is part of the language, the documentation generation part of the toolchain. You can even validate example code from the comments. You can enable warnings for missing documentation on methods. In my experience documentation for Rust crate is generally at least okay-ish, it generally saves me a trip into the code itself. That being said Rust prototypes are also a lot more expressive than their C counterparts (no questions about "should I free that pointer?") so that probably helps as well.
My point is that the harder you make it for people to comment their code, the less useful comments you'll get. Freeform text (where you can even whip up a ghetto table or graph with a fixed size font in a few minutes) gives you the best chance of meaningful comments.
I'm not convinced, I don't find the Rust format particularly hard to grok, it's Markdown so it looks mostly fine when viewed as plaintext. A Markdown table is pretty ghetto but on top of that it'll display nicely in the generated HTML. Also if you really want to do ascii art you can always use ``` to escape everything: https://github.com/simias/pars/blob/ce820b5af0be3091d47104cf...
I guess it does add a little bit of friction but if somebody tells me they don't write docstrings because they don't want to bother with the Markdown syntax I'd reply I'm going to assume that they're just looking for excuses at this point.
Seems to me text would not be an issue at all, examples might be. And you could integrate the documentation format with the language itself such that code examples in the host language can be recognised and properly formatted by the editor or renderer.
A simple, if inelegant, way to achieve this: Use a standardised format on save. On read, pretty-print based on user-defined settings.