Hacker News new | past | comments | ask | show | jobs | submit login
Code as text is a problem (dohgramming.com)
31 points by smowledge on May 6, 2017 | hide | past | favorite | 26 comments

Oh this perennial idea. That our editors could produce plain text, but represent everything as a rich data structure.

This sails happily along until someone realizes that they _already do that_. Code is structured. It compiles to an AST which can be worked on by any number of IDEs and editors.

You want a super rich structure in that regard? Something that makes refactoring a breeze? Well, you're asking for a strong, expressive type system.


Nothing to see here.

Came here to say this. Yes, use an IDE.

Obviously, there are huge benefits of using text as storage. What if your team wants to start using Java, but git does not support it yet? Or Joe's editor?

What are good tools for running scripts on an AST? Are you talking about like an Eclipse plugin or something?

Roslyn is basically a compiler API for .NET. Take a look at this walkthrough [0] to see how to implement custom diagnostics and quick fixes that operate on AST level.

[0]: https://social.technet.microsoft.com/wiki/contents/articles/...

I will repeat my comment. See Projectional Editors and Nim macros.

Maybe the real question here is why we don't use standard parsers everywhere all the time? Well, for some data formats we usually do - take JSON for example. (You can do a lot with jq.)

I think the problem is that for most languages, the AST is very complex, and writing a transformation pass isn't something you do as a one-off hack.

Lisp famously keeps things simple. I think we could do better than that on syntax, but designing a language with an AST designed for hacking still seems like a good idea.

But there are other ways a language can be easier to hack on. The gofmt tool can do a search and replace on Go expressions and also write the output in Go's standard format. This builds on the language's strengths: (1) having a standard format (2) mostly unambiguous references to imported symbols, so you don't need type-checking to disambiguate. This makes it much easier to find all references to a function.

Take a look at how Smalltalk does this (try Pharo, for example). You don't edit code as a whole: you add classes, methods, fields and edit them separately.

This has several benefits: changes are isolated - easier to reason about and track in version control system, refactoring is easier.

Interesting. Is Pharo image-based like SmallTalk used to be or does it use something else?

Programmers are just about the last people who work with filesystems directly. I expect git will be used for sharing for a while, but I wonder what it will take to move beyond checking out code into local working directories to a more cloud notebook-style approach? (There are lots of experiments but it's not mainstream.)

Image-based, as Pharo is a close fork of Smalltalk.

I've been thinking about a source control system, that will track functions/classes, instead of bytes. You could track who changed that helper method and not be distracted by moving classes between files.

>If the primary representation of every programming language was data, [...]. When the programmer wants to modify code, render it into a textual representation. When they’re done, parse it back into nice, inspectable data.

I see several problems with this:

1) human behaviors: you want the ultimate "Source Of Truth" to be the code-as-data structure instead of the text but programmers would eventually just shift the SOT to the "rendered text" because it's easier to work with. Friction for editing is a big deal. A similar behavior happens with database Stored Procedures where programmers use external "schemachanges.sql", "tabletrigger.sql", etc for source code revision control. The db engine's "source of truth" of that code is an internal blob but that's not visible to Git/emacs/vi/Notepad++/etc.

2) work-in-progress code is often broken and can't be parsed: E.g. the parser stops at a missing semicolon or unbalanced quote and therefore, you end up storing a large unparsed string for text with parsing errors. If you have a 500 LOC source file where the first 20 lines was parsed but the last 480 was not, you might as well have left all 500 lines in the "rendered" text as the SOT. The ideals of maintaining a perfect round-trip from code-as-data --> code-as-text --> code-as-data ... is an illusion in this case.

Based on current macro trends of compilers exposing their parsing engine (e.g. clang, C# Roslyn), it looks like programmers would rather use "parsing as a library API to process normal text" rather than "code-stored-as-parsed-data to be rendered to text for editing".

> it looks like programmers would rather use "parsing as a library API to process normal text"

Yeah. That code is stored as text hasn't turned out to be that much of a problem, when parsing it is usually quite cheap. Rust's `cargo check` (parse + type check + borrow check basically) runs in under a second even for large codebases.

For ease of implementing editing/refactoring features, the author seems to want a standardised, poly-lingual storage representation for code. That sounds like... text. There is little else in common between all languages. The winning abstraction so far isn't the storage format, it's something like Language-Server which lives above a parser/semi-compiler and just works with symbols and filenames/line numbers/char spans. These are pretty much common to all languages.

If you wanted all the benefits described, like one-line renaming operations in your SCM, you'd need to completely normalise the storage format into relational data structures. This would only really work for static languages, and would open you up to a heap of distributed systems problems: What is the true representation of a particular symbol? What if we both commit two symbols with the same ID, or two IDs for the same symbol? Nobody wants to need transaction support and SQL constraints for a file in Git. And think of how a one-line change like this:

    // main.rs
    + pub fn gimme_seven() -> Option<i32> { Some(7) }
would add some entries in database tables all over the place:

    // symbols.json
    + { "kind": "function", "id", "d3c42d4d-9df6-43a0-97a0-147fe8e52a2d", "name": 'gimme_seven' ... },
    // main.json // { exports: [ ...
    + { "id": "d3c42d4d-9df6-43a0-97a0-147fe8e52a2d", "visibility", "public" }
Using fully-denormalised text is the only reason SCM works in the first place. Programmers suffer boundless pain just trying to get computer-generated (usually XML) config files to play nice in shared repositories, and I'd never wish the same upon anyone trying to write actual code.

Working on AST is hard unless it is a lisp.

Additionally, the tooling to work with ASTs for many languages is poor.

However I was pleasantly surprised when I discovered clang query(c++). Eli has a nice introduction: http://eli.thegreenplace.net/2014/07/29/ast-matchers-and-cla...

I wish someone would build more refactoring tools with proper semantic analysis

> the diff shows 5 billion changed lines of code

This is either hyperbole, or exactly the problem: It's no wonder that a behemoth of that size is hard to refactor.

I'd be curious what the HN community thinks of a text editor plugin I've been considering. This wouldn't necessarily solve the problem described, but would mitigate it to some degree I think.

The idea is to write code in blocks. So you have a raw file with and the following operations:

INSERT (create a new block of text, either top-level or nested inside another)

DELETE (Selected block or all nested blocks)


Each block would effectively be a node consisting of its text, optionally split into a LEFT block of text and a RIGHT block. Each line would be stored along with a line number. Then when an insert or update occurred, all line numbers would be updated based on which lines were already to the left and right of the insert, at which point the file would re-render.

I think if you added tagging and selection / editing by tagging you'd get the benefits of something structured without getting down to the AST level or breaking all of the other useful add-ins built on the assumption that code will be text.

Also as an aside to my previous large comment I just want to add that I'm disappointed by the dismissive tone of some of the other comments and the "this is how our fathers have always done it" attitude.

The "IDE" that a number of people talk about, let's face it, is anything but integrated.

It is laughable to call them "integrated development environment". They are glorified text editors that go to extreme measures just to give you an ounce of useful information which is often unreliable.

Such as trying to "parse" your literal strings and use heuristics to determine if it's a SQL query. That is not integrated, it is a sad hack.

My point is we can set our bar of expectation from our programming languages and environments so much higher. Let's not just settle on what we have.

In early Mac days I worked at an angel funded startup that was building a graphical programming language called Benjamin. It was amazing, had all the features of a compiled language, but was just too early. Still surprised no one has done anything close since.

Programmers that don't understand how to effectively edit text are a problem.


I guess you could dump to AST and manipulate that. There is a utility to dump PHP code to AST, or vice-versa[1]. It was used to make a PHP7 to PHP5 converter: https://murze.be/2016/03/converting-php-7-code-equivalent-ph...

[1] https://github.com/nikic/PHP-Parser

While I don't really like this idea (in 1985 it may have changed the course of programming but in 2017 with advanced IDEs it smells like a solution looking for a problem) it has one interesting benefit...

If two languages have similar grammars and APIs you could potentially allow the programmer to choose the language they want and each programmer can work in their own language. Even one of those graphical block-based languages.

In the entire history of languages, this would probably only work for prehistoric-C# and Visual Basic. Compile-to-JS languages wouldn't work, they're all too different. Not even Lisps. The idea doesn't really have legs.

See Projectional Editors like Intentional Workbench and Nim macros.

Why do we not consider code generation/editing APIs/tools as an expected feature of a language instead of an editor?

No it isn't.

Amen. I have been thinking about this a lot as well.

In fact I was thinking of writing a blog post about it along the lines of "The plain-text tragedy".

A program is so much more than a flat collection of lines of text stored on a filesystem.

Our tools force us to "reduce" a program and dumb it down to a bunch of flat files.

This in turn increases the cost of representing information (syntax clutter).

It also adds cognitive overhead for naming things and so on.

As a result we get ourselves into a fight with the compiler. On one hand we want to give it as much information as possible so it can help us solve the problem.

On the other hand we don't want to overwhelm ourselves so we awkwardly compromise somewhere in between (terse syntax, short weird variable names, "def" for define "fn" for function, symbols and so on).

As a result, and believe me I could write a book of rants about this, what we call "IDE" is anything but integrated.

Imagine an editor that allowed you to easily view and modify "objects" in your programs.

Instead of the project tree representing the files on a filesystem it would be representing components of your project.

You could view a function, select it in the editor and right click on its name and click "Write documentation", or "View entities calling this function", "View tests involving this function", "View authors".

In a way similar to, for example, how stored procedures are stored as objects in a database that you can fiddle around with.

And you could have custom attributes attached to these objects. You could tag them with domain-specific stuff.

Imagine the possibilities for version control. If the units of our programs where tracked and edited as first-class-citizen objects in a smart database you would automatically get "git on steroids".

We already "have" some of this stuff but they are not integrated. We first throw away information then come up with parsers that try to make sense of it again like fancy auto-complete intellisense stuff. As a result they become expensive to write and maintain and they become fragile and unreliable. You can't trust them blindly because their accuracy varies wildly. Good luck getting intelligent intellisense in a large project with dynamic languages and transpilers and so on.

We need a development environment that is not hostile towards the humans. It should actively encourage the programmer to provide and specify things about the domain.

We need programming languages that allow and encourage the programmer to provide information. The flat-file ASCII syntax that we are so used to punishes that.

In the "real world" there's so much meta-data associated to each object/entity/unit of the program.

Consider a simple function that adds two numbers together.

While from a mathematical standpoint it's nice to be able to write "function add(a, b) { return a + b; }". From a industrial real-world team-effort software engineering point of view that is incomplete.

Ideally we would want documentation (purpose, arguments, return value, etc), tests, history/version-control and some more context-specific things such as logging that can be turned on-off depending on the context.

We need languages that recognise these concepts and editors that allow us to keep these things in a way that doesn't get reduced down to meaningless text (like documentation crammed in comments that later needs to be pulled out and parsed awkwardly).

If we don't "throw away" the information suddenly so much of the stuff that we work so hard to do today becomes available for free (such as much more accurate documentation that stays up to date with the project, version control that operates at the program level not lines of text).

You could "tag" the components of your program with various attributes.

Then your IDE could auto-generate a report that says "Hey 57.2% of your functions related to authentication currently do not have a test case associated with them. Would you like to create a test case for checkUserLogin function now? click here".

As I said right now we have a lot of this stuff but they are not integrated and fall apart easily, they rely on fragile mechanisms to keep them accurate so you can never let your guard down and trust what you see on the screen you have to actively fight to keep things in sync with each other and so on.

> A program is so much more than a flat collection of lines of text stored on a filesystem. > Our tools force us to "reduce" a program and dumb it down to a bunch of flat files.

Agreed with a reservation: text sucks, but we have nothing better, and your post does not suggest an improvement. If you have something more concrete on your mind, I have some specific questions here I think we'll need good answers before we jump the text-based ship.

> It also adds cognitive overhead for naming things and so on.

I take this you advocate going without textual names. Presumably, if we didn't name things in our programs, programming (both "writing" and "reading" them) would be easier. If this is actually your point:

* How are our tools going to present "source code" to us? * How are we going to refer to pieces of our programs, in spoken and written communication?

These things have great impact, right now I can get into a discussion of execle(), execl(), execlp(), execve(), execv() and execvp() (these are pretty average regarding mutilation in the interest of brevity, at least they're pronouncable[1]).

> Imagine an editor that allowed you to [...] view a function, select it in > the editor and right click on its name and click "Write documentation", or > "View entities calling this function", "View tests involving this function", > "View authors".

* What would "view a function" mean in a nontextual programming language mean? * "Write documentation" implies more text. Doesn't the argument against text in programming languages apply to documentation as well, just more forcefully? What is your proposed solution to keep documentation in sync with the code? * What does "View entities calling this function" mean in a universe without names?

[1] cf. wcsnrtombs(), wcspbrk(), wcsrchr(), wcsrtombs().

A better alternative is theoretically possible we just haven't devoted significant time and resources to it.

It is possible to create an "editor" that edits the objects of the application but not as text.

It can look almost identical to a text editor but under the hood it knows more about each "node" that you are interacting with.

One challenge is that sometimes these applications are not flexible enough to deal with "temporarily invalid" programs. So they become rigid and feel less flexible than a normal text editor.

But those challenges can be overcome.

This suddenly opens the door to so many wonderful possibilities. I know because I have worked on an experimental prototype for such a thing.

For example text-based version control will be automatically obsolete. Your IDE will have the history of every single piece of your program (program is stored as a rich object in the database).

Your IDE can suddenly present different "layers" of information to you in a clean way.

For example you can "hide" an entire block of code and just label it with "Check if user has sufficient privileges".

Then the IDE can allow you to expand and explore that section of the program if you wish. But otherwise you can collapse it and just scan over it with your eyes.

The one-dimensional flat nature of a text-based system discourages us from attaching additional information to the pieces of the program.

We need a way to be able to "navigate" through multiple dimensions of information in relation to the obejcts in the program.

For keeping documentation in sync, the IDE allows you to view/create/modify the documentation for a given object of the program therefore it can guarantee certain things.

Your environment is fully aware of the relationship between an object and its corresponding documentation. It's not just a piece of string that gets ignored.

So when you change the body of a function it can pop up a warning and say "Hey you have modified this function since revision #48452 but the documentation has not been changed. Click here to update the documentation.".

Your environment will be aware of, for example, the relationship between tests and obejcts (such as functions).

So it can tell you in a 100% reliable way, not based on file name conventions and error-prone parsing, that hey across your project the following functions have no tests associated with them.

There's enormous inertia against an initiative like this because of our dependence on existing technologies. But we cannot advance software engineering to the next stage unless we move beyond the "a program is a bunch of files that we view and edit like a book" mentality.

We need environments that are much more aware and in-sync with the broader context of a program/project as a whole and in a reliably-structured manner.

Check this out as a basic example: https://isomorf.io/#!/demos

(I'm not affiliated with them)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact