
Code as text is a problem - smowledge
http://dohgramming.com/post/code-as-text-is-a-problem/
======
weeksie
Oh this perennial idea. That our editors could produce plain text, but
represent everything as a rich data structure.

This sails happily along until someone realizes that they _already do that_.
Code is structured. It compiles to an AST which can be worked on by any number
of IDEs and editors.

You want a super rich structure in that regard? Something that makes
refactoring a breeze? Well, you're asking for a strong, expressive type
system.

Shrug.

Nothing to see here.

~~~
erikpukinskis
What are good tools for running scripts on an AST? Are you talking about like
an Eclipse plugin or something?

~~~
hdhzy
Roslyn is basically a compiler API for .NET. Take a look at this walkthrough
[0] to see how to implement custom diagnostics and quick fixes that operate on
AST level.

[0]:
[https://social.technet.microsoft.com/wiki/contents/articles/...](https://social.technet.microsoft.com/wiki/contents/articles/26954.writing-
diagnostic-with-code-fix-using-roslyn-net-compiler-platform.aspx)

------
skybrian
Maybe the real question here is why we don't use standard parsers everywhere
all the time? Well, for some data formats we usually do - take JSON for
example. (You can do a lot with jq.)

I think the problem is that for most languages, the AST is very complex, and
writing a transformation pass isn't something you do as a one-off hack.

Lisp famously keeps things simple. I think we could do better than that on
syntax, but designing a language with an AST designed for hacking still seems
like a good idea.

But there are other ways a language can be easier to hack on. The gofmt tool
can do a search and replace on Go expressions and also write the output in
Go's standard format. This builds on the language's strengths: (1) having a
standard format (2) mostly unambiguous references to imported symbols, so you
don't need type-checking to disambiguate. This makes it much easier to find
all references to a function.

~~~
open_bear
Take a look at how Smalltalk does this (try Pharo, for example). You don't
edit code as a whole: you add classes, methods, fields and edit them
separately.

This has several benefits: changes are isolated - easier to reason about and
track in version control system, refactoring is easier.

~~~
skybrian
Interesting. Is Pharo image-based like SmallTalk used to be or does it use
something else?

Programmers are just about the last people who work with filesystems directly.
I expect git will be used for sharing for a while, but I wonder what it will
take to move beyond checking out code into local working directories to a more
cloud notebook-style approach? (There are lots of experiments but it's not
mainstream.)

~~~
open_bear
Image-based, as Pharo is a close fork of Smalltalk.

I've been thinking about a source control system, that will track
functions/classes, instead of bytes. You could track who changed that helper
method and not be distracted by moving classes between files.

------
jasode
_> If the primary representation of every programming language was data,
[...]. When the programmer wants to modify code, render it into a textual
representation. When they’re done, parse it back into nice, inspectable data._

I see several problems with this:

1) human behaviors: you want the ultimate "Source Of Truth" to be the code-as-
data structure instead of the text but programmers would eventually just shift
the SOT to the "rendered text" because it's easier to work with. Friction for
editing is a big deal. A similar behavior happens with database Stored
Procedures where programmers use external "schemachanges.sql",
"tabletrigger.sql", etc for source code revision control. The db engine's
"source of truth" of that code is an internal blob but that's not visible to
Git/emacs/vi/Notepad++/etc.

2) work-in-progress code is often broken and can't be parsed: E.g. the parser
stops at a missing semicolon or unbalanced quote and therefore, you end up
storing a large _unparsed_ string for text with parsing errors. If you have a
500 LOC source file where the first 20 lines was parsed but the last 480 was
not, you might as well have left all 500 lines in the "rendered" text as the
SOT. The ideals of maintaining a perfect round-trip from code-as-data -->
code-as-text --> code-as-data ... is an illusion in this case.

Based on current macro trends of compilers exposing their parsing engine (e.g.
clang, C# Roslyn), it looks like programmers would rather use _" parsing as a
library API to process normal text"_ rather than _" code-stored-as-parsed-data
to be rendered to text for editing"_.

~~~
cormacrelf
> it looks like programmers would rather use "parsing as a library API to
> process normal text"

Yeah. That code is stored as text hasn't turned out to be that much of a
problem, when parsing it is usually quite cheap. Rust's `cargo check` (parse +
type check + borrow check basically) runs in under a second even for large
codebases.

For ease of implementing editing/refactoring features, the author seems to
want a standardised, poly-lingual storage representation for code. That sounds
like... text. There is little else in common between all languages. The
winning abstraction so far isn't the storage format, it's something like
Language-Server which lives above a parser/semi-compiler and just works with
symbols and filenames/line numbers/char spans. These are pretty much common to
all languages.

If you wanted all the benefits described, like one-line renaming operations in
your SCM, you'd need to completely normalise the storage format into
relational data structures. This would only really work for static languages,
and would open you up to a heap of distributed systems problems: What is the
true representation of a particular symbol? What if we both commit two symbols
with the same ID, or two IDs for the same symbol? Nobody wants to need
transaction support and SQL constraints for a file in Git. And think of how a
one-line change like this:

    
    
        // main.rs
        + pub fn gimme_seven() -> Option<i32> { Some(7) }
    

would add some entries in database tables all over the place:

    
    
        // symbols.json
        + { "kind": "function", "id", "d3c42d4d-9df6-43a0-97a0-147fe8e52a2d", "name": 'gimme_seven' ... },
        
        // main.json // { exports: [ ...
        + { "id": "d3c42d4d-9df6-43a0-97a0-147fe8e52a2d", "visibility", "public" }
    

Using fully-denormalised text is the only reason SCM works in the first place.
Programmers suffer boundless pain just trying to get computer-generated
(usually XML) config files to play nice in shared repositories, and I'd never
wish the same upon anyone trying to write actual code.

------
entelechy
Working on AST is hard unless it is a lisp.

Additionally, the tooling to work with ASTs for many languages is poor.

However I was pleasantly surprised when I discovered clang query(c++). Eli has
a nice introduction: [http://eli.thegreenplace.net/2014/07/29/ast-matchers-
and-cla...](http://eli.thegreenplace.net/2014/07/29/ast-matchers-and-clang-
refactoring-tools)

I wish someone would build more refactoring tools with proper semantic
analysis

------
majewsky
> the diff shows 5 billion changed lines of code

This is either hyperbole, or exactly the problem: It's no wonder that a
behemoth of that size is hard to refactor.

------
ellius
I'd be curious what the HN community thinks of a text editor plugin I've been
considering. This wouldn't necessarily solve the problem described, but would
mitigate it to some degree I think.

The idea is to write code in blocks. So you have a raw file with and the
following operations:

INSERT (create a new block of text, either top-level or nested inside another)

DELETE (Selected block or all nested blocks)

UPDATE

Each block would effectively be a node consisting of its text, optionally
split into a LEFT block of text and a RIGHT block. Each line would be stored
along with a line number. Then when an insert or update occurred, all line
numbers would be updated based on which lines were already to the left and
right of the insert, at which point the file would re-render.

I think if you added tagging and selection / editing by tagging you'd get the
benefits of something structured without getting down to the AST level or
breaking all of the other useful add-ins built on the assumption that code
will be text.

------
borplk
Also as an aside to my previous large comment I just want to add that I'm
disappointed by the dismissive tone of some of the other comments and the
"this is how our fathers have always done it" attitude.

The "IDE" that a number of people talk about, let's face it, is anything but
integrated.

It is laughable to call them "integrated development environment". They are
glorified text editors that go to extreme measures just to give you an ounce
of useful information which is often unreliable.

Such as trying to "parse" your literal strings and use heuristics to determine
if it's a SQL query. That is not integrated, it is a sad hack.

My point is we can set our bar of expectation from our programming languages
and environments so much higher. Let's not just settle on what we have.

------
valuearb
In early Mac days I worked at an angel funded startup that was building a
graphical programming language called Benjamin. It was amazing, had all the
features of a compiled language, but was just too early. Still surprised no
one has done anything close since.

------
tinix
Programmers that don't understand how to effectively edit text are a problem.

[https://news.ycombinator.com/item?id=14282600](https://news.ycombinator.com/item?id=14282600)

------
tyingq
I guess you could dump to AST and manipulate that. There is a utility to dump
PHP code to AST, or vice-versa[1]. It was used to make a PHP7 to PHP5
converter: [https://murze.be/2016/03/converting-php-7-code-equivalent-
ph...](https://murze.be/2016/03/converting-php-7-code-equivalent-php-5-code/)

[1] [https://github.com/nikic/PHP-Parser](https://github.com/nikic/PHP-Parser)

------
throwaway2016a
While I don't really like this idea (in 1985 it may have changed the course of
programming but in 2017 with advanced IDEs it smells like a solution looking
for a problem) it has one interesting benefit...

If two languages have similar grammars and APIs you could potentially allow
the programmer to choose the language they want and each programmer can work
in their own language. Even one of those graphical block-based languages.

~~~
cormacrelf
In the entire history of languages, this would probably only work for
prehistoric-C# and Visual Basic. Compile-to-JS languages wouldn't work,
they're all too different. Not even Lisps. The idea doesn't really have legs.

------
ilaksh
See Projectional Editors like Intentional Workbench and Nim macros.

------
haltingthoughts
Why do we not consider code generation/editing APIs/tools as an expected
feature of a language instead of an editor?

------
kwhitefoot
No it isn't.

------
borplk
Amen. I have been thinking about this a lot as well.

In fact I was thinking of writing a blog post about it along the lines of "The
plain-text tragedy".

A program is so much more than a flat collection of lines of text stored on a
filesystem.

Our tools force us to "reduce" a program and dumb it down to a bunch of flat
files.

This in turn increases the cost of representing information (syntax clutter).

It also adds cognitive overhead for naming things and so on.

As a result we get ourselves into a fight with the compiler. On one hand we
want to give it as much information as possible so it can help us solve the
problem.

On the other hand we don't want to overwhelm ourselves so we awkwardly
compromise somewhere in between (terse syntax, short weird variable names,
"def" for define "fn" for function, symbols and so on).

As a result, and believe me I could write a book of rants about this, what we
call "IDE" is anything but integrated.

Imagine an editor that allowed you to easily view and modify "objects" in your
programs.

Instead of the project tree representing the files on a filesystem it would be
representing components of your project.

You could view a function, select it in the editor and right click on its name
and click "Write documentation", or "View entities calling this function",
"View tests involving this function", "View authors".

In a way similar to, for example, how stored procedures are stored as objects
in a database that you can fiddle around with.

And you could have custom attributes attached to these objects. You could tag
them with domain-specific stuff.

Imagine the possibilities for version control. If the units of our programs
where tracked and edited as first-class-citizen objects in a smart database
you would automatically get "git on steroids".

We already "have" some of this stuff but they are not integrated. We first
throw away information then come up with parsers that try to make sense of it
again like fancy auto-complete intellisense stuff. As a result they become
expensive to write and maintain and they become fragile and unreliable. You
can't trust them blindly because their accuracy varies wildly. Good luck
getting intelligent intellisense in a large project with dynamic languages and
transpilers and so on.

We need a development environment that is not hostile towards the humans. It
should actively encourage the programmer to provide and specify things about
the domain.

We need programming languages that allow and encourage the programmer to
provide information. The flat-file ASCII syntax that we are so used to
punishes that.

In the "real world" there's so much meta-data associated to each
object/entity/unit of the program.

Consider a simple function that adds two numbers together.

While from a mathematical standpoint it's nice to be able to write "function
add(a, b) { return a + b; }". From a industrial real-world team-effort
software engineering point of view that is incomplete.

Ideally we would want documentation (purpose, arguments, return value, etc),
tests, history/version-control and some more context-specific things such as
logging that can be turned on-off depending on the context.

We need languages that recognise these concepts and editors that allow us to
keep these things in a way that doesn't get reduced down to meaningless text
(like documentation crammed in comments that later needs to be pulled out and
parsed awkwardly).

If we don't "throw away" the information suddenly so much of the stuff that we
work so hard to do today becomes available for free (such as much more
accurate documentation that stays up to date with the project, version control
that operates at the program level not lines of text).

You could "tag" the components of your program with various attributes.

Then your IDE could auto-generate a report that says "Hey 57.2% of your
functions related to authentication currently do not have a test case
associated with them. Would you like to create a test case for checkUserLogin
function now? click here".

As I said right now we have a lot of this stuff but they are not integrated
and fall apart easily, they rely on fragile mechanisms to keep them accurate
so you can never let your guard down and trust what you see on the screen you
have to actively fight to keep things in sync with each other and so on.

~~~
contras1970
> _A program is so much more than a flat collection of lines of text stored on
> a filesystem._ > _Our tools force us to "reduce" a program and dumb it down
> to a bunch of flat files._

Agreed with a reservation: text sucks, but we have nothing better, and your
post does not suggest an improvement. If you have something more concrete on
your mind, I have some specific questions here I think we'll need good answers
before we jump the text-based ship.

> _It also adds cognitive overhead for naming things and so on._

I take this you advocate going without textual names. Presumably, if we didn't
name things in our programs, programming (both "writing" and "reading" them)
would be easier. If this is actually your point:

* How are our tools going to present "source code" to us? * How are we going to refer to pieces of our programs, in spoken and written communication?

These things have great impact, right now I can get into a discussion of
execle(), execl(), execlp(), execve(), execv() and execvp() (these are pretty
average regarding mutilation in the interest of brevity, at least they're
pronouncable[1]).

> _Imagine an editor that allowed you to_ [...] _view a function, select it in
> > the editor and right click on its name and click "Write documentation", or
> > "View entities calling this function", "View tests involving this
> function", > "View authors"._

* What would "view a function" mean in a nontextual programming language mean? * "Write documentation" implies more _text_. Doesn't the argument against text in programming languages apply to documentation as well, just more forcefully? What is your proposed solution to keep documentation in sync with the code? * What does "View entities calling this function" mean in a universe without names?

[1] cf. wcsnrtombs(), wcspbrk(), wcsrchr(), wcsrtombs().

~~~
borplk
A better alternative is theoretically possible we just haven't devoted
significant time and resources to it.

It is possible to create an "editor" that edits the objects of the application
but not as text.

It can look almost identical to a text editor but under the hood it knows more
about each "node" that you are interacting with.

One challenge is that sometimes these applications are not flexible enough to
deal with "temporarily invalid" programs. So they become rigid and feel less
flexible than a normal text editor.

But those challenges can be overcome.

This suddenly opens the door to so many wonderful possibilities. I know
because I have worked on an experimental prototype for such a thing.

For example text-based version control will be automatically obsolete. Your
IDE will have the history of every single piece of your program (program is
stored as a rich object in the database).

Your IDE can suddenly present different "layers" of information to you in a
clean way.

For example you can "hide" an entire block of code and just label it with
"Check if user has sufficient privileges".

Then the IDE can allow you to expand and explore that section of the program
if you wish. But otherwise you can collapse it and just scan over it with your
eyes.

The one-dimensional flat nature of a text-based system discourages us from
attaching additional information to the pieces of the program.

We need a way to be able to "navigate" through multiple dimensions of
information in relation to the obejcts in the program.

For keeping documentation in sync, the IDE allows you to view/create/modify
the documentation for a given object of the program therefore it can guarantee
certain things.

Your environment is fully aware of the relationship between an object and its
corresponding documentation. It's not just a piece of string that gets
ignored.

So when you change the body of a function it can pop up a warning and say "Hey
you have modified this function since revision #48452 but the documentation
has not been changed. Click here to update the documentation.".

Your environment will be aware of, for example, the relationship between tests
and obejcts (such as functions).

So it can tell you in a 100% reliable way, not based on file name conventions
and error-prone parsing, that hey across your project the following functions
have no tests associated with them.

There's enormous inertia against an initiative like this because of our
dependence on existing technologies. But we cannot advance software
engineering to the next stage unless we move beyond the "a program is a bunch
of files that we view and edit like a book" mentality.

We need environments that are much more aware and in-sync with the broader
context of a program/project as a whole and in a reliably-structured manner.

Check this out as a basic example:
[https://isomorf.io/#!/demos](https://isomorf.io/#!/demos)

(I'm not affiliated with them)

