It appears the F* language is created for verification as its primary project, Project Everest, is a replacement technology stack for HTTPS with cryptography embedded.
The challenge this paper attempts to address is embedding a low level language in a high level language plus verification. In a general sense language grammar embedding is a common problem. Examples include Boost templates in C++, React JSX which embeds XML in JavaScript, and HTML templates (ERB, Twig, PHP, Elm, Handlebars, Dust.js).
I have been attempting to solve for the problem of language embedding, in various ways, for years. The root cause of the complexity is that different parsing schemes are executed as a document is passed through different applications ignoring the remainder of the given document outside the delimiters of the syntax concerned by the parser at present. There are consequences to that approach in performance, integrity, and structure. This became apparent when I encountered documents containing FreeMarker code and Handlebar code in a way that did not respect the structure of each other template scheme but made sense in isolation.
I was able to solve for embedding by realizing that embedding is a data structure imposed upon the outer language. If a language is reduced to a primitive structure, such as a parse tree, then all languages become equally represented as various data structures. That won’t make practical sense unless the structure representing these languages is uniform. If there were a common parse tree to all languages that represents respective languages as data structures and the points of embedding are just child structures apart from the whole the problem of embedding is solved for.
Because everything is represented as simple structures in a set of like structures recursion from nesting is accounted for without increased complexity. An example in this case would be to embed an instance of assembly in the outer F* grammar which then contains as instance of F* as a string or comment in that assembly.
A demo of the simplified and uniform parsing scheme is at https://prettydiff.com/parse-framework/runtimes/browsertest.... but at this time only a few languages are supported. I am still working on expanding support and proving the concept. Unlike the research paper though, a data structure alone is not enough to provide verification, which requires a proof of some sort.
I was ready to shred your idea but then I tried out your parser. It's interesting, actually. But, the detail is where things get really challenging. Escape tokens tend to be pretty quirky, like <?= or ```. And C++'s template declaration operators, along with C's parameter type declarations, will be super challenging because there is grammar feedback into the tokenizer for 100% correct tokenizing. But in general I like your idea of a global standard for tokenizing, leading to a universal tokenizer. Full parsing is another story though. For a correct parse tree of C, you need to actually have feedback from the symbol table. And I suspect plenty of languages have one or two weird parsing tricks, so supporting it through all languages will end up being a major challenge. Good luck!
With F# being a practical language for most tasks, and F* meshing well with it, this makes for a great combination of languages to know for verifiable programs. F* outputs F# (and OCaml), and C (Kremlin). Now you can have your low-level, systems programming, and your general PL along with some code verification and dependent types. I code F# on my iPad Pro with a full, live IDE called Continuous[1]. The author of Continuous basically forked the Roslyn compiler for iOS. Idris was my other contender for this niche, however, the F#/F* combo seems to be more practical especially if you want to do iOS/Android in addition to desktops.
The challenge this paper attempts to address is embedding a low level language in a high level language plus verification. In a general sense language grammar embedding is a common problem. Examples include Boost templates in C++, React JSX which embeds XML in JavaScript, and HTML templates (ERB, Twig, PHP, Elm, Handlebars, Dust.js).
I have been attempting to solve for the problem of language embedding, in various ways, for years. The root cause of the complexity is that different parsing schemes are executed as a document is passed through different applications ignoring the remainder of the given document outside the delimiters of the syntax concerned by the parser at present. There are consequences to that approach in performance, integrity, and structure. This became apparent when I encountered documents containing FreeMarker code and Handlebar code in a way that did not respect the structure of each other template scheme but made sense in isolation.
I was able to solve for embedding by realizing that embedding is a data structure imposed upon the outer language. If a language is reduced to a primitive structure, such as a parse tree, then all languages become equally represented as various data structures. That won’t make practical sense unless the structure representing these languages is uniform. If there were a common parse tree to all languages that represents respective languages as data structures and the points of embedding are just child structures apart from the whole the problem of embedding is solved for.
Because everything is represented as simple structures in a set of like structures recursion from nesting is accounted for without increased complexity. An example in this case would be to embed an instance of assembly in the outer F* grammar which then contains as instance of F* as a string or comment in that assembly.
A demo of the simplified and uniform parsing scheme is at https://prettydiff.com/parse-framework/runtimes/browsertest.... but at this time only a few languages are supported. I am still working on expanding support and proving the concept. Unlike the research paper though, a data structure alone is not enough to provide verification, which requires a proof of some sort.