
Show HN: Sparser – A Multilanguage Parser - austincheney
* <a href="https:&#x2F;&#x2F;sparser.io" rel="nofollow">https:&#x2F;&#x2F;sparser.io</a><p>* <a href="https:&#x2F;&#x2F;github.com&#x2F;Unibeautify&#x2F;sparser" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Unibeautify&#x2F;sparser</a><p>This is my attempt at creating a universal language parser.  It attempts to solve a couple of problems:<p>* Support multiple languages<p>* Recursively extend support to languages embedded within other languages<p>* Output a uniform format for all supported languages<p>This is a personal project so any feedback would be helpful.  Something interesting I found after I built it is that this parser is not as fast to write output as many other JavaScript parsers, but its output is much faster to read from due to the simplicity and predictability of the format.
======
jhpriestley
I think the emphasis on embedded languages is interesting. A lot of modern
code is stuff like js-in-jsx-in-html-in-yaml etc., a fact which isn't really
addressed by most tooling.

It seems like you're aiming to make a more flexible lexer, with support for
nested contexts. My immediate question is how this differs from a grammar-
driven parser. Does it handle types of embedding that can not easily be
encoded as CFGs? Or is it faster or simpler than e.g. an Earley or LR parser?
I couldn't find an answer to these questions on a casual read of the docs.

~~~
austincheney
The biggest short-coming I have found with my approach is that I have not
systematically mapped out the language nesting identity. At the moment I am
utterly reliant upon rules in a given language to identify the thing that is
nested. For example I know that stuff in an HTML style tag will be CSS,
because the rules of HTML say so.

This approach has limitations where the relationship between an embedded
grammar is not clear. Is a markup grammar that contains something like Liquid
template tags HTML or XML? How do you specify the code by language name,
because the embedded grammar is the Liquid tags but you want to identify the
grammar of the template tags apart from other template languages. What if a
markup language instance contains template tags from unrelated grammars?

I could auto detect the language each time the grammar changes, but that would
be slow and introduce unexpected results. The precise solution would be to
allow users a means to define nesting based upon syntax or structure and I
have not thought through this yet.

------
indentit
I like the uniform output, nice work!

I personally think its easier to use something like the `sublime-syntax`
grammar format to define a lexer (with a project like syntect[1]), rather than
implementing each language's lexer as code, but that would need some extra
annotation to get such a nice structured output. The bonus, however, if
someone where to work on such an addition, would be that syntax highlighting
will use the same engine and deal with incomplete/incorrect code consistently

[1]:
[https://github.com/trishume/syntect](https://github.com/trishume/syntect)

------
nemoniac
It would be helpful if you say that your focus is markup, scripting and
styling languages and not programming languages in general or even natural
languages.

This is far from a "universal language parser".

~~~
austincheney
I have intention to extend support to other languages, such as PHP or Python.
I just haven't written that code yet. It mostly seems to work with Java and C#
right now, but I lack the tests necessary to guarantee such support.

------
qeyefre6
> [https://sparser.io](https://sparser.io)

Your home page eats my CPU and makes my notebook hot enough to boil water.

Whatever you're doing on that page, please stop it.

~~~
austincheney
The cpu burn is actually gpu burn. It is caused by a subtle animated effect in
the background and impacts Chrome more than all other modern browsers
combined. I will be turning this off in the next release.

~~~
qeyefre6
Thank-you

------
samanator
why is it that when the input is (excluding all quotation marks)

"obj = {};" the last token is ";"

but when the input is "obj = {}" the last token is "x;"

does "x;" mean an implicit ";"?

~~~
austincheney
I introduce pseudo tokens in place of missing syntax. Using the option
_correct_ converts the pseudo tokens into actual tokens. The pseudo tokens
always start with _x_ and can be easily ignored or discarded by any consuming
application.

I include the pseudo tokens for two reasons. The first reason is that they
allow the parser to reason about the code more closely to the language
specification. For example the specification says statements MUST be
terminated by a semicolon and if a semicolon is not provided one will be
inserted automatically (ASI). The second reason is that the pseudo tokens are
necessary to eliminate certain ambiguity necessary for some advanced features.

------
otabdeveloper2
> The Universal Parser

> Doesn't actually parse, it's a lexer.

Good lord!

~~~
austincheney
A lexer is a scanner that creates a token list. A parser may contain lexers
and outputs a data structure.

* [https://en.wikipedia.org/wiki/Lexical_analysis](https://en.wikipedia.org/wiki/Lexical_analysis)

* [https://en.wikipedia.org/wiki/Parsing#Parser](https://en.wikipedia.org/wiki/Parsing#Parser)

