Hacker News new | past | comments | ask | show | jobs | submit login

I'm not sure of the need to do that. Not for parsing anyway.

A well designed regex system is enough without tokenization.

The parser for the Raku language, for example, is a collection of regexes composed into grammars. (A grammar is a type of class where the methods are regexes.)

We could probably do the token thing with multi functions, if we had to.

    multi parse ( 'if',  $ where /\s+/, …, '{', *@_ ) {…}
    multi parse ( 'for', $ where /\s+/, …, '{', *@_ ) {…}
    …
Or something like that anyway.

(Note that `if` and `for` etc are keywords only when they are followed immediately by whitespace.)

I'm not sure how well that would work in practice; as hypothetically Raku doesn't start with any keywords or operators. They are supposed to seem like they are added the same way keywords and operators are added by module authors. (In order to bootstrap it we of course need to actually have keywords and operators there to build upon.)

Since modules can add new things, we would need to update the list of known tokens as we are parsing. Which means that even if Raku did the tokenization thing, it would have to happen at the same time as the other steps.

Tokenization seems like an antiquated way to create compilers. It was needed as there wasn't enough memory to have all of the stages loaded at the same time.

---

Here is an example parser for JSON files using regexes in a grammar to show the simplicity and power of parsing things this way.

    grammar JSON::Parser::Example {

        token TOP       { \s* <value> \s* }

        rule object     { '{' ~ '}' <pairlist>  }
        rule pairlist   { <pair> * % ','        }
        rule pair       { <string> ':' <value>  }

        rule array      { '[' ~ ']' <arraylist> }
        rule arraylist  {  <value> * % ','      }

        proto token value {*}

        token value:sym<number> {
            '-'?                          # optional negation
            [ 0 | <[1..9]> <[0..9]>* ]    # no leading 0 allowed
            [ '.' <[0..9]>+ ]?            # optional decimal point
            [ <[eE]> <[+-]>? <[0..9]>+ ]? # optional exponent
        }

        token value:sym<true>    { <sym>    }
        token value:sym<false>   { <sym>    }
        token value:sym<null>    { <sym>    }

        token value:sym<object>  { <object> }
        token value:sym<array>   { <array>  }
        token value:sym<string>  { <string> }

        token string {
            「"」 ~ 「"」 [ <str> | 「\」 <str=.str_escape> ]*
        }

        token str { <-["\\\t\x[0A]]>+ }
        token str_escape { <["\\/bfnrt]> }
    }
A `token` is a `regex` with `:ratchet` mode enabled (no backtracking). A `rule` is a `token` with `:sigspace` also enabled (whitespace becomes the same as a call to `<.ws>`).

The only one of those that really looks anything like traditional regexes is the `value:sym<number>` token. (Raku replaced non capturing grouping `(?:…)` with `[…]`, and character classes `[eE]` with `<[eE]>`)

This code was copied from https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra... but some parts were simplified to be slightly easier to understand. Mainly I removed the Unicode handling capabilities.

It will generate a tree based structure when you use it.

    my $json = Q:to/END/;
    {
      "foo": ["bar", "baz"],
      "ultimate-answer": 42
    }
    END
    my $result = JSON::Parser::Example.parse($json);
    say $result;
The above will display the resultant tree structure like this:

    「{
      "foo": ["bar", "baz"],
      "ultimate-answer": 42
    }
    」
     value => 「{
      "foo": ["bar", "baz"],
      "ultimate-answer": 42
    }
    」
      object => 「{
      "foo": ["bar", "baz"],
      "ultimate-answer": 42
    }
    」
       pairlist => 「"foo": ["bar", "baz"],
      "ultimate-answer": 42
    」
        pair => 「"foo": ["bar", "baz"]」
         string => 「"foo"」
          str => 「foo」
         value => 「["bar", "baz"]」
          array => 「["bar", "baz"]」
           arraylist => 「"bar", "baz"」
            value => 「"bar"」
             string => 「"bar"」
              str => 「bar」
            value => 「"baz"」
             string => 「"baz"」
              str => 「baz」
        pair => 「"ultimate-answer": 42
    」
         string => 「"ultimate-answer"」
          str => 「ultimate-answer」
         value => 「42」
You can access parts of the data using array and hash accesses.

    my @pairs = $result<value><object><pairlist><pair>;

    my @keys = @pairs.map( *.<string>.substr(1,*-1) );
    my @values = @pairs.map( *.<value> );

    say @pairs.first( *.<string><str> eq 'ultimate-answer' )<value>;
    # 「42」
You can also pass in an actions class to do your processing at the same time. It is also a lot less fragile.

See https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Act... for an example of an actions class.

---

Note that things which you would historically talk about as a token such as `true`, `false`, and `null` are written using `token`. This is a useful association as it will naturally cause you to write shorter, more composable regexes.

Since they are composable, we could do things like extend the grammar to add the ability to have non string keys. Or perhaps add `Inf`, `-Inf`, and `NaN` as values.

    grammar Extended::JSON::Example is JSON::Parser::Example {
        rule pair { <key=.value> ':' <value>  } # replaces existing rule

        # adds to the existing set of value tokens.
        token value:sym<Inf>    { <sym> }
        token value:sym<-Inf>   { <sym> }
        token value:sym<NaN>    { <sym> }
    }
This is basically how Raku handles adding new keywords and operators under the hood.



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: