
Pattern Matching Without Regex – Introducing the Rosie Pattern Language - philk10
https://spin.atomicobject.com/2018/11/14/rosie-pattern-language/#.W-xEPiyim0g.hackernews
======
greggyb
I get that PEGs are more powerful than regexes, but it seems the example of
componentizing is weak at best.

I can do this very easily on the command line or in a .<shell>rc file:

    
    
      ipv4component='[0-9]{1,3}'
      ipv4="$ipv4component"'\.'"$ipv4component"'\.'"$ipv4component"'\.'"$ipv4component"
    
      $ grep -E $ipv4 ...
    

I am bad at regex, so I'm sure the example is a poor definition, but the idea
is there. I can make variables that hold components of a regex, and since a
regex is just a string I can compose these via concatenation.

If I did this a lot, I could build a small helper script (or probably just a
set of shell functions) to maintain a library file of regex components that I
can use in the shell with grep.

~~~
taeric
It isn't that hard to build functions around a lot of this, either. Emacs has
the wonderful macro "rx".
[https://www.emacswiki.org/emacs/rx](https://www.emacswiki.org/emacs/rx)

------
msoucy
The article would be helped by using a full regex for ipv4 addresses - the one
it uses would match invalid numbers (999.999.999.5 for instance), but the
proper one is more complex (and would probably make for a better example as a
result)

Also I think there's something wrong with this blog's formatting, it appears
to be replacing underscores with italics even within code samples.

~~~
setr
I feel like thats fine: syntactically valid, semantically not;

Ideally syntax vs semantics should be decoupled in most parsing (hence the
AST)

~~~
samatman
256 is either 1[0-9][0-9] or 2[0-5][0-6], for the three digit case; that which
can be syntactically detected, should be.

~~~
diggernet
Except you've missed 2x7-2x9...

~~~
samatman
Ah, you're right, that was careless of me

here's the real deal courtesy of the URI spec:

IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

    
    
          dec-octet   = DIGIT                 ; 0-9
                      / %x31-39 DIGIT         ; 10-99
                      / "1" 2DIGIT            ; 100-199
                      / "2" %x30-34 DIGIT     ; 200-249
                      / "25" %x30-35          ; 250-255

------
drivers99
Kind of looks like grok. Grok lets you name patterns, then build up larger
patterns by those names, and then also name the groups that it matches to
those sub-patterns so you can refer to them in the data. It's built on top of
regex, as each pattern can be defined by a mix of other patterns and/or regex.

For example this grok pattern (taken from [1] )

%{TIMESTAMP_ISO8601:timestamp} \\[%{IPV4:ip};%{WORD:environment}\\]
%{LOGLEVEL:log_level} %{GREEDYDATA:message}

refers to a pattern called TIMESTAMP_ISO8601 and calls it "timestamp" in the
resulting output data structure.

In logstash, TIMESTAMP_ISO8601 is predefined in a patterns file, such as [2],
which is made of up of a mix of regex and other patterns like YEAR, MONTHNUM,
etc.

TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T
]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

MONTHNUM is a regex (optional 0 followed by 1-9; or 10 through 12):

MONTHNUM (?:0?[1-9]|1[0-2])

I'm not sure what all of Rosie's base patterns are. This appears to be a valid
regex though, from the example: [:alpha:]+ (regex character class and "+"
meaning 1 or more). It's C instead of java which is useful in more/different
places.

[1] [https://www.elastic.co/blog/do-you-grok-
grok](https://www.elastic.co/blog/do-you-grok-grok)

[2] [https://github.com/logstash-plugins/logstash-patterns-
core/b...](https://github.com/logstash-plugins/logstash-patterns-
core/blob/master/patterns/grok-patterns)

~~~
eutropia
Is there anyway to make use of grok on the command line like the post does
with rosie patterns? I rather like the idea of `| grep 'net.ipv4'`

~~~
busterarm
I wish. I think the closest that you can get is to run Logstash locally with
the stdout output plugin. You'll have to do some work with the logstash config
to set up your grok input filters each time, but it should give you what you
want otherwise.

~~~
aequitas
Grok is implemented in other languages too. For example
[https://github.com/garyelephant/pygrok/](https://github.com/garyelephant/pygrok/)
It makes developing and unit testing Logstash Grok patterns a lot quicker
compared to spinning up a full Logstash instance for every line of code
changed.

------
neurotrace
This is a very cool too but I feel like I'm missing something. This looks like
any other PEG parser generator. The only difference I see is that it will
automatically handle the case where a valid match starts somewhere other than
at the start of the stream. I'm not sure that this constitutes calling it a
whole language unto itself.

What separates this from tools like PEG.js[1] or pest[2]?

[1]: [https://pegjs.org/](https://pegjs.org/) [2]: [https://github.com/pest-
parser/pest](https://github.com/pest-parser/pest)

~~~
yAak
For the sake of discussion, here's what the author says: [http://rosie-
lang.org/blog/2018/02/25/why-rpl.html#why-not-u...](http://rosie-
lang.org/blog/2018/02/25/why-rpl.html#why-not-use-one-of-the-many-existing-
peg-libraries)

I guess pest is comparable then, but wasn't mature when the author started
work on Rosie?

(I'd be curious for a proper comparison, but I'm not really knowledgeable in
this area -- I had no idea there were so many alternatives to regex:
[https://en.wikipedia.org/wiki/Comparison_of_parser_generator...](https://en.wikipedia.org/wiki/Comparison_of_parser_generators#Parsing_expression_grammars,_deterministic_boolean_grammars))

------
KeyboardFire
The example they give isn't really convincing, to me. I can see the usecase
for this kind of language, but for e.g. searching for a pattern on the shell
that isn't just one of a few predefined special cases, it seems like it'd
still be a lot easier to compose regexes on the fly.

------
jmaa
I don't see the difference between this and any other Context-Free Grammar
specification language. Yacc is an industry standard, and even SNOBOL4 (1967)
had first-class CFG datatypes. Maybe he's just excited about being able to use
CFGs in the cmdline?

------
dblotsky
I was going to agree with everyone about how it’s not a language, but reading
into it more, I proved myself wrong.

This _is_ a different language insofar as it describes PEGs, not regexes,
which is fundamentally different and more powerful (it can parse more things).

The naming of patterns isn’t unique, since you can just put regexes in
variables in every other language too. However, the syntax in Rosie seems
nicer, and sharing is easier.

------
msla
This is nice within a specific usecase: Being able to make files with all the
pattern chunks you use repeatedly, so you can reuse them and add to them. If
you can't make files, it at least looks no _worse_ than composing regexes on
the command line, but it also doesn't look all that different.

Edit: OK, I was wrong. It _is_ strictly more powerful than regexes, in that it
can correctly match nested pairs.

------
ketralnis
This talk about Rosie
[https://www.youtube.com/watch?v=MkTiYDrb0zg](https://www.youtube.com/watch?v=MkTiYDrb0zg)
is also quite nice

------
AndrewOMartin
Can this be used to parse HTML?

~~~
neurotrace
From the third paragraph:

> Rosie has several benefits over traditional regexes, including the ability
> to parse recursive structures like HTML and JSON

