
Designing a Good DSL - zonotope
http://tonsky.me/blog/dsl/
======
jmts
Claiming that regular expressions are too terse is a bit much. There are only
three (!) fundamental operators in basic regular expressions (four if you
include parenthesis), with all other non-language specific operators being
derived from that (ignoring precendence rules):

1\. concatenation, to append regex A or regex B: AB

2\. alternation, to select between A or B: A | B

3\. kleene star, to repeat A zero or more times: A*

4\. parenthesis allows specification of a sub-expression: (A)

The following are all derived/syntactic sugar:

[ABCD] -> (A | B | C | D)

A+ -> AA*

A{2} -> AA

A{2,4} -> AA(|A|AA) or A(A|AA|AAA)

A? -> (A|)

Just about everything else is implementation specific (if choice of special
characters and available operators isn't already). That means you either need
to be using the features regularly to remember them, or you have to look them
up anyway.

Regular expressions are terse not because they are badly designed, but because
by definition the description of regular languages is inherently minimal. It
is part of their beauty. Without this minimalism, every tidy little one liner
we have to perform some simple match becomes a multi-line specification in
Backus-Naur form.

The world needs to get over this fear of regular expressions from ignorance
and continued misinformation. They are not magic or impossible to understand.
They are an elegant description of a very simple state machine which steps
through a string one character at a time, nothing more.

Edit: corrected derivation of A{2,4} a la Twisol and jbnicolai.

~~~
jbnicolai
Agreed. The problem is the risk of small, relatively hard to spot & nearly
impossible to properly debug mistakes.

> A{2,4} -> AA(A|AA)

~~~
Twisol
And this is why we have the syntax sugar.

~~~
slavik81
That introduces problems too. If you try to use sugar like '+' with an
implementation that doesn't support it, you don't get any sort of error.
Instead you get a different expression.

Unfortunately, there's an inherent tradeoff between encoding efficiency and
error detection. Notice that with the VerbalExpressions it would be trivial to
return a useful error message if the 'at_least_one' pattern did not exist.

~~~
b2gills
Perl 6 regexes attempt improve upon this situation by making regexes more like
a regular programming language. That is it errs on the side of error detection
rather than encoding efficiency. (It also adds features that would be
difficult to add to Perl 5/PCRE regex design)

For a start if it didn't support using `+`, then any attempt to use it would
generate a compiler error because it is not alphanumeric. (regex is code in
Perl 6)

All non-alphanumeric characters are presumed to be metasyntactic, and so must
be escaped in some way to match literally. Arguably best way is to quote it
like a string literal. (Uses the same domain specific sub-language that the
main language uses for string literals)

    
    
        / "+" + /   # at least one + character
    
    

It really is a significant redesign.

    
    
        /A{2,4}/    # Perl 5/PCRE
        /A ** 2..4/ # Perl 6
    
        /A (?:BA){1,3}/x
        /A [BA] ** 1..3/ # Perl 6: direct translation
        /A ** 2..4 % B/  # Perl 6: 2 to 4 A's separated by B
    
        /A (?:BA){1,3} B?/x
        /A ** 2..4 %% B/   # Perl 6: %% allows trailing separator
    
        /\" [^"]* \"/x     # Perl 5/PCRE
        /\" <-["]>* \"/    # Perl 6: direct translation
        /｢"｣ ~ ｢"｣ <-["]>*/ # Perl 6: between two ", match anything else
                           # (can be used to generate better error messages)
    
        ---
    
        # Perl 5
        my $foo = qr/foo/;
        'abfoo' =~ /ab $foo/x;
    
        # Perl 6
        my $foo = /foo/;
        'abfoo' ~~ /ab <$foo>/;
        # or
        my token foo {foo}     # treat it as a lexical subroutine
        'abfoo' ~~ /ab <&foo>/;
    
        ---
    
        # Perl 5
        my $foo = 'foo';
        'abfoo' =~ /ab \Q $foo \E/x; # treat as string not regex
        # Perl 6
        my $foo = 'foo';
        'abfoo' ~~ /ab $foo/; # that is the default in Perl 6

------
EdCoffey
I'd add a few extra points:

1\. Don't (unless there's an extremely good reason to do so). Ask yourself: is
there a net benefit gained by forcing a developer to learn and use your DSL,
ignoring their likely familiarity with general purpose languages that could
solve the same problem? For starters, any application where humans won't be
reading/writing a large amount of the DSL is probably out.

2\. Avoid anything that resembles natural language. If you're writing a true
natural-language interpreter, you're not writing a DSL. If you're writing a
DSL that looks like natural language, people will be tempted to apply the
grammar rules they already know from that language, rather than the strict
grammar of your DSL, resulting in frustrating errors. It's a whole lot easier
to memorise the rules in a language of keywords and symbols, because you don't
have to first banish your existing knowledge of natural language.

3\. Don't try to accomodate "non-technical" users performing intrinsically
technical tasks. There's no point creating a "friendly" DSL over HTML and CSS
if the people authoring in the DSL still require an in-depth knowledge of the
box model, responsive web design etc. All you've done is kicked the can down
the road and created a false sense of capability.

~~~
bocklund
> 3\. Don't try to accomodate "non-technical" users performing intrinsically
> technical tasks. There's no point creating a "friendly" DSL over HTML and
> CSS if the people authoring in the DSL still require an in-depth knowledge
> of the box model, responsive web design etc. All you've done is kicked the
> can down the road and created a false sense of capability.

Should be #1 in the OP. So underrated.

------
cestith
Perl6 goes far beyond regular expressions.

[https://docs.perl6.org/language/grammar_tutorial](https://docs.perl6.org/language/grammar_tutorial)

[https://docs.perl6.org/language/grammars](https://docs.perl6.org/language/grammars)

Even with Perl5, if you're using too many regular expressions, you might want
to check CPAN.

[https://metacpan.org/search?p=1&q=parse&size=500](https://metacpan.org/search?p=1&q=parse&size=500)
(over 4k results searching for "parse")

In particular, Parse::RecDescent, Parse::Yapp, Parse::Lex, Parse::Flex,
Regexp::Common, Net::IP, NetPacket::IP, ... really most of the things you'd
want to parse from Apache::ParseLog to Parse::DNS::Zone

The inclusion of (not so) regular expressions in a language doesn't mean one
needs to abuse them.

------
NickBusey
Interesting article, though the title should perhaps be `How not to design a
Bad DSL` since the vast majority of the advice is apparently what NOT to do.

~~~
nerpderp83
If the no-go space is larger and easier to fall into, it makes sense to have a
more detailed danger map. It might stop a lot of folks from even taking the
journey, which unlike real journeys is proper choice. If you are making a DSL
for your own enjoyment, don't transmute your joy into someone else's pain. If
end users have real issues that your DSL will solve, by all means, make it.

------
greenyouse
Would anybody mind sharing links to good DSLs?

From Clojure I think hiccup is a good example.

[https://github.com/weavejester/hiccup](https://github.com/weavejester/hiccup)

~~~
9214
Embedded DSLs for:

* implementing other DSLs - [https://www.red-lang.org/2013/11/041-introducing-parse.html](https://www.red-lang.org/2013/11/041-introducing-parse.html)

* description of visual interfaces and specification of 2D drawing operations - [https://www.red-lang.org/2016/03/060-red-gui-system.html](https://www.red-lang.org/2016/03/060-red-gui-system.html)

* low-level programming - [https://static.red-lang.org/red-system-specs.html](https://static.red-lang.org/red-system-specs.html)

------
traverseda
I'd like to see a parser-generator built using these principles, as most of
the DSLs for building DSLs are "bad DSLs", by this definition.

~~~
wcrichton
Arguably parser combinators satisfy the author's requirements, e.g.
[https://github.com/Geal/nom/](https://github.com/Geal/nom/)

------
Twisol
(EDIT: I replaced asterisks with <AST> in the pattern below, since HN doesn't
have a convenient way to escape asterisks last I checked.)

> First, most of its syntax beyond the very basics like "X+" or "[^X]" is
> impossible to remember. It’d be nice to know what "(?<!X)" does without
> having to look it up first.

I realize regex can be hard to remember, but I think this is a little
overblown. "(? ...)" is the general form of an extended group -- something for
which "special behavior" occurs. The character sequence following that
determines what the special behavior is. In this case, "<!" means "negative
lookbehind": "<" for "lookbehind" and "!" for "negative". Compact, but
mnemonic.

> [a-z]{3,10}://([^/?#]<AST>)([^?#]<AST>)(?:\?([^#]<AST>))?(?:#(.<AST>))?

Yes, this is very dense. But it's not really meant to be read at a glance --
URLs do not have an exceptionally simple pattern, and they have multiple
parts, many of which are optional. You could write a recognizer for URLs
explicitly, but I doubt it would be as immediately recognizable in full. I
consider it more important to obtain a high-level understanding before a low-
level understanding, and (for me at least) I can see that this regex matches
URLs up front, setting my expectations for the details later.

I think a PEG or combinator parser would be more self-documenting, so I'm not
saying regex can't be pushed too far. But it isn't nearly as unstructured as
it looks.

> Many DSLs were designed to reduce amount of non-DSL code to the absolute
> zero. They try to help too much.

I 100% agree here. A DSL should be laser-focused on doing one thing well. Any
language should be composed of orthogonal features, each laser-focused; a DSL
is just a language with a very small handful of features geared toward a
specific domain. (Of course, ideals are rarely realized in full.) I
particularly like GraphQL as of recently; it has a really nice feel to it.

I found an interesting paper [0] on design principles for DSLs while composing
this comment. I've only skimmed it, but it looks quite nice.

[0] [https://arxiv.org/abs/1409.2378](https://arxiv.org/abs/1409.2378)

~~~
Twisol
Incidentally, here's how I might write that URL regex with comprehensibility
outweighing all else. Notice that I enable free-spacing [0], which can usually
be enabled by passing a flag rather than embedding it in the regex. In Python,
the re.X (or re.VERBOSE) flag does the job.

    
    
        (?x) # free-spacing
            (?P<protocol> [a-z]{3,10} ) ://  # http://
            (?P<domain>   [^/?#]* )          # google.com
            (?P<path>     [^?#]* )           # /path/to/resource
        (\? (?P<query>    [^#]* ))?          # ?id=123
        (\# (?P<fragment> .* ))?             # #section
    

[0] [https://www.regular-
expressions.info/freespacing.html](https://www.regular-
expressions.info/freespacing.html)

------
ellisv
> Another non-verbose DSL example is Java date and time format string:

> "YYYY-MM-DD'T'HH:mm:ss.SSSZ"

Oh that crazy Java formatting date times using ISO 8601!

Although the author didn’t really mention any good DSLs I do appreciate the
tips in the second half.

