
Crafting Interpreters: Strings - benhoyt
http://www.craftinginterpreters.com/strings.html
======
munificent
Author here! Happy to answer questions, take feedback, receive criticism.

 _Waves awkwardly, shuffles feet._

~~~
smueller1234
Hey Bob. Still a big fan of this project. Thank you!

This chapter introduces not only strings but the concept of more complex value
types that use an auxiliary struct. There's an anecdote related to that which
you might find amusing - and maybe a bit appalling.

In the implementation of perl, there's multiple kinds of integer value type in
order to support it's complex implicit casting semantics and additional edge
cases. To pick the least wonky, there's plain integer values, "IVs", and those
which have a stringified representation of the int attached for caching the
expensive conversion, "PVIVs". The former store the int in the primary value
struct. The latter use an auxiliary struct with extra space.

Now, the reason for all this prep is that in order to make the two behave the
same, and not introduce a lot of branches to low level code, perl always uses
the indirect access to the integer value (essentially "x->aux.intval"). This
works because for basic integer values, it initializes the aux pointer in the
main struct to point to the right place in memory before the location of the
main struct such that this access always points at the right slot of the main
struct. Madness! So much fun!

(A better explanation is at [https://medium.com/booking-com-development/how-
we-spent-two-...](https://medium.com/booking-com-development/how-we-spent-two-
days-making-perl-faster-939457ef16a1))

~~~
_old_dude_
Shenandoah, one of the GCs in Java uses a similar trick, each time there is an
access to a field, instead of directly accessing to it, the VM access to a
pointer at the top of the structure that point just to the address below
itself, so each field is read using a double indirection. This allows the
application to run when the GC move the objects (during the evacuation phase),
because each time the GC relocate an object, it also changes the top pointer
in the old region to reference the new region. The GC also rewrites the
pointers inside the stacks by replacing the pointer to the old region by the
value of the top pointer, so the pointers in the stack now reference the new
region too.

~~~
smueller1234
Ah, and I'd always wanted to read up on that and never prioritized it. Thank
you!

This might be a naive question, but how do you do all that redirection under
concurrent accesses? Clearly that amount of rewiring isn't going to be atomic?
Or am I misunderstand and the only change that needs to be atomic is the top-
of-structure redirection?

~~~
_old_dude_
As you said, only the top pointer need a CAS.

If you want more info, the is a good presentation by Christine Flood about
Shenandoah at FOSDEM 2017
[https://archive.fosdem.org/2017/schedule/event/shenandoah/](https://archive.fosdem.org/2017/schedule/event/shenandoah/)

------
fusspawn
I love this series.

Implimented my own interpreter of the back of the original series. Managed to
get a type system and all sorts added.

The author was even responsive via email giving me hints on where to
investigate extending it further. (belive I was trying to add lists at the
time).

Having followed a ton of 'how to write a programming language / compiler / ect
tutorials'.

Its the first one to really make me grok what's going on and move beyond just
copy pasting examples.

So thanks Bob. Keep up the awesome work!

~~~
munificent
Thank you! This means the world to me. :)

~~~
fusspawn
No thank you!

I have some spare change kicking in a bitcoin wallet gathering dust and would
like to buy you a coffee got an address I can send it to?

Couldn't see one on the site or your blog.

~~~
munificent
Give it to someone more needy than me. When the book is done and I have the
print edition out, the easiest way to say thanks is to buy a copy and/or write
a nice review.

Until then, just enjoy it. :)

------
c-smile
If strings are immutable in L0x then this string representation:

    
    
        struct sObjString {
          Obj obj;         
          int length;      
          char* chars;     
        };   
    

needlessly wastes 4 or 8 bytes for the chars pointer (among other heap
management overhead related to separate allocations).

Instead, I think, you should have

    
    
        struct sObjString {
          Obj  obj;         
          int  length;      
          char chars[0];     
        };   
    

and allocate that string object as one chunk with its characters.

~~~
munificent
Take a look at the first challenge exercise at the end of the chapter. :)

~~~
c-smile
My pardon, hadn't got to that part at the moment of commenting.

------
camgunz
Hey Bob, I super love this series. Everyone I know who's into PL likes it, so
great work :)

A little thing in your section about the different UTFs: I've heard UTF-16 is
a lot better than UTF-8 for East Asian languages because pretty much all those
codepoints are multibyte. I don't _necessarily_ think anything you wrote was
invalid, but just FYI.

~~~
ufo
However, this is only the case if you have just a big blob of text and UTF-8
might be better if the east asian text is interspersed with ASCII characters.
For example, HTML with east asian text often is smaller when encoded with
UTF-8, because a significant portion of the content is HTML markup.

~~~
camgunz
Right well again, if your average East Asian codepoint is 3 bytes in UTF-8,
and your average markup codepoint is one byte, then as the percentage of
markup in your corpus rises you'll grow asymptotically towards 1. But not all
text contains markup ;) Consider any database storage for example, text in a
game, e-books, visual novels. I guess anything that isn't the web or JSON is
what I'm saying -- which is a lot.

------
freecodyx
>>Don't get "voilà" confused with "viola". One means "there it is" and the
other is a tiny string instrument. Yes, I did spend two hours drawing a viola
just to mention that

madness,

~~~
tomcam
Since we’re correcting people... Viola is not tiny. It’s larger than a violin,
though smaller than a cello.

------
rurban
Hi, Why no rle encoded string length? You cannot access the string then
directly, but you can handle arbitrary string lengths and still save a lot of
space. inlined strings for short strings or interned strings for faster
comparison and sort are also important types. ropes maybe also.

------
oldandtired
If you want to see strings implemented as immutable representations look to
the unicon/icon vm representation. This has been in effective use for well
over 30 years and has all sorts of nice properties.

------
james-mcelwain
This book is the best!

