Hacker News new | past | comments | ask | show | jobs | submit login
Fast Lua Serialization (2023) (artemis.sh)
87 points by synergy20 10 months ago | hide | past | favorite | 50 comments



Kinda ironic that the article doesn't discuss it, but Lua was originally a data loading and simulation driving language for Fortran. Lua was the serialization format.

Lua itself can load millions of records per second from disk.


Lua was not originally a serialization format. It was a data entry format which was written by hand which then generated the files fed into simulations. Load speed was very important, but they weren't generating Lua. It has been used as a serialization format (for example, World of Warcraft saves UI state by generating Lua files), but that came along much later.

https://www.lua.org/history.html covers the early history.


I'm going to need to hear more about it because this is absolutely wild.



Good link. But for those who won't bother reading[1], that paper doesn't support the Fortran assertion. Fortran is only mentioned twice in the HOPL conference paper, once regarding syntax familiarity (lack of semicolons), and again in the context of FFI. In neither case is Fortran mentioned in a privileged manner, but rather alongside multiple other languages to highlight non-C, non-C++-specific considerations and motivations.

I know there are Fortran shops that make profitable use of Lua as both an embedded extension language and as a glue language. But Fortran users do so in the same way Lua is used with C, C++, and other languages. That merely reflects that the Lua creators had a good understanding of real-world software ecosystems, and successful in how they shaped the design and implementation to minimize impedance mismatches between Lua and other environments, across syntax, semantics, API, and various implementation details that are invariably ignored or overlooked in comparable languages.

[1] For those who like reading, see https://www.lua.org/docs.html#papers


I second this (and see the link in response). Cliché, I’m sorry, but I’ll be looking for reasoning in TFA about 1-based versus 0-based indexing. Better for clerks, w/o programming in mind?


What would be the source for that first sentence of yours?


These days, LuaJIT 2.1 also comes with a highly specialized serializer.

https://luajit.org/ext_buffer.html#serialize


luajit buffer serialization is amazing fast and also surprisingly very space efficient. this should really be at the top of the thread.


As another extremely simple idea, in my personal experience, just using regular Lua table syntax for serialization and then pre-compiling it with luac so it could be loaded quickly via dofile(), produced results just as fast for loading it in as using lua-protobuf. (I don't remember the difference between writing out Lua tables vs. lua-protobuf because my use case needed to read the data more often than generate it, but it must have not been big enough, if any, otherwise I would probably remember it.) I was loading gigabytes of data for large batch processing.


We exactly did the first half in my previous job, because JSON was really painful to use as is (and none of JSON5 or HOCON or whatever were popular enough at that time). The second half was substantially different due to our idiosyncratic requirements though.


For what it’s worth, luac has been deprecated now.

[EDIT] sorry folks! Must have misunderstood something I read a while back. Trying to dig it up now. But I could’ve sworn I’d read somewhere that this at least wasn’t suggested, and I thought it was also removed from the build in a 5.4 patch. Will circle back if I find what I’m looking for.


Where did you see that? I'm skeptical of that claim because I know some embedded uses of Lua strongly utilize pre-compiling to Lua bytecode so they can keep their hardware as slim as possible. I know some will even strip out the Lua compiler from the binary they ship to shrink the profile even more. Also, many video games like to pre-compile their shipped Lua scripts to reduce their load times.

I know that the Lua team has internal private repositories, and that luac is developed in a separate repo for them. I've seen occasional reports on the Lua mailing list that luac.c is forgotten or not updated in non-official releases. That is because those source drops didn't go through the full official release process which includes merging in from their luac repo. Maybe you are confusing these intermediate source drops with deprecation? If there is deprecation, I would like to see the details on that. I presume they would be introducing some kind of replacement that addresses all the real world use cases that rely on the abilities of pre-compiling to Lua bytecode.


> But I could’ve sworn I’d read somewhere that this at least wasn’t suggested

Maybe you’re thinking about the use of untrusted bytecode? Loading it is strongly discouraged because that’s insecure - Lua has nothing like a bytecode verifier.


Are you sure? It doesn't look like it has to me.


Would that be safe even with untrusted data? (Assuming you were the one who serialized and compiled the table)


Lua should be robust in the face of untrusted source code. However untrusted bytecode is unsafe.


Are you asking if there are special numbers or strings or some other kind of plain data that would glitch out the interpreter when they're read back in? That would be an impressive failure of a programming language.

Yes it's safe as long as you're serializing correctly (which isn't very hard).


By definition you can't trust that "untrusted data" has been serialized correctly.

Lua has strong sandboxing capabilities (i.e. ability to limit and control the environment visible from a chunk of code), but the Lua authors years ago explicitly disclaimed the ability to sandbox untrusted code. The compiler and runtime are not bug free. They don't have the resources, notwithstanding that compared to any other non-formally verified implementation their track record is pretty decent, even compared to past and current efforts from Sun, Microsoft, Mozilla, and Google. If you want to run untrusted Lua code, Lua sandboxing should be just the first of multiple line of defense, just as modern web browsers rely on various operating system mechanisms to constrain breakouts.


> By definition you can't trust that "untrusted data" has been serialized correctly.

Please read the comment again. They said the person loading the data is the same person that serialized it. The data came from an untrusted source prior to being serialized.

For example, consider a guestbook program. People send it untrusted text and then the program serializes it into a database. Reading the database back is safe.


By definition you can't trust that "untrusted data" has been serialized correctly.

Lua has strong sandboxing capabilities (i.e. ability to limit and control the environment visible from a chunk of code), but the Lua authors years ago explicitly disclaimed the ability to sandbox untrusted code. The compiler and runtime are not bug free. They don't have the resources, notwithstanding that compared to any other non-formally verified implementation their track record is pretty decent, even compared to past and current efforts from Sun, Microsoft, Mozilla, and Google. If you want to run untrusted Lua code, Lua sandboxing should be just the first of multiple line of defense, just as modern web browsers rely on various operating system mechanisms (process separation, filesystem to constrain breakouts.


Lua has its own built-in low-level serialization, in the form of string.pack() and string.unpack()[1]. This is akin to Perl's pack()[2].

But note that this is a tool used for laying out bytes in a certain order (like a C struct), not a general "serialization framework" for traversing data structures and whatnot. But it can be the building block for one.

[1] https://www.lua.org/manual/5.4/manual.html#6.4.2

[2] https://perldoc.perl.org/functions/pack


As the author of org.conman.cbor, I'd be interested in seeing the test data. The only bit that's in C is the low-level bit packing of CBOR values. I might have to revisit the use of LPEG to validate the UTF-8 strings (used because Lua 5.1/5.2 do not have UTF-8 support routines). This also supports references, which are nice (they can allow one to encode a Lua table like "x = {} x[1] = x" (a self-referential table)) that might slow things down as well

Also included in the package is a striped down version of my CBOR encoder that doesn't support the CBOR tagging that's about 1/3 the size, so I wonder how that compares against lua_cbor.


I used to use msgpack when it was schemaful. In a time when Avro, protob, thrift and msgpack were competing against each other and minimum message size was the benchmark

New versions are schemaless and I simply stayed in the old one. 10 years using it happily

Had to tweak the java version for managing strings over 32 bytes though, it was failing from flutter now


As a connoisseur of serialization formats, this is intriguing - I wasn't aware there was a schemaful version of msgpack once, and an ad-hoc search doesn't reveal any leads.

Can you point me to some information on how that older format worked?


> It correctly handles the strange array/map duality of lua tables in the most efficient way it can do safely (serializing both variants concurrently and only writing out the correct one at the end)

The article doesn’t expand on this - what are they referring to? Are lua tables arrays?

Serializing something common like tables 2x seems like a massive overhead.


Lua doesn't have a separate "array" type, as arrays are just a special case of key->value where the key is always a positive integer. While this is nice (it simplifies the language yet Lua tables are incredibly versatile), it does indeed make it a little more work to translate structures to languages/serializations that distinguish between the two types, if the serializer needs to guess at runtime (e.g. if you don't have a predefined schema).

I can't comment on the efficiency of whatever this implementation is doing, I haven't read the code. It does sound a little expensive. Also it's not always 100% clear which the correct mapping is, unless the developer explicitly picks one. For example, a serializer has no way to determine whether '{}' is intended to be an empty array or an empty hash.

Another common gotcha, e.g. when translating to JSON, is that Lua does not restrict the types of keys (while JSON objects only support string keys), so boolean true/false are possible keys, for example, which cannot be translated to JSON directly.


I ended up drawing the same conclusion while implementing my language. An insertion ordered hash table turned out to be just a normal array of values where the keys map to array indexes instead. Other than speed, I'm struggling to think of a good reason to keep the dedicated array type...


Note that the lua VM still optimizes tables that look like arrays by storing values in the array separate from the hash table [1].

[1]: https://www.lua.org/gems/sample.pdf


This is really interesting. So it doesn't hash integer keys, it uses them directly as indexes in a separate array. How is order maintained between the two arrays though?


Badly. In addition to https://news.ycombinator.com/item?id=41146240 Lua has two iterators:

  for k, v in pairs(t) do … end
  for i, v in ipairs(t) do … end
ipairs just counts from 1 to n until t[n] == nil. Pairs returns all keys (array’s too) in an order that is undefined.

Merged dict and array looks neat on syntactic level but really is pita to deal with. It’s a questionable design in itself, and then you go outside and realize that you’re incompatible with everything else cause {} and [] are two different things.


Yeah, I assumed as much... Looks like order just isn't maintained at all. Stopping at the first nil is pretty confusing and surprising to me, it implies a loop over an array full of nils would not iterate even once which makes no sense. Javascript does it right: array length is the greatest numerical index in the object plus one.

Abstraction seemed neat at first but completely fell apart in the details. Many such cases. Looks like a true vector type is a good idea after all.

> Merged dict and array looks neat on syntactic level but really is pita to deal with.

It doesn't have to be that way.

An insertion ordered hash table also contains an array part, it's just that it's completely internal to the implementation. It behaves just like a normal hash table, but internally the keys hash to indexes in the array part. The order of the hashes is undefined but the order of the key-value pairs in the array part is defined to be the order they were inserted into the table.

It's really neat and it seems to work really well. It behaves exactly like an array if used like one. It definitely isn't an array though: integer keys are not offsets, they are hashed like every other key type. This doesn't seem like a huge problem to me. The lack of this property seems to be the root cause of the complexity of Lua's tables.

It's tempting to think of these insertion ordered hash tables as some kind of superset of arrays. Just like it's tempting to think of hypergraphs as supersets of graphs, of graphs as supersets of trees, and of trees as supersets of linked lists. I haven't learned of any cracks in this abstraction, at least not yet. The reason to prefer less general data structures always seems to be performance, not correctness. Offsetting a pointer is faster than hashing some bytes, indexing two arrays, comparing keys for equality and then offsetting a pointer.


>array full of nils

which in Lua are indistinguishable, both in theory and practice, from a newly created empty array: {}


Interesting side-effect of newer Python dictionaries preserving insertion order too.

Explained by Hettinger here:

https://www.youtube.com/watch?v=npw4s1QTmPg


I learned it from his email to the Python mailing list!

https://mail.python.org/pipermail/python-dev/2012-December/1...

https://github.com/lone-lang/lone/commit/da13b67a61248b832b2...

I didn't know he had done a presentation about it, thank you for posting that video!


The biggest Lua gotcha is that accessing a missing element is not an error. Requires a hack to distinguish “really null”.


In my opinion this is actually kinda awesome, because it’s as if every element in the domain maps to an element in the range, so there are no disjoint or undefined mappings in a table. It’s not how we’re used to thinking but when the logic is designed correctly it’s pretty useful and easy.

Should accessing an absent record really be an _error_? Shouldn’t it just be a “None” or “Missing” option or something like that instead? It doesn’t seem to me like indexing a set with an absent key is really an error, it’s just not a value.


There are definitely different approaches with different tradeoffs. Lua tends to prioritize implementation simplicity so I would guess that's a factor here.

Ideally yeah you'd return the Nothing side of an Option type but that's not representable in lua's type system. Returning an {:ok, data} tuple erlang-style is a pretty solid middle ground, and that is a common convention in lua with multiple return of ok, result.

But throwing an error has advantages too. There is a potentially important semantic difference between an unused key and a null data. The difference becomes especially important when you go mixing data types IMO. I'm begrudgingly fine with a hashmap returning null for unset keys, but not with an array returning it for an out of bounds index. With lua's approach you can't easily differentiate these things and it does cause serious problems. Everyone hated this in php, I don't see why it's such a popular choice when lua does it.


The problem with bounds errors is that it brings little value for a correct program and there are better ways to help with correctness than a runtime error (but that transcends Lua-like langs).

Looking at python, it’s more often and irritating to convert to .get() after an error than receiving an error and realizing it was helpful. Code that optimistically assumes non-null results rarely survives few next lines after getting nulls anyway.


Why would it be an _error_ to evaluate a function which is not defined for the given element?


Lua gotcha is not this.

There’s no difference in “nil” vs “exists but nil”, because nil means “doesn’t exist” by definition. The gotcha is in # operator which when applied to a table may return incorrect “length” of its array part if it has holes in it. For example #{1,2,nil,4} can be 2 or 4.

It happens because Lua arrays are not “just dicts with integer keys” as someone said above. A lua table impl has two parts: one for dict and one for array. # basically returns “impl(t).array.length”. This array part gets populated and depopulated heuristically creating non-determinated outcomes for #.

This has nothing to do with bounds or existence tests. “Really null” is an attempt to have nil in an array, which Lua doesn’t by design.


most often seen with something like JSON serialization where JSON null will result in the key being completely absent from the Lua table, thus re-encoding into JSON will be missing that key entirely instead of existing but with a null value.

It's why almost every serialization library will have userdata values to represent "null" and things like empty arrays instead of empty objects.

Still love Lua though.


Actual Javascript is even worse, "undefined" and "not defined" are two different things, in addition to null.


It’s really “not exists”, although js’s own lexicon uses “define property” instead of “create property”. They don’t hesitate to “delete” it though.

Apart from names, I find explicit models easier to work with and they allow for better designs. E.g. in Lua if t has mt.__index = {x=1}, then t.x == nil is non-representable, because t.x = nil deletes t.x and now .x proxies into index, which returns 1. In javascript, since existence is a separate thing, you can assign literally any value to t.x and it will not be proxied to a prototype until you delete t.x. This has a whole cascade of consequences that make your meta life easier.

Undefined vs null is still a mess though, but not because the two exist, but because standard APIs use and return them mostly arbitrarily.


No matter what system you use, there are always unrepresentable states like that. If you have javascript-style properties, then t.x cannot represent "the object lacks property x".

I'm fine not having explicit property existence, and prefer throwing out the complexity needed to support it. Maybe I'm underestimating the usefulness, but I think in most situations I'd be happy with either setting to false or using a plain old initial value instead of __index, and in the rest of cases I could get the same effect by having __newindex disable __index for that key.


T lacks x is represented by !Object.hasOwn(t, ‘x’), which lives on an explicit plane with in, delete, etc. While there is space for further argument, in my own practice js style is absolutely superior to lua minimal style when it comes to meta and also in general (although the latter is much less critical, I rarely do such distinctions in “user” code). The goal is not to find a universally infinite looped superidea, but to have one that is practical and no less.

The best Lua way to do it imo is to not do it at all and use plain tables for data. Reinventing something that was omitted by design, and broken by that same design, is futile. Lua is as it is. You use it as is, or you fight with it, or you choose something else.


> hasOwn

Okay, I guess I should have elaborated. I'm talking about the situation where you'd use "in" because you want properties on the prototype to count. HasOwn would not work there. There is no way to default via prototype to "yes it exists" but override that with "no it does not exist".

> The goal is not to find a universally infinite looped superidea, but to have one that is practical and no less.

I find nil plenty practical when I use Lua. Even when working in Javascript, I have never felt the need to override a prototype value with null or undefined.


A Lua table is data structure which contains both an array and a hash map part i.e. it accepts both indexed and keyed entries. E.g.:

  myTable[1] = "red"

  myTable.name = "Color Codes"


For the actual language, there are no "parts". Every table is a map, and the implementation could do several things behind the scenes.

The issue is not knowing if the table can be encoded as an array, since that's more compact in many encodings.


In my Lua/cbor use, what I did was use a custom tag (I think the ID was 0x7AB1E, for "TABLE") with an array type composed of a sub array and map. The array contains the array elements of the table up to the first nil, and the map contains the rest. Worked well enough for my use.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: