I agree. Focusing on reading rather than updates should be the priority since bulk scans are more meta-common than random-poking. E.g., an approach in Nim: https://github.com/c-blake/nio
I agree with the author that it's annoying that any kind of change to the header length (even trivial stuff like renaming tensors) would require an entire rewrite of the file. Though I think there is an easier way to solve this: Add some padding to the header.
More formally, I'd keep the existing safetensors format exactly as it is (I.e. first header size, then header, then data), but also do the following:
- First research how long the JSON headers are typically in practice (a KB? an MB? larger?).
- Think of some number that would be the "ideal header size", i.e. that is convenient to handle and that 99% of all headers will fall under. Let's say 1024.
- When writing a new file, add the constraint that the header size must always be a multiple of that number - i.e. round up the actual header size to the next multiple and fill the remaining bytes with whitespace.
- That should give you a situation where most of the files have effectively a fixed header size of 1024+8 bytes, a small portion has a header of 2048+8 bytes and a very small number of huge files have a larger header.
When updating a file, you can now distinguish two cases:
- Does the header size change, e.g. from 1024 to 2048? Though luck, you have to rewrite the entire file.
- Does it not change? Then you can update the header in-place or even append new tensors to the end of the file. This should be the common case.
(It would also be useful to keep track of the header information in memory, separate of the data, so you can quickly calculate the size difference and write the new header without that additional pass over the data)
Reading the file is the same as reading standard safetensors files, you just have to make sure, your JSON parser is tolerant of extra whitespace.
Padding is better and does solve many things, but it's still in the file. That external header/metadata nio.nim approach allows you to separate concerns like slicing rows and columns (for 2D anyway) in a shell pipeline. This is nice if you want to do relational database like things but with a db that is fully in binary, avoiding ascii -> bin (parse) and even more costly bin -> ascii (print) cycles. The mental model is closer to "text-workflow", though.
To follow up on the general thought, another thing padding can help with that you/the article don't mention might be "many time series in a matrix format". In one index order, you can just add a row, but in the transposed index order, you need to widen a matrix. To avoid a full re-write you can instead just pad with NA. Still O(nObjs) tiny updates to flush to disk which could even be many 4k page writes, but for long time series it can still be more efficient. https://github.com/c-blake/nio/blob/ffb671ce23b1b77899c3bcff... has a `proc upstacks` command to do this and https://github.com/c-blake/nio/tree/main/demo/timeSeries a fully worked example.
You could also imagine a linked list of header chunks. If the header is not large enough you could then add another chunk and make the previous chunk point to it.
The problem of partial updates without rewriting a whole large file is old, and has been solved.many times. Some old, widely known examples: zip files, also wad and pk files from iD Software games. Any of the approaches could have solved the problem of efficient updaters while protecting against a short read. (Or serve as a warning, if we look at the history of zip files.)
One of the lessons I received while growing as a software engineer is to assess if I could be the first person ever encountering a particular problem. Most often I am not, so existing solutions can be found and are worth looking at, and maybe directly reusing.
Not for writing. But you'll have fun reading this file.
Bonus challenges:
- Read the file from STDIN or from a socket, where you only get one pass over the data without buffering.
- Read the file, but the file is truncated by one byte (i.e. the last byte is missing).
With all due respect, this sounds like someone repeating the mistakes of the "end of central directory" record of zipfiles.
reply