Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: BSON Extension for Postgres (github.com/buzzm)
115 points by buzzm 4 months ago | hide | past | favorite | 27 comments
JSON support in postgres is superb but sometimes you really want decimal, date, and binary types, "carefree" UTF8 string handling (i.e. no escaping), and robust roundtrippability. So I made an extension for BSON.



The code here is really well designed. This project can serve as a tutorial on how to build a Postgres extension. Have a look at this:

https://github.com/buzzm/postgresbson/blob/main/pgbson--2.0....

and this:

https://github.com/buzzm/postgresbson/blob/main/pgbson.c

Really nice stuff.


That's the first time I hear of BSON. How does this compare to cbor and messagepack?


For cross-platform transactional Queues we found bson rather reliable as a message body format in AMQP over SSL, and tag the protocol handler with the user UUID and API session key. The uptime on that system is nearing 4 years, and its the one part of the system that has been low drama.

It is not meant for large reporting formats like live charts etc. These parsers can be unstable if they ignore type checks, precedence, and nesting depth.

The original libbson author wouldn't give me a donation link on inquiry, as they probably thought it was some sort of scam. The library merged into mongoDB years ago. Would not touch mongoDB now days given its current license, and propensity to implode key structures on upgrades. =)


It started as internal MongoDb format, so it has a different history and context. But today comparison might be interesting.


Why BSON and not JSONB, which is already supported in Postgres?


They have similar names but solve completely different problems. JSONB is just a binary representation of JSON, with the same limited set of data types. It doesn't provide "decimal, date, and binary types", which OP identified as part of the draw to BSON.


JSONB is optimized for traversal and compressibility. Is BSON so-optimized too?


It originated with MongoDB as their primary storage and interchange format, so I'd assume it at least was meant to be optimized for that purpose.

Either way, like I said, JSONB and BSON are solving slightly different problems—you probably would not choose BSON if you don't need the extra data types, and if you do need the extra data types JSONB can't help you any more than JSON can.


BSON is more like traditional JSON. All collection elements are ended with specific sub element (just like ']' '}' in JSON), you don't need the complete document before output the first byte. So it is streamable and memory efficient when output.

But this also affect decoding. Not prefixed with size means the memory size you need to allocate is unknown at the time of decoding. You may need to allocate more than required amount of memory or grow on demand if you don't know the size at the time that decoding starts.

Beside this, BSON is also an exchange format. There is full spec and you can use it outside of mongodb. (For example, you need to write an object that contains ArrayBuffer to disk? Just serialize it as BSON and write to disk.) While jsonb in progress is just a DB internal representation, you don't actually see the binary format anywhere.


> All collection elements are ended with specific sub element (just like ']' '}' in JSON)

I don't think so. Arrays and documents store the number of elements before the data and don't use a terminator value. Only element names are nul terminated strings.


JSONB is not optimized for compression.


> It doesn't provide "decimal [...] types", which OP identified as part of the draw to BSON.

This part is misinformation, as json[b] works perfectly fine with decimals (well, Numeric, which is an arbitrary precision decimal type)


JSON[B] does work perfectly fine with decimal and float inside postgres and pgsql, e.g. select (data->>'amount')::numeric. The issue is moving that data in and out of postgres (or truly, any other data manager) with precision. If the JSON[B] is {amount: 12.09} then you have no guarantee that someone picking it up outside of postgres uses float conversion which could lead to 12.0899999. With BSON, amount is a decimal128 and stays that way, fromBytes() and toBytes().


> If the JSON[B] is {amount: 12.09} then you have no guarantee that someone picking it up outside of postgres uses float conversion which could lead to 12.0899999. With BSON, amount is a decimal128 and stays that way, fromBytes() and toBytes().

But how is that relevant? With BSON as transfer format you still don't have that guarantee that the client won't misinterpret the data it receives by using bfl8 as its own internal data format after deserialization.

All languages I know have libraries that can decode the numbers from JSON strings into their relevant arbitrary-precision decimal type. If the user doesn't use that, why would it be a failure of JSON[B]/PostgreSQL?


With BSON I am guaranteed that, for example, in Java, when I do `Document d = BSON.fromBytes()`, then `BigDecimal amount = d.getDecimal128('amount')` is non-lossy. `amount` is typed; `double amount = d.getDouble('double')` is an error. Again, it's not a failure of JSON[B]/postgres. It's what happens when the data carrier (JSON) is just a string with a small inferred type suite. How many times have we run into issues when `{"v": 1.3}` works and `{"v": 1.0}` works but `{"v":1}` gets turned into an `int` by accident?


Fair. A charitable interpretation would be that they're referring to the fact that in BSON you can specify the exact number type you want to use [0], whereas in JSONB you just have one shared number type for all number fields.

[0] https://www.mongodb.com/docs/manual/reference/bson-types/



It says in the posted comment.


This is really cool. A common use case for me is building JSON objects directly in my query, for example to return a list of json objects. Usually this means date columns lose their type, is there a way of returning bson like with jsonb_build_object that keeps this types?


Perhaps this is an opportunity to ask somebody who might know about BSON performance. As a POC/stress test for work I added two multi-GB datasets to Postgres (as JSONB) and to Mongo (BSON). While trying to query approximately a hundred megabytes of data (a few hundred documents) from each I found that Postgres executed the query and decoded the JSON data in under a second, while it took Mongo a few seconds. Does this mean that BSON is slow to deserialize? Or perhaps it is not related to serialization? I was quite confused.


Did you do a EXPLAIN ANALYZE on the postgresql query (and perhaps mongodb if it has something similar)?

It might help to find if it was the actual query or the de-serialization that was the bottleneck.


Listed status is "experimental" but only two relatively minor commits in the last two years. Maybe it's more stable than that implies or the author is looking to pick it back up?


I've used the BSON Java ObjectId for distributed primary key generation for a long time. Really useful for distributed systems.


It's specifically advised NOT to use ObjectId for distribution. This data type has a notion of time embedded in it which results in monotonically increasing ObjectIds. Use UUID or a hashed version of the ObjectId


As I understand it, it was 'Tested using Postgres 14.4'.

I'm wondering if there are any plans to support Postgres versions 15 and 16?


Retested on OS X 13.2 for both 15.5.3 and 16.1; bug discovered thanks to 16! Fixed and new version pushed.


Considering a random brew install bumped my postgres to 14.10.1 and I didn't even notice until 45 mins ago, maybe I should.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: