Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Pg_jsonschema – A Postgres extension for JSON validation (github.com/supabase)
203 points by oliverrice on July 21, 2022 | hide | past | favorite | 56 comments
pg_jsonschema is a solution we're exploring to allow enforcing more structure on json and jsonb typed postgres columns.

We initially wrote the extension as an excuse to play with pgx, the rust framework for writing postgres extensions. That let us lean on existing rust libs for validation (jsonschema), so the extension's implementation is only 10 lines of code :)

https://github.com/supabase/pg_jsonschema/blob/fb7ab09bf6050...

happy to answer any questions!




This is awesome -- really excited that Supabase is picking this up with their commitment to open source in general and PG in particular.

Some prior art:

- https://github.com/gavinwahl/postgres-json-schema (mentioned in the repo)

- https://github.com/furstenheim/is_jsonb_valid

pgx[0] is going to be pretty revolutionary for the postgres ecosystem I think -- there is so much functionality that would benefit from happening inside the database and I can't think of a language I want to use at the DB level more than Rust.

[0]: https://github.com/tcdi/pgx


I was working on a custom YugabyteDB rollout last year. Part of that work was building some custom Postgres extensions for additional multi-tenancy enforcement in a distributed cluster.

I must admit that the whole extension ecosystem in Postgres is amazing, even without any additional language layer.

I don’t write C very often but writing Postgres extensions was so easy to approach, it blew my mind!


The `jsonschema` crate author here.

First of all, this is an exciting use case, I didn't even anticipate it when started `jsonschema` (it was my excuse to play with Rust). I am extremely pleased to see such a Postgres extension :)

At the moment it supports Drafts 4, 6, and 7 + partially supports Draft 2019-09 and 2020-12. It would be really cool if we can collaborate on finishing support for these partially supported drafts! What do you think?

If you'll have any bug reports on the validation part, feel free to report them to our issue tracker - https://github.com/Stranger6667/jsonschema-rs/issues.

Re: performance - there are a couple of tricks I've been working on, so if anybody is interested in speeding this up, feel free to join here - https://github.com/Stranger6667/jsonschema-rs/pull/373

P.S. As for the "Prior Art" section, I think that https://github.com/jefbarn/pgx_json_schema should be mentioned there, as it is also based on `pgx` and `jsonschema`.


thanks for the great work on the jsonschema crate!

I didn't know about the pgx_json_schema (or is_jsonb_valid) and will add both of them to the README today


This is absolutely brilliant.

In windmill, https://github.com/windmill-labs/windmill (self-hostable AWS Lambda, OSS AGPLv3) we infer the jsonschema of your script by doing static analysis but so far we were not doing validation of the payload itself, if your script failed because of incorrect payload that was your problem. Now without any additional effort I will be able to add validation and great error reporting "for free".


That project looks awesome, there are so many opportunities to take entrenched cloud products that leave a lot to be desired and turn them into smaller businesses that better serve the customers that AWS largely ignores due to their preference for volume.


Thanks, the self-hostable AWS lambda is actually just one part. On top you have a flow builder and platform which is aiming to provide an OSS self-hostable alternative to solutions like pipedream, airplane, aws step. I am not necessarily aiming to replace AWS per se because a great way to host it is EC2 but what I want to reduce is lock-in into proprietary services and architecture complexity, not every lambda need to scale to a 1M req/s.

Supabase is actually a great comparable, complementary and inspiration. Ideally, windmill allow tech teams and startups to focus on their business logic of their internal tools, not the infra to run it.


I particularly appreciate the project structure of the linked repo. Its good to see actual production application code beyond a hello-world todo style app for languages like Rust and toolkits like Svelte. I look forward to going through the various pieces and learning them better by example.


Nice work, however, I am structurally dubious of putting too much functionality onto a classical centralized RDBMS since it can't be scaled out if performance becomes a problem. It's CPU load and it's tying up a connection which is a large memory load (as implemented in postgres, connections are "expensive") and since this occurs inside a transaction it's holding locks/etc as well. I know it's all compiled native code so it's about as fast as it can be, but, it's just a question of whether it's the right place to do that as a general concern.

I'd strongly prefer to have the application layer do generic json-schema validation since you can spawn arbitrary containers to spread the load. Obviously some things are unavoidable if you want to maintain foreign-key constraints or db-level check constraints/etc but people frown on check constraints sometimes as well. Semantic validity should be checked before it gets to the DB.

I was exploring a project with JSON generation views inside the database for coupling the DB directly to SOLR for direct data import, and while it worked fine (and performed fine with toy problems) that was just always my concern... even there where it's not holding write locks/etc, how much harder are you hitting the DB for stuff that, ultimately, can. be done slower but more scalably in an application container?

YAGNI, I know, cross the bridge when it comes, butjust as a blanket architectural concern that's not really where it belongs imo.

In my case at least, probably it's something that could be pushed off to followers in a leader-follower cluster as a kind of read replica, but I dunno if that's how it's implemented or not. "Read replicas" are something that are a lot more fleshed out in Citus, Enterprise, and the other commercial offerings built on raw Postgres iirc.


If data validation doesn't belong in the database, then what does? At that point you're treating your RDBMS as a fancy filesystem and drastically underutilizing the tool.

By centralizing data validation, it removes many potential failure and inconsistency scenarios (w.r.t different services validating things differently).

Worrying about CPU, without seeing if it's a real problem for your use case, is a premature optimization. Similar to worrying about foreign key constraint checks being too expensive. This is rarely the case, but if it winds up being a problem, you can relax it later and move the check it elsewhere in your stack (or remove it entirely).


> If data validation doesn't belong in the database, then what does? At that point you're treating your RDBMS as a fancy filesystem and drastically underutilizing the tool.

Data storage and querying. Validation should be done in the application layer for security and scalability.

"Premature optimization" is the rallying cry of people standing in the corner of a room with a painted floor holding a paint brush.


If you are thinking of SQL injection when you say "security", yes you need to sanitize the input before querying the database. But that is different than validating the sanitized data is correct. The database is generally a better place to do that. If worrying about scalability is a legitimate concern and not a premature optimization, then look at one of the distributed SQL databases (Spanner, TiDB, CockroachDB, Yugabyte) which are getting pretty good.


> Data storage and querying

Storage + querying is vastly underutilizing a tool that is designed to help you safely manage the entire lifecycle of your data – From defining that data (DDL), making it accessible (queries, indexes), ensuring consistency (checks, constraints, types, etc...), managing cascading data events (triggers), and more!

It's a fool's errand to try to (poorly) replicate these things elsewhere when there's already a battle hardened system with decades of investment that allows you to do these things declaratively and consistently for free.

And it's much safer! The DB will enforce that validation even when services are doing massive migrations or new clients are mutating the data in ways you didn't anticipate. It's far less error prone than ensuring that all of your services are validating data the same way.

> Validation should be done in the application layer for security and scalability.

Input validation, for sure. Sanitize as close to the perimeter as you can. But for consistency and integrity of data, let the database do what it's great at.

Your application server will probably not do it as efficiently as your highly-tuned database. There's no reason to assume that pushing that work into the database isn't a better scalability strategy. And it's certainly a more robust strategy from an organizational perspective.

Fixing slow things is frequently trivial compared to fixing bad data or tracking down subtle data bugs. For example, borrowing from a real world scenario in an Elixir service that uses Ecto to handle data mapping and validation, our code assumed a key in a JSON blob was a string, but in a handful of old rows it's a number. This inconsistency only existed in ~0.0005% of rows, and only in old data, but it surfaced in all sorts of weird ways. And was difficult to track down.

Using the RDBMS to handle validation would have never allowed this to happen.

> "Premature optimization" is the rallying cry of people standing in the corner of a room with a painted floor holding a paint brush.

That strikes me as an uninformed opinion. Optimizations usually work by making lots of assumptions on the nature of your problem and then implementing something that depends on those assumptions always holding. This leads to more convoluted and fragile code, but in return you get performance.

Writing optimized code frequently involves removing optionality, which is the equivalent of painting yourself into a corner.


Saying "Premature optimization" is fine, as long as you leave yourself a way out. And the way out is clear here - when your database no longer can handle the load, drop the DB check constraints and move them over to application layer.

Could you please elaborate on the "security" part of your comment? To me it seems database level checks would be more secure, as they would work with every possible way to update data, rather than just the blessed paths going through the app layer.

And as far as scalability goes... it depends. If your ambition is to take on twitter, you probably need to take scaling seriously on every step along the way. However, I've been building mainly B2B systems for 15 years now, and just about all of them have been purring away happily on the low end of AWS RDS offerings. There are orders of magnitude more performance available at the click of a button. Database scalability is not something that keeps me up at night.


All this does is make writing a CHECK constraint for a JSON blob relatively straightforward. And yes you should use CHECKs in the database. In practice, it is very easy for application code to either not check or screw it up. It is much harder to find and clean up bad data in the database than prevent it from being inserted. By making it a schema config, it is much easier to audit by other engineers and more likely to be correct.

If CHECKs are causing perf issues, then okay maybe put the problematic ones somewhere else.


You could cost the stored procedures' deployment costs in cryptographically-signed-redundant-data-storage bytes and penalize unnecessary complexity by Costing Opcodes (in the postgres database ~"virtual machine runtime", which supports many non-halting programming languages; such as rust, which is great to see.) and paying for redundant (smart contract output) consensus where it's worth it.

TIL about https://github.com/mulesoft-labs/json-ld-schema: "JSON Schema/SHACL based validation of JSON-LD documents". Perhaps we could see if that's something that should be done in pgx or in the application versions deployed alongside the database versions?


No processing done at the DB very often implies more queries and consequently a higher workload in the DB than just doing the processing there. That's evident when looking at typical ORM usage (usually very chatty applications).

Toon Koppelaars shows that in a few experiments done with Oracle ( https://www.youtube.com/watch?v=8jiJDflpw4Y&ab_channel=Oracl... )

Also very often doing the work in the DB means a simpler solution, fewer race-condition issues, and less work overall.

When talking about this particular case, the CHECK constraint, it is probably matter of measuring the impact and understanding at which point this might be relevant, and considering the cost of the alternative implementation at the application layer


Scaling isn't really a concern for everybody and every application.

It's good to be aware of additional strain you put on your DB, but I've worked with lots of applications where scaling simply wasn't a problem, and there I'd prefer validations closer to the data.


Very cool!

I remember when kicking the tires on postgrest/postgraphile that I found validation and error handling to be one of the less intuitive areas. Not the actual field-level constraints, but how to adapt it to fit a fast-fail vs slow-fail model.

When I had attempted before, the only ergonomic option was fast-fail (the first check constraint violated would bubble the error upward) rather than slow-fail (collect all invalid fields and return the collection of errors, which IME is more common on average web forms or api requests).

Looking at the single code file and tests, I see only singular field errors. Has a more ergonomic approach to validation-error collection been developed other than writing a large function to iterate the new record fieldwise against the schema?


> fast-fail (the first check constraint violated would bubble the error upward) rather than slow-fail (collect all invalid fields and return the collection of errors, which IME is more common on average web forms or api requests)

... > Has a more ergonomic approach to validation-error collection been developed other than writing a large function to iterate the new record fieldwise against the schema?

Thats an interesting problem that makes sense in the context of form validation.

re: have the ergonomics improved -> no

but you could hack that behavior together like this:

```

create or replace function handle_errors(errors text[])

    returns bool

    language plpgsql
as $$

begin

if errors = array[]::text[] then

  return true;

 end if;

 raise exception using errcode='22000', message=(select string_agg(x, ', ') from unnest(errors) v(x));
end;

$$;

create table account(

id int primary key,

email text,

first_name text

check (

  handle_errors(

   array[]::text[]

   || case when email is null then 'email must not be empty' else null end

   || case when first_name is null then 'first_name must not be empty' else null end

  )

 )
);

insert into account(id, email, first_name)

values (1, null, null);

-- ERROR: email must not be empty, first_name must not be empty

```

not that I'd recommend it :)


Even in the fast fail case it would help if it told you which constraint failed. In the example from the readme it says the row failed validation, but not why. Would be nice if it said something like “items of `tags` must be type string, but integer was given”. Or even just that it was an invalid item type found in tags.


Heh I really wish they supported this natively in the DB. This would be super useful for crud stuff if you could just set a GUC and get the fail slow mode, and get a list of errors returned to the client.


This is really cool. This will make it much simpler to convert from firestore to supabase. I think that's the last missing feature that firestore provided which supabase didn't.

We are already running a sync process between firestore and postgres. So we can do aggregations on JSON data. At this point it's only a matter of time before we move to superbase


It would be valuable to know which JSON-Schema it supports, since there are currently 4 different versions that differ in their capabilities (as one might expect). Related to that, does it honor the "$schema" key allowing the schema to declare which version it is?

The postgres-json-schema alternative that's mentioned in the repo also ships with what appears to be a conformance test suite; does this carry the same, or was the focus more on speed?


> 4 different versions that differ in their capabilities

pg_jsonschema is a (very) thin wrapper around https://docs.rs/jsonschema/latest/jsonschema/ it supports versions 4, 6, and 7

> appears to be a conformance test suite; does this carry the same, or was the focus more on speed?

it was mostly an excuse to play with pgx that went surprisingly well. The performance is very good but safety was our main interest.


> It would be valuable to know which JSON-Schema it supports, since there are currently 4 different versions that differ in their capabilities (as one might expect).

There are 9 drafts, at least six of which seem to have support in implementations. (Plus there is a different variation [an “extended subset” of Draft 5] defined and used within OpenAPI 3.0.)


I have a very dumb question: why would you use this instead of a traditional schema? I thought the value of json columns was to be partially schemaless and flexible

What is the use case for this versus normal column definitions, if you’re looking to enforce schemas?


Not the author.

Developers sometimes really just want to dump the data as a JSON. For them this means not having to write a lot of boiler plate ORM or SQL templates, and shipping trivial features quickly.

Example, user UI preferences are a good candidate for something like this. You probably don't want to add a new column just to remember the status of checkbox, when the user last logged in.

As a DBA, you probably still want to define a schema for this data, so as to not cause unexpected web app crashes. It ensures the some level of data consistency without increasing the maintenance overhead.

Obviously you wouldn't use it for business critical data, in my opinion.


Ah gotcha, so the use case is when the frontend wants to send JSON and there is no backend dev around to munge things into a standard postgres schema in order to save it?

Honestly, it seems like a middleware layer would suffice in order to stick with more "standard" postgres features. (not to indict this feature, but simply because its more likely a maintainer would understand and have experience)


There’s a lot of value in keeping data in the same format everywhere. Why “munge” it into some other format if there’s no actual need?


“Flexibility” isn’t why nosql patterns are useful - at least not in the sense of fewer constraints on your data.

Imagine you’re writing an editing interface of some sort with images, text blocks etc. Each is conceptually a “node” but have different attributes. A single node table with a column per attribute means redundant columns. A table per node type introduces other problems because you need to query all your nodes together. Basically however you model this you end up with tradeoffs and real apps can obviously get much more complex - next thing you have 50 tables and a complex query just to get the data out.

Contrast to the other extreme - storing the whole thing in a single hierarchical json document. There’s no restriction on the data shape, and you can just pass the whole thing around as json. Versioning becomes much simpler because you’re versioning a single document. Your export format is just the json.

There’s tradeoffs of course, and often a middle ground is the right approach - but json columns with validation definitely have their place.


I'm using a pg json column to store data as a second source of truth for data that is primarily held on a separate microservice.

The schema on that end is pretty intricate but to prevent hitting two services for certain types of data, we just dump it to a json column.

Furthermore, for a personal project of mine to help me with productiving / daily schedules, i'm using a json column for a to-do list in the schema of

{[some_todo_item]: boolean,}

which can't traditionally be represented in pg columns as the to do items are variable.


Do you need schema to be enforced? I guess my question was: "why use a JSON column with an enforced schema to save data, when perfectly good normal pg schema are available?"

Totally get why you would want to just save data in whatever format you send it in, that's how I prototype as well. But a regular column has the advantage of familiarity with other devs, not to mention better syntax for querying.


Not experienced with Postgres and its ecosystem unfortunately, but all those Postgres extensions popping up on hn lately certainly make me envious. To someone with more insight: How risky is it to rely on those extensions? I guess rust handles the 'accidental data corruption or crashes' aspect. How difficult is it to continue to use such an extension once the original author walks away? Is the extension API somewhat (or perfectly?) stable? Given that this example probably mostly used in CHECK contraints, I guess it could be fairly easy removed or replaced from a running installation?


The most common constraint is whether the DB as a service offerings support a given extension, since they don't support installing custom ones. Naturally choosing to support an extension across a fleet of hundreds of thousands of instances (running dozens of different minor versions) isn't a decision made lightly, so it can take a while for new extensions to be supported.

Supported extensions on AWS RDS: https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...

Supported extensions on Google Cloud SQL: https://cloud.google.com/sql/docs/postgres/extensions

Supported extensions on Azure Database: https://docs.microsoft.com/en-us/azure/postgresql/single-ser...


Shameless plug (StackGres team member here) but StackGres possibly has the largest selection of ready-to-use Postgres extensions [1].

Give it a quick try on any Kubernetes cluster, like k3s on your laptop (one command install), and install any extension from the Web Console or a 1-line in the SGCluster yaml.

[1]: https://stackgres.io/extensions/


I just love the work both Supabase & Hasura have done making people aware of how powerful Postgres is.


How does this validate data with a variable amount of keys with the same value type for example a to-do list

my day to day to do list varies in the number of tasks, but the completion will always be in boolean

    [
        {
         "task": "do Foo", 
         "completed": False, 
        }, 
        {
         "task": "do Bar", 
         "completed": False, 
        }, 
        {
         "task": "do Baz", 
         "completed": False, 
        }, 
        ...
    ]
Also, what is the issue of schema validation before inserting into the json column, as this is what I'm doing with a small microservice with Redis.


This is awesome! I remember playing with Hasura (which uses Postgres) a few years ago. Hasura is great but one of the things I really wanted was JSON schema validation at a database level so I could keep all the "raw" types together if that makes sense. I was then, and do now, do public-facing schema validation but that doesn't necessarily validate how it's going to be persisted into the database so being able to have that closer to the database level now is great.


As an aside, I'm a long time backend developer writing my first mobile app with Dart/Flutter. I tried the popular backend recommendation in that ecosystem. After weeks wasted on it, I googled "{PopularBackend} alternatives" out of frustration and this thing called "Supabase" appeared in the results. What a breath of fresh air it's been. It uses Postgres as a backend (with postgREST), which means I can put all those skills to good use (you can go far with all the batteries Postgres comes equipped with: row-level security, triggers, functions, etc). It's open source and the free tier lets me do pretty much all I need for development (and probably for a production MVP). Which means I don't have to worry about "after questions" until after.

Supabase team keep doing what you're doing!


thank you for your kind words. I’ll share this with the team - they will really enjoy reading it

good luck building, and make sure to share the critical feedback with us too. it really helps to improve the product


> make sure to share the critical feedback with us too. it really helps to improve the product

Will do.


I was just looking into this for handling typescript integration with postgres. I think there's a lot of opportunity to make that work really well. Zapatos and pg-typegen are good steps but it could be even better.


If adding this check to an existing table with millions of records, will it scan the entire table checking all of the records, or just check when the records are inserted or updated.


if you add a check constraint to an existing table only new/updated records are checked. Existing records will not be checked


What you describe is not the default behaviour with PostgreSQL (though it is available, via ALTER TABLE ADD CONSTRAINT ... NOT VALID).


Thank you!

Some suggestion for the next roadmap:

- a Dockerfile ( The dockerfile helps me a lot in trying out new technologies )

- info about the compatibility with new PG15

- CI/CD


Do you really need a Dockerfile for a library that consists of one file?


I do not have rust installed by default.

I mean I don't think it's just 1 line

- this is just a rust part: https://github.com/rust-lang/docker-rust/blob/master/Dockerf...

- plus "postgresql-server-dev-14"

- ???


[deleted]


So we’ve come full circle and now JSON is just XML with lighter syntax.


I don't view that as a loss. JSON by itself is just a string with no context information so either the context is in the programmer's head, in a human-readable document, or in a schema file.

Adding a schema to it seems natural to me (as someone who prefers strong-typing), and the JSONSchema implementation seems pretty lightweight and sensible.

I'd go even further down the mining XML's legacy route and add better XPATH2.0 + XQuery - style support. XQuery was a great tool, so expressive. Though it did spawn XQueryX [1] which I'm still unsure whether it was a joke or not.

[1] https://www.oreilly.com/library/view/xquery/0596006349/ch22s...


Postgres does support jsonpath syntax which covers some of those cases:

https://www.postgresql.org/docs/current/datatype-json.html#D...


It’s funny to think about where we’d be if someone had come up with a better xml syntax before json took off.


With the difference that it's not a markup language, and it doesn't have those pesky namespaces.


To a certain degree, yes. Agreed that it's funny and ironic. But also from an overall software-engineering and developer-experience perspective, this isn't the worst outcome...


Nah, we still don't have JSL (either JSLT or JSL:FO.) And, given that JSON isn't markup, I don't see that as being likely, at least not JSL:FO.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: