This is one of the things that seems to be a little bit nuanced (surprisingly?) because it's not just about creating the right query to fetch data optimally, but it is also about making sure that it's handling data with the right authorization context of the current end-user.
This becomes especially tedious when you're fetching objects and lists of objects and the "resolvers" to fetch those objects need to mix in authorization rules in each resolver.
Hasura tackles this by providing a model level authorization layer (similar to Postgres RLS, but at the Hasura/application layer) which makes GraphQL to SQL feasible even for apps. It's not just about a translation layer. Postgraphile does this using Postgres RLS.
Those are the key benefits that make Hasura useful for data-fetching in practice. Over the last few years we've been working with users with billions of records per table, especially in enterprise environments. And it's great to see how smooth things have been.
Especially as the number of tables and models explode, having a declarative system to specify authorization at a model level and then get a guarantee that an effective SQL query will get created is a nice thing to have. It allows our users to deal with higher level concepts, model data, setup authorization, make queries, look at query plans (which now show full SQL queries), add indexes, test, add to an allow-list, done.
 - https://hasura.io
(I'm from Hasura)
That largely depends on your existing architecture. Some of us have authorizations services, others have a separate database to store policies and others have authorization rules deep into their existing code base.
However, as I mention in the post, Hasura is great if you make the decision to use it early for a project, but it's not a luxury everyone has, and not a decision everyone will want to make even starting a new project.
> because it's not just about creating the right query to fetch data optimally, but it is also about making sure that it's handling data with the right authorization context of the current end-user.
You're absolutely right that its tricky. (A good example of this is Facebook building a viewer-aware entity layer for some of these reasons). Hasura makes this great as well, but again, many are not in a position so "simply use Hasura". That's why I quickly mention that the post will focus on existing/mature/complex codebases.
Even though authorization systems or database policies exist outside the "data" database, quite often that information is used during the data fetch right? Let's say the user has certain properties and then those properties of the user and attributes of the data are used to define a constraint which is usually applied as the data fetch query is made.
For example, our goal at Hasura is to allow developers to bring values from external systems as well in their rules, so that the authorization rule doesn't exclusively depend on values coming in from the same database. This is what makes Hasura useful on existing systems (alongside your existing complex codebase), especially when you're looking for extremely high read performance or subscriptions for example. I do want to emphasize that Hasura is far more valuable (in a business sense) for our users on existing systems that for new systems!
I'd love to hear examples from your experience about the ratio of authz where it could be declarative vs where it was necessarily imperative in code. The "values" in the constraint specification can come from anywhwere, the viewer entity, another database, but the constraint is declarative.
What's it been like at GraphLQ API for github? Specifically, do you have a sense for how complex the authorization for the data models is beyond "declarative" properties of the data and the viewer itself?
Why does Hasura need to know about the authorization model? Graphql.org's best practices guide says authorization should be in the business layer and should be a single source of truth. In other words, GraphQL responses should be going through the same authorization module as REST or RPC responses.
Your comment makes me think I was maybe wrong about that? Could you confirm and/or link to some docs about this aspect?
However I think I’ve worked out a decent compromise:
- I start configuring the basics of my Hasura setup through the Admin GUI.
- I then use the CLI to download the metadata with the command `hasura metadata export`.
- Commit that to my repo.
- Using the existing metadata files as a template, make any changes when required.
- Use the cli-migrations docker image to apply my metadata on each deployment of Hasura (as detailed here: https://hasura.io/docs/1.0/graphql/manual/migrations/advance...)
Not sure if that’s the most efficient way to do it but it seems to work and it was pretty quick to setup.
1. You use the GUI and then export/import the metadata
2. You can use the GUI and as you do, the GUI will keep incrementally updating the metadata configuration file and even create database migrations for you in your folder on your machine, kind of like an IDE
3. You can stop using the UI altogether, and just write the configuration by hand
Also, feel free to ask around on our discord if you need help: https://discord.gg/hasura
You can choose to bring in that configuration via a Kubernetes configmaps type thing, or by baking a new image for a release (like you would have done with your source code say).
That way you can leverage proper version control, merge feature branches, do copy paste of parts of models, reuse entire modules across projects, etc.
(I'm from Slicknode)
Internally we have a little tool that dumps the current state of the permissions into a txt file that we commit in the repo. This makes it easier to get a snapshot of the permissions as well as compare what’s changed. The Hasura internal tables are pretty self explanatory so it only took an hour or so to knock up.
Ps, love hasura. I suspect we’ll end up running it in front of all our data layers eventually. If there was an easily hackable admin front end that sits over graphql I’d be interested in hearing about it. Leaning towards Forrest admin at the moment, but I’m still not 100% sold.
I scratch my head on this problem for quite a while. I would love things like database-to-GraphQL or even Datalog solutions.
But eventually, I found I still need to add another web endpoint on top of it to fix all the authorization logic, making things equally verbose like typical Web applications, defeating the purpose of using database-to-GraphQL solutions.
Having RBAC is Okayish for a lot of applications, but also not feasible for others. Usually ABAC or more expressive authorization rules are needed.
Do you have any thoughts on more expressive/powerful ways to tackle this issue?
I think at a minimum you need authentication, authorization, access controls (some of which are security filters), and of course auditing. You need to support RBAC, and also ABAC, roles/security groups, and then the access controls too.
The solution to row/cell level security (RLS) in most databases comes through projection and predicate-based filters and/or data obfuscation/masking/encryption. These are a form of access controls, and these access controls, in terms of (oversimplified) read/write controls on top of permission objects that are security filters.
These security filters may use attributes from a role, user or be defined within a permission object themselves and then granted to roles, users and/or groups of users.
One simple example - if the ultimate interface to other systems is running a SQL query on a remote system, then RLS would add a condition to the WHERE clause of the query that filtered the rows returned based on an attribute.
To get to cell-level protection (e.g. row/column) in a relational result, you need projection limitations (e.g. filtering the columns returned in a SELECT or obfuscating the data in those rows returned using data masking). The most advanced solutions encrypt the data in such a way as you can still use the data to join, but cannot see the source values due to the security filtering layer.
Hasura seems to be going down the path (I worked for a company that did the same) of integrating "data access firewall" type features into their product. It makes a lot of sense - as it is an interface layer, a lot of additional security can be baked in that goes above and beyond that of the actual DBMS behind it. This is especially needed in federated situations where you have heterogeneous platforms that differ in their security capabilities (as is the case with GraphQL and Hasura, trying to span multiple data sources).
You can even add security rules like.. preventing cross joins, preventing long running queries - basically you can address just about every concern DBAs have about letting people run queries directly against a DB.
Hasura's authz system is actually like row-level security that can span attributes and values across data sources and "viewer" properties. We're gradually expanding these capabilities, but that's our roadmap.
Is there an ABAC policy that you tried with Hasura that didn't work? Would love to hear!
For example, here’s one user’s stack:
I have developed a Python library that does essentially this GraphQL-to-SQL pattern, leveraging the power of SQLAlchemy and some metaprogramming to create the 4 CRUD methods for each table automatically, all with filters, orderby, pagination and subrelationships/queries. And if you use SQLAlchemy's automap, you don't even have to have the database schema ready.
I needed this when a client with a huge Microsoft SQL Server database (2k tables+) asked me to quickly export it 'to the web' with the least friction possible. I researched Hasura and it was available only for PGSQL, so I created my own, and by using SQLAlchemy I can connect with any database supported. I am using pyODBC with Microsoft's drivers on Linux and I can easily connect to the database and export it.
I am gonna provide example code soon'ish on the library's repo. But basically, it goes:
# Instantiate the library, pass in your db url and your SQLAlchemy database schema (declarative_base() or automap_base())
# GraphQL objects, schema and actual crud functions leveraging the passed connection_string get dynamically generated.
# Instantiate a GrapQLView, and pass it to your Flask App()
# That's it!
- we introspect your SQLAlchemy models and generate Dataloader-based resolvers that emit small independent SQL queries. This way this optimization composes naturally with other GraphQL types in your schema.
- we are thin wrapper on top of Graphene, the main GraphQL Python library. It makes it very easy to extend your type if needed. We also added ways to rename and modify the type of a column / field.
- for the SQL query generation we currently lean on SQLAlchemy to do the heavy lifting. It can emit efficient in IN() JOIN queries for us.
There is still room for performance improvements but I'm confident we'll get there eventually (though I expect the optimizations to only work for DBs that support LATERAL JOINs). My main focus at the moment is to come up with an API that it possible to create performant types that aggregate multiple tables.
It's possible to use sqlalchemy with django, but you're basically on your own (don't expect to be able to use django-admin and other pluggable apps etc.)
> it didn't used to be very nice back in the day
Having used both django and sqlalchemy ORMs very heavily, I've got to say that django does what it does extremely well and intuitively. For what the vast majority of webapps need to do, the "obvious" way of constructing the query causes django to do more or less the right thing. sqlalchemy on the other hand I love for other reasons, but am frequently annoyed at how much faffing is required to do the simple things in a concise way.
I think a few people have thrown around the idea of creating a frontend to sqlalchemy to allow it to understand django model definitions and QuerySet semantics but I don't think anything's come of it.
> Besides the popular Dataloader approach, another very common way of tackling this problem is by taking a GraphQL query, and coming up with the exact SQL needed to resolve it
But why? Why not just use DataLoader? Is there something intrinsically wrong with it that makes it a bad solution? I want to know! My understanding was the DataLoader was extremely performant and worked for 90% of all situations where a join is needed to get values for a GraphQL query.
> Data fetching remains something to improve
Certainly this might be the case, but the article didn't address the most popular form of data fetching and explain why a graphql-to-sql compiler would be a better solution.
The Dataloader approach gives you a potentially (arguably likely) suboptimal baseline performance for simple projects, or ones where tight coupling with the database is acceptable. But it’s one that tends to scale better and more predictably with complexity (both codebase size and architecture). A hybrid model where you use Dataloader but include some database query optimisations driven by your selection sets is likely a winning approach, but will probably need to be fairly bespoke to your architecture.
One thing I will say as we do make a lot of “IN” queries as a result. You fetch a bunch of rows with a data loader and then you want to fetch a bunch of rows from some other table based on those first results so you end up with an IN query where you might have had crafted a JOIN if you were writing it by hand.
I don't know if it's common across other GraphQL implementations, but the graphql-java library lets you peek  into the selections so you can opt to limit data you select, execute a join instead of using an IN query, etc.
I've used this with good results when making a prototype graphql interface to our billing software, anything that had a 1:1 relationship with a parent record would get JOINed by the resolver, while 1:N or N:M relationships would execute separate queries.
However, another commenter pointed out above that some implementations allow peeking into the GraphQL query. That way if you know you're going to need the user anyway, you can already perform the join proactively when you fetch the ads.
In a few cases we've done this because we always want the joined results so we just do the join regardless of the query we get.
I still don't understand why this abstraction is useful, if not harmful.
I think the abstraction is somewhat useful if your GraphQL queries vary tremendously (one device asks for the user along with the ads, another device doesn't). Expand this out to a bunch of different data types and you'd find yourself writing a lot of variations of the queries. GraphQL with Data loader allows you to write these things exactly once albeit with potentially suboptimal queries.
All of that being said, we're finding that our queries vary very little and are somewhat questioning the value of data loaders and of GraphQL altogether in our case.
it kind of just breezes over it without explaining the reasons why DataLoader isn't a good long-term solution
If I had to guess, because it's something that you just turn on in your project and it just works -or doesn't-.
I've used it in a project together with graphql-sequelize, which gives you "automatic" resolvers for your queries. Turn dataloader on, perform a query that uses any of those automatic resolvers and all underlying SQL queries to the same table are magically merged together. Now, replace those automatic resolvers with custom ones and you lose this magic. Now you have to write your own ORM query that groups anything that would end up being multiple queries into one, basically the case the author is pointing out.
This scaling could work a variety of ways. Records might be horizontally sharded across a bunch of database instances. Caching is layered above the database to protect against hot database queries, etc. The generation logic now needs to understand how to partition a single GraphQL query across the shards and caching layers.
It's definitely not impossible, and in some circumstances it may be easy. But it couples your GraphQL implementation very tightly to your current database system in a way that may make it difficult to rearchitect your database system at the next order of magnitude scaling problem - "We really need to redo how we think sharding and caching, and our GraphQL's direct interaction with the data layer is going to make this a challenge." Maybe it's easier if the company is new and is GraphQL-only. An older company that has a mix of a few data access technologies might wish the GraphQL left the data layer concerns to the data layer.
I know it's counter intuitive, but we think of this as an opportunity to solve those problems at the Hasura layer. Sharding and caching are hard but also not impossible to make declarative if you know enough about how the data is modelled, how data ownership works (authorization) and how the cache is managed (caching hot queries, caching complex queries but infrequent, invalidated on time or data).
I'm not trying to say the problem isn't hard, I'm trying to ask what it would take to solve the problem. :) We're also actively working on solving them. Imagine if those challenges got solved for a sufficient number of use cases!
Incidentally, databases are getting better at preventing users outgrowing them too while preserving the SQL API as much as possible, or providing a transform. Citus / YugaByte / Planetscale / Cockroach / Spanner to name a few.
Have you guys thought about looking at some other tools that might be symbiotic? Hasura's interface would be an incredible companion to something like Dremio (also open source), since it is a fully-SQL engine across a number of heterogeneous systems, with probably the most advanced caching I have ever seen in a system like it (Apache Arrow in-memory and columnar storage). If you can make your query interface adapted to tools like Dremio, you shouldn't have any problem exploiting caching in those kinds of platforms - might be even more efficient than building into Hasura for some use cases.
Edit: I think there is a newer type of interface architecture forming here, and I think GraphQL represents one of the pieces of the puzzle, would be really interested to see more detail on Hasura's take on that.
Have to admit I haven't had this issue yet and I think it's a nice problem to have at that time as means too many paying customers :)
You can do this in PostgreSQL, it supports sharding natively through its FDW ("foreign data wrapper") feature.
Sometimes you are creating an application and building its underlying datastore at the same time. Hasura is great for this.
But sometimes you already have a disparate ecosystem of services and datastores and you want to provide a GraphQL API for that. This use case is better-served by other approaches.
Great tool. Personally I don't trust GraphQL completely.
I had to come up with a deliberately contrived UI to find a realistic query that was more than 4 relationships deep. Most were around 3 (though double-nested connections were fairly common within this). The more common problem was around overall data size, where the bottleneck was JSON serialisation overhead and network transfer size.
Query whitelisting is the winning approach for 1st party APIs. If you’re opening things up to third parties, e.g. the Github API, complexity analysis becomes the more valuable solution.
Scheduled for the 1.3 release this coming week!
Disclaimer, it's being used in low-volume production by us, but that is about it, so ymmv. The "docs" is the sample app used for integration tests/etc.
I mention it b/c I agree with the article that GraphQL is super-easy to N+1, but the Hasuras/Postgraphiles are a fairly different application architecture than "I'd like some ActiveRecord-style domain objects".
Prisma is also promising, but AFAIU more of "a really great query layer" than "how you'd write your domain objects", which is what I wanted to focus on.
I was originally a node/JS skeptic, but it is uncanny how well the JS "limitation"/architecture of an event-loop + async/await basically provides Haxl-style batching "for free" to the application programmer.
Went looking around a bit more and found MikroORM. The developer is fantastic, fixes issues and overall it is a better designed tool that just works the way you'd expect.
Edit: I just noticed that stephen is the op, an old friend. It is worth looking into Joist as well. =)
When looking into Graphql, and having toyed with it a little, it’s never made sense to me. Always has come across as an extra layer of obfuscation and complexity when dealing with the data access later. You have to write backing sql/db queries anyway to support the Graphql layer. More efficient use of time to just write good db queries.
Is there something I’m missing here or is the world finally catching up to reality?
The benefits for api consumers and frontend developers are quite clear. Easier ergonomics to explore the API and ability to collocate data fetching alongside UI components to automatically have a graphql api client optimise the data fetching required to load a screen/page itself itself.
The benefits of building a graphql api/service do seem harder to evaluate.
You can use GraphQL with SQL. This is about that.
However, you can also use GraphQL with NoSQL. AWS Amplify uses GraphQL with DynamoDB, and that's a pretty common stack in the AWSiverse.
You can't use SQL with NoSQL, obviously.
So, if you're looking to learn more about GraphQL, just realize the lessons here aren't universally applicable, though GraphQL-to-SQL is probably the best implementation, imo.
What do you mean by that?