Hacker News new | past | comments | ask | show | jobs | submit login
Skipping the boring parts of building a database using FoundationDB (tigrisdata.com)
75 points by ovaistariq on Sept 21, 2022 | hide | past | favorite | 41 comments



It's a shame that FoundationDB went closed source when it did, since that was the key period (imo) for database exploration - the early 2010s. If it would've been more popularized then I imagine most people would be using it now.

The nice thing about FDB is that after 3 plus nodes, you can simply add nodes using your cloud provider of choice and it scales pretty nicely while still giving your high availability and fault tolerance.

It's pretty funny to me though to see this - I've been spending a few days building a simple database on top of FDB that supports indexes, secondary indexes and schema migrations backed by json-schema (very, very similar to this, totally independently!)

To get into a little bit, it's not super difficult if you use FDB. FDB is a very bare key value store. It's incredibly low level. You don't even get a notion of collections. You have to implement everything yourself. what it does give you, however, is a giant hash map that will guarantee that items are in sorted order.

so to build what I was describing it's easy:

a collection can be a tuple to map:

   (your-app, your-collection, _id, your-model-id-number) => json
e.g.

   (hn-app, users, _id, 1) => { _id: 1, username: endisneigh }
   (hn-app, users, _id, 2) => { _id: 1, username: reader }
an index can be something like:

   (your-app, your-collection, your-field-to-index, index-value, _id, your-model-id-number) => json
e.g.

   (hn-app, users, username, endisneigh _id, 1) => { _id: 1, username: endisneigh }
   (hn-app, users, username, reader, _id, 1) => { _id: 1, username: reader}
Because FDB gives you transactions, you maintain the index by populating the keys according to the pattern above on your create* and update* operations.

To do something like a schema migration, FDB gives you a get_range operation that you can use to find all keys that have a prefix. So what you'd do is store a value indicating that you're doing a migration into the database, iterate through the keys in a batch (so it's all a single transaction), update the value in the db saying what the last key you've migrated is, and continue until you've done all of the keys.

A lot of stuff is pretty trivial once you assume the underlying semantics are solved. I've seen some interesting projects involving things like using FDB as a virtual file system for SQLite, but the problem with that is FDBs primitives are actually flexible, and so there are optimizations you can make if you built it using those primitives from the beginning, as opposed to using FDB as simple a key value store without taking advantage of the transactions.

-------

On another note, one idea I've had (feel free to steal) is to reimplement IndexedDB using FoundationDB. IndexedDB is also a key value store which supports transactions, like FDB. Obviously IDB is not networked.

The idea is that if you can semantically map IDB with FDB, then you could use FDB as a store for IDB (scoped to the user, of course). And then any app that uses IDB for its storage (like an offline app) could use FDB as the backing without having to use a different set of data structures to actually represent the storage.


We love FDB for all the reasons you mentioned and decided to build on top of it. It is quite flexible and provides a low enough primitive to be used as a building block for systems like Tigris.

However, if you are an application developer looking for a ready-made solution that you can plug-in as your application's backend, then FDB does require heavy lifting. For example, you would have to implement auth mechanism, query layer, schema management, and indexing. This is where Tigris comes into play.

You have a very interesting idea about backing IndexedDB APIs with FDB.


Oh absolutely! I love that you're taking this on. I noticed there was no documentation on schema migrations - I assume you just haven't added it, but I assume it's available or on the roadmap?

Once you get all of your core functionality completed, you should definitely look at the IndexedDB APIs with FDB. I see you're considering FDB as a service. You could definitely compete with Firebase if you had some admin primitives around ACLs and you reimplemented the IDB APIs with FDB.

For instance, you and I both are on two computers obviously. We could each have a Tigris instance. If your app is down, we fallback to the regular IDB api and everything is saved. You could save entire transactions that aren't persisted to FDB and replay them when FDB comes back up.

More interestingly, as the admin, you could use all of the IDB tooling like LevelDB, PouchDB, absurdsql, etc and only concern yourself about the user (you and I) and things like how many keys they can save on the free plan, premium, etc.


Schema migrations are supported but there are some restrictions. Here are some docs, we will be adding more details: https://docs.tigrisdata.com/overview/datamodel

Assuming you have declared your schema as shown here https://docs.tigrisdata.com/typescript/getting-started You can evolve it by updating your type definitions, deploy the new version of application. Once `createOrUpdateCollection` is called, it will update the schema.

--

The IDB idea sounds very cool, let me dig into it more.


One thing I'd personally recommend from thinking about this last week regarding schema migrations is the following:

Unlike Postgres or RDBMS, being a NoSQL store FDB has some advantages with long running things like migrations. In particular, FDB can store things in an arbitrary manner.

I created a notion of a "preempt" for migrations. The way a preempt works is that when you define a migration, you also define a preempt which represents how the old value changes to the new value after the migration.

For example, if you have:

   { username: endisneigh, _id: 1 }
and you want:

   { username: endisneigh, _id: 1, lengthOfUsername: 10 }
You'd obviously run some code to modify everything. Lot's of ways to do this, map reduce, batch job, etc. The problem is, if you happened to have 100 million of these rows, it will take you a long time to modify all of them. There are a lot of ways to solve this - locking being a popular one.

I created a notion of a preempt so you can define the change in the migration and immediately have access to the change if you access the particular record prior to the migration job getting to it.

So in the above example, you could have a migration that looks like the following:

   class Migration {

   @up
   function migrate(oldRecord) {
      oldRecord.lengthOfUsername = oldRecord.username.length;
      return oldRecord
   }

   @down
   function migrate(currentRecord) {
     delete currentRecord.lengthOfUsername;
     return currentRecord;
   }

   }
What's nice about this is that if you use "preempts", you don't have to have any conditions around the long running jobs in your application code. You can treat the long job as already being completed as soon as you run it, regardless of the amount of records. You can call it a just-in-time migration for new records, to be run as you access records. The reason I felt this to be necessary is to maintain the transactional (completed or not completed) semantics FDB gives you because it made the code easier to work with if you can assume things are done, or not. Eventual consistency is a huge pain and creates too many bugs imho. The other reason I like preempts with FDB is because it's literally something you can't do with RDBMS (you couldn't treat a column as another type until the transaction has actually completed for a alter table, for instance).

I would also not get too invested in your architecture such that you cannot change data types during schema migrations. I'd generalize it so it's always possible. Like if you have an integer field, and an index on it and you use $gte, it does what you expect. If you change it to a string, $gte still works, and uses the lexicographic ordering instead of the number ordering. You can imagine the equivalent for all of the operators.

Only caveat is that you'd either need to ensure all preempt code is idempotent since your long running job might run the code twice (once just in time, and another during the migration), or you'd need to save which records have already been processed via the just-in-time migration and skip those as necessary. This leads to issues since you would need more storage space, and then you'd have to clean it up, and if you have a full disk that leads to more issues, etc.


Avoiding rebuilding the table is definitely possible for certain schema changes like adding a new field.

But supporting data type changes without rebuilding is not ideal. It will lead to data quality issues and complexity on the application. Integer -> String example is simple. But what about String -> integer, how are the consumers of data supposed to handle the situation where the field in some records has a string value and in some has an integer value? They will have to add type checking which complicates each of these consumers.

Then some of the downstream consumers such as data warehouses depend on strict data validations. We went through this problem at Uber- I blogged about it here https://www.uber.com/blog/dbevents-ingestion-framework/


What they are saying is that if you grab the value it will be made into String (as you expect it to be if you locked and did it all) with a JIT migration as a part of the fetch operation.

Solving the exact problem you stipulate.


postgres has a version of this preempts idea with default values on columns. postgres will fill in the value at query time without needing to backfill the data. postgres is not a horizontally scalable database like FDB so not a direct comparison. In practice this means the migration lock is much shorter and it becomes possible to actually have a default in large tables.


Allowing default values for columns is definitely doable and we can also implement it in a similar way by filling it in during the query. But changing the type of a field to an incompatible type is tricky and needs more constraints and external machinery to fix the history.


The IDB idea is very cool. We are doing something similar to manage our platform users, apps, etc(internal metadata for our platform). But extending it for the use case you mentioned seems very interesting!


> reimplement IndexedDB using FoundationDB. IndexedDB is also a key value..

I did sth pretty similar last month: https://rxdb.info/rx-storage-foundationdb.html

It supports indexes, mongoDB queries etc. to store and query JSON documents via RxDB on top of FoundationDB.


beautiful! I'll check it out. heh, I knew it was a good idea, I'm glad someone else thought so as well


Suppose I had an implementation for a (MT safe) all in memory KV store based on tries, for example where the memory overhead is low -- certainly lower than hashing. Is this something I can plug into FDB?

You write "giant hash map that will guarantee that items are in sorted order". Where can I get more info on that?


Both keys and values in FoundationDB are simple byte strings and keys are treated as members of a total order, the lexicographic order over the underlying bytes, in which keys are sorted by each byte in order.

You can see some more details about it here: https://apple.github.io/foundationdb/data-modeling.html


> It's a shame that FoundationDB went closed source when it did

This is worth correcting at source as well as down the comment tree. FoundationDB was 100% closed source until 2018, when it was open-sourced by Apple.


That’s fair - to clarify I mean you couldn’t even use it without paying. Before they offered a binary for free.


> It's a shame that FoundationDB went closed source…

For anyone else who's confused, FoundationDB went closed-source in 2015 but went open-source (Apache 2.0) again in 2018.


FoundationDB was never open source until 2018. They had a free binary version you could download from the website until they were acquired by Apple, though.


I guess I was still confused! Thank you for the correction.


Is there a 'limits' doc somewhere? how many objects can an array hold? what is the depth limit for nested data?


The limitations are listed here: https://docs.tigrisdata.com/apidocs/#section/Limitations

Apart from the size of the document, there is no limit on the size of the array or the depth of nested data. We plan on substantially increasing the document size limit.


"The maximum allowed document size is 100KB"

I suppose this is due to the FDB limitations, so this obviously isn't a blobstore nor will it ever be (?). For example, we need to store video and image files which are easily 100KB - 1GB in size. Tigris or FDB are great to store metadata in (and the metadata is just as important to us as anything) but the blob storage is a bit of a problem. Would be interesting to integrate something with, for example, MinIO or S3.


Thank you for explaining your use case. This is one of the features we have in mind to support use cases similar to yours for storing videos, images, etc. But this may not happen this year. It would be great if you could add something here https://github.com/tigrisdata/tigris/issues that will help us track and monitor similar use cases.


Done, and thanks a lot for the invitation to add, happy to contribute

https://github.com/tigrisdata/tigris/issues/562


Thank you! appreciate your contributions. Please do not hesitate for future contributions in terms of feature requests, code contributions, etc.


There is no limit on array size or on nested data; the only limit is on the document size, which we will slowly relax. But if the object is deeply nested, then we need to extract all these nested keys(which may be slightly costly) and index them as we also allow filtering on any key of the object. If the depth is not significant then the cost is negligible.


Kudos! you guys tweaked the sales pitch in a single day!


I hope you are sold now :) I don't really have experience pitching our own product. So the feedback on HN has been enlightening.


Can typescript be limited to just creating the schema and then use javascript for the rest of your application?


You can use our typescript client for everything not only for Schemas. Here is a section that has examples on how to use it https://docs.tigrisdata.com/typescript/. Please let us know if anything is confusing.


Didn't explain myself well, meant I want to use Javascript only and maybe typescript only for the schemas. I understand that Tigris exposes a HTTP endpoint and is all json but wondered how well the Typescript SDK will play for pure Javascript development.


I am the TypeScript client author here. Tigris exposes native HTTP apis and we have openapi YAML available at https://github.com/tigrisdata/tigris-api/blob/main/server/v1...

The native Javascript SDK is on our roadmap. In the meantime if you would like you can generate Javascript client using openapi generator such as https://blog.logrocket.com/generating-integrating-openapi-se...

We would be happy to accept your contribution.


Building distributed database systems correctly from the ground up is a notoriously hard problem. We have seen this firsthand building Docstore at Uber.

This is one of the most confusing aspects of the modern data infrastructure industry, why does every new system have to completely rebuild (not even reinvent!) the wheel? Vendors are spending so much time rebuilding existing solutions, they end up not solving the actual end users’ problems, although ostensibly that's why they decided to create a new data platform in the first place!

In this post we talk about our approach to building Tigris - the open source developer data platform. We talk about why we chose to build on top of FoundationDB, one of the most reliable distributed KV store with an amazing correctness story. We also go into detail about our experience using it.


Are you Himank or Yevgeniy?

Good luck with your project, FDB fucked it's users back in 2015 when it abruptly closed shop and went closed source. Hopefully some good can come of it yet.


It has significant usage across large companies such as Apple, Snowflake, Epic Games, VMWare, etc. I don't see it going closed source. Besides that, here Tigris is taking the responsibility for the product as it is abstracting FoundationDB from the end-user.


I can assure you from FDB will not be closed sourced again. But I agree that it was pretty bad for its [open source] users that it was made closed source. Terrible move for adoption.


> FDB fucked it's users back in 2015 when it abruptly closed shop and went closed source

This is simply untrue - it was not open source prior to acquisition either. The first point at which FDB was open-source was 2018.


FDB screwed their users in 2015 when they abruptly closed shop with no further development, support, or even a migration plan.

Gotta get that Apple money.


Many of those users continued to be users, with source licenses, who are users to this day. If you had a commercial contract you were unaffected. Perhaps not if you were freeloading…


Ah yes, it was my own fault for believing in them.


I mean, that was Apple. It was no small feat to get them to later open source it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: