JunoDB: PayPal’s Key-Value Store Goes Open-Source

gregwebs · on May 21, 2023

> JunoDB is unmatched when it comes to meeting PayPal’s extreme scale, security, and availability needs.

It would be nice to see some benchmarks or just a mention of any kind of number. TiKV is a CNCF donated project with roughly the same architecture and has been deployed in larger clusters than 200 nodes.

cyanydeez · on May 21, 2023

That sounds like ad copy written by a lawyer.

flagrant_taco · on May 21, 2023

At least they didn't include a claim that it's 10,000x faster than the competition

itsTyrion · on May 22, 2023

did someone do that? /srs

lopkeny12ko · on May 21, 2023

> While JunoDB is not considered a Permanent SoR (System of Record), we do use JunoDB for a limited set of long term (multi-year) SoR needs.

I'd be very interested in learning why JunoDB isn't used as a SoR (or why PayPal doesn't consider it suitable for SoR).

bitwidget · on May 21, 2023

Possibly cost? Offloading unused data to long term cold storage is cheaper than storing on JunoDB and never using it.

dcl · on May 21, 2023

I've only ever used SQL and relational databases. What are the use cases of Key-value stores? What's the canonical example of where they are a clearly the right solution?

jerf · on May 21, 2023

In a nutshell, you trade in the various guarantees relational databases provide, like transactions, relational integrity, and high levels of queryability for higher performance on what a key/value store can do, simplicity of API, and some things that can be easier for the system to provide precisely because it is offering fewer guarantees. Particularly distribution... SQL is hard to distribute precisely because of the guarantees it offers.

When to use it? You must be sure you have a case where the additional guarantees relationship databases provide are not necessary, and you either need the simplicity of usage or deployment. Or in rare cases, speed, but whereas most people seem to act as if this the main reason, I consider it a relatively poor reason. A relational DB with its guarantees turned off (i.e., no relations, no transactions, tables with just a key and a value) perform fairly closely to a key/value store for a wide range of scaling needs. There is a top end where this matters, but fewer programmers need this than think they need this. Still, it is a valid concern, and if you don't need relational guarantees it may let you scale down the instance size.

Most of the cases I see where key/value stores are a good choice relate more to the simplicity than performance. I love me my Postgres, but nothing compares to the simplicity of just tossing up a Memcache somewhere and solving my problem, if that's what my problem calls for. No schema, no migrations, an API so simple my local programming language may well simply integrate it with my native associative array syntax... if you are careful to use it only where you aren't going to need relational functionality the bang for the buck can be very nice.

jmartrican · on May 21, 2023

I'm a big fan of relational DBs. But I do still have a Redis instance running in my environment. I know that my relational DB can easily do what Redis is doing but there are two big reasons that I will disclose below for why I still use Redis.

1) The system that is reading the data from Redis is a front-end system. I have concerns with security. As a front-end system, it can be accessed from the Internet and potentially hacked. As such, I make sure this system does not have access to the DB.

2) Preserve DB connections, size (affects costs of storing backups and time to hydrate), CPU, and memory. I would really like to keep my DB as trim and spry as possible. I am using traditional relational DB and not distributed. So in an effort to keep my complexity in DB management down, I offload some work to Redis. I could have created a second relational DB that has low security requirements and no relational object mapping, but I didnt think of that till now (I will explore this further in the next few days). In my case the data that is stored in Redis is accessed a lot and consists of data from multiple tables, so it was a good pick for moving to Redis. Performance wise, I suspect relational DB could perform jus as fast, but again I want to offload traffic off of the DB to not have to grow it or go distributed. If I was a better DB admin, I could probably created views or other relational DB features. But I felt it was easy to just let my API backend code (which has access to data that is also not in DB or might need to be formatted or calculated) construct the final data object then store a copy in Redis, whenever the data is updated.

zinodaur · on May 21, 2023

Great writeup. I use a KV store for work, but only because the alternative for our data size (>1PB, >1 trillion keys), sharded sql, is totally awful. Distributed KV stores like foundationDB, dynamoDB actually do offer transactions , which is a huge win and I think their main selling point.

I guess I'm like you - I really don't understand why someone who had another choice would opt in to a KV store (barring something like memcache, or something really high performance using an embedded KV store).

twic · on May 21, 2023

Don't think of key-value stores as an alternative to relational databases. Think of them as really big hashtables.

NavinF · on May 21, 2023

Caching.

It'd be pretty silly to use a relational database for something that trivially shards across servers, often doesn't have any consistency requirements, and only needs 3 columns (request, response, expiry time).

I don't think it's even possible to use postgres as a traditional cache. Eg this trigger looks extremely slow and I would be unsurprised if a human holding ctrl+shift+r could single-handedly DoS a cache like this: https://stackoverflow.com/questions/26046816/is-there-a-way-...

jmartrican · on May 21, 2023

You can use Postgres for caching, and many other things. https://www.amazingcto.com/postgres-for-everything/

ilyt · on May 21, 2023

Missing the point entirely

jmartrican · on May 21, 2023

I was responding to this comment.

> I don't think it's even possible to use postgres as a traditional cache.

The link I posted shows that you can use Postgres as a traditional cache. Specifically, the SO link showing that Postgres cannot do caching (expiring old data) is explicitly addressed in the link I posted.

One would have to got out of their way to miss the point. And miss it entirely.

ilyt · on May 21, 2023

theoretically possible is far away from practical. That's my point. It would take quite a bit of work to do even subset of what most other solutions come out of the box, at most likely much lower performance (the reason to use cache in first place).

You can put a fucking text files on the disk as cache and `find -atime X -delete` to clear it, doesn't mean you should use it.

NavinF · on May 22, 2023

That link just says "Use Postgres for caching instead of Redis with UNLOGGED tables". That's not helpful and going one link deeper, it says "reports range from 10% to more than 90% in speed gain" which is not that fast.

> the SO link showing that Postgres cannot do caching (expiring old data) is explicitly addressed

Nope, UNLOGGED tables don't expire old data. They just fill up forever unless you use a trigger like the one in the answer I linked.

I also don't see anyone claiming they can saturate their network interface with postgres the same way they can with any KV store.

anarazel · on May 23, 2023

> I also don't see anyone claiming they can saturate their network interface with postgres the same way they can with any KV store.

I recently tested it, and I could trivially saturate a 10GBit interface with postgres.

https://news.ycombinator.com/item?id=35296130

NavinF · on May 24, 2023

With concurrent writes?

anarazel · on May 25, 2023

I'm sure there's a rate of concurrent writes where that wouldn't be true anymore, but if so, you're likely not going to be benefiting that much from the approach to caching.

AdieuToLogic · on May 21, 2023

> What are the use cases of Key-value stores?

The ideal use-case is when there is one, and only one, property used for retrieval which is guaranteed unique by the system. Less ideal, but often very performant, is when retrieval always uses one property which may not be unique.

Once retrieval requires anything other than a single predefined property, querying key-value persistent stores degrade into linear searches.

BulgarianIdiot · on May 21, 2023

Relational dbs are mostly keyval stores inside. An index is a keyval store projection of another keyval store (but say, keying it for another subset of the value). B-tree indices and hashmaps are ways to represent keyval stores for quick look-up (b-trees are convenient as they allow range look-ups, & are automatically sorted, while hashmaps aren't, but have lower overhead for lookup and storage).

In essence everything is keyval. A sparse array is an ordered keyval store with integer keys (also technically everything is ordered, too, but some orders are stable, and useful, while others aren't). A dense array is an adjacency-optimized version of a sparse array where the key is implicit based on computable offset within a larger dense array, your address space. RAM address space is also variations on that theme. Raw disk storage. And file systems. Everything is. Maybe I spend too much time messing with storage, but I can't see it any other way at this point.

somethingAlex · on May 21, 2023

A concrete example would be a users shopping cart, as they build it. You don’t need the niceties of a fully ACID compliant DB, you need write performance, and extremely high availability.

That was at least a chief use case spotlighted in the original Dynamo paper by Amazon that what the precursor to AWS’ DynamoDB paper.

Not to say that couldn’t be done with Postgres but of course they were dealing with insane scale on Amazon Day.

ledauphin · on May 21, 2023

it's a common misconception that modern non-relational stores (such as DynamoDB) aren't ACID compliant. DynamoDB offers ACID transactions, even across tables, as of several years ago.

not that you're saying they don't, but some people might interpret your comment that way.

AdieuToLogic · on May 21, 2023

> DynamoDB offers ACID transactions, even across tables, as of several years ago.

It depends on how DynamoDB is used[0]:

  Transactional operations provide atomicity, consistency,
  isolation, and durability (ACID) guarantees only within
  the region where the write is made originally.

  Transactions are not supported across regions in global
  tables.

Granted, this likely handles most use-cases and the restriction enforced makes complete sense.

0 - https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

xwowsersx · on May 21, 2023

That's right. FoundationDB and ArangoDB, among others, are also ACID-compliant, I believe.

AdieuToLogic · on May 21, 2023

> A concrete example would be a users shopping cart, as they build it. You don’t need the niceties of a fully ACID compliant DB, you need write performance, and extremely high availability.

Using a key-value store for shopping carts can work for awhile, especially for the use-case you describe, but fails when system functionality grows beyond retrieving only by a cart ID.

And when using a persistent store which does not provide ACID capabilities, the system will ultimately have to enforce at least atomicity and consistency via server logic.

makeitdouble · on May 21, 2023

> when system functionality grows beyond retrieving only by a cart ID.

Either you manage your system so it never ever uses any other key than a cart ID (services have been running for decades keeping the same unique ID, that's not some unreasonable thing).

Or you migrate your data to match the completely wild new requirement, and taking costly steps to deal with a funfamental business change would be seen as reasonable in most orgs.

firecall · on May 21, 2023

They have a list of Common Use Cases in the article! :-)

sakopov · on May 21, 2023

At my old job we used DynamoDB in a microservice architecture and I always thought it was a perfect fit. Not sure about JunoDB, but DynamoDB is notoriously difficult to index for various querying patterns. This seemed like less of an issue for microservices because the models are generally very compact and easy to query.

huntertwo · on May 21, 2023

No schema validation so you get faster writes, generally allows for easier scaling

dikei · on May 21, 2023

Key-Value stores are the OG of datastore: most SQL and relational databases are built upon Key-Value stores

throwaway892238 · on May 21, 2023

When you don't need a relational data model and you just need to store random key=value entries

ok_dad · on May 21, 2023

Caching

voytec · on May 21, 2023

Curious why they decided to blog about tech on Medium and not under own domain, like they do[1] for corporate posts.

[1] https://newsroom.paypal-corp.com/

rektide · on May 20, 2023

https://github.com/paypal/junodb

vssadvt · on May 20, 2023

Did anyone try this new key value data store ?

AdieuToLogic · on May 21, 2023

It would be interesting to see how this compares to FoundationDB[0].

0 - https://www.foundationdb.org/

dangoodmanUT · on May 21, 2023

I think FoundationDB could meet the "extreme scale, security, and availability needs" of PayPal, I'd bet Apple's is more extreme, and they've shown ~500 core clusters doing well into the millions of ops/s

jensenbox · on May 21, 2023

I would imagine that this project was a Not Invented Here sort of thing when Redis was presented as an option.

Total conjecture on my part.

wg0 · on May 21, 2023

This is based on RocksDB which is "sorted key value" store like LevelDB (HBase, Hypertable etc.) and keep sorting and merging keys while flushing to disk at certain point.

Redis in comparison is a different thing if I'm not wrong.

JustLurking2022 · on May 21, 2023

It's a valid point - RocksDB can be thought of as the internal backend implementation of the DB. It's the front end API that matters as to whether Redis could fulfill the same purpose.

bushbaba · on May 21, 2023

Why do you need junk then if rocksdb exists?

antonvs · on May 21, 2023

Juno is a disk-based store - a closer comparison would be to Mongo. And back when Juno development started, Mongo would not have seemed like a very good option (if it even is today.)

Also, Redis was not originally distributed.

threeseed · on May 21, 2023

MongoDB is a document database not a key-value store.

That distinction is massively important when you're talking about distributed systems as key-value has far less edge cases to consider. Also its architecture is quite different as it doesn't have the concept of proxies.

Basically in the realm of databases the two are nothing alike.

redwood · on May 21, 2023

You can think of key / value as a subset of document namely one with a single index

orra · on May 21, 2023

> Juno is a disk-based store

The question is definitely interesting. And to be fair, Juno was originally in memory.

parhamn · on May 21, 2023

Ah yes, redis, the first distributed KV store that wasn't NIH.

bradhe · on May 21, 2023

Right and don’t you understand that Redis has the availability guarantees, scaling constraints, and memory architecture for all use cases?! Why would you ever need a different KV store when you have Redis!

avinassh · on May 21, 2023

Is Redis safe? I make a SET request and before the data is written to disk, if Redis crashes, my data is gone for good, right?

edit: Redis supports fsync at every query in AOF mode

https://redis.io/docs/management/persistence/#ok-so-what-sho...

bradhe · on May 31, 2023

I was being sarcastic.

ChocolateGod · on May 21, 2023

> Why would you ever need a different KV store when you have Redis!

If you want to have tier based storage, where you trade off latency for increased data size, meaning disk space is your limit and not your RAM.

bradhe · on May 31, 2023

I was being sarcastic.

pgt · on May 21, 2023

Everything in Redis has to fit in RAM.

bradhe · on May 31, 2023

My statement was sarcastic.

vichle · on May 21, 2023

Redis was not a distributed system a decade ago.

ilyt · on May 21, 2023

KV database is just fun enough project to be NIHed many many times

astrea · on May 21, 2023

That's how I feel about all of these such things developed by big companies.

threeseed · on May 21, 2023

If you're smarter than the hundreds of engineers designing and building the distributed systems powering the world's largest applications then well done.

Personally, whenever I see a system like this I try to look for why they didn't go with an existing solution. And in 99% of cases either an existing solution never existed or they had a unique requirement that necessitated building something from scratch.

JustLurking2022 · on May 21, 2023

Honestly, I work at a FAANG and, sadly, the answer is often a combination of people paying absolutely zero attention to tech that's NIH (e.g. SWEs working on DBs who have never used Postgres) and the fact that no one gets promoted for implementing a solution using 3rd party software. The system is setup such that you need to show you've developed something of sufficient complexity, and using OSS just doesn't look good in that context.

threeseed · on May 21, 2023

> DBs who have never used Postgres

PostgreSQL is great for single instances but is poor when it comes to high availability and horizontal scalability.

Be curious what use cases where engineers are writing their own database but it is single instance.

JustLurking2022 · on May 23, 2023

I'd expect an engineer designing a new database, even for internal use, to be familiar with the competition's strengths/weaknesses even if they're going in a different direction.

ilyt · on May 21, 2023

>If you're smarter than the hundreds of engineers designing and building the distributed systems powering the world's largest applications then well done.

There is plenty of absolute code abominations powering "world's largest applications"

Engineer competence is also vaguely related to quality of the infrastructure, yes, you need smart people to make big complex things, but you also need smart people to manage ungodly legacy enterprise spaghetti

astrea · on May 25, 2023

You're assuming a level of engineering/scientific "purity" that doesn't really exist within orgs to the idealistic extent that we would like to imagine. I'm not saying that I'm smarter than every FAANG engineer, but having met quite a few and worked in large orgs myself, it's easy to say some of these tools come from pride/arrogance/a need to have your name on something/a fundamental misunderstanding of existing tooling. Not necessarily because they're unraveling parts of the universe yet unexplored and need a bespoke weapon to tackle new issues.

quazar · on May 21, 2023

Watch out, it will call the cops on you if one of the keys is "ALEP".

omginternets · on May 21, 2023

Can I be let in on the joke?

seized · on May 21, 2023

Someone whose account got banned because the invoice they issued via PayPal had "ALEP" in the license key contained within. Aleppo is the banned word.

vxNsr · on May 21, 2023

Why is Aleppo banned?

arp242 · on May 21, 2023

Syria is a nation under heavy sanctions; "Alep" is the old name for Aleppo, apparently still in use. Governments around the word demand financial institutions prevent fraud, funding to terrorist organisations, money laundering, etc. and can give heavy fines if they don't do enough, so you end up with this kind of stuff, for better or worse.

I had a friend that worked for ISIS: Innovative Solutions In Space. They had a lot of problems with all sorts of financial institutions once that other ISIS started becoming better known. They've since rebranded to ISISPACE for that reason.

seized · on May 21, 2023

Aleppo, Syria. To block people sending money to terrorists (not stating that they are).

tough · on May 21, 2023

or tardigrade

harikb · on May 21, 2023

Seems to be based on RocksDB. But I wonder if the persistence it is like Redis's persistence (where the persistence is just snapshot/txn-log style)

> JunoDB storage server instances accept operation requests from proxy and store data in memory or persistent storage using RocksDB. Each storage server instance is responsible for a set of shards, ensuring smooth and efficient data storage and management.

avinassh · on May 21, 2023

The description says:

> JunoDB is PayPal's home-grown secure, consistent and highly available key-value store providing low, single digit millisecond, latency at any scale.

what do they mean by 'consistent' here?

tentacleuno · on May 21, 2023

I really wish they gave examples of code using JunoDB in the article, just to give the reader a rough idea of how talking to it works.

Alifatisk · on May 21, 2023

https://github.com/paypal/junodb/blob/main/docs/junodb_arch_...

web3-is-a-scam · on May 21, 2023

How does this compare to redis?

colesantiago · on May 21, 2023

How does JunoDB compare to Redis?

joelrwilliams1 · on May 21, 2023

Wonder how does it compare to DynamoDB

Alifatisk · on May 21, 2023

I really wonder how it compares to DragonflyDB

tyingq · on May 21, 2023

It says it stores things on disk, rather than in memory.

arthurcolle · on May 21, 2023

Redis has persistence since a while now

danpalmer · on May 21, 2023

Redis essentially can’t store more data than what fits in RAM.

While it has persistence options, they’re for durability and backup, not to increase the storage available.

JunoDB appears to store data primarily on disk, and limits storage by disk size, but then caches in memory as necessary perhaps. Quite different in behaviour and trade offs to Redis.

arthurcolle · on May 21, 2023

Thanks the the info!

tyingq · on May 21, 2023

Yes, though, I imagine it's not at all the primary use case. Knowing they have different primary use cases seems relevant.

arthurcolle · on May 21, 2023

Redis persistence I would imagine is certainly a primary use case these days, but I can appreciate that perhaps this is not the case.

infaloda · on May 23, 2023

YCSB benchmarks?

sbussard · on May 21, 2023

How does it compare to kansas though?

jmartrican · on May 21, 2023

Kansas is a state in the United States of America. While DBs and caching systems do operate in Kansas, the state of Kansas itself cannot easily be utilized as a key value store. Mainly because Kansas is not software but a piece of land with people and government. While it is reasonable to assume that the people and government of Kansas can create a key value store, it would require a law change to consider said software to be officially recognized as being part of Kansas.

Alifatisk · on May 21, 2023

Haven't laughed this much in a while

tln · on May 21, 2023

Do you have a link for that?

negermeister · on May 21, 2023

[flagged]

anonylizard · on May 21, 2023

Its open sourced because.

1. Internally, its a legacy tech that's slowing people down and costing $$$ to maintain.

2. Its not something that'll give any competitor a leg up. Given its a completely saturated market even 10 years ago.

3. Hence, open source it, in the hopes someone will improve/maintain it for free

tentacleuno · on May 21, 2023

Re: 1, how do you know this? Or are you basing this off an assumption?

dvhh · on May 21, 2023

This looks very interesting, and is yet another demonstration of how amazing os rocksdb