Hacker News new | past | comments | ask | show | jobs | submit login
Red Hat Satellite to standardize on PostgreSQL backend (redhat.com)
420 points by pplonski86 32 days ago | hide | past | web | favorite | 285 comments

Rule #1: Always start with PostgreSQL unless you have a very compelling reason not to.

Always start with in-memory data structure with straightforward persistence (i.e. load and save it all at once, using some popular format).

If you need ACID, then start with SQLite.

If you need scalability as well, then PostgreSQL. Most other features aren't worth the hassle of configuration compared to a single file on disk and a library to link to.

(But I suspect that most NoSQL apps these days would do just fine with SQLite, perf-wise. As an industry, we don't have a good sense of scale.)

For most languages and frameworks, using Postgres as opposed to sqlite is barely a higher cost if you’ve got the ops skills. I’ve yet to see an app that straddled this local/web-scale line where SQLite was a better choice or Postgres wasn’t the obvious choice.

There isn’t some massive cost to using Postgres instead, is there some desktop divide I’m missing here?

From: https://www.sqlite.org/whentouse.html

Appropriate Uses For SQLite:

SQLite is not directly comparable to client/server SQL database engines such as MySQL, Oracle, PostgreSQL, or SQL Server since SQLite is trying to solve a different problem.

Client/server SQL database engines strive to implement a shared repository of enterprise data. They emphasize scalability, concurrency, centralization, and control. SQLite strives to provide local data storage for individual applications and devices. SQLite emphasizes economy, efficiency, reliability, independence, and simplicity.

SQLite does not compete with client/server databases. SQLite competes with fopen().

I like the way I recall someone (Richard Hipp?) putting it: SQLite is a file format.

SQLite is an efficient file format for relational data with excellent tooling.

Would you use Postgres for developing something like, say, iTunes? I’d think SQLite would be a much more “natural” fit there.

For an app developed with Postgres, you need a Postgres instance running as a prerequisite. Whereas with SQLite, you just need to use a small library, there is no server or daemon involved.

sqlite doesn't require a server, right? It's just "read the database file", whereas Postgres requires a running server as well as the client processes.

Granted, if your ops are at the stage where "run postgres myself" is a pain, using some managed DB (heroku?) probably makes more sense than sticking to the sqlite file, if only because you'll be paying for "real backups" and availability.

I can use SQLite for prototyping without db server.

It's a file or some in-memory buffer, I don't need docker or anything else to run some very simple unit/integration tests with the DB.

That seems like good advice for smaller-scale web development, and extremely good advice for single-user applications.

If you're working on a larger scale, or expect to be growing into a larger scale, though, then I think it's wise to skip straight to using a full-blown RDBMS. It's not about scalability, per se - very, very few projects need more performance than SQLite can deliver - so much as about manageability. A top-tier RDBMS will have a better story for resiliency and availability, it will have a more robust security model, it will handle concurrency better, it will have a more complete SQL dialect that allows you to handle more complex queries while keeping them more maintainable, etc.

This is one of the few spots where I think YAGNI doesn't apply. In expectation, it's cheaper to start on the RDBMS than it is to start small and grow as needed. The costs of switching databases are so high, the costs of outgrowing your data layer are even higher, and the costs of trying (and, usually, failing) to keep your data layer database-agnostic are still higher yet, and, in that particular problem domain, the likelihood of eventually being able to benefit from an industrial-grade RDBMS is so easy to anticipate.

I really like Firebird's model where you can either access a database file directly (like SQLite) or through a server process (like Postgres). Switch anytime you want. Best of both worlds. How lovely it would be if Postgres were to support something like that.

Yup this.

People don't realize just how performant SQLite can be. If you are running just 1 server and won't probably need more than SQLite is more than enough.

2 application servers and up is where PostgreSQL can come into play.


After reading the SQLite when-to-use document I transitioned some smaller marketing sites over from Postgres because it was overkill for the specific situation. Other multi-app server applications or ones that scale, SQLite might not be the best use, but it still has it's cases. Like copying over a subset of the original data and running queries locally instead of hitting a production database

If only SQLite had a decimal type. I would be using it a lot more. I know all the workarounds but they are tedious and error prone.

But postgresql gives you not just the standard SQL database. It also does pubsub, timeseries, and lots more. It's fairly lean in itself and does not induce much overhead. Can run locally just with a unix socket and thus be just as easy as sqlite.

The biggest problem with SQLite is that it seems to be easy to use it wrong. If you are using SQLite as it is supposed to be used, it has much better performance than its reputation suggests.

In what way?

I think the easiest way to do it wrong, is to close the file every time you finished your operation. With the default settings, this will sync and wait until the filesystem completes its operation. So unless you will close the file for a long time, keep it open. If you want to access the DB in a multi-threaded environment you better let one process manage the database.

I guess 90% or performance problems with SQLite come from the filesystem.

Not quite sure but Quora suggests

>[slow]...This is because the entire database was locked every time someone viewed the page because it contained updates/inserts.

So avoid loads of inserts I guess https://stackoverflow.com/questions/54998/how-scalable-is-sq...

There's plenty of use cases where you want to record analytics or metrics for each page load though -- so I guess SQLite simply isn't suitable for that?

Depends on how many concurrent requests you have.

Writes require an exclusive file handle/lock (only one write can happen at a time), so read heavy uses are usually a better match.

"Always start with in-memory data structure with straightforward persistence (i.e. load and save it all at once, using some popular format)."

That sounds like a bad idea in many cases. Prone to many kinds of problems, like accidentally corrupting the stored state beyond recognition from a tiny application bug.

How exactly would you corrupt it? For all the stuff that's in-memory, it's as safe as your language is. If you're writing it to disk all at once, it's easy to implement this atomically with rename - this guarantees that you have either the old file or the new file, but not some weird in-between state.

And if you mean corrupting by the app going amok and asking to corrupt data because of logic errors in the app itself, that's just as doable with a database, and especially doable with local-file DBs (which usually don't try to limit what you can do).

Create some ID that's supposed to be unique, and due to a race condition, it's not. Then, before you know it, the data is a mess.

Unique keys, foreign keys, check constraints, and DDL changes were all meant to solve real problems. I guess some apps might not benefit as much, but it seems like asking for trouble.

Given how easy it is to just use postgres, why invite all the risk?

If your data is in-memory, why would you have IDs and foreign keys? You'd just have objects referencing other objects (or maps referencing maps, or whatever is idiomatic). I wasn't talking about re-implementing a relational database in-memory, but simply working with data in the way that is natural in your language.

If you do actually need advanced schema features, then I agree, just use PostgreSQL. But none of the things that you've listed are advanced features, and all of them have some trivial equivalent for in-memory data.

Side note: do you maybe have an image of a very particular kind of app in mind? Like, a moderately sized web app? Keep in mind that I'm talking in general here, about web/desktop/mobile, and about all sizes. And in practice, even for web apps, people often underestimate the actual usage and perf requirements.

If you make code changes, that usually means you need to adjust your data structures, too. That means you need to manage multiple on disk versions, or maybe migrate the old data to the new version.

I mean, you can do all of this stuff another way. But PostgreSQL represents a lot of lessons learned over decades with all kinds of applications.

Sure, there may be some level at which it's just easier to serialize the in-memory data. But that level is lower than most people realize and postgres is pretty darn easy to use given all the kinds of problems it helps you avoid.

Even an MVP of a trivial web app is probably fastest to deliver with postgres, and less risky.

And there are tons of things that are really convenient even for the tiniest app. For instance, being able to go in and browse your data to find out what's causing a minor bug.

I had a python script that was roughly:

read file, unpickle, do stuff in memory, pickle, write.

A bug in "do stuff in memory" could easily muck things up and I didn't get round to renaming the file for every edit. Yeah you could do some auto rename thing but it might be simpler to just use sqlite.

> I didn't get round to renaming the file for every edit.

That was your problem then, no? If you don't do that from the get go, you're essentially saying that you don't care about atomicity, consistency and durability.

> Yeah you could do some auto rename thing but it might be simpler to just use sqlite.

You mean atomic rename? The simplest way is to just use a library that does it for you. One does exist for Python. It takes 2 lines of code to safely dump your JSON data with it.

> Always start with in-memory data structure with straightforward persistence (i.e. load and save it all at once, using some popular format).

Do you have any examples of this? It's rare I see suggestions like this.

I've done this for HTML mobile apps a few times before where the entire app (fairly simple) state is saved and loaded as JSON when you make any changes. It's super simple and very little can go wrong. This was as opposed to using Web SQL (which wasn't deprecated at the time) where it's much more complex to keep what's in-memory synchronised with what's in the database.

Edit: I'm talking about client side code to store local state here.

I mean, most apps that have configs treat them in this manner. It's just that people assume that configs and other metadata are "small enough", and the actual data they process is not. But when you look at what "small enough" really means on modern hardware - heck, even on 10 year old hardware! - that divide doesn't really follow.

The same thing goes for queries, by the way. When it's in-memory, you often don't even need indices, because linear scans will be "fast enough" on many datasets. And with list comprehensions etc, this can be written just as declaratively as SQL (for an extreme example, see LINQ, which even has outer joins).

And yeah, the "very little can go wrong" part is exactly what's so great about this. It's KISS at work. If you don't really need complexity, why pay the tax?

i struggle to see why in-memory data structure is the way to start can you please explain. do you create json for all your users?

You create a JSON (for example) for all your users when you need to store it. But in memory, it's just a collection of User objects - or whatever is idiomatic for your PL. And you query it with the same tools your language offers - e.g. sequence comprehensions. So there's no impedance mismatch, and no need for the vastly more complicated code that is needed to bridge it. Even if it's not your code - i.e. if you're using an ORM or something similar - it's still unneeded complexity, if you don't have too many users.

Think of it from the opposite perspective - if you can work directly with in-memory data, why would you prefer to run SQL queries instead? The latter is obviously more complex, but what's the advantage? There are many valid cases that have a good answer to that question, but there are many more that don't.

I can't help but think that we're conditioned to use DBMS (SQL or not) largely by inertia. When we learn about them, the examples are necessarily toy ones - tables with a dozen of records, that sort of thing. But one side effect is that it subtly normalizes the notion that this much data warrants a dedicated DBMS - so people don't balk at even patently ridiculous setups, like a separate SQL server used to hold a grand total of a couple thousand records in all its tables (this is a real thing, something that I personally did in a LOB app many years ago; I didn't have a good justification, it was pure cargo cult, and the app could have done everything it did in-memory, faster, and easier to code).

And sometimes, when you ask, people say, "yeah, it's only 1000 records now, but we're going to grow later, and then we'll need a DBMS". But you'd need to grow by many orders of magnitude to get there - and if you get that opportunity, you'll also have the resources to adapt. But everybody dreams of being Google.

And then you have apps like, say, a todo list manager app. Is it ever going to deal with millions of todo records? I don't think so. Then why does it need an SQLite DB?

That sounds extremely complicated because you have to do the vast majority of work yourself.

>So there's no impedance mismatch, and no need for the vastly more complicated code that is needed to bridge it. Serializing arbitrary data structures is far from trivial, especially because JSON doesn't support references natively and therefore things like cycles are impossible without adding IDs yourself which is incompatible across JSON parsers. The naive way to serialize objects to JSON also tends to cause duplicated objects for every reference.

Or you can just use any SQL Database and use an ORM, write your damn Domain Classes [0], call .save() when you want to save and then query the damn thing [1] and if you want eager loading you can configure that too. Honestly it's so frictionless I don't see why I should bother with whatever your preferred workflow is.

[0] http://gorm.grails.org/latest/hibernate/manual/index.html#do... [1] http://gorm.grails.org/latest/hibernate/manual/index.html#cr...

I'm sure most complaints against ORMs happen because newcomers, who don't have sufficient knowledge of relational databases, try to model arbitrarily complex objects that cannot be mapped to a relational database (and not to JSON without extra support by the parser either) or because they think they don't have to write queries at all. In other words: they are fundamentally trying to swim against the stream and then blame the tool. If you know of the limitations of relational databases then ORMs are just a tool that saves you time and lets you avoid boring work.

I don't think the number of records is the right benchmark.

The reason why data is better kept in a database system is because it often has a very different life cycle than applications. It often lives longer and gets used by more than one application. That's why modelling and storing data somewhat separately from applications often makes sense regardless of the amount of data you have.

Also, procedural code is often far more complicated than a SQL query regardless of the number of records being processed. But that obviously depends very much on the specific problem, on the programming language and on the developer's skill set.

> the reason why data is better kept in a database system is because it often has a very different life cycle than applications. It often lives longer and gets used by more than one application.

It sounds like you're thinking about web and enterprise apps that share data sources. For those, absolutely, use DBMS. You don't want to solve concurrency and consistency yourself.

I was talking more broadly, of apps of all kinds. Does a TODO list app for your phone need a DBMS? There's not going to be any concurrent data access there, nor huge amounts of data, nor complicated queries. But that describes most apps, and most of their data! Even for web apps, an exclusive data store is more common than shared (just think about all the WordPress blogs online!).

> Also, procedural code is often far more complicated than a SQL query regardless of the number of records being processed.

If your language doesn't have some kind of declarative query framework for sequences, you're probably better off with SQL. But these days, who doesn't have that? Even Java caught up.

(I'd single out C++/STL for offering non-composable primitives, until ranges get into the standard. But C++ is a wrong choice for a data-heavy app in general, IMO. )

Unfortunately, I've rarely seen such a simple thing as file saving implemented correctly by many developers. Missing fsync(), missing checking of errors on close(), in-place non-atomic data overwrites. One of the good things about using SQLite/Postgres/MySQL is that at least they save your data correctly.

> i struggle to see why in-memory data structure is the way to start can you please explain.

It’s much simpler. The idea is to start simple and avoid complexity. Do you need the extra complexity of a PostgreSQL instance or can you get away with doing it in-memory? If so you can save yourself a lot of complexity.

> If so you can save yourself a lot of complexity.

The complexity exists, you're just shifting it around some. The "complexity" of setting up a populate db server is non-zero, but generally a known-entity. The complexity of managing your own stuff 'in-memory', with all the attendant issues about concurrency/corruption/performance/etc - would almost always end up being far more custom to your particular project.

If that were true, we wouldn't store any data in memory at all - everything would be in the database to avoid "the attendant issues about concurrency/corruption/performance/etc".

But we don't. Perhaps because the attendant issues often aren't issues in practice? There's no concurrency problem if it's just your app working with data, and it's logically single threaded (i.e. possibly async, but serialized) - which is by far the most common model for non-web apps. There's also no concurrency problem if you use immutable data structures. There's no corruption problem if you are using a memory-safe language, and your data structures enforce their invariants, as they should in any good design regardless of anything else. Performance is going to be better, often significantly so, until you have enough data that dumb queries over it are slower than the overhead of communicating with the database. And so on.

So, no, the complexity doesn't necessarily exist. The complexity of your data is constant, and of the kind of processing that you need to do with it, true. But we're talking about complexity of the pipeline you use to access that data - and there, you can absolutely make it more complicated that it needs to be, and I'm arguing that it is what usually happens in practice.

surely you are using angular, nodejs or react right? why not just raw html and css? its much simpler

To clarify, I was talking about client-side code where there isn't a server component e.g. so your choices are usually to store in files, SQLite or IndexedDB.

"popular format", sqlite, Postgres are quite sensible.

My personal mix also contains rockdb (lmdb is interesting, didn't use yet) and redis, as potential alternatives for sqlite and Postgres, depending on the use case.

You can actually use postgresql for pubsub and thus instead of redis in some use cases.

In-memory or local disk databases aren't going to work for any real world use case, regardless of how small or simple your application is. Using Postgres or MySQL is barely more complex than SQLite, and it makes more sense to start with one of them than move everything over once you realize you aren't going to be running on your dev machine forever.

After developing with Postgres for several years and then moving to sqlite for dev, sqlite is a breath of fresh air. Now our policy is that everything goes into sqlite db unless the data requires multiple writers. The fact that you have to keep a process running and updated is reason enough to not use Postgres if you can help it.

For instance, we have processes that fetches and stores data from APIs. It's trivial to share this with other developers for dev purposes. This data is read only and it reduces an annoying step of loading the data into Postgres.

This is coming from someone who starts projects with only Postgres for many years and I do love Postgres. Sqlite is an underrated tool in the web dev world.

Really though, I would bet 95% of wordpress websites will make do with sqlite.

I'm using sqlite for a small project but I was kind of disappointed to learn how annoying it is to drop or rename columns. Especially when you write migration scripts with up/down statements, it's kind of a pain compared to postgres.

But otherwise it's been fine to use, a good starting point for internal projects.

How do you browse the data on the remote server with any GUI tools?

Last time I checked not even DataGrip could do it.

> In-memory or local disk databases aren't going to work for any real world use case, regardless of how small or simple your application is.

You'll have to provide some proof for that assertion, especially given that there are numerous examples of apps doing exactly that. For example, Firefox stores bookmarks and history in an SQLite database.

I'm pretty sure he's taking about web application...

Sqlite is only really viable if you don't have an undefined amount of nodes pushing commits into the DB

That's a challenge desktop applications don't have

For many web apps it's true simply because of the expected usage, or sometimes because of access patterns (e.g. when most clients are read-only, or when most users don't author anything).

From https://www.sqlite.org/whentouse.html:

"The SQLite website (https://www.sqlite.org/) uses SQLite itself, of course, and as of this writing (2015) it handles about 400K to 500K HTTP requests per day, about 15-20% of which are dynamic pages touching the database. Dynamic content uses about 200 SQL statements per webpage. This setup runs on a single VM that shares a physical server with 23 others and yet still keeps the load average below 0.1 most of the time."

How can you build and live with a page that runs 200 queries for every access?

Thats a website, not a webapp... Its literally the first sentence.

you need interactivity with writes for an actual webapp. sure, you could keep the writes in the browser cache, but then it wouldn't really be a webapp either, more like an app that opens in the browser.

I have numerous real-world use cases in production and making money that use in-memory and local disk databases. In fact, it's mostly hashmaps and JSON files.

There was a time when it was a shoo-in to also say "if you only need a key-value store, use Berkeley DB".

SQLite is really great, but as a datastore for a web app or a web service, it lacks a few important things to my eyes: - Cannot rename or drop column - Cannot create an index concurrently (without blocking writes) - No remote protocol (you have to use the CLI through SSH to run ad-hoc queries)

Column rename was added recently.

Thanks! I missed this.

Regarding your first point, even simpler could be to store the json as string in sqlite as to avoid all the file handling stuff, but I would say that sqlite is already a popular solid application format ;)

But the only file handling stuff you need to load JSON from a file is to open the file. And if it's in SQLite, then you'll need to open the database and query - more complicated, no?

I think is more complex than that, when you modify some data structure in memory you also want write to disk immediately as to not lose any data in case of a crash, so this means to write the new file under a separate name and then renaming the new file to replace the old file only after it has been successfully written as to provide safety against storage corruption.

I wouldn't recommend this to a beginner, the sqlite driver(of your language) would handle all this for you in a safe way.

> this means to write the new file under a separate name and then renaming the new file to replace the old file only after it has been successfully written as to provide safety against storage corruption

I agree, that's about the only bit of complexity in all this. But it's completely generic, so it can be made into a small library, and then reused without dealing with the complexity every time. For example:


Open the file, parse its content, deal with any errors. SQLite is simpler.

Opening the file and parsing JSON in it is two lines of code in Python.

Errors can happen in both cases - you can have a corrupt SQLite DB, just as you can have invalid JSON. The way you handle them is pretty much identical.

"Very compelling reasons" means different things to different people. Which i suppose is a catch 22. The young but introspective dev may make the wrong call on a side project and learn to stick to psql in the future.

The dev who is convinced "web scale" will arrive before their first AWS bill and picks Mongo as their production database will have a worse fate. As will their team and company by proxy.

Rule #1: Always start with PostgreSQL. You don't have a compelling reason not to.

Is there a nice web interface that is made for PostgreSQL? I’m in love with the power that PHPMyAdmin brings to MySQL. Also there are DigitalOcean Ubuntu droplet images with it all pre-configured.

It always feels way snappier to use a local GUI connected through SSH than PhpMyAdmin and its interface also feels very cluttered to me.

Compelling reason #1: We don't need hot-warm HA and don't want to fight for a database with IT every time we write a homebrew app.

We use SQLite.

If you're writing apps for user computers that have no reason to sync to a central remote database then SQLite is totally fine. If you're writing an application that runs in a cluster or on the web, Postgres is the way to go. There almost never a reason to go with Mongo unless you're just trying to get buzzwords in a presentation and trying to seem hip. That's going away now though. My place exclusively uses Postgres on all of our webservers.

In 2019.

A LOT of people want MySQL over Postgres because of ignorance about how far Postgres has came and they don't know/care. I try to explain why I like transactional DDL and the guardrails that come with Postgres.

With MySQL you end up having to use Percona to get a really robust DB setup with proper tooling that compares to Postgres in my experience.

Both, of course, have plenty of known and unknown knobs you can turn to your liking and can make either database appear to be better by any number of metrics for your use case.

> A LOT of people want MySQL over Postgres because of ignorance

or you know, for multi master cluster pretty much out of the box. Gallera DB.

It's rarely as nice in reality as it sounds on paper though...

Gallera DB setups aren't easy, and still have complex issues.

And I'm sure one day some new DB will pop up, and for most workloads it'll be better. People will still use PostgreSQL because it was the right choice in 2019. So the cycle will repeat...

Without a doubt. Lots of movement around specializations (ML/AI datasets) and decentralization...

When the tasks we're solving are fundamentally different than I expect the tools to support those tasks will be fundamentally different, too.

Why not something even simpler like H2?

Java dependency.

A silly, genuine question: when I'm writing a NodeJS app, and I need to store a JSON in a DB that looks like this:

{ "a" : "b", "c" : { "ca":"cb", "cd":"ce" }, "d": { ... } }

How do I store it? To me, NoSQL seemed like the choice in the past.

This is not difficult in relational DBs, but it requires rather long SQL commands, and SQL table design. It seems to me that "MongoDB.insert(obj)" seems like the way to go, as it is much simpler.

How do experienced Node devs solve this issue? Is it simply relational all the way? Is it Postgress's JSON field..?

> How do I store it? To me, NoSQL seemed like the choice in the past.

If you're using postgres, you could use a column of type jsonb. Postgres comes with many operators and functions (https://www.postgresql.org/docs/11/functions-json.html) to query into jsonb typed columns and many of them even allow index usage for very quick access.

However, the other option would be to use proper SQL tables and normalization to store your data. Then you can query very easily for your data and by making use of a real typed schema, you get free validation of your input and protection against corrupted data later down the line.

And aggregating your information to produce reports over various documents becomes very easy too.

Here's above schema in SQL DDL, Postgres dialect. I'm not sure about nullability and types of your values, but by making use of database types and constraints, you can describe the shape of your data much better which further helps you preventing invalid data from being stored due to bugs

    create table things (
        id bigserial primary key,
        a text not null

    create table c_things (
        thing_id bigint not null 
            primary key 
            references things (id) on delete cascade,
        ca text not null,
        cd text not null

    create table d_things(
        thing_id bigint not null 
            primary key 
            references things (id) on delete cascade
        -- ...

Or you can store the jsonb in Postgres and use views to also present it in a normalized way. See example here: https://www.endpoint.com/blog/2016/02/29/converting-json-to-...

In .NET I use Marten to use PostgreSQL as a document database, gives you a typed schema with JSONB storage.

postgres has a jsonb datatype that works pretty much like mongodb.


And other regular RDBMS like MariaDB and MySQL have much the same functionality. I store JSON in RDBMS all the time.

interesting - i never knew that. I use postgresql, so dont have first hand knowledge, but i saw this


> JSON support: PostgreSQL supports JSON and JSONB while MariaDB doesn’t. It supports an alias for JSON instead, which is a LONGTEXT column.

Even the proposed json column, is not indexed - postgresql jsonb is indexed and queries fast.

Yeah MariaDB got 'dynamic columns' first, which iirc supported key values but not arrays, and then MySQL added a more complete 'json' data type, so the exact functionality differs by flavour and version.

The postgres support is simply nicer to use and more powerful, e.g. the indexing.

But if your RDBMS is any fairly-recent version of MariaDB or MySQL, you can use JSON too.

Of the top of my head, the thing I find most griping about the MySQL syntax is you have to do

instead of postgres-style

Whereas what I really want to be able to do is

But most of the consumers of these modern 'dynamic columns' are apps and then it all really matters less if you don't spend all your days at the sql prompt writing stuff by hand.

Didn't PostgreSQL get hstore before MariaDB got dynamic columns?

Yeah I seem what I wrote could be ambiguous, but in that first para I was meaning to compare MariaDB and MySQL, not claim that MariaDB beat PostgreSQL.

Ntoe that MySQL's `JSON` type is using an efficient binary storage scheme which allows fast lookups etc. https://dev.mysql.com/doc/refman/8.0/en/json.html

My favorite MySQL JSON feature probably is JSON_TABLE which converts a JSON document on the fly into a table which can hen be processed using SQL features. https://dev.mysql.com/doc/refman/8.0/en/json-table-functions...

(Disclaimer: I work for Oracle's MySQL Engineering team)

But will it perform as good as a dedicated nosql-db? Will the abilitys match? The querys? There usually is a tradeoff when something is forced into something alien. Is this the case here too?

IIRC, Postgres use a key-value store internally, so it does not really matters if you store records (i.e. tables) or objects (i.e. JSON documents).

There was an article on HN a couple of months ago that was comparing Mongo and PgSQL performance, and PgSQL ended up being as fast if not faster for most use cases.

I don't see how you'd describe postgres validly as a key-value store at the storage level / internally. It's a pretty traditional row-store (without support for index oriented tables), i.e. all columns of a table are stored together. At the storage level there's no designation of what the key of a row is. Of course you can have primary keys, but that's a higher layer concern.

Aren't tables a bit like key-value stores where the key is an object ID and the value is the column values? My understanding is very vague, but i think that came out of PostgreSQL's heritage as an "object-relational" database, even if it's vestigial today.

Since 8.1 PostgreSQL does not work this way anymore.

The oid type is currently implemented as an unsigned four-byte integer. Therefore, it is not large enough to provide database-wide uniqueness in large databases, or even in large individual tables. So, using a user-created table's OID column as a primary key is discouraged. OIDs are best used only for references to system tables.[1]

Besides, "object-relational" only means that it supports custom data types and inheritance, and at least custom data types are still a core feature of PostgreSQL.

[1] https://www.postgresql.org/docs/8.1/datatype-oid.html

And in PG 12 we've removed the magic oid columns (as in WITH OIDS), although you can still create one manually (but there'll not be any values assigned automatically).

Can I make queries into nested keys of those objects?


> There usually is a tradeoff when something is forced into something alien.

Exactly! My thoughts whenever I see people using nosql databases.

I suppose the answer, like all db answers are..it depends :)

My take on it is if the structure of "c" isn't likely to vary wildly over different entries and you have use cases that need queries like:

get rows where c.ca == x and/or c.<some other field> == x

...and either the frequency or table size makes it unfeasible to scan through all entries, I'd opt for using a relational storage solution and chuck "c" into its own table with a foreign key entry in the original data structure.

Otherwise a non-relational structure could suite it better, in which case you can still use Postgres or SQLServer or any other relational DBMS that supports json. I'd make the deciding factor in that case more about what do I forsee myself using for other use cases, e.g. if I'm already using a RDBMS for other models in the same app then just keep using it, but if not and all my data are arbitrary blobs of JSON then use any ol document store ¯\_(ツ)_/¯¯

Commenting from a SQL Server perspective as it's what I know best - there's a native JSON data type, so you're absolutely able to store JSON in columns and interact with it as such.

If you don't mind getting a bit more sophisticated..... Certainly with XML (the JSON tools weren't as mature as the XML ones, though they're hopefully improving) it isn't tricky to sling a document at the database and have it create a whole tree of very standard RDBMS-style table records. Nor is it that tricky to write code which, when given the ID of a record, returned it and its children as a document tree.

Whether it's the right approach for you is another matter, but it's very doable.

You could do it in a JSON field, and I've personally done that but mostly out of laziness. I think in the end it's better to actually come up with a normalized schema.

I typically reserve JSON fields for storing data whose schema I don't control. For example, when Stripe gives back a giant chunk of JSON after a transaction, I use a JSON field.

This sentence, when talking about support for old enterprise versions, is a bit suspicious:

> Satellite will not use newer versions of MongoDB that are licensed under SSPL.

So my read is that there is not only an intention of database consolidation, but also a desire of avoiding the SSPL.

Nothing "suspicious" about it.

Red Hat also explicitly evicted Mongo from RHEL 8 because of the SSPL [0], and they have a longstanding policy of only shipping OSS.

It's no secret, and they've been very upfront about not shipping anything SSPL.

[0] Previous HN discussion: https://news.ycombinator.com/item?id=18919543

Indeed a sane choice

Does someone have the time explain what NoSQL Mongo didn’t do that PostgreSQL does?

I guess I don’t really understand the structured SQL vs NoSQL when you aren’t accessing directly but via a program. I also don’t understand why they would have supported a NoSQL and object-relational at the same time at any point.

From reading the post it sounds like they want transactions for some of the functionality they’re trying to build. I guess Mongo still doesn’t have those.

They recently added multi document transactions though.


This may just be my bias against NoSQL (I’ll qualify this with I don’t believe it should never be used, but 90% of use cases psql/MySQL/etc is likely the better choice these days), but when it comes to choosing a DB engine when you want ACID transactions, I’d pick the one that was built to handle ACID transactions from the start, rather than one that just added it with this large caveat in the middle of the docs:

> In most cases, multi-document transaction incurs a greater performance cost over single document writes, and the availability of multi-document transaction should not be a replacement for effective schema design. For many scenarios, the denormalized data model (embedded documents and arrays) will continue to be optimal for your data and use cases. That is, for many scenarios, modeling your data appropriately will minimize the need for multi-document transactions.

So, to be able to have one of the core features of SQL, you lose the biggest feature of NoSQL which is super fast writes. Which, they try to remind you that if you model your data in correctly, by creating denormalized documents with embedded data, you probably won’t need them anyway. Which just makes me cringe at having to maintain these huge unstructured documents just so you get faster writes since you don’t have to worry about ACID.

Or, you can use Postgres, which was built with ACID from the start, using technology and design patterns proven over decades, and has support for JSONB if you need unstructured/document storage that you can also (fairly) efficiently query.

Maybe I’m just woefully uninformed, but I just can’t imagine a use case for Mongo unless your dealing with Google level data/traffic, and fortunately in their case, they have enough money to hire enough people that actually know how & when to use Mongo effectively to the point that it’s not just a fancy way to write to /dev/null.

I’d actually be really interested in some non-Google scale use cases for Mongo, like where Postgres’ performance was actually an issue, and how much Mongo actually outperforms it, and the trade offs/issues switching. Most posts I’ve seen are of the opposite migration, but I want to see what all the hype for a DB engine (what I’ve always saw as one of the less _sexy_ areas of CS, at least in marketability, compared to ML/AI or Programing Language/Compilers)


The primary fact is, the data their working with really lends itself to a more relational data store than non-relational.

they can leverage traditional SQL databases and achieve vast performance improvements by taking advantage of PostgreSQL's built in relationship handling

This is not unique to postgresql, but is more a property of traditional relational databases.

I’ve never enjoyed Mongo for a second, not conceptually, not practically.

But I don’t understand the people who act like Postgres is the end of everything. SQL never felt right for me either. It’s a fine language for business analysis, I guess, but is it actually pleasant for development? Do all of you who so enthusiastically post that it should be the first tool you reach for really never experience object-relational impedance? Do you really enjoy modeling like this?

Yes: what you’re really doing is documenting your data structures and access patterns, which is way less work to do consciously rather than trying to bolt it on ad hoc.

A really big thing is normalization which some people tend to downplay until they’ve had to write code to recurse complex structures enforcing consistency or making global changes. Similarly, atomicity and isolation are really useful characteristics to be able to take for granted without having to code around the problem everywhere.

The usual arc I've seen for document stores, ISAMs, etc. is that people say this is great based on the first 20 minutes and then about a year later realize that they've spent thousands of lines of code implementing a really clunky subset of what they'd get out of the box with a SQL database and the believed ease of use or performance benefits were far less dramatic than promised or even negative.

Right, Postgres is better than document stores, but is it actually pleasant to store your data in square tables and then process it to runtime objects from that? Even with battle tested ORMs, it’s never felt right for me.

I find it very pleasant, but I also tend to avoid ORMs. I can just write SQL queries to get me the data I need in any form I want, and it's not always useful to stuff that into objects when I can just process it as is.

Processing data from format to format is a total drag to me. I just can't do it.

ORM design definitely matters — I mostly use the Django ORM which generally feels right and has convenient escape hatches — but I've also seen a fair number of cases where people complaining about the ORM were really reflecting the fact that their project's data model was somewhat off of what it wanted to be and a migration was in order.

The real win for me is once you're doing anything more than just loading an object: once you need to do any sort of reporting, filtering, querying, etc. I appreciate the effort spent on normalization.

I'm so out of date with Satellite server that my first reaction was: Wow it took this long to get oracle out!

I am quite surprised that companies still use Red Hat Satellite Server. In 2010 while I was still at Red Hat, deploying and managing it was the mainstay of the consulting business, and it was based on Oracle RAC. It was to my mind already legacy software at that time.

Satellite of today is not Satellite of 2010 by any stretch.

Satellite 6 is based on foreman + katello + pulp

And, for anyone who has experience with Satellite 6.0/6.1... it has gotten much, much, much better since then.

What are the good alternatives for managing your Red Hat servers?

I am no fun of MongoDB, but I must admit it has gotten a lot better in the past few years. So, why drop it now?

New license that's not OSS (MongoDB Inc saying it is doesn't make it so, the OSD determines OSS, not some company).

The upstream project has been pretty vocal about their reasons to switch, and it's not because of the license. MongoDB just isn't the right tool for their use case, despite valiant attempts to utilize it.

This is very clearly because of licensing.

Any guesses on a timeline? Satellite 6.6? Or will it be in the next major version? 7.0?

Pulp 2 will continue using MongoDB until EOL. Pulp 3, which is still in development, is using Postgres -- but while the platform functionality is nearly done the plugins for e.g. RPM and Docker are not yet near feature parity.

Is there a word for a case of "I told you so" that went on for so long that it curdled from frustration, to despair, to cynicism, to a realignment of your understanding of the human project as something barely capable of tying its own shoes and making it out to the mailbox and back?

Smug self-righteousness

I’ll allow it. Years later, I’m still miffed at Graylog (centralized logging engine) for having required MongoDB for a small bit of auth and meta storage that could’ve easily been done in MySQL or PostgreSQL (RDS even), forcing the need for that much more ops work for a small Mongo cluster for HA. Everyone deprecating the use of Mongo is a welcoming turn of events.

I shall recall these dark days to the next generation as “NoSQL Madness”, or more colloquially, “my schema is my app layer”.

I once worked at a place that, at some point in the past, had developed some kind of semantic graph database that never gained market traction.

The author of it had ended up as CTO and kept seeking out uses for his work and ended up finding all kinds of odd places for it to live, including as the auth database in a large scale document analytics system and running part of the payroll.

We were constantly running into all kinds of scalability issues where this 12 year old component was often the source of our pain, but he'd never even entertain a conversation to eliminate it and consolidate or replace it with something better.

I worked for a company that was old but had a fancy new product. It was just about to be released when....

A competitor bought us for stupid money. They fired most folks (not my small department) had their CTO decide between our fancy new product, and the product he created. There was no question, our product was amazing, his Frankenstein was two pieces of equipment cabled together in three items the footprint ... and still did less, and was quite a ways away from being "ready".

If there was a silver lining, the Frankenstein product doomed that company and we got bought by the far more competent competitor they had.

Happy ending.

It's surprising how that works. I've been through numerous mergers, acquisitions, been in the company acquiring a company.

Lots of hair pulling and I can say that it was really hard to predict the outcome for any individual positive or negative in every case until long afterward.

This is the definition of software engineering hubris distilled into a few paragraphs

I think I know the company. We were almost acquired by them until they hit the rocks and we took a hit. Interesting to hear how that database was a pain to manage.

There aren't many of these databases around... Which database was that? is it still around?

I agree with the general sentiment just wanted to point out that "my schema is my app layer" has some valid use-cases. At my current job we deal with highly complex schemas (modelling insurance contracts). Correctness is paramount but at the same time you need flexibility in terms of change over time. Defining these schemas at the DB level would be painful, a bit like writing web apps in assembly. Languages like Haskell (which we use) help here with their rich and expressive typing capabilities, so we can model these complex domains expressively and use the DB only for persistence. Admittedly it does have downsides, like having to write your own migration layer, but for this use-case the benefits outweigh the pains.

PS.: We do use Postgres though ;) as it has 1st class json support and at the same time we have the luxury of using its relational capabilities where needed (think of a hybrid model).

My pet theory is that NoSQL took off purely because people were sick of having to manage schema changes.

Unfortunately, people reacted to being (justifiably) frustrated with schemas by throwing strict schemas out entirely, instead of making better schema management/migration tools.

also, DBAs hate developers. Developers want to make changes to the database to support their classes such as "i need to add a column" and the DBA response is "no." or, even worse, "fill out this ticket and it will get prioritized in the next scrum" meanwhile the developer is at a standstill.

I interviewed at southwest airlines years ago and i don't remember how it came up but we were talking about bottlenecks or something and i brought up the fact that having to go to a DBA to get a column added to a table, no matter how trivial, is a great source of delay. The whole room just nodded and looked at the floor, it was obviously painful for them.

NoSQL took the DBA out of the loop, now the developers were in full control of what was persisted and what wasn't. If they needed a new field they just made it so. On the flip side, DBAs got really freaked out and cried to whoever would listen.

In my experience you either have a DBA report to Developers or Developers report to a DBA. Never give them equal footing (even implied) because they'll just fight.

I think you need to also look at it from the DBAs stand point. If they did whatever the developers want and the system goes down or more likely other parts become slow, it is the DBA who gets the call.

In a large company like SW, the developer requesting some change for their app may have no idea how else the db is being used. What if their requested changes took down the db and prevented reservations from working?

My examples are extreme, but I have seen similar things in my years as both a developer and a DBA at times.

Took me a while to get back here but I do understand your point and it's totally valid. That door swings both ways.

That's why it's hard for the two camps to work side by side.

NoSql gave power to the devs at the expense of the experience and wisdom of the database folks. I bet many many applications and systems were completely screwed datawise more than once because of devs and NoSql.

To be clear, I've known DBAs who act exactly like you described in your original comment. Very annoying.

The best I've seen it work is to have a DBA on the team building the application.

Tell that to AWS. They've banned relational databases for specific workloads because Dynamo (nosql) provides more consistent performance, and is easier to operate.

Tons of conflation of Mongo's problems with those of nosql in this thread.

Did a project with Dynamo last year. Hope never to see it again.

Compared with RBMS tooling looks like a high school project.

DynamoDB has had major improvements in the last few months: e.g. you get dynamic capacity provisioned tables (avoids re/write capacity exceeded exceptions because of capacity planning uncertainty), and transactions, to name two. However, even if you have a hosted RDBMS it has an implicit read and write capacity throughput that you need to design for (e.. hotspots in partitions), you just hit it a bit later in your project. The bounded latency at scale (throughput, and size of tables) is the main win for DynamoDB.

It couldn't possibly because they're trying to push use of their own technology to force dogfooding.

That might make sense if they didn't also offer a plethora of their own built-in-house as well has managed oss relational DBS.

None of those OSS relational DBs offer Amazon lock-in the way DynamoDB does - it's more reliable income if someone uses it, but it also takes more convincing for people to use it. What enterprise would use it if Amazon themselves don't?.

This argument makes no sense at all. What does lock in have to do with Amazon dog fooding its services? They're... trying to lock themselves in? What?

Amazon owns Dynamo DB, and make it available via AWS. If customers use it for non-trivial applications, they have some degree of lock-in, because they can't just migrate to another (self-)hosted instance of e.g. Postgres, MySQL, Oracle, Mongo, what have you.

So Dynamo is an opportunity for Amazon to generate lock-in through their own proprietary software.

But big and likely even medium size businesses are less likely (compared to tiny companies that barely go above the free threshold) to use a new technology without any big well known users, or publicly documented use-cases etc.

One big way companies can provide some confidence to potential customers about their technology is by dogfooding: they use the thing they're trying to "sell" (regardless of whether it's a licence, a service, whatever).

Your average dev is not making decisions based on what might work best for one particular problem Amazon has.

Do you think all of the companies that chose Cassandra and Dynamo were wrong to do so? There's no use case for NoSQL? There were no lessons learned, value adds from NoSQL?

How do you explain the 'NewSQL' approach, which seems to be so clearly borne of what we've learned from NoSQL?

It should be obvious that NoSQL has value, regardless of the issues with one of the earlier NoSQL DBs.

I don't see a value other than fashion driven development, specially when comparing the bare bones browser GUI for Dynamo with something like SQL Server Management Studio or that whole story with primary and secondary indexes, with prices being set by index usage.

The Cassandra design was always a bit of a frankenstein without clear upside to me, but the nosql craze started great conversations.

There is certainly merit beyond fashion to the dynamo architecture, and there are workloads where (for example) HBase is simply the correct type of tool despite the lack of polish of its management interface

I think it also has to do with the source of data. If you receive data from a third party it’s easy to insert the whole document and figure out what parts you need later. If your data comes from your own client interface it makes more sense to build up the data model over time.

You could just plonk the data in a JSONB, BLOB or just plain old file on a disk with a URL pointing to it while you figure it out. And not introduce another super complex to support dependency...

Schema management and automated migration generation frameworks alleviate a lot of that headache. As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well. You don't even need to use Django for the application, you can use it purely to define schemas and perform migrations on the DB. I'm sure there are alternatives for other languages.

People who got tired of dealing with schemas are now realizing that having zero schema is way more of a headache and way more work than the up front work of creating the schema.

> As long as the schema definitions are well structured and can be easily analyzed against a live db to find diffs and generate migration scripts. Django does this very well.

In my experience Alembic works more better.

Well, happily, those days are gone for good. Who needs NoSQL when you have the blockchain!

Glorious, cursed comment.

Setting up graylog was one of the worst mistakes I made. It took forever to get all the required software installed and configured and then it was taking up all the ram on the server doing fuck all.

There were single script 1-click installs for it in bash all over for me..but yeah jvm is a hog.

I think "jvm is a hog" misattribution is actually part of the cause of "jvm being a hog". It implies that you don't need to worry about your memory management and efficiency in Java, because any hoggyness is the jvm's fault. With GC, you can just get away with having memory leaks everywhere and allocate millions of objects per second without any catastropic consequences. JVM software written with knowledge of memory management and that allocation isn't free can perform just as well as any other platform. As Bryan Cantrill loves to say in ever single talk: "gc is not your problem, allocation is your problem, GC just defers the cost".

My experience with those one-liner installs is that they usually work... strictly speaking. They don't scale, they don't deal with edge cases, they know nothing of your environment. They install one piece of software (in an "interesting" way that won't upgrade), and that's it.

That was just Java doing Java things. Business as usual.

This is a nonsense comment doing language troll things.

Can you explain in more detail about “my schema is my app layer”?

EDIT: fixing autocorrect

MongoDB by default is (was? It’s been a while since I’ve used it) schemaless, which means all of your data validation must take place in your app instead of the database. Your data integrity is then only as good as your weakest validation.

Edit: scheme/schema autocorrect typos corrected. Thanks!

I've always preferred the terms "schema on write" and "schema on read" to schemaful/schemaless.

At some point, you are always going to have to get the data into some sort of consistent model, so that you can operate on it in a predictable and sane way. So there's no question of there being a schema, even if it's only implicit. The question is, do you apply the schema once, when you write to the data store, so that the data at rest is consistently structured? Or do you allow it to be inconsistent in the storage layer, and instead apply the schema and re-validate the data every time you read from it?

There are valid reasons why one might choose either approach.

Which is not to say that valid reasons always play in to the decision to choose one approach or the other.

I'm tired and haven't often dealt with database systems. I'm struggling to see significant benefits for schema on read style systems - maybe progressive migration? I'm not convinced...

When you want to do validation depends on when you can do something about it. I work with a NO-SQL DB at work and while it wouldn't be my choice for most things I would use a DB for, the lack of validation has some benefits. A good example is where you have no ability to validate input from a user, but where you need to store the data anyway. The last thing you want is your noisy data being kicked out by the DB because it doesn't follow a DB constraint. Sometimes you want to go in afterwards and say, "Show me all the data which is incorrect". This is also useful for dealing with important data sent by other systems which have been coded by people other than you. The get the data wrong (or are using older versions of specs, etc) but you want to store what they sent you anyway. Then you can go in later and sort it out by hand.

I don't think that kind of thing is particularly common, but there are definite use cases. In our particular case we use it for financial data where we want the data we are given even if it is flawed. I think the OP is 100% correct. You have to write that validation somewhere or else you are in big trouble. Usually it is easier and more convenient to do it at the DB layer, but sometimes you choose to do it somewhere else.

It sounds more like an edge case though. I can't imagine all the data you need to store may or may not be the right format, so I wouldn't switch my database just because one or two entities need this.

Anyway this is 2019 so PostgreSQL JSONB fields have got you covered. You can even efficiently query the JSON objects within them.

I'll give a qualified "yes" to that. I agree there is no particular reason you can't use PostgreSQL. There are some advantages to the designs of some No SQL DBs if it fits your use case (immutable data, the ability to replicate easily). For our application eventual consistency was a really good fit. Also we wrote it 10 years ago :-) Even still, we often muse about replacing what we're using with PostgreSQL.

The main reason I wanted to reply to the question was that sometimes I see people who just can't get past not enforcing a schema at the DB layer for your whole data model. It really is crucial to understand that doing so means that bad data doesn't end up in your DB. This isn't always what you want. Like I said, not super common, but not unheard of either.

The underlying technology is pretty unimportant as long as you can do what you need to do. I've historically never really been a No SQL DB fan (there are very few downsides to relational data!!!) However, we've been using CouchDB for the odd thing and IMHO it has its place. Interestingly, I think it was my boss who originally selected it and he's gone very cold in that direction, where I've warmed to it while using it. I think the main thing is to understand exactly what benefit it is giving you (in our case easily replicated data with immutable change sets) and not give in to the hype of "OMG! You don't need a schema!", which is just not true. I've never asked him, but it is possible that my boss thought it would make life easier not to have to deal with schemas and DB migrations, and when it actually made things harder he got upset. I came into it knowing these things, but not really understanding the other benefits, which is why I warmed up to it.

If we were to start again, I think we would almost certainly go the PostgreSQL route, but I can see places where we would have some problems. It's probably a wash, really -- which is why we've not seriously tried to move away from CouchDB.

Unlogged jsonb tables in postgres have generally made nosql systems look pretty bad. I'm really happy the industry finally came up with vitesse so we could have a middle option between "My ACID database needs to scale writes so I'll roll my own fragile sharding layer" and "give up all attempts at schema and consistency and transactions".

Vitess is a really comfortable middle ground of fairly familiar database semantics within a partition.

Same with Citus for Postgres. Or CockroachDB / TiDB for a rebuilt natively-distributed modern RDBMS.

The default design pattern for storing potentially invalid data with RDBMS is to (usually bulk) load the data in tables without constraints (loading tables), then do the validation in the database, and move valid records to their final tables.

I do a lot of work with both traditional RDBMSes and NoSQL databases.

The main question I would ask is: Is your data schemaless? Often it is - especially when storing what we'd normally call a "document". Heavily polymorphic data is often better stored schemaless. And sometimes you don't necessarily have the schema in advance (common when storing "other people's JSON").

You can store schemaless data in Postgres via the JSONB type, so this isn't necessarily a "Mongo vs Postgres" issue, but more of a general data modeling issue.

As a point of reference, the folks that struggle with schemaless tend to be the ones using Javascript, Ruby, or other type-ambiguous languages. Schemaless is less of a problem in Java and other languages where class structures enforce your schema.

Not having to know all / as many of the structural details up-front could be of value in some use-cases. It can translate to reducing time-to-start cutting code, which can (in some cases) be a business priority, and can lead to identifying critical dependency problems earlier in development.

I'd happily agree that's an inappropriate model in close to 99% of cases, and that even if it was the right model one could (and most likely should) still use a decent database for this anyway.

I can't speak to document stores very well, but one spot where schema-on-read makes sense is in data warehousing type applications. One of the potential troubles with the traditional ETL approach is that transforming the data to fit a fixed schema almost always involves some information loss that might make the data less suitable for answering certain questions.

That's fine if you can predict what questions your business intelligence or data science team will be asked ahead of time, but, realistically, you can't actually do that. Using a schema-on-read data warehouse instead is a more costly option, but also leaves you more able to respond to changing business demands.

One pretty cool use of schema on read is Splunk. It wants to take in all the data and let you search, transform and visualize it in a variety of ways some of which you may not know until you start exploring what data you have.

Excellent perspective. I start using this at work!

Right, but most people using it use model or data repository patterns to ensure correctness. It does offer nearly infinite flexibility provided you use it correctly. You can add fields without any sort of DB work, you just start adding fields to rows as needed and let it catch up organically.

There are use cases where mongo makes a lot of sense. It's very popular in the node.js / RAD world for sure. I certainly have never been a huge fan by any means. Only relatively recently did they solve distributed writes.

Unfortunately, I’d argue PostgreSQL gets you all the same benefits with JSON storage (fairly equivalent to Mongo docs), while also giving you all the goodness of a relational, transactional, schema enforcing RDBMS. PGSQL became Mongo faster than Mongo could become PGSQL.

This is the same thing that happened in Java.

Other languages started prototyping features... that eventually just end up being implemented in Java.

Java did it way too slow, and that is a significant contributor to it being relegated to "legacy" in many areas. If it waited for the other languages to prototype stuff, it might have not been the case. The problem is that it waited for them to prototype it, refine it, release it, popularize it, and for their community to adopt it, before even starting to work on it in Java - which means that by the time they had it, most people who needed it were already elsewhere (not necessarily off JVM, just another language).

Lambdas were a very good example - if you look at the closest competitor, C#, it got the first take on them back in 2005. Then a major refinement in 2008, adding type inference. By 2010, lambdas were idiomatic in C#. Java, in contrast, released the first version in 2014. And even then, they're still less powerful.

Looking around the office and my phone, Java is anything but "legacy".

As for alternative JVM languages, while they are cool and have brought many fresh ideas into the platform, they remain a very tiny portion of the Java developers' market.

Java takes a very long time, because backwards compatibility and cooperation among giant companies takes years.

C# has basically Microsoft deciding how the roadmap looks like and rebooting the platform multiple times.

C# 8 won't even be fully supported on the .NET Framework.

And yet Google is investing into Kotlin.

Yes, Java takes it slow because its target market - enterprises - want it, and they have good reason to. There's nothing wrong with that. All I'm saying is that there are many other niches where developers find lagging too far behind other available options, and there's a noticeable decline in its use because of that.

And so the analogy with PostgreSQL in this case isn't working well.

Google is investing into Kotlin due to politcs with Oracle, and JetBrains (makers of InteliJ which powers Android Studio) pushing Kotlin everywhere they can.

They even separate Kotlin/Native graphical debugging from InteliJ, so that developers get to shell out for Clion license as well.

The only variant where InteliJ and Clion come together is on Android Studio, thanks to us NDK users being vocal how 3rd class it felt versus Eclipse CDT.

On platforms where developers can make use of standard Java there is less pressure to adotpt alternative JVM languages.

Just like F# will always be a shadow of C#, never getting to play with all toys, with C# slowly picking up all features that matter.

Runtimes are different from languages. The C# language has never been rebooted and is fully backwards compatible, and there's no better example of long-term support than Microsoft. You can still run apps from the MSDOS era, and even upgrade MSDOS through to Windows 10 if you have all the CDs today.

Runtimes and languages go hand-in-hand, a runtime that doesn't support everything that one expects from the standard library breaks compatibility.

Code starts getting full of #ifdefs

Siverlight, .NET Core, WinRT, UWP just to give three reboots.

No support for dynamic APIs, appdomains, IL generation on the fly, reflection APIs done in a different way, ...

Actually C# is not fully backwards compatible, variables declared on foreach statements changed their semantics in C# 5.


MS-DOS is only supported in 32 bit variants, a species in extinsion.

> The C# language has never been rebooted and is fully backwards compatible

This is not true - there have been several breaking changes in C#-the-language since 1.0. For example:


So, like DOS 3.3 to 5 to 6.2, and Windows 3 to 3.1, to 95 to 98 to 98SE, to ME, to XP, to 7, to 8 to 8.1, to 10?

I think I was using floppies until Win 98?

Here's a 10 hour video from DOS to Win10: https://www.youtube.com/watch?v=l60HHWWo9z4

Huh. I didn't know they worked any differently in C#. In what ways are they less powerful in Java?

For example, in Java, lambdas cannot capture mutable variables from the outer scope - it must be final, or effectively final. C# always let you do that, from the very first implementation of lambdas.

C# lets you do that because C# doesn't have a way to declare a local readonly/final variable at all. I significantly prefer features that encourage the use of `final` variables everywhere that it is possible in Java.

I write C#, Java, and Kotlin in roughly equal measure. Each has its pluses. But the claim that Java's lambdas are worse because it doesn't let you--and this was a conscious design choice!--do something so potentially catastrophic and difficult to debug is an odd one.

C# could do the "effective final" rule, same as Java did - don't allow a variable to be closed over if it's mutated anywhere.

And yet it didn't - which made implementation that much more complex, since capturing mutable locals requires lifting them to extend lifetime.

I rather be able to do something than be prevented because someone somewhere thinks that it's "potentially catastrophic". That seems especially hyperbolic in this case.

To each their own, but I can second that most Java programmers find that ability horrifying and if it came to a vote would probably get it ejected.

Do Java programmers find mutable variables in general horrifying? That was not my impression - certainly, the language doesn't make it easy to make everything immutable. If so, why the special distinction for lambdas?

The official rationale was that they expected lambdas to be mostly used for parallel sequence processing, and wanted to avoid race conditions. Of course, in practice, lambdas are very useful in many other places, where there's no concurrency issue at all - async continuation callbacks, for example, or pseudo-custom language constructs implemented as functions with a lambda for a body.

What’s the memory model for captured variables being mutated by multiple threads?

The same memory model as any other shared variable being mutated by multiple threads - you either explicitly synchronize, or you just don't do it. But given that lambdas are far from the only way to get there - and those other ways are already idiomatic in the language (e.g. statics) - why single out lambdas specifically for this?

The ideal solution is to make this a part of the function type - so that APIs that do intend to invoke lambdas concurrently can mark them as such, and then the language would enforce sharing, while other APIs that do not use lambdas in a concurrent context, can use their full power.

Same thing goes for lambdas that cannot escape vs those that can - if you reflect this in the type system, then you can also support safe nonlocal breaks and returns in the former, for example. One of the early Java lambda proposals, the one by Neal Gafter, did just that, and it was awesome.

> why single out lambdas specifically for this?

Because existing memory models have rules for object fields and rules for publishing objects.

These rules don’t apply to local variables.

They could do - but that’s what I’m asking.

Yea, that seems frightening. I like to think of my local variables as my inaccessible local state. I'd imagine it keeps my JIT happy too. What would it even mean to have a lambda reference to a variable who's stack frame has already popped?

> What would it even mean to have a lambda reference to a variable who's stack frame has already popped?

That bit is simple - the variable is kept alive as long as there is still a lambda that references it.

Don’t think about stacks - that’s an implementation detail and the compiler is free to use a combination of the stack and the heap to implement local variables.

The issue I’m talking about is if one thread writes a local and another reads it, what does that look like?

Yeah, or value types. Still waiting on those...

You can use the extensions in IBM J9 or Azul, just like some use GCC C while claiming it is C. :)

Also looking forward to them and disappointed they weren't there since the beginning.

A lot of useful features don't end up being implemented in Java and even if they do, libraries and frameworks have to update to take advantage of them. With a new language libraries are built from the ground up based on the new features.

For example adding async to the language isn't necessarily going to change your programs to be async because every widely used library has already adopted threads and doesn't support async yet and often never will.

Pretty much all of the things you'd want in a relational database are now present in Mongodb too.

The real benefit of MongoDB at this point is the ability to easily scale beyond a single machine with shards and high availability using replica sets.

Postgres will get you pretty far, but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution. At that point all the schema update and backups become a nightmare.

Schemas: https://docs.mongodb.com/manual/core/schema-validation/

Transactions: https://docs.mongodb.com/manual/core/transactions/

SQL: https://docs.mongodb.com/bi-connector/master/

> Pretty much all of the things you'd want in a relational database are now present in Mongodb too.

On the "relational" front, it has denormalized-only schema validation and limits transactions to replica sets (so no transactions with sharding), while also recommending single-document transactions over multi-document transaction via denormalization. (FWIW, transactions aren't available in any open source release).

On the "database" front, it has a history of misleading users and remorselessly dropping data.

> Postgres will get you pretty far

Postgres is an actual relational database, open source, battle-proven with a good design and a great team behind it. It never claimed to be good at, let alone capable of, doing things it could not actually do (well, or at all).

> but beyond a certain point the scaling story breaks down and you have to hack some sort of user space sharding solution.

Scaling IS a hard thing, and presents itself quite differently to different use-cases. Nevertheless, horizontally scaling Postgres — for when one truly hits the petabyte-scale — is a problem that has been solved correctly many times before (out of core). For a similar-to-MongoDB method, check out Citus, with the assurance that it only adds to an actual database; as opposed to the MongoDB way of doing things: make up and promise magic scaling solutions that "Just Work", then try to build a database on top of it.

Most databases will never need this kind of scaling.

RDBMS only provides a limited degree of 'validation'. It still must exist fairly comprehensively in the app.

On the contrary, RDBMS provides far more opportunity for validation, because it has all the data at its disposal, which can be queried as needed without the expense of crossing the boundary.

What is the purpose of validation would you say with modern computers? At one time, specifying the exact number of chars was good for squeezing out as much storage as possible, but less so today.

Its still absolutely critical for almost everything. Some use cases:

* My code depends on this value always existing so make this not null * My code is doing math on this value so make sure it is always a number * This record belongs to other record, make sure other record can not be deleted while this one still exists

Modern computers change next to nothing with the need to validate data. The worlds fastest computer wont tell you how to add a number that doesn't exist.

Validation is almost always a function of business logic, not 'storage compression'.

First, most 'noSQL' DB's (including Mongo) have data validations anyhow, rendering the discussion almost moot.

" RDBMS provides far more opportunity for validation"

This can't be true. The application layer, which ultimately contains all 'knowledge' of all aspects of the business, including data from all other resources, can obviously 'provide more opportunity' for validation than any DB possibly can.

Moreover, 'validation' generally implies aspects which are inherently application specific ergo, doing this purely in the data layer almost implies an intersection of concerns.

Validation in almost every case must be done on the app layer, so anything we get from the DB is an added benefit.

Also, data generally has to be validated when it enters into the business logic, long before it gets into the DB, moreover, there are usually data elements that are not persisted, and must be validated anyhow, again illustrating the requirement for validation above the DB.

> The application layer, which ultimately contains all 'knowledge' of all aspects of the business

I've rarely seen a codebase outlive it's database, but I constantly see databases survive through multiple codebases.

It's extremely common to validate a piece of data not on its own, but how it relates to other pieces of stored data. Without transactional semantics, an application basically can't enforce these invariants w/ any reliability (or those semantics need to be ensured out of band, or w/ little data modeling tricks that tend not to scale well).

There certainly are invariants that are non-trivial or cumbersome to enforce strictly with a schema, but you can really only enforce them w/ a database that provides serializable transactions.

In many cases, schematization of data in the database is good for other reasons though (for instance, guaranteeing type-normalized data in the presence of multiple deployed versions of an app via accident or otherwise, ensuring your queries and updates are typesafe, etc.)

First of all, we were talking about validation of data in the database, specifically.

> 'validation' generally implies aspects which are inherently application specific

Not at all. Taking this at face value implies that some app can write data to the database that is valid according to that app, and then another app can read data that is invalid from its perspective, and have to deal with it. That doesn't make sense - data is data, it's either valid, or it's not. That's why the schema is about the data, not about the app.

> Validation in almost every case must be done on the app layer

For UX reasons, mostly, yes. But it's usually much more basic than what e.g. triggers would do in the DB itself.

I'm not saying that there's nothing to validate outside of the DB, either. But for the data that is in the DB, the DB itself can usually do a better job.

The general idea I've seen (for both SQL and NoSQL databases) is two apps never should write to the same database to ensure separation of concerns. Some API layer instead handles all write operations.

Disclaimer: MongoDB employee. All opinions are my own.

That's generally a good practice (though not always, many people do blue-green deployments, for instance), but it's rarely an assumption you want to make. Lots of deployment blunders can happen that render assumptions like that incorrect.

Even if your team executes perfectly and never runs into this, the biggest problem IMO is that you can't really enforce most of your guarantees w/ any degree of confidence w/o a typed schema. Even if you work within a typed language that perfectly validates all the invariants of your application before storing anything, the second you need to perform work that does not strictly funnel data through your application (i.e. an update query), you are effectively gambling on whether or not those invariants will hold. This kind of "read-modify-write" flow of data doesn't really perform well (or even hold validity) for a lot of common use cases, so in reality you need your database to ensure these things for you.

Also the two deployed apps problem is just a special case of two people interacting w/ a database who aren't working under the same assumptions as to what invariants should hold. That can happen in single code bases, even with a lot of care taken.

You only need a separate API layer if the database can't enforce constraints properly.

The database schema (along with stored procedures, views etc.) is an API and database engines are designed to have multiple concurrent writers. Multiple applications and users needing access to the same data is largely why databases exist in the first place.

It doesn't matter in this case - all that matters is that they share the same data store, regardless of how concurrent access to it is organized. The problem here isn't concurrency, but the semantics of data stored - if one app can change it such that the other app can later retrieve data that it considers invalid according to its business constraints, what is the other app supposed to do?

"data is data, it's either valid, or it's not. That's why the schema is about the data, not about the app."

This is not true.

The objective of the overall app/system (i.e. front/back/middle/DB/storage/services etc.) is to carry out some kind of business logic. A DB schema cannot fully validate stored data against the logic.

Otherwise we wouldn't write backend code, we'd just write a bunch of schemas and be done with it.

Let's use a crude example: a password. (Of course, we would never in reality store a password as a string in the clear, but just as an example ...). When a user sets a new password, we have to validate that it meets specific requirements in terms of format, and then some others rules which are more complicated such as: "can't be the same password as the last 5".

Those 'password rules', for example, cannot be encapsulated in the schema of the DB and yet must be applied in order for the data to be 'valid' from the perspective of the app, or 'overall system'.

The DB may only care that it's UTF and max 20 chars. But the system requires more validation than that.

Re: Your statement about 'one app writing data, and the other app not knowing what to do with it'. This is not true, because all apps operating on such data must understand it data in the context of business/logic context in which it was designed. Even 3rd party users of such data, via API's, must understand this data from the level of business logic - not merely 'schema validation'.

When you query data from Google Geolocation, the 'city' field may be a valid string of a certain length, but that's not very useful: it must actually be the name of a city! Any 'app' using this data must operate with the explicit understanding that this is in fact the name of a city - and not just a string that met a DB schema validation requirement.

>A DB schema cannot fully validate stored data against the logic.

Postgres actually lets you run triggers and similar that can validate data arbitrarily. You can even do web requests with the right extension.

If that is not enough, you can run Python code in your database instead and do the same thing with a slightly more powerful language for general purpose computation.

You could write the entire logic of any business app in a PG database and only use the app as a shiny view layer.

> The DB may only care that it's UTF and max 20 chars. But the system requires more validation than that.

Are you familiar with SQL constraints, triggers, user-defined functions, stored procedures?..

Yes + Postgres has Domains which are very nice, especially if you use only Functions for data insert (which i do) This gives you more granularity, than a domain on a Column. Domians are like Dependent Types, offering very fine grained control, enforced by RegEx, functions, enums, even lookup functions are ok so long as lookup tables are stable.

lmao, “my schema is my app layer”.

> The application layer, which ultimately contains all 'knowledge' of all aspects of the business,

Data always outlives the application. You could argue that some app + data lives on together, but then you have just poorly reimplemented what an RDBMS does for you up front.

Databases are, as their name suggests, closest to the data.

Applications generally can't recreate ACID properties and specifically, they shouldn't be trying to.

"Applications generally can't recreate ACID properties" - why would they?

ACID and 'data validation' are generally separate issues.

Data generally has to be validated as it enters the business logic, before it gets stored in a DB. While a DB may in some cases ensure that data adheres to a schema, this usually does not fulfill all of the validation requirements.

Validation often requires examine a model beyond "is this an int?". That model needs to be self-consistent. That requires atomic movements from consistent state to consistent state.

You can do that yourself. Or let the database do it. For things where you can't express it in a database schema, sure. But you'd be surprised how far it gets you.

> my scheme is my app layer ... MongoDB by default is (was? It’s been a while since I’ve used it) schemeless


Maybe I’m a rare exception, but I chose NoSQL early on when it was still “hot” and have never looked back. We’ve grown from a couple megabytes of data do several dozen terabytes and have had countless issues, but scaling our database was never one of them.

It was just a discovery phase.

"my scheme is my app layer"

To be fair you can do use schema validators in Mongo. Not sure it's widespread in practice. And there are other distributed databases that aren't document stores that have schemas and various subsets of SQL implemented.

Wish we had HN gold for comments like this.

Your comment makes no sense since RH is moving for reasons different that "MongoDB is bad".

Not being worth the effort isn't much better.

You are assuming it’s the technology I’m assuming it’s the licensing change.

As someone on the team, it's not the licensing change. The post is correct when it says we've been slowly working on this since 2016. The license change certainly validates the decision, but it was made for purely technical reasons.

The intensity and frequency of frustration are quite likely to subside if you are able to accept these:

- humans do not always act in their long term best interests

- humans do not always act in a rational manner

- often the former is indistinguishable from the cases where you are not aware of someone's incentives

- true change requires that people really accept it internally (and not just mere commitment to act differently)

Wow was MongoDB really that bad?

A way to think about that question is to ask whether typelessness in general is bad because a lot of the schemaless databases come from that side of the divide.

I’d personally answer yes, but that’s because I believe that typed, schema-generates structures should be pushed all the way from the database to the typescript code on the front end. Changes become something you can deal with with high confidence and things like typos, etc. become impossible. Relational structures enforced in the database (and imho sadly lacking in the intervening layers which is something I’d like to get the time to deal with; though the FoubdationDB record layer certainly has an interesting if unexploited take on it) is a natural extension of this.

But there’s a whole other world where people just don’t want to be tied to constraints. I suspect it’s a personality thing.

Strong typing and schemas are a bit different. They're often conflated, but once you start thinking about more complicated examples, this makes sense.

Look at Clojure for a good example - it is strongly (but dynamically) typed, but then it also has schemas with automatic validation on top of it (Spec).

"I suspect it’s a personality thing."

I do too. And also probably a lack of experience dealing with outages and remediation of mistakes.

If it's always someone else's problem to clean up your mistakes then you can't appropriately appreciate how valuable it is that a machine can help you make less of them.

how do you do this? tie it all the way from the DB to the client?

incidentally I believe this is the difference between being an engineer and being a product person. I straddle the line. in theory I want a strong type system so I can be confident about the future. in practice I don't want to pay the upfront cost and I'm more interested in shipping now and taking on debt.

Look at GraphQL. It lets you enforce a single schema from the front end down to the data layer.

In my company's product, we specify the data models as JSON Schema, and then generate the necessary language code -- we generate the GraphQL schema from it as well as Go data types, with some database glue. Our front end code is currently JavaScript, but we hope to migrate to TypeScript, which will make everything statically types all the way through.

(gRPC fills a similar role, though the web story is lacking. gRPC is brilliant for APIs between backend services, but browser support is not there yet. GraphQL is more convenient for the app-facing layers.)

Yes, you can make it work. But you have to run this special shim or proxy, because browsers still don't all support HTTP/2, header trailers, etc. Hence "lacking".

We wrote it. Everything is derived from proto for us. Some of that is hand-written tooling, some is off the shelf. Mostly you have to be aware at design time of what this approach entails.

Proto is not elegant and the data model and implementation (especially the very poor performance of maps in the Go implementation) can leave a lot to be desired, but it's a thing that pretty much supports everything. It has good enough performance without headaches. Lowest common denominator can be quite useful and so can constraints.

What upfront cost?

Possible alternative headline: "Company weights in trade-offs of having two different database systems, decide to consolidate into a single one." :-)

They don't give too many reasons, but say that operationally it is gonna be easier for them to work with a single database than with two (no-brainer?)

As someone on the team, that didn't factor into the decision too much, although it's definitely a nice perk.

Switching to Postgres made a lot of sense even when looking at Pulp as a standalone project rather than as a part of Satellite.

Except, I'd argue that most large distributed systems these days are using much more than two types of data stores, especially when you start including name/value keystores (and in memory caching variants), big data pipelines, transactional data, "unstructured" data storage (like Mongo), time series, etc.

I work in FinTech and you'll see several backing our apps, and this isn't uncommon.

The licensing change surely has to have played a factor

Generally, if you’re going to standardize on a single data repository, a traditional RDBMS will be a better fit than a NoSQL solution, unless you really need polymorphic capabilities and have a document-centric conceptual model (such as a CMS or certain kinds of contact center solutions). So this is really less about MongoDB being bad than relational being more appropriate.

MongoDB has definitely come a long way from when they were dropping writes and hitting scaling limits, and their new tools ecosystem holds a lot of promise, but I treat them much like any other non-RDBMS store (Redis, Cassandra, Neo4J...) in that they’re very specific tools for specific needs, not something that can be kit-bashed for any general class of problem.

I'm not sure, but I've seen things like this: https://twitter.com/hackernewsonion/status/38257854706983731...

"Web scale"

This is a dead horse that's been beaten to death over and over again here on HN. MongoDB is garbage. MongoDB doesn't scale. MongoDB is $h!t says HN users.

It just feels like an echo chamber. Are you running the latest version of MongoDB in a replica set with journaling enabled and write concern set to one? MongoDB has worked great for my uses, up to moderate write/read scale. Sure, if you are running "big data" or enterprise things, it might not be the best choice, but it's not the steaming pile of horse excrement that some HN users try to make it.

Second, for those who still insist that MongoDB is crap, what is the best pure document store database then? I used to champion RethinkDB, but they failed and development has basically stopped. You wouldn't build a business on RethinkDB now a days unfortunately.

Modern Mongo still seems to corrupt itself on the Unifi Cloud Controller I have to deal with if there is no disk space left or hard power off. Postgres will be usable after disk space is freed, or when booted back up. This is a really basic reliability issue.

OT: Ubiquiti seem to be ... not good at software and only relatively good at hardware. I'm not sure why they're as popular as they are and I'm slightly ashamed at being suckered in by Troy Hunt's marketing.

Is it just a case of the rest of the home/SOHO WAP/router segment is cataclysmically bad, so they only have to be a little bit better? (The WAP is good, but not as fast as the Turris Omnia it replaced).

> I'm not sure why they're as popular as they are

For me it's always been price.

Back when we adopted them Ubiquiti UAPs were pretty much the only thing close to enterprise access points (centralized management, radius support and all that jazz) but at 1/10th of the price of Cisco/Ruckus/Aruba.

I agree that they're a bit overhyped: we have 200+ deployed and they are not exactly the most reliable, some devices just disappear from the controller forever, some early generation ones would just overheat and die, CLI management is non-existent. But hey for $70 a pop you can just buy a truckload of them and replace them as needed.

I'm going to go out on a limb and say Unifi Cloud Controller is not running the latest version and does not have journaling enabled.

> is not running the latest version

MongoDB's not still in early prototype phase or similar is it? Basic reliability shouldn't really need the latest version any more, a few years in. :)

nodesocket said the latest version of Unifi Cloud Controller, not the latest version of MonogDB.

"Are you running the latest version of MongoDB?"

I don't use MongoDB, but I imagine the answer to that question for most people is: No, because I already have an existing product in production with actual customers, and I can't afford to upgrade to the latest version.

With a replica set you should be able to do the update with zero or little downtime with some planning. Of course you'll have to look and watch for breaking changes if you running a really old version.

It's not just the downtime. It's the actual work involved in running all your regressions to ensure backward compatability. ie. do all your existing queries still work? do all your current backup processes still work? did that arcane config variable you rely on change its stinking name?

I was going to respond with “well, any database that can’t guarantee that kind of backwards compatibility shouldn’t be taken seriously,” but then I realized we are talking about MongoDB and I found myself giggling.

The best document store is probably either a filesystem or S3.

Use a relational database. Postgres or MySQL. MySQL used to be heinous 20 years ago, but it's alright now. That's about how long these things take.

The primary thing you want from your data storage layer is you want it to be boring. You want all of the foot-shooting to have happened to other people and to be painstakingly documented. This is what relational databases give you. Are they the most ideal thing in the world always? Heck no. Are they extremely well-understood, operationally and in terms of the performance envelope? Compared to everything else, yes.

When in doubt, do the boring thing.

If your data storage layer is interesting, you're a database company and you know to ignore me. Otherwise why are you spending complexity budget on something which isn't the problem you're actually trying to solve?

     CREATE TABLE pure_document_store(document jsonb NOT NULL);

why would you require the latest version of a database to not have corrupt data ?

I can install 10 yr old database postgres / mysql software and not get corrupt data

No data corruption bugs in the last 10 years is a hard to believe. There was a story here a few days ago about incorrect use of fsync

The whole point of NoSQL was scaling to tons of writes. If you don't have "big data" or it can't handle it, then what's the whole real reason of existence for these things?!? :)

> what is the best pure document store database then?

That is like asking what's the best horse species to use for transporting goods.

Doesn't matter.

You want a diesel truck.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact