Hacker News new | past | comments | ask | show | jobs | submit | SOLAR_FIELDS's comments login

Someone else pointed out “databases”, which appears to be a vague all encompassing term for a bunch of features. Care to elaborate on which missing features you consider to be table stakes?

Yes, it's vague because it's quite a fundamental part of Notion. There's no way to describe that feature separately from Notion. Tables, Kanban boards, gallery views, subpages... everything you can't do in basic markdown is powered by a Notion database.

Most of what you mentioned exists in Obsidian.

With about a dozen plugins and hours of messing around? Sure. Out of the box? Absolutely not.

And I feel pretty confident saying that, knowing that this is by far the most popular page on my personal website: https://input.sh/replicating-notions-tables-with-obsidian-pl...

The level of quality and integration is very different. And obsidian only has it through plugins which can die any day and become useless, which happens quite often. Notion has it as a fundamental part of it's concept, and is constantly improving on it.

but obsidian aint open source though

at least its free and files are stored locally

The issue indeed is that "workflow orchestration" is a broad problem space. I would argue that the solution is not this:

> I don't think that you can just spin up a startup to deliver this as a "solution". This needs to be solved with an open source ecosystem of good pluggable modular components.

But rather more specialized tools that solve specific issues.

What you describe just sounds like a better implemented version of Airflow or the over 100 other systems that are actively trying to be this today (Flyte, Dagster, Prefect, Argo Workflows, Kubeflow, Nifi, Oozie, Conductor, Cadence, Temporal, Step Functions, Logic Apps, your CI system of choice has their own, need I continue, that is not even scratching the surface). Most of those have some sort of "plugin" ecosystem for custom code, in varying degrees of robustness.

For what it is worth, everyone and their mom thinks they can make and wants to be this orchestrator. It's a problem that is just so generic and such a wide net that you end up with annoying-to-use building blocks because everyone wants to architecture astronaut themselves into being the generic workflow orchestration engine. The ultimate system design trap: Something so fundamentally easy to grok and conceptualize that you can PoC one in hours or days, but near infinite possibilities of what you can do with it, resulting in near infinite edge cases.

Instead, I'd rather companies just focus on the problem space that it lends itself to. Instead of Dagster saying "Automate any workflow" and try to capture that space, just make building blocks for data engineering workflows and get really good at that. Instead of Github Actions being a generic "workflow engine" just have it really good at making CI workflow building blocks.

But we can't have it that way. Because then some architecture astronaut will come around and design a generic workflow engine for orchestrating your domain specific workflow engines and say that you no longer need those.

Actually I think I just convinced myself that what you are suggesting actually IS the right way. If companies just said "we will provide an Airflow plugin" instead of building their own damn Airflow this would be easy. But we won't ever have that either. What we really need is some standards around that. Like if CNCF got together and got tired of this and said "This is THE canonical and supported engine for Kube workflows, bring your plugins here if you want us to pump you up". That might work. They've usually had better luck with putting people in lockstep in the Kube ecosystem at least than Apache has historically for more general FOSS stuff. Probably because the problem space there is more limited.

Can you point me towards resources that help me understand the trade offs being implied here? I feel like there is a ton of knowledge behind your statement that flies right past me because I don’t know the background behind why the things you are saying are important.

It's a huge field, basically distributed computing, burdened here with the glorious purpose of durable data storage. Any introductory text long enough becomes essentially a university-level computer science course.

RADOS is the underlying storage protocol used by Ceph (https://ceph.com/). Ceph is a distributed POSIX-compliant (very few exceptions) filesystem project that along the way implemented simpler things such as block devices for virtual machines and S3-compatible object storage. Clients send read/write/arbitrary-operation commands to OSDs (the storage servers), which deal internally with consistency, replication, recovery from data loss, and so on. Replication is usually leader and two followers. A write is only acknowledged after the OSD can guarantee that all later reads -- including ones sent to replicas -- will see the write. You can implement a filesystem or network block device on top of that, run a database on it, and not suffer data loss. But every write needs to be communicated to replicas, replica crashes need to be resolved quickly to be able to continue accepting writes (to maintain the strong consistency requirement), and so on.

On the other end of the spectrum, we have Cassandra. Cassandra is roughly a key-value store where the value consists of named cells, think SQL table columns. Concurrent writes to the same cell are resolved by Last Write Wins (LWW) (by timestamp, ties resolved by comparing values). Writes going to different servers act as concurrent writes, even if there were hours or days between them -- they are only resolved when the two servers manage to gossip about the state of their data, at which time both servers storing that key choose the same LWW winner.

In Cassandra, consistency is a caller-chosen quantity, from weak to durable-for-write-once to okay. (They added stronger consistency models in their later years, but I don't know much about them so I'm ignoring them here.) A writer can say "as long as my write succeeds at one server, I'm good" which means readers talking to a different server might not see it for a while. A writer can say "my write needs to succeed at majority of live servers", and then if a reader requires the same "quorum", we have a guarantee that the write wasn't lost due to a malfunction. It's still LWW, so the data can be overwritten by someone else without noticing. You couldn't implement a reliable "read, increment, write" counter directly on top of this level of consistency. (But once again, they added some sort of transactions later.)

The grandparent was asking for content-addressed storage enabling a coordination-free data store. So something more along the lines of Cassandra than RADOS.

Content-addressed means that e.g. you can only "Hello, world" under the key SHA256("Hello, world"). Generally, that means you need to store that hash somewhere, to ever see your data again. Doing this essentially removes the LWW overwrite problem -- assuming no hash collisions, only "Hello, world" can ever be stored at that key.

I have a pet project implementing content-addressed convergent encryption to an S3 backend, using symlinks in a git repo as the place to store the hashes, at https://github.com/bazil/plop -- it's woefully underdocumented but basically a simpler rewrite of the core of https://bazil.org/ which got stuck in CRDT merge hell. What that basically gets me is that e.g. ~/photos is a git repo with symlinks to a FUSE filesystem that manifests the contents on demand from S3-compatible storage. It can use multiple S3 backends, though active replication is not implemented (it'll just try until a write succeeds somewhere; reads are tried wider and wider until they succeed; you can prioritize specific backends to e.g. read/write nearby first and over the internet only when needed). Plop is basically a coordination-free content-addressed store, with convergent encryption. If you set up a background job to replicate between the S3 backends, it's quite reliable. (I'm intentionally allowing a window of only-one-replica-has-the-data, to keep things simpler.)

Here's some of the more industry-oriented writings from my bookmarks. As I said, it really is a university course (or three, or a PhD)..




I upvoted this but I also wanted to say as well that this summary is valuable for me to gain a better groundwork for an undoubtedly complex topic. Thank you for the additional context.

I use seaweed as well. It has some warts as well as some feature incompleteness but I think the simplicity of the project itself is a pretty nice feature. It’s grokkable mostly pretty quickly since it’s only one dev and the codebase is pretty small

Not only is it delicious but it’s often 25% of the price or less than other European countries. I am not sure why beer is so cheap in Czech Republic specifically but every country that surrounds it is significantly more expensive

The author must have just been trying to be cute, because yes, they were never referred to as energy beers. In fact, Four Loko was so popular that the brand itself suffered from genericization, where all caffeinated malt liquors were colloquially referred to as Four Loko

Beyond the concerns that have already been raised this is also a substance that is pretty untested at these dosages for ingestion. Side effects are unknown. But who knows, maybe it’s like vaping: we know that it’s probably bad for you to some extent, but most people are on a consensus that it’s probably not worse than cigarette smoking

But way more accessible and cool for under 30s

One thing I wish was a bigger thing in the States was more near beer like options. You basically have two choices for low alcohol beer when it comes to that in the States: you either buy the most watered down thing available which is usually around 3.5% or you buy N/A (<0.5%) or hop water (0%). I would love to just have a place where I can have 1%-2% beverages to easily pace myself without having to control my intake rate. The one-on one-off solution of alternating N/A and regular works but requires a fair amount of self control to manage that. I have also tried naltrexone which mostly just gives me side effects and makes me feel terrible for a couple of days, not really helping slow down intake.

Having lived in Sweden a bit who has better approach around this I can say it’s much nicer - Systembolaget has options for pretty much any percentage you like and breweries like mikkeller are making incredible low ABV options. 2% is available at the grocery store, and bars are legally required to offer N/A options.



is pretty popular in many former USSR states, and can be easily made at home. I have no idea if you will like the taste, I'm used to it from childhood. It has 1-2% alcohol and gives a slight buzz if you drink liters of it (which I tend to do in summer).

Interesting. I don’t typically like the flavor profile of beverages like kombucha but do like sour beers so this is worth a shot. Thanks!

I didn't expect Sweden's Prohibition history to be so interesting.


> the most watered down thing available which is usually around 3.5%

There's many beers in the 6%+ range; they're just not at many grocery stores, and instead you have to go to a specialty store.

Perhaps you misunderstood the comment (and indeed rereading it I can see how it might be taken to mean that I am saying that only 3.5% beers are available in the states, which is definitely not the case. I’ve edited it to be more clear). There are certainly tons of options in the States above 3.5%. I’ve seen 18% beers before. But there are almost no options between N/A (legally less than 0.5%) and 3.5%, which is about the ABV of your typical domestic lager

I think kombucha used to have 1-ish percent or more but they were made to get rid of most of the alcohol.

I recall that a lot of the problem around this was that the content was regulating kombucha as an alcoholic beverage which makes distribution a lot more difficult.

Beyond the features that the sibling comment mentioned, this kind of data isn’t really for end users. It’s a way that you can package it up, “anonymize” it, and sell the data to interested parties.

For someone like Notion, they probably aren't selling this data. The primary use case is internally for analysis (eg product usage, business analysis, etc).

It can also be used to train AI models, of course.

That "probably" is doing a lot of heavy lifting. That said, whether they sell it or not, it's all that data that is their primary value store at the moment. They will either go public or sell, eventually. If they go public, it'll likely be similar to Dropbox; a single fairly successful product, but failing attempts to diversify.

"Selling" is a load-bearing word, too. They're probably not literally selling SQL dumps for hard cash. But there are many ways of indirectly selling data, that are almost equivalent to trading database dumps, but indirect enough that the company can say they're not selling data, and be technically correct.

Is that why they’re putting images in Postgres? I don’t understand that design decision yet.

Notion employee here. We don't put images themselves in Postgres- we use s3 to store them. The article is referring to image blocks, which are effectively pointers to the image.

I... Don't think they are? If you look at the URL for images in notion, you can see the S3 hostname.

They didn’t say the quiet part out loud, which is almost certainly that the Fivetran and Snowflake bills for what they were doing were probably enormous and those were undoubtedly what got management’s attention about fixing this.

Found this comment (from Fivetran's CEO, so, with that in mind) regarding this article enlightening regarding the costs they were facing here https://twitter.com/frasergeorgew/status/1808326803796512865

Snowflake as destination is very very easy to work with on fivetran. Fivetran didn't have S3 as destination till late 2022. So it literally forces you to use one of BQ, Snowflake, redshift as destination. So fivetran CEO's defence is pretty stupid.

They weren't that quiet about it:

> Moving several large, crucial Postgres datasets (some of them tens of TB large) to data lake gave us a net savings of over a million dollars for 2022 and proportionally higher savings in 2023 and 2024.

I'd like to see more details. 10s of TB isn't that large -- why so expensive?

Fivetran charges by "monthly active rows", which quickly adds up when you have hundreds of millions to billions of rows that are constantly changing.


yep, and Notion's data model is really bad for this pricing. Almost every line you type is a "block" which is a new row in their database.

They’re likely paying for egress from the databases as well.

DBA salaries, maybe?

Maybe cloud hosted

I thought the quiet part was that they are data mining their customer data (and disclosing it to multiple third parties) because it’s not E2EE and they can read everyone’s private and proprietary notes.

Otherwise, this is the perfect app for sharding/horizontal scalability. Your notes don’t need to be queried or joined with anyone else’s notes.

Also whether this data lake is worth the costs/effort. How does this data lake add value to the user experience? What is this “AI” stuff that this data lake enables?

For example, they mention search. But i imagine it is just searching only within your own docs. Which i presume should be fast and efficient if everything is sharded by user in Postgres.

The tech stuff is all fine and good, but if it adds no value, its just playing with technology for technology sakes

I too was surprised to read that they were syncing what reads, at a glance, to be their entire database into the data lake. IIUC the reason that Snowflake prioritizes inserts over updates is because you're supposed to stream events derived from your data, not the data itself.

This ^. This switch from managed to in house is a good example of only building when necessary.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact