Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hydra 1.0 – open-source column-oriented Postgres (hydra-so.notion.site)
269 points by coatue on Aug 3, 2023 | hide | past | favorite | 77 comments
hi hn, hydra ceo here

hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.

postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).

following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.

for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. https://dashboard.hydra.so/signup




Ok so this is a Citus fork.

Where can I read about what the differences / trade offs are? I don’t see anything in the docs.


Congratulations!

Can we expect support for gist, gin, spgist, and brin indexes sometime in the near future?

Based on the source code, it appears that they are not supported:

https://github.com/hydradatabase/hydra/blob/96056312e7c0f413...

"... Columnar supports `btree` and `hash `indexes (and the constraints requiring them) but does not support `gist`, `gin`, `spgist` and `brin` indexes."


Do different kinds of indexes work better for columnar storage? Or is it the same principles for both?


Difference principles of indexing, as least based on my experience with ClickHouse.

* Column-based stores have really fast scans due to compression and vectorization, so you'll generally always read down the column. The way to speed it up is to have "skip indexes" that allow you to skip blocks, e.g., don't even bother to read/decompress them.

* Commonly used indexes need to be very sparse, so they fit in memory even when tables run to hundreds of billions of rows.

* Finally highly compressed columns can be used as indexes to filter data rapidly. ClickHouse calls this PREWHERE processing.

Edit: clarify skip indexes



how does Hydra compare to Citus? https://www.citusdata.com


generally faster across the board, a lot of work was done to expand and speed it up, plus updates, deletes, and vacuuming.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...


Since benchmarks can be misleading I want to point out that the differences between Hydra and the "tuned"[0] PostgreSQL (which are some very basic settings) are a lot less convincing, with plain old PG coming ahead on quite a few: https://tinyurl.com/eju9tht2

I also noticed quite a bit of parity between Hydra and Citus on data set size. Is Hydra a fork of Citus columnar storage?

0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...


> plain old PG coming ahead on quite a few

I found that is common among these types of databases (e.g. Citus, Timescale, etc.) which perform well under very specific conditions, and worse for many (most?) other things, sometimes significantly worse.

That said, Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G, the insert time is ~9 times faster, and "cold run" is quite a bit faster too. It's only "hot run" that shows a fairly small difference. I think it's fair to say Hydra "wins" that benchmark.

> Is Hydra a fork of Citus columnar storage?

Yes: "Hydra Columnar is a fork from Citus Columnar c. April, 2022".


> Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G

you can run pg on compressed filesystem


I'm sure you can, but AFAIK neither uses compression in that benchmark so it's a fair comparison. Even if filesystem compression would reduce that to 17.5G (doubtable), it won't be free in terms of CPU cycles, and no matter what it's still ~120G to load in memory, bytes to scan/update, etc.


my bet is that hydra uses compression inside already, otherwise it is hard to explain where difference comes from.

> it won't be free in terms of CPU cycles

it can reduce IO traffic significantly, and it can be very positive trade off depending on circumstances.


I had assumed that PostgreSQL is so much larger because it creates heaps of indexes (which is probably also why inserts are so much slower for it), but I don't really have a good way to confirm that quickly.


one can choose to not create "heaps of indexes".


At which point your performance will drop like a brick for these types of queries – I'm pretty sure these indexes weren't added for the craic.


it depends on your query obviously.

In general, I did very deep benchmarking of pg, clickhouse and duckdb, and I sure didn't make stupid mistakes like this: https://news.ycombinator.com/item?id=36990831

My dataset has 50B rows and 2tb of data, and I think columnar dbs are very overhiped and I chose pg because:

- pg performance is acceptable, maybe 2-5x times slower than clickhouse and duckdb on some queries if pg is configured correctly and run on compressed storage

- clickhouse and duckdb start falling apart very fast because they specialized on very narrow type of queries: https://github.com/ClickHouse/ClickHouse/issues/47520 https://github.com/ClickHouse/ClickHouse/issues/47521 https://github.com/duckdb/duckdb/discussions/6696


"2-5x times slower" can mean the difference from 2 seconds to 4 to 10 seconds. Two seconds is still (barely) acceptable for interactive usage, ten seconds: not so much. You're also going to need less beefy servers, or fewer servers.

I also "just" use PostgreSQL for all of this by the way, but the limitations are pretty obvious. You're much more limited in what you can query with good performance, unless you start creating tons of queries or pre-computed data and such, which have their own trade-offs. Columnar DBs are "overhyped" in the sense that everything in programming seems to be, but they do exist for good reasons (the reason I don't use it are because they also come with their own set of downsides, as well as just plain laziness).


ClickHouse can do large GROUP BY queries, not limited by memory: https://clickhouse.com/docs/en/sql-reference/statements/sele...


as explained in https://github.com/ClickHouse/ClickHouse/issues/47521#issuec... it can't, that parameters only applies on pre aggregation phase but not aggregation.

Feature request is not implemented yet: https://github.com/ClickHouse/ClickHouse/issues/40588


ClickHouse uses "grace hash" GROUP BY with the number of buckets = 256.

It can do size about 256 times larger than a memory because only one bucket has to be in memory while merging. It works for distributed query processing as well and is enabled by default.

About the linked issue - it looks like it is related to some extra optimization on top of what already exists.


> only one bucket has to be in memory while merging.

its hard for me to judge about implementation details, but per that person reply memory is also multiplied by number of threads which do aggregation.


Do you have happen to have any documentation about your benchmarking? I'm also considering these options at the moment (currently using pg+timescaledb) and interested in what you found.


I don't have documentation.

I just created large tables, and tried to join, group by, sort them in pg, clickhouse, duckdb, looked what failed or being slow, and tried to resolve it.

I am happy to answer specific questions, but I didn't use timescaledb.


> 0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...

that postgres config is very underpowered, it has only 8 workers per gather while machine has 192 vcpus.


Submit a PR?


I am not sure what is the process, who will rerun benchmarks then?


Right when I was thinking URL shorteners were out of fashion... /S


It's just there to make it easier for mobile users to click it..


I don’t see why the GitHub link is any harder to click than the tiny url link in that post.

I’m pretty sure the only reason url shorteners exist with purpose is because of Twitter limits (and software that doesn’t visually hide egregiously long urls), but continues to be used outside of those places due to cargo culting


Congrats on the launch!

For those who have not experimented with columnar based databases, I would highly recommend toying around with them.

The performance improvements can be substantial. Obviously there are drawbacks involved with integrating a new database into your infrastructure, so it is exciting to see columnar format introduced to Postgres. Removes the hurdle of learning, deploying and monitoring another database.


Hydra Columnar is a fork from Citus Columnar c. April, 2022.


Citus: https://github.com/citusdata/citus

BTW, Citus license is GNU Affero General Public License (github lists “conditions: same license”) and hydra is Apache. How is that possible if the latter is a fork? There’s probably something about these licenses I’m not aware of and I’m curious.


hydra columnar inherits its license from citus: https://github.com/hydradatabase/hydra/blob/main/columnar/LI...

but, hydra itself is more than just the columnar extension.


Thanks for explaining. This is confusing to me as a github user, i.e. if I saw a license in the project’s description, I wouldn’t expect another license in a subdirectory.

Github now has UI for repos with multiple licenses: https://github.blog/changelog/2022-05-26-easily-discover-and... , that would have been clearer for me.


they must not look at anything other than the top directory, then, because this repo's sidebar only says "Apache-2.0 license" which I grant is clarified down at the _bottom_ of the README but if that multi-license thing really wants to be helpful it should at least look one subdirectory down


Always awesome to see folks moving Postgres forward. Congrats on the launch!


Talked to OP last night and played around with it this morning. This is something I've wanted to see added to postgres for a long time, and couldn't have been done by a nicer and more accommodating founder. Very excited.


Congrats on the 1.0 milestone!

A few months ago, we worked with the team to bring Hydra to Jupyter, you can check out the tutorial here: https://docs.hydra.so/analyze/jupyter

JupySQL's GitHub: https://github.com/ploomber/jupysql


Nice and thanks @edublancas, that was a useful tutorial and it's nice to be able to query hydra with SQL via Jupyter. fan of your project


awesome project - and we tested Zing Data ( http://www.zingdata.com ) with Hydra to make really fast analytical queries on postgres scale to analytics users on mobile and so far have seen great results.


This looks really impressive, and I'm excited to see how it performs on our data!

P.S., I think the name conflicts with Hydra, the configuration management library: https://hydra.cc/


And also the password bruteforcing tool by THC.


Also with the OIDC server by Ory [1] and a certain defunct Russian darknet marketplace [2].

[1]: https://www.ory.sh/hydra/

[2]: Not today, tovarisch mayor.


The many heads of Hydra


Congrats on the launch.

Curious, why not contribute to Citus? Why create a fork and have duplicated effort?


Because then it would be much harder to bait investors. By forking, they can more confidently say things like, "We built Hydra to solve real problems affecting real people who will part with oodles of money to access our totally unique solution."


I would love to have the ability to define "column groups" in regular Postgres. For example most of the time I just leave the table row-oriented, but when trying to optimize performance of specific tables and queries I could tell postgres to store one or a handful of columns separately similar to a column oriented database. This would allow getting quite a bit of performance out of my database for the few cases in OLTP where I need to do some scans or searches. Even just breaking out some rarely used columns into a separate group could be quite nice for keeping the caches hot.


From my initial 5 minutes of getting setup with an existing install, it's all fairly easy.

# Requirements

  - lz4
  - zstd
# Building

  git clone https://github.com/hydradatabase/hydra
  cd hydra/columnar
  ./configure
  make
  make install
# Install

  CREATE EXTENSION IF NOT EXISTS columnar;
The actual extension docs are at https://docs.hydra.so/concepts/using-hydra-columnar


Very out of touch with Postgres, but is there a native column oriented table type option in Postgres so that you choose either row-based or columnar in the CREATE TABLE DDL?


I don't believe Postgres has this natively, but an alternative to OP is Citus, a Postgres extension which allows this kind of syntax.

https://www.citusdata.com/blog/2021/03/06/citus-10-columnar-...

EDIT: per another comment, OP is a fork of Citus Columnar!


Congrats to the entire Hydra team on the launch! We (Artemis - https://www.artemisdata.io/) are stoked to be build with you as a partner to help data teams analyze data even faster!


super impressive performance improvements!

do most of your customers replicate their postgres database to Hydra for analytics jobs, or what's the typical set up?


Can you add the extension to an existing database?


That's also my question. Couldn't find anything for it in the docs provided.

Also, how to migrate data from an existing database? (Is it the usual pg_dump/psql combo?)


Indeed I am also wondering — if I have an existing Postgres DB , how do I leverage this?


For a Postgres extension, there's a strangely lack of documentation on how to add Hydra to an existing PG installation.


This looks wild! Been looking for a good event based logs DB and didn’t want to go full clickhouse. This will do!


How does it compare to Timescaledb? Performance? Features? (TSDB is my current Postgres data lake setup)


This is really cool, just played around with it a bit but excited to do a deeper dive later. Nice work guys!


Thanks! feel free to DM me in the hydra discord (or email) anytime


Such a game changer and useful alternative to legacy databases!


How are updates handled, is it doing a merge on read ?


First we added a bitmask to mark rows as deleted - these rows are filtered out on read. Then updates are implemented as deletions + inserts. We have also added vacuum functions to remove/rewrite stripes that have >20% of deleted rows in order to reclaim space and optimize those stripes.


Is hydra compatible with the PostGIS extension?


I wanted to say thank you for using actual Open Source licenses. It's gotten to where I treat any recent "Launch HN" or "Show HN" containing "open source" in the title as "well, I wonder which crazy license this project is using"


Can you point me to examples of Launch HNs using funky licenses? Show HNs are free-form but Launch HNs are curated by us, and I'd like to know what red flags to watch for.

(As this is offtopic for the current Show HN, it might be better for to email hn@ycombinator.com if you, or anyone, would be willing to share that way.)


I sent email to avoid being a distraction, but I did want to follow up publicly and say that I apologize for lumping Launch HN into the same bucket as Show HN. For the most part the Launch ones are really Open Source and I apologize for the over generalization :-(


Congrats on the success!


can I ask if you guys take contributors?


of course :) drop by our Discord if there's something you'd like to contribute and want to chat about it beforehand, need help/have questions getting started, etc. https://hydra.so/discord


Are you funded?


Check out the "about us" section on the page for more details! "We are excited to share that Hydra raised a $3.1M seed round to drive development of columnar Postgres. We remain committed to sharing our upcoming releases to open source."


Congrats guys!


Really cool!


> to query billions of rows instantly

Rows? Rows?!? What's the point to have columnar DB to query rows?


Columnar DBs often allow you to have tables consisting of multiple columns where values in a column are correlated with values in the other columns. All such correlated values belonging to different columns are commonly called a "row" despite not being stored contiguously.

Generally what you do is to scan a column and evaluate a predicate in each value you encounter during the scan (possibly in parallel). For each value of that column that matches the predicate you then keep track of the "position" of the value in the column (a common technique is sparse set data structure such as for example a roaring bitmap). Then you scan through another column and select values for the saved "positions".

As you can see, it's not a stretch to view values from different columns that belong to the same "position" as belonging to the same "row" and the "position" to be the "row index" or "row id"


Sure, you could think of a row as "tuple of values returned from all columns at the same index". But why it was that important to measure and use and present for primarily columnar store? What's the point to measure and count row access? Isn't existing row DBs working for such case?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: