By the way, the 10 million number in the title is outdated. It's now about 30 million indexes.
Also, I've been publishing one blog post a weekday on my personal blog about one Postgres feature or component. If you liked this post, you should check it out: http://malisper.me
1. How good is Citrus really? It seems almost too good to be true but the line on their product page: "[Citrus is not a good fit for] free form SQL by analysts. Because of the distributed nature of Citus some SQL primitives do not map seamlessly. If you have data scientists or end customers that are looking to run free-form SQL, not controlled by an application, then Citus is not an ideal fit for you." Gives me pause. I'm seriously thinking of migrating to them for a project we have.
2. Do you use the cloud or the enterprise pricing for Citrus? If enterprise, how much are we looking at here? I assume it's at least five digits in a setup fee, but enterprise pricing is so hard to predict.
3. Before moving to Citrus what were some Postgres performance tactics that you used other than partial indexes that made a real difference on performance?
4. When did a single Postgres instance on a single server stop scaling? Was it a problem of writes per second or querying? How long did it take to migrate to Citrus?
5. What is your primary application written in? Do you split it off by endpoint and use something higher level for the low throughput endpoints or do you just write everything in, say, Java or Go?
6. How do you deal with cyber attack? Even normal sites get targeted, but breaking into a analytics company seems like a cyber criminals dream. How do you handle real vs fake traffic on your clients website? I could imagine spammers exploiting this or is that not a primary problem?
I feel tempted to ask more, but I'm afraid I'd be exploiting your offer to the point of being a chore.
1. We've got plenty of customers that rely on Citus as their system of record. How good is it really comes down to your use-case. We're a good fit for certain transactional workloads that have a natural shard key (this is the case for many multi-tenant apps). Or hybrid tranasction/analytical apps like Heap. In these cases you're usually powering some end user facing dashboard and need subsecond query response across a large dataset much like Heap. We're less of a fit for traditional data warehousing, and then if you only have 10 GB of data single node Postgres works just fine.
2. We have customers that come on board for both Enterprise and Cloud. In the past year since we've launched our database-as-a-service we do have a steeper curve of people ramping up on it. It's of course, easier to press a button and have a scalable database than it is to set it up and manage yourself. In general if you're already running on AWS then Citus Cloud makes sense, if you can't run on AWS we're always happy to try to make Enterprise work for you, not to mention there is the open source version.
4. This really varies based on application, lots of apps don't run into issues until they're at a 1-2 TB, and others benefit from parallelism as early as 40-50 GB. That's not to say they couldn't keep throwing more hardware at it, but we've seen some customers migrate at 100 GB and get 2-3x or more performance improvement from single node Postgres.
As the rest are a bit more specific to Heap I'll let them chime in, though happy to answer any more questions from the Citus side where helpful.
I had a look at Citus for analytics workload (typically using full table scan queries), comparing it with Greenplum and Impala (which uses HDFS and Hive metastore, but not much more from Hadoop). I found Citus to not be very competitive with Greenplum; Greenplum executed table scans quite a bit faster, but more importantly Citus didn't have column stores without adding the cstore extension, and column stores massively increase table scan performance (like, multiple orders of magnitude); and cstore_fdw isn't in the box, you need to add it manually. Greenplum in turn didn't perform as well as Impala as data scaled up; also, Impala could deal with thousands of columns, whereas Greenplum and Citus inherit Postgres limits. Greenplum is also tied together with ssh and scripting duct tape, operationally it's a lot less flexible than HDFS.
Postgres on its own isn't very good at analytics queries, it's far too slow for similar reasons to why Citus without cstore was too slow for us, only you get even less I/O parallelization.
I am a frontend developer who wants to start making(and deploying) full stack apps. I have heard that Postgres is one of the most reliable relational databases out there and I would love to use it in my apps.
Any suggestions on how to get started with Postgres? Or relational dbs in general? Books/courses/making projects/anything else?
Thanks in advance!
There's a lot of blog posts on HN about bending databases like Postgres to pretty insane workloads. Heap's workload is a nice example of that. they're interesting blog posts but don't let them distract you. Most databases really do work pretty well by default, if you just use them the simplest way possible. Most apps don't have insane workloads, and especially when you start out they won't.
If you just make a table for every "thing" in your system and link them together with foreign keys you're pretty much set. Most popular frameworks (Rails, Django, etc) do this for you without even touching SQL.
This approach is fine for nearly every case. I bet even Heap started like that. Famously, Twitter started like that as well, and the only reason that got them in trouble (remember the failwhale?) was because they were growing faster than nearly any platform ever before. Most sites don't 10x their user count every half year, and if you do, lucky you. Scaling problems are luxury problems.
Half of HN is jealous of people who have scaling problems. I sure am. The other half, well, they won't have scaling problems because they're not shipping fast enough. Their backends would scale though. They read all the books on Postgres and doing it right from the start!
Everything in moderation. You can often run a single master PostgreSQL system into the millions of transactions a second without having to resort to sharding and other complexities.
Django's ORM is very easy to use, but it is not especially powerful, and when you want to do advanced queries (multiple join conditions for example) it often isn't possible. When someone asks me for some data from the database, usually its easier for me to go directly to the database than use the Django ORM.
Use the ORM to start with, but its most definitely a good idea to learn proper SQL .
So, I think starting with PHP and raw SQL is a good thing for motivated learners.
And if you need motivation, try building a WordPress site with themes and plugins first. And maybe an Android app. And then something cute like a LED blinker with a raspberry.
Then something that I felt taught me a lot was endeavouring to use only a single SQL query per HTTP request. Which meant some silly things like learning CTEs in much more depth (including stuff like doing updates and selects via multiple cte's but within a "single" query); and using more aggregation and window functions to create the JSON responses from the DB itself as much as possible. (I used some PLPGSQL as well, although I tried not to just reach for it every time, hence why I mention it in brackets rather than directly :-p )
It wasn't perfect, but it was a fun thing to do and I felt like I learnt a lot more about Postgres (in particular, but SQL/DBs on the whole) as I progressed.
(The reason I mention this is that Massive.js is really nice for Postgres, and as a frontend developer, using Node.js might be reasonably approachable?)
It's really worth checking out all his stuff - his explanations are concise and easy to follow: http://www.craigkerstiens.com/categories/postgresql/
Books-wise, Learning SQL is a great start. http://shop.oreilly.com/product/9780596007270.do
The postgres docs themselves are also pretty fantastic.
The O'Reilly books are a good reference.
For theory I really enjoyed reading E.F. Codd's papers on the relational model. He was a good writer!
I am also writing a book you might be interested in https://agentultra.com/postgres-book.html where I focus on treating PostgreSQL as an application server. It covers data design, server programming, extensions, and integration with external processes and data stores.
Outside of that, there's a lot that you'll just pick up through usage. Get comfortable with basic SQL and then everything else in Postgres is a bonus.
I attempted to write a high level capability overview for that exact purpose. :-)
You've got to consider that indexes can be viewed as cut-down tables that you can't directly address.
I would have thought that you are going to start leaning heavily on the query optimiser/index updaters, which may well not be well designed for this kind of approach. With the caveat that it obviously seems to working now :-)
I have come up with a schema/set of indexes that we think would allow us to have a fixed set of indexes. The downside is we would have to do a lot of the work Postgres is currently doing ourselves.
Would it be possible to filter out only interesting events?
The Heap philosophy is to capture all of the data up front and then make it easy for our customers to specify and query the subsets of the data they care about. That way our customers never have to write any tracking code and they never have to worry about forgetting to track something.
- Are you storing each log event as a row in Postgres?
- How do you rotate logs, what happens when logs get indefinitely large?
- Can logs be exported to some common format for offline analysis?
We currently have a default 3-year retention policy. We will keep your old data for up to three years.
We have a redshift integration which you can use to export your data from Heap into redshift.
What sort of performance impact do these indexes have on writes? I'm guessing PostgreSQL has to look at evaluate each index predicate for the inserted data to see if the index needs to have an entry added.
When we first deployed partial indexes as our primary indexing strategy, we did notice significantly higher CPU. At the time our assumption was that evaluating the partial index predicates was the cause of the high CPU. I later wound up investigating the high CPU use and found the time wasn't spent on evaluating the partial index predicates, but on actually parsing them! I recently wrote up another blog post on how I found this out: https://blog.heapanalytics.com/basic-performance-analysis-sa...
Also, our website is down due to modest traffic.
The current article (Nov 2016) is here:
select * from users where username_id = 5;
but this won't use the index:
select * from users where username_id = (select 5);
Basically, the query planner won't evaluate the parameter before it decides what strategy to use.
Your example is fundamentally different from parameters sent by a client in a parameterized query, i.e. $1, $2, etc. Also while your flawed example will never use a partial index, parameterized queries can and often will do that. With prepared statements (which are a slightly different concept from parameterized queries) the devil will lie in the details, but even those don't really prevent partial indexes from being used since postgres version 9.2, released years ago.
This makes partial indexes of EXTREMELY limited use for complex queries which often need to pass a non-literal or do a join on a partially indexed column.
In fact, it is this EXTREMELY inconvenient limitation that often forces people to rearchitect large tables around a set of smaller tables with inheritance and check constraints, so that the planner will do index scans on the smaller tables, rather than a huge index.
In fact, if partial indexes were properly architected and supported, it would eliminate 90% of the use cases for partitions in PSQL. Partitions, by the way, are also a sham in PSQL 10, merely putting some syntactic sugar over table inheritance and what is basically a table-routing insert/update trigger. The current planner support for them is a joke / non-existent / worse than roll-your-own inheritance with check constraints and triggers.
Dan gave a pretty good talk about the high-level details of how/why postgres a couple of years back: https://www.youtube.com/watch?v=NVl9_6J1G60.
One reason of why pg is that SQL is really powerful for building complex queries.
Treat all your metadata as data at DB level, your DB metadata are one level higher.
To index those effectively, you'd need to create a trigram index in Postgres, which takes up a TON of space. Because 90%+ of rows won't satisfy any of the predicates defined by the customer, it's probably unnecessary to trigram index every row.
Out of interest, malisper, how much space do you think you save by using partial indices?
edit: removed a statement I couldn't support about taking unnecessary CPU
I actually ran some experiments on using a naive trigram index and found it doesn't currently don't work for our use case. Most of our data is highly redundant (e.g. Every CSS hierarchy contains `div`) in it. This causes every query to read a large portion of the trigram index which makes the query really slow. I've heard the developers of pg_trgm improved the performance of this use case in a recent version of pg_trgm. I haven't reran my experiment yet because Heap is currently on Postgres 9.5, and the new version of pg_trgm is only available for 9.6+.
I believe partial indexes collectively account for ~5% of our disk storage. A trigram index alone would cause an about 30% increase in our disk usage. To eliminate partial indexes, we would also need full indexes on other commonly queried fields such as the url or the text of what was clicked on. Each of these indexes adds a solid amount of space.