I've used Joe quite a bit, and all I can say is that it's a great idea that solves a noticeable problem: troubleshoot your queries.
It may not look like is a hard challenge: but the reality is that most troubleshoot them in production, which means that:
a) you need access to production, when you shouldn't;
b) you run queries in production, potentially affecting -sometimes noticeably- production workloads.
Joe provisions a database clone using ZFS snapshot (I believe LVM support was added recently) and gives you a full copy of the database for you, with real data. You can test the queries, explain plans, create indexes to test if you need. And when you are done, it goes away. With a Slack interface, so you don't need to bother with connection, provisioning, credentials and decommissioning.
> but the reality is that most troubleshoot them in production
I find this hard to believe. Maybe on the very rare occasion where the problem is tied to production and irreproducible elsewhere, but my experience has been that most troubleshooting happens on development, testing, staging, etc servers.
> and gives you a full copy of the database for you, with real data
Other than for small toy databases, I don't see how this would work for large databases. Not to mention the huge security issue.
You can test, optimize, etc queries on non-production servers. If the problem is production specific, it most likely is more server specific and not query optimization related.
> I find this hard to believe. Maybe on the very rare occasion where the problem is tied to production and irreproducible elsewhere, but my experience has been that most troubleshooting happens on development, testing, staging, etc servers.
Let me rephrase: many may troubleshoot in dev/staging, but that's not the right thing to do. Query execution plans can change completely based on data volume and even with the data itself. Databases use detailed statistics about the data to influence the planner's decision.
Hence, tuning a query on an environment without production data can lead to query optimizations that are either irrelevant or totally wrong in production. I have seen this countless times in my professional experience.
> Other than for small toy databases, I don't see how this would work for large databases.
Since it uses cloning at the volume/fs layer, the size of the database is irrelevant --only the rate of changes matter.
> You can test, optimize, etc queries on non-production servers.
No, you cannot, as I mentioned above. It's a waste of effort and can only lead to the false illusion that your query is good --and then beat you in production.
> Let me rephrase: many may troubleshoot in dev/staging, but that's not the right thing to do.
Maybe we have a different definition of "troubleshoot". Also, dev and/or staging should have similar data/specs/etc to production.
> Query execution plans can change completely based on data volume and even with the data itself.
Yes. That's why you could have your production server generate/log actual query plans for queries that are causing you problems. You could even set performance (cpu,io,bandwidth,etc) thresholds for your server to log. Of course that's in addition to the profile/monitoring data you have on the server.
> Hence, tuning a query on an environment without production data can lead to query optimizations that are either irrelevant or totally wrong in production. I have seen this countless times in my professional experience.
Well then your issue is that your dev/staging setup is poor. How or what do you even develop, test and troubleshoot? Might as well develop, test and troubleshoot straight on production.
> Since it uses cloning at the volume/fs layer, the size of the database is irrelevant --only the rate of changes matter.
Size still does matters. And as I mentioned, it still has a problem with security.
> No, you cannot, as I mentioned above. It's a waste of effort and can only lead to the false illusion that your query is good --and then beat you in production.
"waste of effort"? Developing, testing, troubleshooting, etc should be done off production. It's only on the rare cases that you should troubleshoot on production. This is best practices and basic security.
Unless you are working on internal office setup which isn't facing the outside world and where security and uptime doesn't matter. Your comment reminds me of people saying using "admin/root" account for everything is fine because everything else is "waste of effort". Do you even have a dev/staging setup? What do you use that for?
Everyone's professional experience is different, but I've never heard anyone claim "many may troubleshoot in dev/staging, but that's not the right thing to do". Not the right thing to do?
In my professional experience, the testing/troubleshooting is done on dev/staging. Once we feel everything is up to snuff, code gets pushed to production. If there are issues on production, we try to replicate it on dev/staging/etc and troubleshoot it. Most of the time, we find the bug/issue. On the rarest of occasions do we have to troubleshoot directly on production which serve our company and especially our clients who have SLAs with us. Your cavalier attitude about production is something I've yet to come across. But if it works for you, then I guess that's all that matters.
Ideally, you would be absolutely right and I would not disagree with you. But I strongly do, based on many years of experience in the field.
Out of all the customers we help with their Postgres infrastructure, and that includes several $B companies, I have found none to have either:
* Dev environments dimensioned similarly to production. Actually, this typically neither happens on staging and sometimes not even in QA.
* All except QA, if any, have the same data as production.
So in practical terms, I haven't seen it is possible to troubleshoot on dev/test/staging/QA environments, least to have a copy of the full database. This is reasonable, too: how all developers may have 5TB of data on their laptops? (leaving aside all privacy and data protection considerations).
That's why having a tool, Joe, to light provision production-like environments, that are disposable, where you can troubleshoot your queries, is for me a very nice approach to this problem, and the one I recommend.
If I'm reading this correctly, Joe uses docker running on the same host as your regular production db, and provisions a snapshot, and a new docker image to mount it. So yes, your production DB data is 'safe' but if your query uses lots of CPU, or lots of Disk IO, then its still going to fight the other images on the system for resources. IE, a Full Table scan of a 2TB table is going to affect the other docker systems running on this same server.
Joe uses Docker and ZFS (or LVM+ext4, alternatively) on a separate machine, where data is transferred, normally, from archives (optionally, being transferred constantly, if there is a small Postgres "sync" server with "restore_command" configured), stored on ZFS, periodically snapshotted and prepared (and at this point, we can anonymize it, see https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc...), and snapshotted again. Such snapshots of prepared PGDATA can be done periodically. And from there, we do thin provisioning, in a few seconds.
All this happens on a fully separated machine, not affecting production nodes by any means.
In general, you are right. If, in your case, using the "raw" databases is unacceptable, it is better to anonymize data first (of course, the physical layout will change in this case).
However, Joe doesn't reveal the data. Engineers see only EXPLAIN (ANALYZE, BUFFERS) plans.
It may be considered acceptable. Of course, there are ways to retrieve particular values with some tricks like "explain analyze ... limit (select value from...)"), but this will be noticeable. Massive data leaks are not possible, Joe filters the output, presenting only metadata -- the whole idea was to provide the data for backend engineers, many of those don't have access to production data.
The question you're raising is broader: let's think how many times we send some personal data to our colleagues in Slack, how is it controlled now? are we okay with that? Emails, tokens, and so on.
At the same time, Joe does boost the development speed because it becomes easier to troubleshoot SQL performance and discuss it with colleagues, collecting reliable facts.
What would you say if it would be:
1. A chatbot for on-prem Mattermost?
2. A specialized GUI, SaaS or standalone?
> The question you're raising is broader: let's think how many times we send some personal data to our colleagues in Slack, how is it controlled now? are we okay with that? Emails, tokens, and so on.
People's existing tendency to expose critical data via Slack shouldn't be reinforced, and coming from a product that handles sensitive data by design, it's disheartening to see this line of thinking here. I would hope your product makes it harder for me to compromise data, not easier.
> At the same time, Joe does boost the development speed because it becomes easier to troubleshoot SQL performance and discuss it with colleagues, collecting reliable facts.
Great! Now make that troubleshooting accessible via an authenticated URL and I'm in. Especially because I may want to revisit a particular explain sometime down the line, and simple anchor-tag based list of bookmarkable past queries is easier to navigate than having to use Slack searches.
If I were to use this tool, I would envision the ideal version of it being self-hosted and able to function independent from any other service. If there was then an option to add a Slack hook that notified people, that would be good. But it shouldn't be central to the product, and the information should strictly travel from the self-hosted service to slack via hooks, exposing only the URLs/titles of result pages. No need to expose the queries or DB structure.
I think the current Slack implementation is a cool proof of concept, but this isn't a product I would seriously consider until it is able to perform its tasks in isolation.
As for "improv[ing] the level of collaboration of developers and DBAs", great that you're thinking about that, but that's my problem to solve, not Joe's. And depending on my team's existing solutions, this may even rule out Joe simply due to our current tool set.
> People's existing tendency to expose critical data via Slack shouldn't be reinforced
I agree with you. However, depending on company's policy, raw SQL, even with some concrete values, and EXPLAIN ANALYZE plans may be considered as acceptable for using them in Slack.
If it is considered harmful, we have a way to inject anonymization step into the snapshot preparation tool -- see https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc... (we're thinking how to improve usability here, not losing flexibility). For example, you can run PostgreSQL Anonymizer (https://gitlab.com/dalibo/postgresql_anonymizer). Of course, the physical layout will change, so troubleshooting of, say, VACUUM commands would be affected. But rows cardinality, normally, won't change, so for large enough tables, it totally makes sense do go this route, for SQL troubleshooting&optimization tasks.
> Great! Now make that troubleshooting accessible via an authenticated URL and I'm in.
We are thinking about the GUI version actively, so stay tuned.
And on CLI version as well, but CLI will work in psql (which I'm personally a big fan of, spent 15 years there or so). This means that all the data will be present.
In the case of GUI, of course, we are able to work on meta-level, similarly to the Slack version.
> If I were to use this tool, I would envision the ideal version of it being self-hosted and able to function independent from any other service.
I hear you well, and it generally aligns with our team's vision. Thank you again for awesome feedback!
> As for "improv[ing] the level of collaboration of developers and DBAs", great that you're thinking about that, but that's my problem to solve, not Joe's. And depending on my team's existing solutions, this may even rule out Joe simply due to our current tool set.
Could you provide some examples of such solutions?
Joe running in Slack is indeed a way to simplify and speed-up SQL optimization workflow for developers, as it takes seconds to get initial query execution plans and optimization recommendations.
>... With a Slack interface, so you don't need to bother with connection, provisioning, credentials and decommissioning.
It's worth noticing that Joe is a use case of Database Lab (https://gitlab.com/postgres-ai/database-lab) and it's still possible to use Database Lab features like thin-clones provisioning of production-sized databases and fast data state reset with Database Lab client CLI (https://postgres.ai/docs/database-lab/6_cli_reference) for purposes of SQL optimization without Slack integration. If a user has sufficient access to the data they can provision thin-clone with Database Lab client CLI and use psql to work with the clone.
Also, we have plans to add support of recommendations and statistics in CLI and REST API to the support of various messaging platforms in the future.
If you are using a PG client already you are probably not the target audience.
It's easier to tell someone to copy paste a query into Slack to get their weekly insights than teaching them how to set up pgcli or some other gui app. You'd also have to create tightly scoped permissions for everyone who only needs it from time to time.
I do use command line clients, because I feel they are more available and convenient than GUI tools, but I'm not a PostgreSQL optimization wizard so I could be in the target audience. My customers probably are.
A slack interface is probably a more auditable environment out of the box. Strong auditability is arguably more important than hard security guarantees.
Well you sure need a representative dataset to optimize your queries. Often, that means a replica of the production server, with real PII data in there.
I usually try to use mocked data as far as possible, but I guess the last validating step usually involves real data. Perhaps this can be used to optimize up to a certain point and then you can validate against production data as a last step.
Semi related, SQL Server have query store feature which records query insights on production db and in SSMS you can see the query statistics on various metrics like resource uses, execution counts etc along with query plan applied.
I haven't dug deep into SQL server yet, but query store suggestions are good enough for now.
We call it macro-analysis -- the analysis of SQL workload as a whole, with call counts, timing, memory-related metrics, etc.
Currently, Joe helps with micro analysis: you take just one SQL query, don't think about others and solve the task of optimizing this particular query.
For "macro", Postgres ecosystem has pg_stat_statements (and our tool postgres-checkup https://gitlab.com/postgres-ai/postgres-checkup for automated health checks builds reports on top of it), pgBadger to analyze logs, auto_explain to capture execution plans.
Building the "bridge" between macro- and macro-analyses is a very interesting topic. I'm going to give a talk about this particular topic at the next PGCon.
How hard would it be to implement the Discord Bot protocol? I'd be willing to help on this front, and it'd be a nice way for me to get my hands wet on Go too.
It may not look like is a hard challenge: but the reality is that most troubleshoot them in production, which means that:
a) you need access to production, when you shouldn't; b) you run queries in production, potentially affecting -sometimes noticeably- production workloads.
Joe provisions a database clone using ZFS snapshot (I believe LVM support was added recently) and gives you a full copy of the database for you, with real data. You can test the queries, explain plans, create indexes to test if you need. And when you are done, it goes away. With a Slack interface, so you don't need to bother with connection, provisioning, credentials and decommissioning.
Give it a try, it is a very good tool!