It may not look like is a hard challenge: but the reality is that most troubleshoot them in production, which means that:
a) you need access to production, when you shouldn't;
b) you run queries in production, potentially affecting -sometimes noticeably- production workloads.
Joe provisions a database clone using ZFS snapshot (I believe LVM support was added recently) and gives you a full copy of the database for you, with real data. You can test the queries, explain plans, create indexes to test if you need. And when you are done, it goes away. With a Slack interface, so you don't need to bother with connection, provisioning, credentials and decommissioning.
Give it a try, it is a very good tool!
I find this hard to believe. Maybe on the very rare occasion where the problem is tied to production and irreproducible elsewhere, but my experience has been that most troubleshooting happens on development, testing, staging, etc servers.
> and gives you a full copy of the database for you, with real data
Other than for small toy databases, I don't see how this would work for large databases. Not to mention the huge security issue.
You can test, optimize, etc queries on non-production servers. If the problem is production specific, it most likely is more server specific and not query optimization related.
Let me rephrase: many may troubleshoot in dev/staging, but that's not the right thing to do. Query execution plans can change completely based on data volume and even with the data itself. Databases use detailed statistics about the data to influence the planner's decision.
Hence, tuning a query on an environment without production data can lead to query optimizations that are either irrelevant or totally wrong in production. I have seen this countless times in my professional experience.
> Other than for small toy databases, I don't see how this would work for large databases.
Since it uses cloning at the volume/fs layer, the size of the database is irrelevant --only the rate of changes matter.
> You can test, optimize, etc queries on non-production servers.
No, you cannot, as I mentioned above. It's a waste of effort and can only lead to the false illusion that your query is good --and then beat you in production.
Maybe we have a different definition of "troubleshoot". Also, dev and/or staging should have similar data/specs/etc to production.
> Query execution plans can change completely based on data volume and even with the data itself.
Yes. That's why you could have your production server generate/log actual query plans for queries that are causing you problems. You could even set performance (cpu,io,bandwidth,etc) thresholds for your server to log. Of course that's in addition to the profile/monitoring data you have on the server.
> Hence, tuning a query on an environment without production data can lead to query optimizations that are either irrelevant or totally wrong in production. I have seen this countless times in my professional experience.
Well then your issue is that your dev/staging setup is poor. How or what do you even develop, test and troubleshoot? Might as well develop, test and troubleshoot straight on production.
> Since it uses cloning at the volume/fs layer, the size of the database is irrelevant --only the rate of changes matter.
Size still does matters. And as I mentioned, it still has a problem with security.
> No, you cannot, as I mentioned above. It's a waste of effort and can only lead to the false illusion that your query is good --and then beat you in production.
"waste of effort"? Developing, testing, troubleshooting, etc should be done off production. It's only on the rare cases that you should troubleshoot on production. This is best practices and basic security.
Unless you are working on internal office setup which isn't facing the outside world and where security and uptime doesn't matter. Your comment reminds me of people saying using "admin/root" account for everything is fine because everything else is "waste of effort". Do you even have a dev/staging setup? What do you use that for?
Everyone's professional experience is different, but I've never heard anyone claim "many may troubleshoot in dev/staging, but that's not the right thing to do". Not the right thing to do?
In my professional experience, the testing/troubleshooting is done on dev/staging. Once we feel everything is up to snuff, code gets pushed to production. If there are issues on production, we try to replicate it on dev/staging/etc and troubleshoot it. Most of the time, we find the bug/issue. On the rarest of occasions do we have to troubleshoot directly on production which serve our company and especially our clients who have SLAs with us. Your cavalier attitude about production is something I've yet to come across. But if it works for you, then I guess that's all that matters.
Out of all the customers we help with their Postgres infrastructure, and that includes several $B companies, I have found none to have either:
* Dev environments dimensioned similarly to production. Actually, this typically neither happens on staging and sometimes not even in QA.
* All except QA, if any, have the same data as production.
So in practical terms, I haven't seen it is possible to troubleshoot on dev/test/staging/QA environments, least to have a copy of the full database. This is reasonable, too: how all developers may have 5TB of data on their laptops? (leaving aside all privacy and data protection considerations).
That's why having a tool, Joe, to light provision production-like environments, that are disposable, where you can troubleshoot your queries, is for me a very nice approach to this problem, and the one I recommend.
AFAICT, this product relies on LVM snapshots/copy on write : so is ideal diagnosing, trying out ideas on large databases.
All this happens on a fully separated machine, not affecting production nodes by any means.
That is a security nightmare. Hard pass.
Production is not affected, the chatbot works with thin clones provided by Database Lab https://gitlab.com/postgres-ai/database-lab/.
In general, you are right. If, in your case, using the "raw" databases is unacceptable, it is better to anonymize data first (of course, the physical layout will change in this case).
However, Joe doesn't reveal the data. Engineers see only EXPLAIN (ANALYZE, BUFFERS) plans.
It may be considered acceptable. Of course, there are ways to retrieve particular values with some tricks like "explain analyze ... limit (select value from...)"), but this will be noticeable. Massive data leaks are not possible, Joe filters the output, presenting only metadata -- the whole idea was to provide the data for backend engineers, many of those don't have access to production data.
The question you're raising is broader: let's think how many times we send some personal data to our colleagues in Slack, how is it controlled now? are we okay with that? Emails, tokens, and so on.
At the same time, Joe does boost the development speed because it becomes easier to troubleshoot SQL performance and discuss it with colleagues, collecting reliable facts.
What would you say if it would be:
1. A chatbot for on-prem Mattermost?
2. A specialized GUI, SaaS or standalone?
People's existing tendency to expose critical data via Slack shouldn't be reinforced, and coming from a product that handles sensitive data by design, it's disheartening to see this line of thinking here. I would hope your product makes it harder for me to compromise data, not easier.
> At the same time, Joe does boost the development speed because it becomes easier to troubleshoot SQL performance and discuss it with colleagues, collecting reliable facts.
Great! Now make that troubleshooting accessible via an authenticated URL and I'm in. Especially because I may want to revisit a particular explain sometime down the line, and simple anchor-tag based list of bookmarkable past queries is easier to navigate than having to use Slack searches.
If I were to use this tool, I would envision the ideal version of it being self-hosted and able to function independent from any other service. If there was then an option to add a Slack hook that notified people, that would be good. But it shouldn't be central to the product, and the information should strictly travel from the self-hosted service to slack via hooks, exposing only the URLs/titles of result pages. No need to expose the queries or DB structure.
I think the current Slack implementation is a cool proof of concept, but this isn't a product I would seriously consider until it is able to perform its tasks in isolation.
As for "improv[ing] the level of collaboration of developers and DBAs", great that you're thinking about that, but that's my problem to solve, not Joe's. And depending on my team's existing solutions, this may even rule out Joe simply due to our current tool set.
> People's existing tendency to expose critical data via Slack shouldn't be reinforced
I agree with you. However, depending on company's policy, raw SQL, even with some concrete values, and EXPLAIN ANALYZE plans may be considered as acceptable for using them in Slack.
If it is considered harmful, we have a way to inject anonymization step into the snapshot preparation tool -- see https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc... (we're thinking how to improve usability here, not losing flexibility). For example, you can run PostgreSQL Anonymizer (https://gitlab.com/dalibo/postgresql_anonymizer). Of course, the physical layout will change, so troubleshooting of, say, VACUUM commands would be affected. But rows cardinality, normally, won't change, so for large enough tables, it totally makes sense do go this route, for SQL troubleshooting&optimization tasks.
> Great! Now make that troubleshooting accessible via an authenticated URL and I'm in.
We are thinking about the GUI version actively, so stay tuned.
And on CLI version as well, but CLI will work in psql (which I'm personally a big fan of, spent 15 years there or so). This means that all the data will be present.
In the case of GUI, of course, we are able to work on meta-level, similarly to the Slack version.
> If I were to use this tool, I would envision the ideal version of it being self-hosted and able to function independent from any other service.
I hear you well, and it generally aligns with our team's vision. Thank you again for awesome feedback!
> As for "improv[ing] the level of collaboration of developers and DBAs", great that you're thinking about that, but that's my problem to solve, not Joe's. And depending on my team's existing solutions, this may even rule out Joe simply due to our current tool set.
Could you provide some examples of such solutions?
Well, if we're already doing our collaborating via Discord or Microsoft Teams or whatever then a Slack-based tool is at a significant disadvantage.
Theoretically you could reveal certain type information by writing specialized attack queries for checking certain information, for example:
WHERE username='joe' AND is_admin IS true
>... With a Slack interface, so you don't need to bother with connection, provisioning, credentials and decommissioning.
It's worth noticing that Joe is a use case of Database Lab (https://gitlab.com/postgres-ai/database-lab) and it's still possible to use Database Lab features like thin-clones provisioning of production-sized databases and fast data state reset with Database Lab client CLI (https://postgres.ai/docs/database-lab/6_cli_reference) for purposes of SQL optimization without Slack integration. If a user has sufficient access to the data they can provision thin-clone with Database Lab client CLI and use psql to work with the clone.
Also, we have plans to add support of recommendations and statistics in CLI and REST API to the support of various messaging platforms in the future.
It's easier to tell someone to copy paste a query into Slack to get their weekly insights than teaching them how to set up pgcli or some other gui app. You'd also have to create tightly scoped permissions for everyone who only needs it from time to time.
I'd probably set up something like https://github.com/getredash/redash for these cases though, it's also very user friendly.
It looks like users get their own disposable copy. Nothing's run on the prod copy.
Can we just pick sensible and unique names? A simple search for 'Joe Software' would have shown that the name is taken.
I haven't dug deep into SQL server yet, but query store suggestions are good enough for now.
Currently, Joe helps with micro analysis: you take just one SQL query, don't think about others and solve the task of optimizing this particular query.
For "macro", Postgres ecosystem has pg_stat_statements (and our tool postgres-checkup https://gitlab.com/postgres-ai/postgres-checkup for automated health checks builds reports on top of it), pgBadger to analyze logs, auto_explain to capture execution plans.
Building the "bridge" between macro- and macro-analyses is a very interesting topic. I'm going to give a talk about this particular topic at the next PGCon.
Thanks in Advance.
Please join our Community Slack https://database-lab-team-slack-invite.herokuapp.com/ and ping me there -- let's discuss it.
We're refactoring the main code of Joe right now, to improve its extensibility. The main purpose is to simplify the process of adding new protocols.
Is there any particular interest? If so, I'd love to explore how it could work, let's connect.