
Joe, a Postgres query optimization bot - samokhvalov
https://postgres.ai/blog/joe-0.5/
======
ahachete
I've used Joe quite a bit, and all I can say is that it's a great idea that
solves a noticeable problem: troubleshoot your queries.

It may not look like is a hard challenge: but the reality is that most
troubleshoot them in production, which means that:

a) you need access to production, when you shouldn't; b) you run queries in
production, potentially affecting -sometimes noticeably- production workloads.

Joe provisions a database clone using ZFS snapshot (I believe LVM support was
added recently) and gives you a full copy of the database for you, with real
data. You can test the queries, explain plans, create indexes to test if you
need. And when you are done, it goes away. With a Slack interface, so you
don't need to bother with connection, provisioning, credentials and
decommissioning.

Give it a try, it is a very good tool!

~~~
dntbnmpls
> but the reality is that most troubleshoot them in production

I find this hard to believe. Maybe on the very rare occasion where the problem
is tied to production and irreproducible elsewhere, but my experience has been
that most troubleshooting happens on development, testing, staging, etc
servers.

> and gives you a full copy of the database for you, with real data

Other than for small toy databases, I don't see how this would work for large
databases. Not to mention the huge security issue.

You can test, optimize, etc queries on non-production servers. If the problem
is production specific, it most likely is more server specific and not query
optimization related.

~~~
ahachete
> I find this hard to believe. Maybe on the very rare occasion where the
> problem is tied to production and irreproducible elsewhere, but my
> experience has been that most troubleshooting happens on development,
> testing, staging, etc servers.

Let me rephrase: many may troubleshoot in dev/staging, but that's not the
right thing to do. Query execution plans can change completely based on data
volume and even with the data itself. Databases use detailed statistics about
the data to influence the planner's decision.

Hence, tuning a query on an environment without production data can lead to
query optimizations that are either irrelevant or totally wrong in production.
I have seen this countless times in my professional experience.

> Other than for small toy databases, I don't see how this would work for
> large databases.

Since it uses cloning at the volume/fs layer, the size of the database is
irrelevant --only the rate of changes matter.

> You can test, optimize, etc queries on non-production servers.

No, you cannot, as I mentioned above. It's a waste of effort and can only lead
to the false illusion that your query is good --and then beat you in
production.

~~~
dntbnmpls
> Let me rephrase: many may troubleshoot in dev/staging, but that's not the
> right thing to do.

Maybe we have a different definition of "troubleshoot". Also, dev and/or
staging should have similar data/specs/etc to production.

> Query execution plans can change completely based on data volume and even
> with the data itself.

Yes. That's why you could have your production server generate/log actual
query plans for queries that are causing you problems. You could even set
performance (cpu,io,bandwidth,etc) thresholds for your server to log. Of
course that's in addition to the profile/monitoring data you have on the
server.

> Hence, tuning a query on an environment without production data can lead to
> query optimizations that are either irrelevant or totally wrong in
> production. I have seen this countless times in my professional experience.

Well then your issue is that your dev/staging setup is poor. How or what do
you even develop, test and troubleshoot? Might as well develop, test and
troubleshoot straight on production.

> Since it uses cloning at the volume/fs layer, the size of the database is
> irrelevant --only the rate of changes matter.

Size still does matters. And as I mentioned, it still has a problem with
security.

> No, you cannot, as I mentioned above. It's a waste of effort and can only
> lead to the false illusion that your query is good --and then beat you in
> production.

"waste of effort"? Developing, testing, troubleshooting, etc should be done
off production. It's only on the rare cases that you should troubleshoot on
production. This is best practices and basic security.

Unless you are working on internal office setup which isn't facing the outside
world and where security and uptime doesn't matter. Your comment reminds me of
people saying using "admin/root" account for everything is fine because
everything else is "waste of effort". Do you even have a dev/staging setup?
What do you use that for?

Everyone's professional experience is different, but I've never heard anyone
claim "many may troubleshoot in dev/staging, but that's not the right thing to
do". Not the right thing to do?

In my professional experience, the testing/troubleshooting is done on
dev/staging. Once we feel everything is up to snuff, code gets pushed to
production. If there are issues on production, we try to replicate it on
dev/staging/etc and troubleshoot it. Most of the time, we find the bug/issue.
On the rarest of occasions do we have to troubleshoot directly on production
which serve our company and especially our clients who have SLAs with us. Your
cavalier attitude about production is something I've yet to come across. But
if it works for you, then I guess that's all that matters.

~~~
ahachete
Ideally, you would be absolutely right and I would not disagree with you. But
I strongly do, based on many years of experience in the field.

Out of all the customers we help with their Postgres infrastructure, and that
includes several $B companies, I have found none to have either: * Dev
environments dimensioned similarly to production. Actually, this typically
neither happens on staging and sometimes not even in QA. * All except QA, if
any, have the same data as production.

So in practical terms, I haven't seen it is possible to troubleshoot on
dev/test/staging/QA environments, least to have a copy of the full database.
This is reasonable, too: how all developers may have 5TB of data on their
laptops? (leaving aside all privacy and data protection considerations).

That's why having a tool, Joe, to light provision production-like
environments, that are disposable, where you can troubleshoot your queries, is
for me a very nice approach to this problem, and the one I recommend.

------
cwp
So... this thing gives full read/write/ddl access to production data ...via
Slack?

That is a security nightmare. Hard pass.

~~~
samokhvalov
Hi, Postgres.ai founder here.

Production is not affected, the chatbot works with thin clones provided by
Database Lab [https://gitlab.com/postgres-ai/database-
lab/](https://gitlab.com/postgres-ai/database-lab/).

In general, you are right. If, in your case, using the "raw" databases is
unacceptable, it is better to anonymize data first (of course, the physical
layout will change in this case).

However, Joe doesn't reveal the data. Engineers see only EXPLAIN (ANALYZE,
BUFFERS) plans.

It may be considered acceptable. Of course, there are ways to retrieve
particular values with some tricks like "explain analyze ... limit (select
value from...)"), but this will be noticeable. Massive data leaks are not
possible, Joe filters the output, presenting only metadata -- the whole idea
was to provide the data for backend engineers, many of those don't have access
to production data.

The question you're raising is broader: let's think how many times we send
some personal data to our colleagues in Slack, how is it controlled now? are
we okay with that? Emails, tokens, and so on.

At the same time, Joe does boost the development speed because it becomes
easier to troubleshoot SQL performance and discuss it with colleagues,
collecting reliable facts.

What would you say if it would be: 1\. A chatbot for on-prem Mattermost? 2\. A
specialized GUI, SaaS or standalone?

~~~
ysavir
> The question you're raising is broader: let's think how many times we send
> some personal data to our colleagues in Slack, how is it controlled now? are
> we okay with that? Emails, tokens, and so on.

People's existing tendency to expose critical data via Slack shouldn't be
reinforced, and coming from a product that handles sensitive data by design,
it's disheartening to see this line of thinking here. I would hope your
product makes it harder for me to compromise data, not easier.

> At the same time, Joe does boost the development speed because it becomes
> easier to troubleshoot SQL performance and discuss it with colleagues,
> collecting reliable facts.

Great! Now make that troubleshooting accessible via an authenticated URL and
I'm in. Especially because I may want to revisit a particular explain sometime
down the line, and simple anchor-tag based list of bookmarkable past queries
is easier to navigate than having to use Slack searches.

If I were to use this tool, I would envision the ideal version of it being
self-hosted and able to function independent from any other service. If there
was then an option to add a Slack hook that notified people, that would be
good. But it shouldn't be central to the product, and the information should
strictly travel from the self-hosted service to slack via hooks, exposing only
the URLs/titles of result pages. No need to expose the queries or DB
structure.

I think the current Slack implementation is a cool proof of concept, but this
isn't a product I would seriously consider until it is able to perform its
tasks in isolation.

As for "improv[ing] the level of collaboration of developers and DBAs", great
that you're thinking about that, but that's my problem to solve, not Joe's.
And depending on my team's existing solutions, this may even rule out Joe
simply due to our current tool set.

~~~
samokhvalov
Thank you for very detailed feedback!!

> People's existing tendency to expose critical data via Slack shouldn't be
> reinforced

I agree with you. However, depending on company's policy, raw SQL, even with
some concrete values, and EXPLAIN ANALYZE plans may be considered as
acceptable for using them in Slack.

If it is considered harmful, we have a way to inject anonymization step into
the snapshot preparation tool -- see [https://gitlab.com/postgres-ai/database-
lab/-/blob/master/sc...](https://gitlab.com/postgres-ai/database-
lab/-/blob/master/scripts/create_zfs_snapshot.sh#L173) (we're thinking how to
improve usability here, not losing flexibility). For example, you can run
PostgreSQL Anonymizer
([https://gitlab.com/dalibo/postgresql_anonymizer](https://gitlab.com/dalibo/postgresql_anonymizer)).
Of course, the physical layout will change, so troubleshooting of, say, VACUUM
commands would be affected. But rows cardinality, normally, won't change, so
for large enough tables, it totally makes sense do go this route, for SQL
troubleshooting&optimization tasks.

> Great! Now make that troubleshooting accessible via an authenticated URL and
> I'm in.

We are thinking about the GUI version actively, so stay tuned.

And on CLI version as well, but CLI will work in psql (which I'm personally a
big fan of, spent 15 years there or so). This means that all the data will be
present.

In the case of GUI, of course, we are able to work on meta-level, similarly to
the Slack version.

> If I were to use this tool, I would envision the ideal version of it being
> self-hosted and able to function independent from any other service.

I hear you well, and it generally aligns with our team's vision. Thank you
again for awesome feedback!

> As for "improv[ing] the level of collaboration of developers and DBAs",
> great that you're thinking about that, but that's my problem to solve, not
> Joe's. And depending on my team's existing solutions, this may even rule out
> Joe simply due to our current tool set.

Could you provide some examples of such solutions?

~~~
lmm
> Could you provide some examples of such solutions?

Well, if we're already doing our collaborating via Discord or Microsoft Teams
or whatever then a Slack-based tool is at a significant disadvantage.

------
mpol
Joe the editor?

Can we just pick sensible and unique names? A simple search for 'Joe Software'
would have shown that the name is taken.

~~~
IAmEveryone
...and lo, thereby the world ended.

------
kalimatas
Just learned about Database Lab. Amazing! Thanks for sharing!

------
alexandernst
But why a Slack app instead of a PG module of some sort...?

------
akartasov
I've just tried the live demo in the Database Lab Community Slack. It looks
amazing! Keep it up!

------
m2mdas2
Semi related, SQL Server have query store feature which records query insights
on production db and in SSMS you can see the query statistics on various
metrics like resource uses, execution counts etc along with query plan
applied.

I haven't dug deep into SQL server yet, but query store suggestions are good
enough for now.

~~~
samokhvalov
We call it macro-analysis -- the analysis of SQL workload as a whole, with
call counts, timing, memory-related metrics, etc.

Currently, Joe helps with _micro_ analysis: you take just one SQL query, don't
think about others and solve the task of optimizing this particular query.

For "macro", Postgres ecosystem has pg_stat_statements (and our tool postgres-
checkup [https://gitlab.com/postgres-ai/postgres-
checkup](https://gitlab.com/postgres-ai/postgres-checkup) for automated health
checks builds reports on top of it), pgBadger to analyze logs, auto_explain to
capture execution plans.

Building the "bridge" between macro- and macro-analyses is a very interesting
topic. I'm going to give a talk about this particular topic at the next PGCon.

------
fb03
How hard would it be to implement the Discord Bot protocol? I'd be willing to
help on this front, and it'd be a nice way for me to get my hands wet on Go
too.

Thanks in Advance.

~~~
samokhvalov
Cool!

Please join our Community Slack [https://database-lab-team-slack-
invite.herokuapp.com/](https://database-lab-team-slack-invite.herokuapp.com/)
and ping me there -- let's discuss it.

We're refactoring the main code of Joe right now, to improve its
extensibility. The main purpose is to simplify the process of adding new
protocols.

------
tarun_anand
Any plans to support Citus data?

~~~
samokhvalov
Not yet.

Is there any particular interest? If so, I'd love to explore how it could
work, let's connect.

