Tracing Uncovers Half-Truths in Slack’s CI Infrastructure

frankchen · on July 27, 2021

Abstract:

Traditional monitoring tools like logs and metrics were necessary but not sufficient to debug how and where systems failed in CI, which relies on multiple, interconnected critical systems (e.g. GHE, Checkpoint, Cypress).

In this talk, Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions. We go from stories with early incidents that motivated further investment throughout Slack’s internal tooling teams to stories about gains in performance and resiliency throughout our infrastructure.

emptysea · on July 28, 2021

I'm curious how Slack does their e2e tests, are they spinning up every dependent service for each PR or are there some long running services shared among the PRs?

Related, are the databases and storage (e.g., s3) populated with fake data for the test runs before hand?

At my current place we have a couple separate services, but haven't built out e2e tests.

PaywallBuster · on July 28, 2021

at my company we doing this for a Drupal application:

  - new server

  - new database copy (anonymized) (same db server)

  - shared s3 (dev)

  - shared microservices (dev)

Cypress runs against this feature branch environment

For other services we mostly only have dev/stage/prod so it's easier.

sambe · on July 28, 2021

A bit light on the technical details that matter (i.e. the actual metrics added, how they went from data to conclusions, exactly why Honeycomb helped), this reads a bit like an advert targetting managers with purchasing power. Is it a report from attending a talk? That might make sense at least.

seattleeng · on July 28, 2021

I worked on a similar system at Plaid, but we used lightstep. The biggest challenge was on boarding engineers to the tool. It was also a bit of a workaround with respect to test flakiness, which we tackled separately. Still, I and many engineers (particularly new ones that were t familiar w the system architecture already) found it valuable!

wdb · on July 28, 2021

I liked this article: https://logz.io/learn/opentelemetry-guide/

You also have the Observability podcasts

nefitty · on July 28, 2021

I found this good guide on observability for devs by lightstep: https://lightstep.com/observability

inshadows · on July 28, 2021

Even better: https://en.wikipedia.org/wiki/Observability

nefitty · on July 28, 2021

I don’t understand any of that. I understand programming concepts though.

paws · on July 28, 2021

Edit: If there's a better venue to ask please send me elsewhere.

Hi Omar, It seems like you're not at Plaid anymore, but if you don't mind there's something I've been trying to figure out for a while and would appreciate your opinion/thoughts.

It has to do with the way Plaid works. Seems to me when financial institutions have account security liability, such 'security guarantees' tend to be contingent on the account holder _never disclosing credentials_. Given the nature of how Plaid works [1], it would seem Plaid users could be taking considerable risk.

I'm wondering if my understanding is right -- do account holders forfeit protection 'guarantees' their financial institution might offer when they use Plaid? Do financial institutions consider Plaid an 'authorized user' in some special way? Quite possibly I'm missing something and was hoping to understand. Thanks!

[1] https://security.stackexchange.com/questions/198005/is-plaid...

phoenixy1 · on Aug 9, 2021

Hi, I'm a little late to the party, but I saw this and I do work at Plaid, so I thought I might be able to help answer. First disclaimer: I am definitely not a lawyer. However, the basic premise that banks' liability guarantees are contingent on account holders never disclosing credentials is, as far as I understand it, not correct.

Basically, your rights to be protected against unauthorized transfers are provided under Regulation E, which gives customers the right to address unauthorized transactions from their accounts. Under Reg E, once a consumer properly notifies their financial institution of an unauthorized transaction within a specific amount of time, the financial institution is obligated to limit the consumer’s liability for the unauthorized transaction.

If you provide proper notice to your bank under Reg E, your bank cannot waive their liability, even if you shared account information with a third party. The Consumer Financial Protection Bureau (CFPB) made this explicit in a recently published Compliance Aid. Quoting the relevant section from their FAQ below:

"Q: If a financial institution’s agreement with a consumer includes a provision that modifies or waives certain protections granted by Regulation E, such as waiving Regulation E liability protections if a consumer has shared account information with a third party, can the institution rely on its agreement when determining whether the electronic fund transfer was unauthorized and whether related liability protections apply?

A: No. The Electronic Fund Transfer Act (EFTA) includes an anti-waiver provision stating that “[n]o writing or other agreement between a consumer and any other person may contain any provision which constitutes a waiver of any right conferred or cause of action created by [EFTA].” 15 U.S.C. § 1693l. Although there may be circumstances where a consumer has provided actual authority to a third party under Regulation E according to 12 CFR § 1005.2(m), an agreement cannot restrict a consumer’s rights beyond what is provided in the law, and any contract or agreement attempting to do so is a violation of EFTA."

Source: https://www.consumerfinance.gov/compliance/compliance-resour...

paws · on Aug 16, 2021

Thank you, helpful!

user5994461 · on Aug 16, 2021

The loophole lies in the wording, the regulation only protects you against unauthorized transactions.

Plaid makes transactions on your behalf, it's arguable whether they are unauthorized transactions.

To quote a similar case from real life. When your wife empties your bank account, you may not be able to get your money back because that doesn't constitute an unauthorized transaction. Also, nobody will flag it or alert you, because they don't think of it as abnormal.

wdb · on July 28, 2021

If I understand it correctly this does something similar to: https://github.com/honeycombio/buildevents ?

londons_explore · on July 28, 2021

Adding tracing to your own infrastructure is all well and good...

But so much infrastructure is a mishmash of in-house, opensource and closed source. Even the opensource stuff you probably don't want to maintain patches to integrate your tracing tool. In that case, you'll end up with big gaps in traces, making it less useful.

ilrwbwrkhv · on July 28, 2021

Slack still feels so sluggish, slow and clunky though. Opening the emoji drawer for example takes so long.

ithkuil · on July 28, 2021

Perhaps because first they had to fix their CI so they can fix their real problems.

It's often underrated how much an otherwise competent team can crumble under technical debt (of which CI and test flakiness is one specimen).

It sounds plausible to me that slack's rapid growth curve put them in the perfect position to accrue massive amounts of debt from which they are struggling to wade their way through.

ilrwbwrkhv · on July 28, 2021

This is such a hard one. How do you do this as a startup. How do you keep delivering features fast without accruing crazy amounts of debt? Does it boil down to the quality and experience of the hackers on the team?

dwohnitmok · on July 28, 2021

I think one good threshold for most startups is product-market fit.

Before you have it, you should be willing to accumulate large amounts of tech debt both because your product might change enough that you'd be rapidly throwing out large amounts of code anyways and because having product-market fit is such an overwhelming need that almost everything should be traded against it because your startup is dead without it.

After you have it, you have to be much more diligent about paying down tech debt, because you switch from a more "exploratory" mode to a "refinement" mode where you now are unlikely to be throwing away large swaths of your code at the drop of a hat, and instead need to iteratively build upon what you already have and that becomes very difficult to do with large amounts of tech debt.

jrockway · on July 28, 2021

I think engineers are super bad at deciding where to take on technical debt. Often people will save themselves half a day at the cost of making some future feature nearly impossible. I don't really blame the engineers, but rather the system -- the customers demand new things, then there are biweekly "sprints" to set deadlines, and there are daily standups where nobody wants to say "I'm doing the same thing I did yesterday and it's not done yet." In that system, saving five minutes at any cost is the right thing to do, and then you get boiled alive by your technical debt.

I honestly think that the glacial pace of software engineering is mostly due to this phenomenon -- "where can we save a little bit of time today?" and then it snowballs into entire teams not getting anything done. The interest on the loan they took out is higher than their income. I guess that's why they call it technical debt. (Does that make Scrum Masters technical loan sharks?)

dboreham · on July 28, 2021

You can tell that this is occurring because you'll never hear someone say at the standup "I got nothing done since the last standup because [yarn|sbt|cargo] blew up spectacularly and I spent two days figuring out what was happening and that I had to revert to the previous point release that still works".

steveBK123 · on July 28, 2021

I think management induced agile/standups are there to produce this outcome. If your boss wants to see/hear daily updates, they don't want to see daily reports of non-progress.

Developers who don't get that intuitively quickly learn it when they get the wrong type of questions after the wrong type of updates.

Tail wagging the dog.

Akronymus · on July 28, 2021

Choose a better tech stack IMO. One that facilitates the pits of success rather than pits of failure.

http://www.paulgraham.com/avg.html

nanis · on July 28, 2021

> Does it boil down to the quality and experience of the hackers on the team?

Yes. But also requires management who can understand and value what they have in that team.

loginatnine · on July 28, 2021

Yup, been using slack since 2015, first in the Web app and now in the "native" app and usability and speed have gone downhill in the past years. Focus problems after editing a message, wrong recommandations when typing /call, sluggish edit mode, sluggish recommandations window on /xxxx commands, wysiwyg editor first forced for everyone even though it was buggy, ctrl k getting slower and slower to switch between chats. That's the things in top of my head, I used to love the tool but not so much anymore.

Their support is great and responsive though (I've reported every issues I've encountered)

the_gipsy · on July 28, 2021

Typing is extremely sluggish. Huge red flag for a chat app.

mgarciaisaia · on July 28, 2021

Don't even think about using Slack on a spotty internet connection.

vnorilo · on July 28, 2021

Recently I was trying to connect to a meeting on a train journey through a rural lake district with less than ideal connectivity. Zoom with screen share (cam off though) was more functional than text mode Slack, which kind of blows my mind. I miss IRC.

steveBK123 · on July 28, 2021

Will this allow them to brain cycles to implement a competent vacation mode? Outlook handles this so much more gracefully, people can see I am out of office, when I am returning, and get an out of office reply from me.

Meanwhile on slack theres multiple settings to toggle to the point it seems easier to just uninstall the app for a week.