Hacker News new | past | comments | ask | show | jobs | submit login
Tracing Uncovers Half-Truths in Slack’s CI Infrastructure (frankc.net)
59 points by frankchen 52 days ago | hide | past | favorite | 29 comments


Traditional monitoring tools like logs and metrics were necessary but not sufficient to debug how and where systems failed in CI, which relies on multiple, interconnected critical systems (e.g. GHE, Checkpoint, Cypress).

In this talk, Frank Chen shares how traces gave us a critical and compounding capability to better understand where, when, how, and why faults occur for our customers in CI. We share how shared tooling for high-dimensionality event traces (using SlackTrace and SpanEvents) could significantly increase our velocity to diagnose code in flight and to debug complex system interactions. We go from stories with early incidents that motivated further investment throughout Slack’s internal tooling teams to stories about gains in performance and resiliency throughout our infrastructure.

I'm curious how Slack does their e2e tests, are they spinning up every dependent service for each PR or are there some long running services shared among the PRs?

Related, are the databases and storage (e.g., s3) populated with fake data for the test runs before hand?

At my current place we have a couple separate services, but haven't built out e2e tests.

at my company we doing this for a Drupal application:

  - new server

  - new database copy (anonymized) (same db server)

  - shared s3 (dev)

  - shared microservices (dev)
Cypress runs against this feature branch environment

For other services we mostly only have dev/stage/prod so it's easier.

A bit light on the technical details that matter (i.e. the actual metrics added, how they went from data to conclusions, exactly why Honeycomb helped), this reads a bit like an advert targetting managers with purchasing power. Is it a report from attending a talk? That might make sense at least.

I worked on a similar system at Plaid, but we used lightstep. The biggest challenge was on boarding engineers to the tool. It was also a bit of a workaround with respect to test flakiness, which we tackled separately. Still, I and many engineers (particularly new ones that were t familiar w the system architecture already) found it valuable!

I liked this article: https://logz.io/learn/opentelemetry-guide/

You also have the Observability podcasts

I found this good guide on observability for devs by lightstep: https://lightstep.com/observability

I don’t understand any of that. I understand programming concepts though.

Edit: If there's a better venue to ask please send me elsewhere.

Hi Omar, It seems like you're not at Plaid anymore, but if you don't mind there's something I've been trying to figure out for a while and would appreciate your opinion/thoughts.

It has to do with the way Plaid works. Seems to me when financial institutions have account security liability, such 'security guarantees' tend to be contingent on the account holder _never disclosing credentials_. Given the nature of how Plaid works [1], it would seem Plaid users could be taking considerable risk.

I'm wondering if my understanding is right -- do account holders forfeit protection 'guarantees' their financial institution might offer when they use Plaid? Do financial institutions consider Plaid an 'authorized user' in some special way? Quite possibly I'm missing something and was hoping to understand. Thanks!

[1] https://security.stackexchange.com/questions/198005/is-plaid...

Hi, I'm a little late to the party, but I saw this and I do work at Plaid, so I thought I might be able to help answer. First disclaimer: I am definitely not a lawyer. However, the basic premise that banks' liability guarantees are contingent on account holders never disclosing credentials is, as far as I understand it, not correct.

Basically, your rights to be protected against unauthorized transfers are provided under Regulation E, which gives customers the right to address unauthorized transactions from their accounts. Under Reg E, once a consumer properly notifies their financial institution of an unauthorized transaction within a specific amount of time, the financial institution is obligated to limit the consumer’s liability for the unauthorized transaction.

If you provide proper notice to your bank under Reg E, your bank cannot waive their liability, even if you shared account information with a third party. The Consumer Financial Protection Bureau (CFPB) made this explicit in a recently published Compliance Aid. Quoting the relevant section from their FAQ below:

"Q: If a financial institution’s agreement with a consumer includes a provision that modifies or waives certain protections granted by Regulation E, such as waiving Regulation E liability protections if a consumer has shared account information with a third party, can the institution rely on its agreement when determining whether the electronic fund transfer was unauthorized and whether related liability protections apply?

A: No. The Electronic Fund Transfer Act (EFTA) includes an anti-waiver provision stating that “[n]o writing or other agreement between a consumer and any other person may contain any provision which constitutes a waiver of any right conferred or cause of action created by [EFTA].” 15 U.S.C. § 1693l. Although there may be circumstances where a consumer has provided actual authority to a third party under Regulation E according to 12 CFR § 1005.2(m), an agreement cannot restrict a consumer’s rights beyond what is provided in the law, and any contract or agreement attempting to do so is a violation of EFTA."

Source: https://www.consumerfinance.gov/compliance/compliance-resour...

Thank you, helpful!

The loophole lies in the wording, the regulation only protects you against unauthorized transactions.

Plaid makes transactions on your behalf, it's arguable whether they are unauthorized transactions.

To quote a similar case from real life. When your wife empties your bank account, you may not be able to get your money back because that doesn't constitute an unauthorized transaction. Also, nobody will flag it or alert you, because they don't think of it as abnormal.

If I understand it correctly this does something similar to: https://github.com/honeycombio/buildevents ?

Adding tracing to your own infrastructure is all well and good...

But so much infrastructure is a mishmash of in-house, opensource and closed source. Even the opensource stuff you probably don't want to maintain patches to integrate your tracing tool. In that case, you'll end up with big gaps in traces, making it less useful.

Slack still feels so sluggish, slow and clunky though. Opening the emoji drawer for example takes so long.

Perhaps because first they had to fix their CI so they can fix their real problems.

It's often underrated how much an otherwise competent team can crumble under technical debt (of which CI and test flakiness is one specimen).

It sounds plausible to me that slack's rapid growth curve put them in the perfect position to accrue massive amounts of debt from which they are struggling to wade their way through.

This is such a hard one. How do you do this as a startup. How do you keep delivering features fast without accruing crazy amounts of debt? Does it boil down to the quality and experience of the hackers on the team?

I think one good threshold for most startups is product-market fit.

Before you have it, you should be willing to accumulate large amounts of tech debt both because your product might change enough that you'd be rapidly throwing out large amounts of code anyways and because having product-market fit is such an overwhelming need that almost everything should be traded against it because your startup is dead without it.

After you have it, you have to be much more diligent about paying down tech debt, because you switch from a more "exploratory" mode to a "refinement" mode where you now are unlikely to be throwing away large swaths of your code at the drop of a hat, and instead need to iteratively build upon what you already have and that becomes very difficult to do with large amounts of tech debt.

I think engineers are super bad at deciding where to take on technical debt. Often people will save themselves half a day at the cost of making some future feature nearly impossible. I don't really blame the engineers, but rather the system -- the customers demand new things, then there are biweekly "sprints" to set deadlines, and there are daily standups where nobody wants to say "I'm doing the same thing I did yesterday and it's not done yet." In that system, saving five minutes at any cost is the right thing to do, and then you get boiled alive by your technical debt.

I honestly think that the glacial pace of software engineering is mostly due to this phenomenon -- "where can we save a little bit of time today?" and then it snowballs into entire teams not getting anything done. The interest on the loan they took out is higher than their income. I guess that's why they call it technical debt. (Does that make Scrum Masters technical loan sharks?)

You can tell that this is occurring because you'll never hear someone say at the standup "I got nothing done since the last standup because [yarn|sbt|cargo] blew up spectacularly and I spent two days figuring out what was happening and that I had to revert to the previous point release that still works".

I think management induced agile/standups are there to produce this outcome. If your boss wants to see/hear daily updates, they don't want to see daily reports of non-progress.

Developers who don't get that intuitively quickly learn it when they get the wrong type of questions after the wrong type of updates.

Tail wagging the dog.

Choose a better tech stack IMO. One that facilitates the pits of success rather than pits of failure.


> Does it boil down to the quality and experience of the hackers on the team?

Yes. But also requires management who can understand and value what they have in that team.

Yup, been using slack since 2015, first in the Web app and now in the "native" app and usability and speed have gone downhill in the past years. Focus problems after editing a message, wrong recommandations when typing /call, sluggish edit mode, sluggish recommandations window on /xxxx commands, wysiwyg editor first forced for everyone even though it was buggy, ctrl k getting slower and slower to switch between chats. That's the things in top of my head, I used to love the tool but not so much anymore.

Their support is great and responsive though (I've reported every issues I've encountered)

Typing is extremely sluggish. Huge red flag for a chat app.

Don't even think about using Slack on a spotty internet connection.

Recently I was trying to connect to a meeting on a train journey through a rural lake district with less than ideal connectivity. Zoom with screen share (cam off though) was more functional than text mode Slack, which kind of blows my mind. I miss IRC.

Will this allow them to brain cycles to implement a competent vacation mode? Outlook handles this so much more gracefully, people can see I am out of office, when I am returning, and get an out of office reply from me.

Meanwhile on slack theres multiple settings to toggle to the point it seems easier to just uninstall the app for a week.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact