Hacker News new | past | comments | ask | show | jobs | submit login
For RoR, see every method call, parameter and return value in production (callstacking.com)
122 points by puuush on Nov 22, 2023 | hide | past | favorite | 67 comments



The calculator at the bottom of the page is doing some weird calculations. If you have no incidents a year, it still costs you money. So do incidents that take zero minutes to resolve. I took apart the code, and this seems to be the equation in use:

    (revenueTarget / 8760) * resolutionTimeTarget * numIncidentsTarget * resolutionTimeTarget + numEmployeesTarget * avgEmployeeTargeRate
This means revenue lost is correlated to the square of the lost time and the cost from employees is a static yearly cost.

There are a couple things wrong with it. The third * should be switched with a + and the last term need to be multiplied by the number of incidents.

    (revenueTarget / 8760) * resolutionTimeTarget * numIncidentsTarget + resolutionTimeTarget * numEmployeesTarget * avgEmployeeTargeRate * numIncidentsTarget
Which if anyone at Call Stacking is here, just means changing

    o = (n / 8760) * e * t * e + i * r;
to

    o = (n / 8760) * e * t + e * i * r * t;
or more succinctly

    o = t * e * (n / 8760 + i * r)
I'm assuming that's minified, so

    numIncidentsTarget * resolutionTimeTarget * (revenueTarget / 8760 + numEmployeesTarget * avgEmployeeRateTarget)
Edit: With the correct math, the example is wildly different. It should be $37,277.81, not $87,991.23.


Maybe there is a tool that can help them trace where this bug is coming from in their cost calculator?


If someone is up for writing a Call Stacking client for the language platform of their choice, please email me.

This would be your reference implementation.

https://github.com/callstacking/callstacking-rails

I'm sure we could work out an arrangement.

jim@callstacking.com


Here's the client code for those that would like to inspect.

https://github.com/callstacking/callstacking-rails


How does this work on the backend? Does it only trace method calls when an exception is thrown, or does it profile the call stack of every request?

Something I've been interested in is the performance impact of using https://docs.ruby-lang.org/en/3.2/Coverage.html to find unused code by profiling production. Particularly using that to figure out any gems that are never called in production. Seems like it could be made fast.


The idea is to turn it on for a given request when needed - via a parameter, feature flag, etc.

prepend_around_action :callstacking_setup, if: -> { params[:debug] == '1' }

Once the request completes, the instrumented methods are removed to remove the performance overhead.


This is very cool and useful.

Also it is only necessary because of the lack of type enforcement which means no code can be relied and on and all code has to be constantly inspected for new bugs. Ugh.


Types do not prove logistical correctness.

Imagine a million-line codebase. There are half a dozen suspicious methods with complex sets of if/else if/else statements. And each of those statements make subsequent method calls.

Determining that code path is a nightmare. Types won't save you.


Do you mean logical?

Types absolutely do prove logical correctness. In fact that’s all they do. However they can only prove the correctness of logic that’s type encoded. If your program is a primitive soup, there’s not much logic for them to prove.


I can see this being useful in strongly-typed languages too. There is a massive class of logical bugs that will type-check correctly, that will still result in wrong results being returned.


I built a toy raytracer in Haskell for fun. I found all of these bugs. It turns out that when you implement dot product slightly wrong, the image output is very confusing.


Even just redundant call paths are useful to spot. I recently looked through the standard library of a strongly, statically typed language and found that in one pretty basic function a validation function was called over 8 times despite reliably returning exactly the same result every time and this tool would highlight that very easily.

That's not even mentioning the logic bugs you can spot more easily as well.


This is advertised as a tool for production. Seems like it would be useful in development too.


And new engineer onboarding.

You have a new engineer. Point them to the Call Stacking dashboard.

"Here's a list of all of our endpoints for our application.

Click on a trace, and you can see the relevant methods for that endpoint and the context in which they are called."

This will get your new engineers up to speed, much quicker.


wow!


Recently did something similar for a java project using AOP. Basically adding an annotation to each method and logging the parameters before the method call and return values after the method call. Whenever there is an exception, a mail will be sent with the stacktrace along with the entire request path(including method calls, parameters and return values). Extremely useful for debugging and to proactively fix the issues.


Curious how you do this performantly for any non-trivial codebase. Like, consider a class whose logging representation is the data it contains, which can be arbitrarily large. Generally this is not an issue because it’s only logged rarely and on designated paths where people actually care about this and it was expected to be used in that fashion. How would this work for a large object that you pass to a function repeatedly, or in a deeply nested stack trace?


The project I worked on has a non-trivial codebase. So far I haven't seen any performance issues though I was worried initially. The idea is to use it during development and beta testing and switch it off later once the application is stable enough. Might keep it on for some more time if there are no performance issues.


After signup I see a typo in :

The trace URL will also be outut via the Rails log.

And the local usage section is hard to read white text on light blue background in this safari browser.


Good catches. Should be fixed now.


This is a product where the SaaS doesn't seem to add much that couldn't be done as easily locally, other than monetizing the process.


Disagree.

1) In a large-scale production scenario, you typically do not have the data, nor the interaction flow, to reproduce the bug locally. The idea is that you enable Call Stacking on the fly, when needed. Turn it off when not needed.

2) Having multiple runtime captures of the same endpoint across two different deployments or time periods allows you to quickly compare for logic or data changes (argument values and return values are visible).

3) Commenting on individual lines of execution allows for the team to have a specific discussion surrounding logic changes.


local is never really a production environment though


Pretty cool. Not sure what pricing is, seems like it's focused for enterprise.


Typo on front page: "What's does an hour of downtime cost you?"


Would this mean that any data I happened to have in memory during the flow now permanently lives in callstacking's data stores? How does it handle all the data flowing through from a security perspective?


It respects the normal RoR toolchain parameter filtering, so anything that you say is sensitive (or everything by default, if you'd like) also doesn't get sent to CallStacking.


The same filtering mechanism you have in place for your application logs is applied to the argument hash before being sent to the server.

https://github.com/callstacking/callstacking-rails/blob/599d...


Some of this already exists to some degree: https://github.com/MiniProfiler/rack-mini-profiler


Different emphasis.

The goal is to quickly be able to see just the important, executed methods for a given request.

E.g. you may have a 2,000-line User model, but Call Stacking allows you to pinpoint, "Oh, only these three methods are actually being called during authentication. And here are the subsequent calls that those methods make. And here's where the logic change occurred."


This seems ridiculously useful. What’s the catch?


The instrumentation has a performance overhead.

You enable the instrumentation with a prepend_before_action,

e.g.

prepend_around_action :callstacking_setup, if: -> { params[:debug] == '1' }

When the request is completed, the instrumented methods are removed (thus removing the overhead).

You have to enable it judiciously. But for a problematic request, it will give the entire team a holistic view as to what is really happening for a given request. What methods are called, their calling parameters, and return values, all are given visibility.

You no longer have to reconstruct production scenarios piecemeal via the rails console.


I'm not familiar with rails, so sorry if your reply above inherently answered this.. So you're enabling the tracing with a call in your controller method, but how is the tool capturing function params and returned values for sub-calls in the respective controller method?

Is it waiting for execution to return to the controller method and polling the stack trace from there?


as far as I can tell, it only executes the trace when asked. It's not an APM like newrelic. Most likely the trace meaningfully slows down the individual request.

When I was at ScoutAPM, we built a version of this that was stochastic instead of 100% predictable. We sampled the call stack every 10-50ms. Much lower overhead, and it caught the slower methods, which is quite helpful on its own, especially since slow behavior often isn't uniform, it happens on only a small handful of your biggest customers. But it certainly missed many fast executed methods.

Different approaches for sure, solve different issues.


These types of profiling gems usually kill performance


How does this product handle sensitive data? I'm guessing this is not a HIPAA compliant service.


Correct. The SaaS version is not HIPPAA compliant.

On-premise is an option.


Does similar tool exist for Django / Python?


https://werkzeug.palletsprojects.com/en/3.0.x/debug/#using-t... ?

Not sure if Django could use it, but used an earlier version in Flask and Pyramid.


And by similar to the Rails tool, I mean the functionality, not the intended usecase. Werkzeug is intended for local dev mode only.


Does it work for background jobs as well?


I would love to see this for Laravel.


Let’s write the Laravel client together?

jim@callstacking.com


I don't feel confident enough in PHP yet (my first language is python)

I hope you find somebody to help! Thank you!


Does this handle multithreading?


The irony is that the issues this helps with could be solved far before production. Compile time, or some local runtime even. Just not in Ruby.

Nearly all the issues this shows you quickly are issues that static typing would prevent compile time, or type-hints would show you in dev-time.

I've been doing fulltime Rails for 12+ years now, PHP before that, C before that. But always I developed side-gigs in Java, C# and other typed languages and now, finally fulltime over to Rust. They solve this.

Before production. You want this solved before production. Really.


Of the bugs that I've experienced in large-scale, Rails production systems, typing is a small subset.

Manually reconstructing logistical errors based on a combination of user input and system data, are the most time-consuming issues to diagnose.

When your codebase is 500,000+ lines of code, which code paths are relevant for a given endpoint? What methods were called and under what context? How do we begin to reconstruct this bug?

These are the scenarios for which Call Stacking gives instant visibility to.


> Of the bugs that I've experienced in large-scale, Rails production systems, typing is a small subset.

Really? Are you counting errors where a value turned out to be nil when it wasn’t expected to be? Because that’s a type error. It’s not a type error that many statically typed languages fix (Java is notorious for null pointer exceptions) but it’s a type error that can, in principle, be fixed with static typing.


I'm always surprised by that, I've worked with 10-15 people team on relatively large rails codebase and, yes, the bugs we usually see are not type related bugs (including nil when it's not expected). I keep reading people saying that type systems eliminate a huge class of bugs but, it's not been my experience with languages that use poor type systems (java, rust... ). Languages that have much better types like OCaml are different and I've had great luck with OCaml in particular (my favorite language to use when I can) but, there's also the fact that devs who use ocaml tend to also be much better than average (in the same way that in my experience devs working with php or nodejs exclusively tend to be much worse than average).

Note before I'm downvoted for my last comment, there are exceptions but looking at the average candidate applying for a php, nodejs position compared to one for rails and one for ocaml.


In the app I work on a huge percentage (60% in the timeframe I analyzed) of the errors are errors a type system would catch. Nil references, long method chains where the middle call returns a different object than expected, bad refactorings changing the return type of methods…

It’s staggering really. These could be solved with better programming practices, they’re not errors usually experienced programmers make, but they exist nonetheless.

An exception/error system, with checked exceptions or handling of errors via result monad/errors like in Go, would solve the vast majority of other errors.

Type systems also encourage other things like contracts on API endpoints and typed messages, serving as a early warning system when writing those bugs.

Purely business logic bugs are actually not common. I guess we haven’t had the opportunity to create those “higher order” bugs while wading through the others.


Do you have tests?

I feel like 60% of the errors I encounter is high for type errors.

I also feel like with testing, fixing those errors occupies less than 5% of my debugging time.


In this case, I guess you can create some ruby gem/library/convention/dsl/architecture that makes it harder to accidentally pass nils - and be more efficient than a popular language that has static typing.

I’m writing that to address the “just not in ruby” remark[1] from earlier.

[1]: https://news.ycombinator.com/item?id=38385890


Depends on the type system. I would say for Java / Python level static types they catch a small but significant fraction of bugs (10-20% according to the only objective measurement I've seen, which is easily worth it). However some languages like Rust, Haskell and OCaml let you express much more in the type system.

Subjectively it feels like that catches more like 30-60% of bugs.

So this thing is still useful but Berkes is right that you need it a lot less if you use better static types.

> which code paths are relevant for a given endpoint?

This is exactly the question that static types can answer... statically. You don't need a runtime log to find out.

You do need a runtime log to see the actual values though. So it's not like a debugger is completely useless in Rust. But I definitely reach for it much less than in other languages.


I worked on a large financial transaction system for up to $7B per day in Haskell that used formal methods like Agda, TLA+, etc, with a lot of logic in the type level (i.e. Liquid Haskell), as a test engineer. The entire system was covered in property based tests, proofs, and then normal tests from unit -> system. We literally had two bugs categorised as P2/P1/P0 on release, both of which were design related, and both were fixed before users saw them. It was crazy effective (but took years).


Nice! Can you say what company you were working for?


I don’t want to totally dox myself but there’s a couple crypto companies doing that sort of stuff like Algorand, Tezos, Cardano, etc :)


> This is exactly the question that static types can answer... statically. You don't need a runtime log to find out.

This tool seems to be able to display the relevant code paths. That sounds super convenient and useful. Do statically typed languages have tools to do similar?


Yeah "find all references" will show you everything that can call a particular function. As I said it won't give you the actual values so this still seems useful.


I don’t know if a similar tool exists but in principle Java and .NET has profiling APIs that let you look into method calls. They can be used as a basis for a similar tool.


The number of times I wish I had this tool for my production systems using Java, Kotlin, and Scala is…enormous.

Typing is great, I am a fan. But seeing the values that are running through those types is not solved at compile time.


I run a large Rails application (https://serpapi.com), and the issues that would be solved with the type system would be close to nil.


Any `nil` error, like the famous `undefined method 'users' for nil` is a type error.

Every time that serialization gets the wrong value passed in but continues anyway, is a type error. Every time a database-record misses a value (NULL) but the app continues to run over that, is a type error. And so on.

When I look at my most stable apps' Rollbar or Sentry, the top 20 errors are nearly all errors that a type system (which does not allow |null, like Java's - ugh, useless) would've caught compile or dev-time. The very few non-typing errors that are then left are race-conditions and business-logic-bugs.

The latter are really the only bugs that I'm "fine" with, they come with the domain. Race-conditions are quite often deep down also typing issues - they surface as similar `Undefined method on nil` errors because some data is nil due to the racing threads. Something a typing system would partly fix - as we can see in Rust.


Nil errors are still there, but you are just forced to handle them. It doesn't mean your app is not broken.


[flagged]


I have not come across this sentiment before. Is there something specific about Ruby that makes you think this way or is this your general view of dynamic languages without a strong type system?


Well, if you never get around to releasing the feature, you'll never have production issues. Ruby embodies "perfect is the enemy of done" in a way that I tend to appreciate.


Haha - I can't abandon my first programming love. :D

Ruby is fantastic.


But it's so much more pleasant -- esp the testing and mocking support -- than all the other languages I've tried.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: