Have you tried comparing with 3.7 via the API with a large thinking budget yet (32k-64k perhaps?), to bring it closer to the amount of tokens that o1-pro would use?
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.
Every time I read the story of Therac-25 I feel incredibly frustrated AECL never faced real consequences or (criminal) liability for it.
Maybe I'm retroactively imposing modern day safety culture, but reading the timeline and history, it feels like AECL was completely negligent in waving off the issue as more and more fatalities kept piling up.
Can't believe the devices weren't pulled offline to definitively solve the issue after the first death. Instead, they basically went "can't repro, oh well".
They should have faced consequences for their response, as much as for their error-prone device. Multiple patients had complained of extreme burns during their treatment, and autopsies later confirmed the cause of death to have been radiation exposure, yet AECL was still saying thing like, "damage could not have been produced by any malfunction of the Therac or by any operator error."
Sure, they were laying under our radiation cannon and then died of extreme radiation exposure, but they probably got it somewhere else.
I think worse than other carriers. They are still positioning themselves as the underdog uncarrier, so it feels even less authentic. I don't see much of virtue-signaling from AT&T and Verizon.
This mirrors my experience. Not once did I have a Yellow LTL shipment arrive on time, and typical delays ranged from 2 days to 2 weeks or longer. Shipments would get lost in a terminal, down to staff insisting it never arrived on the trailer, and then spontaneously reappear after a loss claim was filed.
I would've loved to avoid them but vendors would often choose them due to their cost.
The problem isn't the unions. If you've interacted with Yellow you'd know that. It's that the company is disorganized and management is nowhere to be found.
Look at UPS - union shop, going strong. Management is organized. Company is doing fine.
It's amazing to me that in 2023 when the small-package industry (USPS, UPS, FedEx) is SO SO SO highly optimized that I have a choice of 3 carriers to get a box across the country in 1-2 days and by and large it works pretty darn well, and yet if a company wants to ship a pallet the same place, the logistics fall apart into garbage "it'll get there when it gets there."
I know FedEx and UPS offer freight services, but it doesn't sound like they're dominant or close to it in the industry.
Why didn't they put Yellow and the likes in the ground years ago?
There is a very big difference in price. Slow unreliable shipping is something that established companies know how to deal with, while many of them don’t know how to get a positive ROI out of paying more for freight.
Of course, the fact that Yellow has shut down suggests that may not be as true as it once was.
I had the same issue migrating an app off of Render (they also use Cloudflare for app routing).
Cloudflare would refuse to route to my new IP for hours on end. Incredibly frustrating and I almost pulled my DNS off of CF as a result.
I was able to work around it by disabling the orange cloud for the domain for a couple hours, then turning it back on, which must have reset some sort of cache on CF's end.
Ultimately, it's not a DNS issue, it's an internal CF routing issue - it only happens with CDN (orange cloud) on. It seems CF's just caching the orange cloud's original route (via the SaaS provider) way too long internally somewhere and it's not being cleared when the route is changed off of the SaaS.
Maybe a bit unorthodox, but I've been using SquashFS with xz for compression for long-term archival (I generally prefer zstd, but for long-term archival I don't mind waiting longer for better compression with xz).
SquashFS files have file-based deduplication, fast random access, and mountability, all of which are lacking from .tar.* archives without resorting to other indexing tools. And they're mountable on any Linux without installing anything.
Only downside is they're readonly (or more accurately append-only), but for my uses, that's totally fine.
Personally, I stick to Firefox because of multi-account containers, which Chrome doesn’t have (no, profiles don’t count). It’s a really sticky feature.
Agreed. Why is there no Tinder/online dating equivalent for making new friends locally?
Instead of dating-specific qualifiers, it'd ask for your interests, hobbies, values, age and other demos, then match you based on overlap.
Not Meetup - it's not quite solving the same problem, and so it solves things differently (focusing on shared interests and on discrete meetup events).
Because every time someone builds this it turns into a dating (charitably) app.
If there is a way to find sexual partners in a medium, people will do so.
Subscriptions still active at month x are represented as `l(x)` and subscriptions that "die" (cancel/expire) in a given month are represented as `d(x)`.
This gives you a "life expectancy" and a "mortality rate" (so, churn) for any given number of months that a customer has been subscribed. So I can project how long someone will stay subscribed when they're brand new (at month 0) and how long they likely still have when they're at month 8 (longer than at month 0, funnily enough).
With those subscriptions where the month-specific churn will largely decrease the longer someone's subscribed (after passing the initial high-churn first months), this allows measuring/projecting churn on a much more granular level.
Segmenting this data by signup date is useful as well.
If I'm not mistaken, you're aligning all time periods to the same imaginary start date. This gives you the grand perspective but ignores very real changes in your application, team, marketplace, advertising, and product-fit over time.
Yes, you should absolutely segment into cohorts. Ideally you can tag users such that you can slice them not just by time, but what funnel they used or what marketing they've been exposed to (emails, etc.)
Just measuring the grand perspective, as you say, can be a very poor indicator for what is working well, and what isn't.
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.
reply