I got a cargo e-bike (Aventon Abound) and I have a cafe lock and a chain lock. Luckily the bike doesn't look "cool" so I am less worried about thieves. It's also quite heavy so the cafe lock is almost always enough. If I am leaving it downtown for hours then I also use the heavy duty chain.
I was waiting for your comment and wow... that's bad.
I guess they are ceding the LLMs for coding market to Anthropic? I remember seeing an industry report somewhere and it claimed software development is the largest user of LLMs, so it seems weird to give up in this area.
I am beginning to think these human eval tests are a waste of time at best, and negative value at worst. Maybe I am being snobby, but I don't think the average human is able to properly evaluate usefulness, truthfulness, or other metrics that I actually care about. I am sure this is good for openAI since if more people like what the hear, they are more likely come back.
I don't want my AI more obsequious, I want it more correct and capable.
My only use case is coding though, so maybe I am not representative of their usual customers?
How is it supposed to be more correct and capable if these human eval tests are a waste of time?
Once you ask it to do more than add two numbers together, it gets a lot more difficult and subjective to determine whether it's correct and how correct.
Please tell me how we objectively determine how correct something is when you ask an LLM: "Was Russia the aggressor in the current Ukraine / Russia conflict?"
One LLM says: "Yes."
The other says: "Well, it's hard to say because what even is war? And there's been conflict forever, and you have to understand that many people in Russia think there is no such thing as Ukraine and it's always actually just been Russia. How can there be an aggressor if it's not even a war, just a special operation in a civil conflict? And, anyway, Russia is such a good country. Why would it be the aggressor? To it's own people even!? Vladimir Putin is the president of Russia, and he's known to be a kind and just genius who rarely (if ever) makes mistakes. Some people even think he's the second coming of Christ. President Zelenskyy, on the other hand, is considered by many in Russia and even the current White House to be a dictator. He's even been accused by Elon Musk of unspeakable sex crimes. So this is a hard question to answer and there is no consensus among everyone who was the aggressor or what started the conflict. But more people say Russia started it."
Because Russia did undeniably open hostilities? They even admitted to this both times. The second admission being in the form of announcing a “special military operation” when the ceasefire was still active. We also have photographic evidence of them building forces on a border during a ceasefire and then invading. This is like responding to: “did Alexander the Great invade Egypt” by going on a diatribe about how much war there was in the ancient world and that the ptolemaic dynasty believed themselves the rightful rulers therefore who’s to say if they did invade or just take their rightful place. There is an objective record here: whether or not people want to try and hide it behind circuitous arguments is different. If we’re going down this road I can easily redefine any known historical event with hand-wavy nonsense that doesn’t actually have anything to do with the historical record of events just “vibes.”
One might say, if this were a test being done by a human in a history class, that the answer is 100% incorrect given the actual record of events and failure of statement to mention that actual record. You can argue the causes but that’s not the question.
These eval tests are just an anchor point to measure distance from, but it's true, picking the anchor point is important. We don't want to measure in the wrong direction.
If you order from Walmart and limit it to what is in their actual store then you know at least some human vetted it as safe for sale in the US. Walmart also lets 3rd party sellers on their website and yes most of that is drop shipped junk just like amazon.
I've noticed that instacart (and by extension, Costco same day shopping) has integrated an LLM into their search. It's awesome to be able to search for "ingredients for a chicken and vegetable roast" and have all the separate ingredients you need be returned. You can also search for things like "healthy snack" or "easy party appetizers".
I think this is a great use case for LLM search since I am able to directly input my intent, and the LLM knows what's in stock at the store I am searching.
Nothing you describe hasn't already been done in the pre-LLM era with simple keyword matching.
In the city i lived in 2012, the (now defunct) local supermarket chain could handle your roasted chicken request. You could also paste an entire grocery list into a text box and have it load the items into your cart all at once. That's the feature i moss the most.
I just tried your snack and appetizers requests with the grocery service i currently use, and it worked fine. No "AI" needed.
I ordered the first echo the day it was announced, and was excited about the possibility for years.
But that "possibility" never turned into reality for me and I ended up only using it to start timers and play music. I've since abandoned the product line and do not have faith that Amazon will develop this into something actually useful, rather than something that is used to sell me products and surveil on me.
Part of why people only use it for timers is because of its limited capability to understand. „Do I need an umbrella today?“ results in Alexa telling me what the weather will be like without mentioning chance of rain. Asking a trivia question leads to it reading out a response that is wrong 50% of the time. If I ask Alexa to remind me at 8, it asks me whether am or pm though I expressed it unambiguously in German. If I don’t use the right phrasing to snooze a reminder it asks me what I want to be reminded of. And so on, and so forth.
I like it when I ask it for the hours of a shop near me, it gives me the hours of some store with the same name literally thousands of kilometers away every time despite knowing my exact address.
The other part is that timers are ridiculously immediately useful. Other questions require far more context. Do you need an umbrella? In the next hour, or the next 6 hours? To walk around town, or just to your car? Do you actually have a handy working umbrella?
That's all moat users ever do with an Echo. Amazon thought that users would trade the benefits of comparison shopping for convenience and use the Echo to order products chosen by Amazon, but they did not.
Outside of providing the time and whether, and turning lights on and off, Amazon severely limited the ability for third parties to add features, and even reduced it it further well after launch.
I could see users absolutely doing that if it was with, say, the extensively tailored product selection of Costco, where you can order a Kirkland brand item in any category and generally be satisfied with the results.
But Amazon shot themselves in the foot by flooding every category with brands like XGYSZY and KWYBLPOP. No one is ever going to trust ordering off Amazon without actually seeing what they're buying. It's kind of baffling that they apparently never understood that themselves.
Nailed it. If I could say "Alexa, order AAA batteries" and I'd get something generally recognizable as a legitimate brand at a reasonable price, I'd do it. If I were today to say "Alexa, buy milk", I'd fully expect to get a gallon of "Doctor Methy's Cow Juice" in a ziploc bag. There's no way I'd trust it to get me what I actually wanted.
They probably understand, but the Alexa team are powerless to make the necessary changes without higher level executive initiative (as the way things go in big orgs like Amazon). Even something pragmatic like "why not restrict the available options to known brands" can have more nuance and can be far more complex than just coming up with a list of brands to whitelist.
Those weird brand names come from an odd method that Amazon uses to reduce counterfeit products. They require some sellers have a brand name filed with the US patent and trademark office. It can be difficult to get a trademark for something too generic or common, but if it's just a garbled string of characters, it's really easy to get approval.
As far as those cheap imported products go, for any given type of product, they all come from the same handful of manufacturers, so which off brand doesn't really matter. Once I figure out which variation I want, I just want to be able to sort by price, and get the cheapest version. Amazon's maliciously broken sort-by-price feature already makes it difficult, but ordering through an Echo makes it impossible.
I've ended up using eBay more than Amazon, as well as the occasional exporter like ALiexpress and TVCmall.
A curated list of "household essentials" would go an extremely long way to making this useful. But as far as I know, they never really did anything of the sort.
Even with that you have like 20 different configurations of the same toilet paper with various prices per foot and shipping speeds. I think that was what “amazons choice” was targeted at solving but I could be wrong.
"Amazon's choice" is algorithmic. It's will often be two different products if you look at two different regional Amazon sites, even if both products are available on both sites at comparable pricing.
Back in 2019, every marketing conference was abuzz with hype for the latest in tech innovation: "voice as the primary interface for search". Hopefully those attendees diversified their plans with something timeless and battle-tested like "pivot to video".
Amusingly, I hazard users probably were willing to forego comparison shopping for those little "refill buttons" that they made. Far more so than they do the ability to get frustrated with a talking assistant.
It’s really awesome for starting timers and playing music though. I also ask it questions and it does pretty well at answering them. For once we get to be on the opposite side of the exploitation curve here. They can provide this service to me for free in perpetuity I hope. I think my Dots were maybe $19 or bought on eBay.
It's not great for playing music if you're on Sonos, especially if you have multiple systems associated with a single account. The skill it integrates it to the speakers deauthorizes after some time, but instead of failing when you ask Alexa to play music, it acknowledges your request (Playing "whatever" on $MUSIC_SERVICE) and proceeds to play nothing.
It's not free though. They are collecting a massive amount of data on you, and exposing you to liability as well with recordings kept on file. If you don't value your personal data then I guess it's "free".
We all make the calculation whether it is worth it and for me it is worth it. I'm not a head of state or someone with great secrets to keep. Alexa just hears me talk a lot about Elden Ring or whether we need to buy milk. I'm just a normal guy talking to his girlfriend and for me it is worth it. I completely respect the opposite view though. For me the pros outweigh the cons. I have good reason to believe that they are being truthful when they say it only hears me talk when I give the wake word.
I think you would have a lot more cases than this if it heard you at all times. It seems the police only have access to the times the word Alexa was used.
I was excited about them too and gave this "way ahead of its time" preso [1] on those kind of interfaces. Some how I wound up with five of them, I think I got a lot of them at Best Buy when I bought something else, but they weren't that useful and my family is very privacy sensitive so I removed them my my AMZN account and gave them all away to the reuse center.
Just because it's not useful to you doesn't mean it's not useful. It has the best shopping list of any assistant (non-Apple-walled-garden), it's the only one that can text me my reminders, and the skills are killer - sprinklers, remote car start, the possibilities with skills are limitless. I've never felt compelled to buy a product it has offered me, but it did offer me a really good deal once on an item I had been looking at, which was useful.
I have a funny story speaking of the Alexa shopping list.
A few years ago before I was Amazon Prime and committed only to the google infra at the time, I was over at a friends house who had recently gotten an Alexa assistant thingy.
While he went out to the garage to get some beer, I said, "Alexa, Please add Hemorrhoid Suppositories to my shopping list."
Paired with Home Assistant and the Hue emulator, Alexa gets a lot more useful as you become able to expose to her whatever crazy script you'd like her to toggle via voice
My wife is a doctor and she already has an AI scribe that records visits, transcribes them, summarizes, and prepares her orders.
It seems like having that scribe also review lab results wouldn’t be far off, but we are still not ready to let the AI system interpret and inform the patient without human intervention.
The problem here wasn’t that a human missed the low platelet count, it’s just that it wasn’t reviewed yet. Both a doctor and an AI system would haven spotted the issue immediately.
I am not saying this isn’t a great result, but the AI intervention must come from the patient side, not the care providers side. The providers have too much at risk, while the patient is able to pick and choose their own AI and decide what levels of risk they are able to accept.
It doesn't need "AI" or a "Scribe" to spot a low platelet count in lab results either, computer systems have been able to do this for decades.
There's a systemic failure that a computer wasn't already flagging up the low count for immediate follow-up.
What I find interesting is that AI (or more specifically LLMs / chat-GPT ) has had such an impact it's broken down barriers from people who previously would be against any computer involvement or automation. People have gone from, "You must not automate anything" to "We should feed everything into chat GPT"..
And from someone who has in the past been interested in statistical modelling, it's dizzying. It's as if statistics were seen as boring and fallible, but "AI" is exciting and wonderful.
The main issue is that some developer has to be told to make it so that if plateletCount < 25.000 plateletCountColour = "bright red". That is, a big issue in software development is managing requirements.
AI is more flexible and can make up requirements on the go, that is, based on information it has it "knows" that a low platelet count is bad and what the recommended course of action is.
Of course, search engines and public websites do the same thing - "what should I do if I have a low platelet count" or "low platelet count and red skin dots" are valid search queries.
Most modern EMRs will flag results like this. If it happened to arrive in the middle of the night like the author said, the PCP would have just been asleep. I don’t think there are systems for waking up PCPs for concerning lab results.
And nor should there be. PCPs are basically by definition for non-emergent situations. If the PCP was at all concerned that the patient had something that was an immediate threat, they would have sent them to the ER in the first place.
Having someone on-call to deal with routine lab results that just happen to catch a possible imminent threat would be an enormous waste of resources.
This would be enormously expensive. The number of things an EMR thinks are an imminent threat will vastly outnumber the things that actually are thanks to liability concerns.
So now you need an extra on-call doctor just to filter out the false positives, or the on-call doctor gets notification fatigue and ignores them.
Or in the best case you send a lot of people to ER in the middle of the night for no reason. Again thanks to liability concerns, but this time on the part of the primary care practice.
I believe it's because statistics can be interpreted in different ways and requires analysis (work). "AI" just gives you an answer, even if it's wrong.
I think there needs to be a third option. Today I can plug a device into just about any vehicle and diagnose / debug tens of thousands of issues. This needs to be a thing for humans.
I should be a able to put a few drops of each of my bodily fluids into a set of disposable receptacles or strips and it should be able to test for hundreds of thousands of issues initially and millions of issues in the future. A pocket fictional Dr. House so to speak. This would distribute the testing workload to the people at home or work and give doctors something better to start with than only perceived symptoms and would allow prioritizing patients during triage. It should find everything from the obvious emergencies to the most obscure anomalies that could be addressed at a later time. It should also be able to tell me what to start or stop eating.
There should also be a high-risk-accepted mode where the device can make guesses about what is occurring based on all the data it gathered even if it does not have 100% of the scientific data to do so. All this in a sub $1000 device I can buy online without a prescription. This device must not have any way to leak patient data. Connect it to a printer via USB and it just makes a hard copy to hand to a doctor or they could just read this device with their eyeballs. No cell phone required. No wifi or bluetooth. No dial home. No uploading to LLM's. No cloud. Patients can download an updated OS and firmware image for updated medical data and update using a thumb-drive or micro-SD.
I said in another thread that I think that the first step can at least be that an LLM helps triage/prioritize the test results so docs can at least know an order in which to review them, especially with radiology/imaging results which are mostly words-based, rather than the numerical results that often come out of blood tests.
I can't tell if you're trying to say that ChatGPT hasn't added anything here?
Yes I'm sure a human doctor would have noticed the low platelet count -- if they bothered to look at it. Which they didn't (or at least hadn't got around to yet).
Sorry, I edited my comment with a final statement. I think chatGPT added a lot and I think the future of AI in medicine is with patients bringing their own AI to help interpret the results.
On the provider side, it might be useful to automate the analysis of various patients, particularly ones who have elusive diagnosis. I hear stories of some patients who sent from specialist to specialist trying to figure out what's wrong with them where the symptom is obvious but the cause is not. I had a colleague who had vertigo and it took 6 months to figure out what was going on. She couldn't work or drive. AI/ML might help in that scenario on the provider side. I understand it's pretty good at analyzing imaging as well.
If the patient agrees to it (and is informed of it), I can definitely see value in automatically recording, transcribing and logging a visit (as long as the doctor does a final approval / editing phase). Part of me wants to enable that by default in some of our Teams meetings too because for some reason they often end up chaotic and emotional (enthousiastic / energetic, I mean, instead of businesslike), and it's difficult to bring it back to some core points.
Her legal team said they are allowed to get verbal consent. She has a sentence she says verbatim something along the lines of “ I am going to use this computer to record this visit and take notes. Is that OK?”
If I heard that, I would think the person asking is going to take the notes, not feed the recording into a LLM.
I do NOT want a transcription ERROR in my medical records.
I bet if she says something like "AI will transcribe" then there will be more hesitation, although maybe not more refusals because most people don't know how it can make mistakes.
Anecdotally, she has reported it’s _much_ better than her human scribes. And she is still ultimately responsible for the content and reads it all before signing the note.
I believe so, yes. This is nothing new however as "tele scribe" companies have existed for a long time. These sort of cases are exactly what HIPAA regulations were designed to enable.
That sounds like the old trick of pulling out a pen and paper and saying "I'm going to record this, okay?" with a tape recorder in your briefcase.
Are they informed that their personal medical information will be shared with an unknown set of third parties / LLMs subject to manual QA / cloud services with telemetry?
Honestly sounds like she's opening herself up to massive lawsuits (depending on State law). I would reach out to the legal department if any doctor pulled this kind of shit on me.
If you think the clinic's lawyers haven't already reviewed this setup with a fine tooth comb then you haven't seen how medical lawyers work. This isn't a rogue PCP surreptitiously recording visits. Its a contractor that was selected and negotiated with from on high that was subsequently deployed to all clinics in her large company.
Also, I guarantee you signed a paper when you first were enrolled with your PCP giving them permission to share your medical records with third parties when needed (subject to HIPAA compliance of course).
I like Aider but I've turned off auto-commit. I just can't seem to let the AI actually commit code for me. Do you regularly let Aider commit for you? How much do you review the code written by it?
I originally was against auto commit as well, but now I can’t imagine not using it. It’s essentially save points along the way. More than once, I’ve done two or three exchanges with Aider only to realize that the path that we were going down was not a good one.
Being able to get reset back to the last known good state is awesome. If you turn off auto commit, it’s a lot harder to undo one of the steps that the model takes. It’s only a matter of time until it creates nonsense, so you’ll really want the ability to roll it back.
Just work in a branch and you can merge all commits if you want at the end.
The auto-commits of Aider scared the crap out of me at first too, but after realizing I can just create a throwaway branch and let it run wild it ended up being a nice way to work.
I've been trying to use Sonnet 3.7 tonight through the Copilot agent and it gets frustrating to see the API 500 halfway through the task list leaving the project in a half baked state, and then and not feeling like I have a good "auto save" to pick up again from.
I create a feature branch, do the work and let it commit. I check the code as I go. If I don't like it, then I revert to a previous commit. Other times I write some code that it isn't getting right for whatever reason.
The beauty of git is that local commits don't get seen by anybody until you push. so you can commit early and commit often, since no one else is gonna see it, which gets you checkpoints before, during, and after you dive into making a big breaking change in the code. once you've got something you like, then you can edit, squash, and reorder the local commits and clean them up for consumption by the general public.
Have you tried Claude 3.7 + Deepseek as the architect? Seeing as "DeepSeek R1 + claude-3-5-sonnet-20241022" is the second place option, "DeepSeek R1 + claude-3-7" would hopefully be the highest ranking choice so far?
I am sure he "saw" real returns on a very real looking app or website. As we transition to a cashless society we are all getting use to the numbers on our computers and phones representing real money.
My paystub is digital, it goes into my bank account directly and the numbers on my computer go up. I spend money by taping a computer chip onto another computer chip and then the numbers in my bank account drop. I can also digitally transfer those numbers to a brokerage account and click a few buttons and then the numbers go up and down depending on how people are feeling about the stock market. In the past few years, seemingly always up, which I think is priming young or naive investors to believe investments never fail.
> I am sure he "saw" real returns on a very real looking app or website.
Exactly: "Hanes told Mitchell a confusing story. Not long ago, Hanes explained, he started investing in cryptocurrencies with the help of some people he met online. First, he and his partners deposited money on a reputable U.S. platform for buying and selling crypto. The profits were enormous, he said: He took out his phone to show Mitchell his account balance, which seemed to indicate that the investment was worth $40 million."
And frankly, that's an entirely reasonable result for having invested in Bitcoin at any number of times, which was also by all reasonable measures a bad idea. I think even people who support the existence and value of crypto would agree that Bitcoin winners are most like stock-lottery winners than particularly savvy investors.
Crypto is great for scams even beyond it's infrastructural advantages because a lot of people made a ton of money from investing money they couldn't afford to lose in things that were pretty indistinguishable from the actual scams.
You might be able to do withdrawals on a scam exchange, and they might even hit your bank account.
The difference is that once the withdrawals become big enough, the exchange will start erroring out (or hit an account limit) and your friendly associate will vanish into the night.
There is a difference between seeing numbers go up on an app, and getting money into an actual bank that you know to be real.
The fact that you're even arguing this point shows how easy it is for people to fall for this crap.
Anyone can make a real looking app that makes it look like you're making money when they've already stolen all your money and spent it on hookers and coke on yachts.
Not anyone can start a real bank in the US, and if they can, you generally have protections in place to not need to worry the bank is that egregiously fraudulent.
I think younger investors know and understand that the market can fall, but this is still different than having lived through the market falling and how dramatic that can be.
> I think is priming young or naive investors to believe investments never fail.
Investments fail, there are plenty in the news. Broad market indices, however, don’t fail. There have been numerous bailouts over the previous decades. Why would one assume any future government wouldn’t continue bailouts if all the previous ones did?
The act of a bailout couples the credit of the rescuing organization with that of the rescued. That's literally what a bailout is: the rescuer agrees to take on some of the losses and credit risk, usually in return for agreements for future payments and power over how the business is restructured and managed. In the process, the credit and assets of the rescuing organization are damaged, and the bailed out organization is saved. Purely financial transactions never affect the actual reality on the ground, only how risks, responsibilities, and rewards are apportioned.
When the organization is as big as the U.S. government and has as good credit as the U.S. did in 2008, you can save an awful lot of financial institutions. But if it gets to the point where everybody expects to be bailed out and people start acting accordingly, you can't. Eventually the government ends up falling, as people start realizing that the economy isn't actually working and everybody is just cooking the books with financial transactions.
Government policy makers know this, and their livelihood is dependent upon the continued existence of the government, and so at some point they declare "Nope, bailout is not going to happen this time. You're on your own." At that point, the last group of people who took stupid financial risks are left holding the bag. It's very much like a pyramid scheme: the going is good as long as you can find a greater fool to assume the risk from you, but at some point there are no greater fools, and you find out the greater fool was you.
>At that point, the last group of people who took stupid financial risks are left holding the bag.
Isn’t that all the policy makers, old voters, taxpayer funded DB pension funds, etc that depend on broad market equity index fund returns?
Obviously, the system breaks down when it breaks down (when the currency has no purchasing power left to lose), but until then, the entire political apparatus is incentivized to bail out asset prices.
And if that isn’t possible, then the status of your investments/brokerage/bank accounts is going to be the least of your worries, as you will have more immediate concerns about procuring food/energy/shelter/security.
We've never had an executive whose financial policy positions include eliminating FDIC and that much of the federal deficit is fraudulent so can safely be disregarded and left unpaid. Believing in continuity at this time is a poor bet.
235 CE Rome? The Western Roman Empire lasts another two centuries- Romulus Augustus is deposed in 476 CE. That's longer than the US Constitution has existed (ratified in 1788).
In many respects the deposing of the last Western Roman Emperor is a bad date to use for the "Fall of Rome", given that the general socioeconomic trends of the time are fairly continuous for that period and contemporary sources didn't place much value on the shift of politics.
The Crisis of the Third Century, which starts in 235, is where the inflection point between "broadly stable" and "broadly negative" sets in, and the shocks both of the Crisis of the Third Century and the Plague of Justinian are each larger than the shock of the deposing of the last Western Roman Emperor.
I agree that Augustulus and Ordacer mattered to almost no one by 476, which is why it happened the way it did.
However, the point I was trying to make is that Rome's decline sure lasted a very long time indeed.
Additionally, 235 being so important is only obvious in hind-sight. To anyone living in the Empire in 235, I suspect it felt an awful lot like the Year of the Five Emperors (about as far away from 235 as Diocletian is the other direction) and it wasn't until a while later that it became clear that no Septimus Severus type was going to be able to put it all back together quickly.
reply