Debunking Devin: "First AI Software Engineer" Upwork Lie Exposed [video]

mike_hearn · 2024-04-12T08:27:17

An extremely solid and convincing rebuttal. Sad. I wonder what the Devin team will say in response, if anything. Summarizing the video:

• Devin is sold as being able to solve arbitrary Upwork tasks. In the video demo the problem it was asked to solve doesn't match the stated requirements of the customer (who asked for setup instructions, not code).

• Devin is shown fixing errors in the source of a GitHub repo, but the files it's shown editing don't actually exist in that repo and some of the errors its fixing are nonsensical, of the type that'd never be made by a human. Inference: Devin must be fixing bugs in files it has itself created, but that's not clearly indicated.

• There is no need to do any coding in the first place, because the README in the repository has all the instructions needed to achieve the task ready to go and they still work fine with only a one-line tweak, even though the repository is old. This is why the customer asked for instructions for how to run it on EC2 rather than for some coding. Devin didn't seem to read the README or understand that it only had to execute a couple of pre-existing Python scripts. The output in the video makes it look like the task was complex and sophisticated, with a long plan and many check boxes showing work completed, but the work was in fact pointless and redundant.

• Devin's code changes are bad, e.g. writing its own low level file read loop instead of using the standard library properly.

• Although the video makes it look like Devin did the task quickly, and the video creator was able to do the requested task in ~30 minutes, the timestamps in the chat show the task stretching over many hours and even into the next day.

• Devin does nonsensical shell commands like `head -n 5 foo | tail -n 5`

The strange mistakes lead to questions about what underlying model it's using. I don't think GPT-4 would make mistakes like that.

The Internet of Bugs guy is an AI fan and uses coding AI himself, but points out that the company behind it says you can "watch Devin get paid for doing work" which isn't actually supported by their video evidence when watched carefully.

foxy4096 · 2024-04-16T09:59:23

Hearing this just makes me sick.

Like how fake you wanted to be.

Also > Devin does nonsensical shell commands like `head -n 5 foo | tail -n 5`

Why is Devin executing this code, like why?

the__prestige · 2024-04-12T14:40:19

In other words they managed to successfully recreate most people's experience on upwork.

maziyar · 2024-04-13T06:19:26

In other words they managed to fake it until they make! Like most visionaries in silicon valley, lie now, tweet about it, prompt it through fake influencers with their mouth open on YouTube, get that VC money without any due diligence, hire smart people and force them to do it!

thearnabsarkar · 2024-04-13T13:29:32

You are supposed to make it before you get caught faking it.

jstummbillig · 2024-04-13T21:08:56

a) Taking notes

b) On the topic of notes: What are the odds of this being your first comment, on a 2 yo account and me taking note. A little sus.

mkumar10 · 2024-04-12T22:48:40

Underrated

danielpadkins · 2024-04-12T19:23:29

it is confirmed using gpt-4 though (not sure which version)

mewpmewp2 · 2024-04-12T12:18:28

I like hearing this balanced opinion. Generative AI is awesome, but demos around it should be honest and transparent. Looking at you Google as well. I don't know if I can trust a single Google demo for a while. They must do 0 edit, cut demos for quite some time in order to build that trust. Also this fake happy, go lucky, childlike communication after the edits/cuts is cringey. Unless you are doing it real-time like OpenAI did.

Based on the slogans said around Devin I decided to ignore it completely - so while I couldn't say it's bs for sure, I did feel the slogans are embellished and too good to be true.

Also for some reason I don't like the name for it at all. I don't understand how it could be so poorly chosen. Not that names can always give insight, but this name somehow was so off putting to me.

spacechild1 · 2024-04-12T09:51:04

From the youtube comment section:

> I really hate how normalized faking it in demos has become

I fully agree!

spxneo · 2024-04-13T22:09:30

but that is the difference between raising millions or not

of course founders are going to lie and make up faked tech demos

feels like we are close to a Minsky Moment for AI bubble

utopiah · 2024-04-16T05:23:43

What signals would look for and which delay would you expect from each to the actual pop?

PS: totally not asking to know when to sell my shovels, I mean NVIDIA stocks.

magospietato · 2024-04-13T22:37:44

If you've ever attempted to extract a meaningful few-shot response to a non-trivial coding question from an LLM, this shouldn't be a surprise.

That said, I have worked with actual humans in the industry who perform this badly, and that is still a significant achievement for a software program.

HEGalloway · 2024-04-14T07:05:58

I was quite skeptical of this. I've seen another company claiming to do the 3d generation do the same with their demos, they outsourced the "3d generation" part to low-wage workers in 3rd world countries and claimed the models to be generated by AI. I see this as a future trend to get investor money and do the rug pull.

nikincn · 2024-04-13T04:34:44

Attention is all you need and faking it in demo gets it .

Even if you deliver a decent enough product it will sell now..

waihtis · 2024-04-13T08:06:44

Oh ok I guess we are making a fuss about nothing then.

bluecrab · 2024-04-14T02:31:50

It was all hype. There's barely been any A.I product that has been hyped and didn't turn out to be subpar a few weeks after.

lordswork · 2024-04-12T20:42:00

Great video. As a counterpoint to the video author's claims, it's worth pointing out that Devin doesn't have to be anywhere close to as fast as a human software engineer to be useful. Even if it turns an hour task to a day long task, it's still going to cost a fraction of what it would to pay the engineer, so that bar it must meet is quite low.

xmprt · 2024-04-12T23:43:14

I've often felt that reading and reviewing bad code is more cognitively challenging than writing my own code. Add onto it the unreliability of not knowing when the code will be delivered (so now there's a chance I'll be blocked if Devin can't finish when I need it to), and I can't see how this would be useful. I'm sure there will come a point when the code that Devin writes is easy to review just like how ChatGPT is able to write quite meaningful English sentences but I worry that it will make things even worse because now you can have correct looking code that's riddled with bugs.

carlbrown · 2024-04-14T09:06:01

The video doesn't claim that Devin isn't useful. The video is about the claims Devin's company is making: 1) Lying outright about Devin completing and getting paid for Upwork jobs, and 2) Declaring that Devin is an "AI Software Engineer" is, at best, a huge exaggeration.

breadsniffer · 2024-04-12T22:00:19

There's still the cost of reviewing bad code. If the task was "completed" but you have to spend 20 mins looking over the code or the whole 1-hour-long process to "understand" what it did, it still failed.

thisislvca · 2024-04-12T21:06:27

Honestly, I don't fully agree, at least for now. AI writes too much code that isn't optimized until you go in and ask it to edit a certain block in a certain way. At least that's been my experience so far.

So, the cost of needing to go through all the written code afterward and do a ton of code reviews/edits is more expensive than giving a good engineer a good AI.

I'm sure though that in a year or two we'll just be doing reviews, and edits will be rare...

stockcrack · 2024-04-13T10:49:34

If you use it once, maybe. If you use it a thousand times it will pollute the code base to the point where it is useless.

langtang1996 · 2024-04-13T04:51:56

For complex systems composed of long chains of black-box units with randomness, if we evaluate their output using the three dimensions of “precision, stability, and size,” we should not have too high expectations.

Xavier_L · 2024-04-13T06:18:35

I summarized the video in details. https://gosummarize.com/youtube/@internetofbugs/debunking-de...

redgrange · 2024-04-15T03:46:12

Anyone have any tips to getting access? I would love to test some things first hand to evaluate some of the claims.

peeyush81 · 2024-04-15T14:57:01

thats why open source is important, you cant fake it there. https://github.com/princeton-nlp/SWE-agent

hrpnk · 2024-04-12T21:26:37

The UI of Devin is quite nice. Anyone knows to what degree it's inspired by other tools on the market?

NEETPILLED · 2024-04-15T20:58:13

AI bros, is it over? Did we go too far?

heyitaki · 2024-04-13T06:54:46

Picking apart Devin based solely on the demo video while ignoring all of the primary source testimonials on Twitter as to Devin's effectiveness seems somewhat intellectually dishonest... A demo video will of course cherrypick impressive-looking moments, even if they're not really.

wokwokwok · 2024-04-13T07:11:03

You can only report on facts.

They failed to provide any examples of facts with regard to Devin.

This is like arguing that it’s not fair to critique people claiming to have made superconductors because “some people said they are really superconductors” but no one can share samples with anyone for some reason.

A reasonable counter argument would be:

> Here is evidence of Devin actually doing things.

How, other than the available evidence was anyone supposed to evaluate Devin?

There is a broad opportunity for the developers to respond to this, but they haven’t.

Why is that?

It is because he’s right.

Regardless of what Devin can do that video was deceptive and misleading. There no two ways about it.

htormey · 2024-04-13T14:59:29

I don’t trust anecdotes on twitter because every time I’ve tried an agent that’s been hyped up it’s been more expensive and time consuming than just using GitHub co pilot with Claude/ChatGPT and putting up a PR myself.

Hence I’m skeptical of people making claims about a product I can’t try out myself. It’s unclear if the tasks they are doing and the way they are using Agents is relevant to the work I do. Which is usually working on a team of engineers shipping code on a complex code base.

For AI I tend to put a lot more weight in benchmarks, such as SWE-bench, which is why I wrote an article about:

https://www.stepchange.work/blog/why-do-ai-software-engineer...

SWE-bench is mostly small python tasks evaluated solely by unit tests which require less than 15 line changes to a single file. Most of those it fails at and the ones it gets right it ignores all sorts of libraries and conventions used in the rest of the code base.

I’m Optimistic that agents will eventually agents will improve dramatically in a few years but today Devin is not good at making larger changes that build on one another like features.

carlbrown · 2024-04-14T09:10:42

The company said in the description of the demo video that Devin did something in the demo video that Devin clearly did not do in the demo video.

That's a lie, pure and simple, and no statements made elsewhere can make that lie any less a lie.

jibal · 2024-04-13T07:35:09

That's backwards; depending on "primary source testimonials on Twitter" is grossly intellectually dishonest.

SwellJoe · 2024-04-13T08:11:47

"Primary source testimonials on"...Twitter? Are you serious?

nikincn · 2024-04-13T04:40:04

Curious to know top 3 products that ‘mukherjee’ finds good enough

nikincn · 2024-04-13T04:32:12

Faking in demo video is now an essential evil to get virality . Because social media is about virality . All commenting on the video and this thread are now curious to try Devin, to prove it works or to prove it doesnt .

So now if it works faking helped it get virality , more users , more demand for product .

If it doesnt work good enough still it will be good enough for some of the users who discovered it because of the virality .

Only worst case is it is too hopelessly bad or doesnt work at all or tried to get to the moon and got nowhere . Hope the founders are smart enough to not be this bad

spxneo · 2024-04-14T04:22:54

You are mistaken. Lying about a products capabilities, trying to pass off something that isn't might work for a while but eventually people find out (evem with heavy marketing) and word gets around.

When you are selling something, you must be absolutely honest with what you are delivering. If you can't do it don't put it on! Not delivering makes you lose trust.

Scott Wu's option is here is to keep the lie going or just throw in the towel and say hey AI was a hype it's good at summarizing text and descent code assistant but its not going to replace human software engineers for a long time.

Which do you think he's going to take? Whichever is going to result in $$$.

sgt101 · 2024-04-13T20:07:15

No one should do business, or collaborate academically, with anyone associated with the video or company again. To co-author with anyone involved with this is misconduct, and every University should be aware of that - they are liars, they cannot be trusted to do research. The VC's should retrieve all unspent cash and sue the founders.

They can reflect on what they did looking at the canvas of the inside of their tents in a homeless camp.

carlbrown · 2024-04-14T09:17:43

I disagree. There are no actual lies in the video itself, and I don't think that, for example, the engineer who is narrating the video is responsible for writing the description of the video (which is where the actual lie was).

The person (probably in marketing) that made the false claim is at fault, and any manager involved who did not stop the lie is at fault.

The software developers who are working on Devin's code likely had no control of or idea about how the video was going to be marketed.

There have been many times that I've been part of a team that built a product we were proud of, and had some business or sales person at our company, over our objections, make claims about it to a customer (or potential customer) that the sales person had been told were not true.

sgt101 · 2024-04-14T11:30:22

I had an example at work recently where this happened - so I agree with you. You objected, which was right.

I rang the customer and explained the misunderstanding. I discussed the problem with the sales folks and it was agreed that this shouldn't have been said.

I feel ok about it, there was an exaggeration, I corrected it, we moved on.

That's where the difference is. If the developers and the narrator were unaware that these were lies then they should have been aware. If they were aware then they should have objected. I see no evidence that they objected.

hcks · 2024-04-13T07:45:04

Im not curious to try Devin. LLM agents suck and have sucked since the first faked demo 2 years ago. I’ll leave it to the real labs (OA, Anthropic, DeepMind, FAIR) to make real advances.

spmurrayzzz · 2024-04-13T15:36:54

I liked Steve Yegge's quote on the Latent Space podcast late last year: "I'll believe in agents when somebody shows me one that works."

That's really all I require, just show me an agentic workflow that doesn't routinely implode at various stages and I'll buy in to some of the broader claims about the future of agents.