Hacker News new | past | comments | ask | show | jobs | submit login

Hi, I lead the teams responsible for our internal developer tools, including AI features. We work very closely with Google DeepMind to adapt Gemini models for Google-scale coding and other Software Engineering usecases. Google has a unique, massive monorepo which poses a lot of fun challenges when it comes to deploying AI capabilities at scale.

1. We take a lot of care to make sure the AI recommendations are safe and have a high quality bar (regular monitoring, code provenance tracking, adversarial testing, and more).

2. We also do regular A/B tests and randomized control trials to ensure these features are improving SWE productivity and throughput.

3. We see similar efficiencies across all programming languages and frameworks used internally at Google and engineers across all tenure and experience cohorts show similar gain in productivity.

You can read more on our approach here:

https://research.google/blog/ai-in-software-engineering-at-g...






I'm continually surprised by the amount of negativity that accompanies these sort of statements. The direction of travel is very clear - LLM based systems will be writing more and more code at all companies.

I don't think this is a bad thing - if this can be accompanied by an increase in software quality, which is possible. Right now its very hit and miss and everyone has examples of LLMs producing buggy or ridiculous code. But once the tooling improves to:

1. align produced code better to existing patterns and architecture 2. fix the feedback loop - with TDD, other LLM agents reviewing code, feeding in compile errors, letting other LLM agents interact with the produced code, etc.

Then we will definitely start seeing more and more code produced by LLMs. Don't look at the state of the art not, look at the direction of travel.


> if this can be accompanied by an increase in software quality

That’s a huge “if”, and by your own admission not what’s happening now.

> other LLM agents reviewing code, feeding in compile errors, letting other LLM agents interact with the produced code, etc.

What a stupid future. Machines which make errors being “corrected” by machines which make errors in a death spiral. An unbelievable waste of figurative and literal energy.

> Then we will definitely start seeing more and more code produced by LLMs.

We’re already there. And there’s a lot of bad code being pumped out. Which will in turn be fed back to the LLMs.

> Don't look at the state of the art not, look at the direction of travel.

That’s what leads to the eternal “in five years” which eventually sinks everyone’s trust.


> What a stupid future. Machines which make errors being “corrected” by machines which make errors in a death spiral. An unbelievable waste of figurative and literal energy.

Humans are machines which make errors. Somehow, we got to the moon. The suggestion that errors just mindlessly compound and that there is no way around it, is what's stupid.


> Humans are machines

Even if we accept the premise (seeing humans as machines is literally dehumanising and a favourite argument of those who exploit them), not all machines are created equal. Would you use a bicycle to fill your taxes?

> Somehow, we got to the moon

Quite hand wavey. We didn’t get to the Moon by reading a bunch of text from the era then probabilistically joining word fragments, passing that around the same funnel a bunch of times, then blindly doing what came out, that’s for sure.

> The suggestion that errors just mindlessly compound and that there is no way around it

Is one that you made up, as that was not my argument.


LLMs are a lot better at a lot of things than a lot of humans.

We got to the moon using a large number of systems to a) avoid errors where possible and b) build in redundancies. Even an LLM knows this and knew what the statement meant:

https://chatgpt.com/share/6722e04f-0230-8002-8345-5d2eba2e7d...

Putting "corrected" in quotes and saying "death spiral" implies error compounding.

https://chatgpt.com/share/6722e19c-7f44-8002-8614-a560620b37...

These LLMs seem so smart.


> LLMs are a lot better at a lot of things than a lot of humans.

Sure, I'm really poor painter, Midjourney is better than me. Are they better than a human trained for that task, on that task? That's the real question.

And I reckon the answer is currently no.


The real question is can they do a good enough job quickly and cheaply to be valuable. ie, quick and cheap at some level of quality is often "better". Many people are using them in the real world because they can do in 1 minute what might take them hours. I personally save a couple hours a day using ChatGPT.

Ah, well then, if the LLM said so then it’s surely right. Because as we all know, LLMs are never ever wrong and they can read minds over the internet. If it says something about a human, then surely you can trust it.

You’ve just proven my point. My issue with LLMs is precisely people turning off their brains and blindly taking them at face value, even arduously defending the answers in the face of contrary evidence.

If you’re basing your arguments on those answers then we don’t need to have this conversation. I have access to LLMs like everyone else, I don’t need to come to HN to speak with a robot.


You didn't read the responses from an LLM. You've turned your brain off. You probably think self-driving cars are also a nonsense idea. Can't work. Too complex. Humans are geniuses without equal. AI is all snake oil. None of it works.

You missed the mark entirely. But it does reveal how you latch on to an idea about someone and don’t let it go, completely letting it cloud your judgement and arguments. You are not engaging with the conversation at hand, you’re attacking a straw man you have constructed in your head.

Of course self-driving cars aren’t a nonsense idea. The execution and continued missed promises suck, but that doesn’t affect the idea. Claiming “humans are geniuses without equal” would be pretty dumb too, and is again something you’re making up. And something doesn’t have to be “all snake oil” to deserve specific criticism.

The world has nuance, learn to see it. It’s not all black and white and I’m not your enemy.


Nope, hit the mark.

Actually understand LLMs in detail and you'll see it isn't some huge waste of time and energy to have LLMs correct outputs from LLMs.

Or, don't, and continue making silly, snarky comments about how stupid some sensible thing is, in a field you don't understand.


> These LLMs seem so smart.

Yes, they do *seem* smart. My experience with a wide variety of LLM-based tools is that they are the industrialization of the Dunning-Kruger effect.


It's more likely the opposite. Humans rationalize their errors out the wazoo. LLMs are showing us we really aren't very smart at all.

Humans are obviously machines. If not, what are humans then? Fairies?

Now once you've recognized that, you're better equiped for task at hand - which is augmenting and ultimately automating away every task that humans-as-machines perform by building equivalent or better machine that performs said tasks at fraction of the cost!

People who want to exploit humans are the ones that oppose automation.

There's still long way to go, but now we've finally reached a point where some tasks that were very ellusive to automation are starting to show great promise of being automated, or atleast being greatly augmented.


Profoundly spiritual take. Why is that the task at hand?

The conceit that humans are machines carries with it such powerful ideology: humans are for something, we are some kind of utility, not just things in themselves, like birds and rocks. How is it anything other than an affirmation of metaphysical/theological purpose to particularly humans? Why is it like that? This must be coming from a religious context, right?

I cannot at least see how you could believe this while sustaining a rational, scientific mind about nature, cosmology, etc. Which is fine! We can all believe things, just know you cant have your cake and eat it too. Namely, if anybody should believe in fairies around here, it should probably be you!


> Why is that the task at hand?

Because it's boring stuff, and most of us would prefer to be playing golf/tennis/hanging out with friends/painting/etc. If you look at the history of humanity, we've been automating the boring stuff since the start. We don't automate the stuff we like.


Where's the spiritual part?

Recognizing that humans, just like birds are self-replicating biological machines is the most level-headed way of looking at it.

It is consistent with observations and there are no (apparent) contraditions.

The spritual beliefs are the ones with the fairies, binding of the soul, made of special substrate, beyond reason and understanding.

If you have desire to improve human condition (not everyone does) then the task at hand naturally arisies - eliminate forced labour, aging, disease, suffering, death, etc.

This all naturally leads to automation and transhumanism.


> Humans are obviously machines. If not, what are humans then? Fairies?

If humans are machines, then so are fairies.


The difference is that when we humans learn from our errors, we learn how to make them less often.

LLMs get their errors fed back into them and become more confident that their wrong code is right.

I'm not saying that's completely unsolvable, but that does seem to be how it works today.


That isn't the way they work today. LLMs can easily find errors in outputs they themselves just produced.

Start adding different prompts, different models and you get all kinds of ways to catch errors. Just like humans.


I don’t think LLMs can easily find errors in their output.

There was a recent meme about asking LLMs to draw a wineglass full to the brim with wine.

Most really struggle with that instruction. No matter how much you ask them to correct themselves they can’t.

I’m sure they’ll get better with more input but what it reveals is that right now they definitely do not understand their own output.

I’ve seen no evidence that they are better with code than they are with images.

For instance, if the time to complete only scales with length of the token and not the complexity of its contents then it probably safe to assume it’s not being comprehended.


> LLMs can easily find errors in outputs they themselves just produced.

No. LLMs can be told that there was an error and produce an alternative answer.

In fact LLMs can be told there was an error when there wasn't one and produce an alternative answer.



https://chatgpt.com/share/672331d2-676c-8002-b8b3-10fc4c8d88...

In my experience, if you confuse an LLM by deviating from the the "expected", then all the shims of logic seem to disappear, and it goes into hallucination mode.


Try asking this question to a bunch of adults.

Tbf that was exactly my point. An adult might use 'inference' and 'reasoning' to ask clarification, or go with an internal logic of their choosing.

ChatGPT here went with a lexigraphical order in Python for some reason, and then proceeded to make false statements from false observations, while also defying its own internal logic.

    "six" > "ten" is true because "six" comes after "ten" alphabetically.
No.

    "ten" > "seven" is false because "ten" comes before "seven" alphabetically.
No.

From what I understand of LLMs (which - I admit - is not very much), logical reasoning isn't a property of LLMs, unlike information retrieval. I'm sure this problem can be solved at some point, but a good solution would need development of many more kinds of inference and logic engines than there are today.


Do you believe that the LLM understands what it is saying and is applying the logic that you interprets from its response, or do you think its simply repeating similar patterns of words its seen associated with the question you presented it?

If you take the time to build an (S?)LM yourself, you'll realize it's neither of these. "Understands" is an ill-defined term, as is "applying logic".

But a LLM is not "simply" doing anything. It's extremely complex and sophisticated. Once you go from tokens into high-dimensional embeddings... it seems these models (with enough training) figure out how all the concepts go together. I'd suggest reading the word2vec paper first, then think about how attention works. You'll come to the conclusion these things are likely to be able to beat humans at almost everything.


You said humans are machines that make errors ans that LLMs can easily find errors in output they themself produce.

Are you sure you wanted to say that? Or is the other way around?


Yes. Just like humans. It's called "checking your work" and we teach it to children. It's effective.

> LLMs can easily find errors in outputs they themselves just produced.

Really? That must be a very recent development, because so far this has been a reason for not using them at scale. And noone is.

Do you have a source?


Lots of companies are using them at scale.

To err is human. To err at scale is AI.

I fear that we'll see a lot of humans err at scale next Tuesday. Global warming is another example of human error at scale.

>next Tuesday.

USA (s)election, I guess.


To err at scale isn't unique to AI. We don't say "no software, it can err at scale".

CEOs embracing the marginal gains of LLMs by dumping billions into it are certainly great examples of humans erring at scale.

yep, nano mega.

It is by will alone that I set my mind in motion.

It is by the juice of Sapho that thoughts acquire speed, the lips become stained, the stains become a warning...


err, "hallucinate" is the euphemism you're looking for. ;)

I don't like the use of hallucinate. It implies that LLM have some kind of model of reality and some times get confused. They don't have any kind of model of anything, they cannot "hallucinate", they can only output wrong results.

>They don't have any kind of model of anything, they cannot "hallucinate", they can only output wrong results.

it's even more fundamental than that.

even if they had any model, they would not be able to think.

thinking requires consciousness. only humans and some animals have it. maybe plants too.

machines? no way, jose.


yeah, i get you. it was a joke, though.

that "hallucinate" term is a marketing gimmick to make it seem to the gullible that this "AI" (i.e. LLMs) can actually think, which is flat out BS.

as many others have said here on hn, those who stand to benefit a lot from this are the ones promoting this bullcrap idea (that they (LLMs) are intelligent).

greater fool theory.

picks and shovels.

etc.

In detective or murder novels, the cliche is "look for the woman".

https://en.m.wikipedia.org/wiki/Cherchez_la_femme

in this case, "follow the money" is the translation, i.e. who really benefits (the investors and founders, the few), as opposed to who is grandly proclaimed to be the beneficiary (us, the many).


s/grand/grandiose/g

from a search for grand vs grandiose:

When it comes to bigness, there's grand and then there's grandiose. Both words can be used to describe something impressive in size, scope, or effect, but while grand may lend its noun a bit of dignity (i.e., “we had a grand time”), grandiose often implies a whiff of pretension.

https://www.merriam-webster.com/dictionary/grandiose


> Humans are machines which make errors.

Indeed, and one of the most interesting errors some human machines are making is hallucinating false analogies.


It wasn't an analogy.

Machines are intelligently designed for a purpose. Humans are born and grow up, have social lives, a moral status and are conscious, and are ultimately the product of a long line of mindless evolution that has no goals. Biology is not design. It's way messier.

Exactly my thought. Humans can correct humans. Machines can correct, or at least point to failures in the product of, machines.

I don't see how this is sustainable. We have essentially eaten the seed corn. These current LLMs have been trained by an enormous corpus of mostly human-generated technical knowledge from sources which we already know to be currently being polluted by AI-generated slop. We also have preliminary research into how poorly these models do when training on data generated by other LLMs. Sure, it can coast off of that initial training set for maybe 5 or more years, but where will the next giant set of unpolluted training data come from? I just don't see it, unless we get something better than LLMs which is closer to AGI or an entire industry is created to explicitly create curated training data to be fed to future models.

These tools also require the developer class to that they are intended to replace to continue to do what they currently do (create the knowledge source to train the AI on). It's not like the AIs are going to be creating the accessible knowledge bases to train AIs on, especially for new language extensions/libraries/etc. This is a one and f'd development. It will give a one time gain and then companies will be shocked when it falls apart and there's no developers trained up (because they all had to switch careers) to replace them. Unless Google's expectation is that all languages/development/libraries will just be static going forward.

One of my concerns is that AI may actually slow innovation in software development (tooling, languages, protocols, frameworks and libraries), because the opportunity cost of adopting them will increase, if AI remains unable to be taught new knowledge quickly.

It also bugs me that these tools will reduce the incentive to write better frameworks and language features if all the horrible boilerplate is just written by an LLM for us rather than finding ways to design systems which don't need it.

The idea that our current languages might be as far as we get is absolutely demoralising. I don't want a tool to help me write pointless boilerplate in a bad language, I want a better language.


This is my main concern. What's the point of other tools when none of the LLMs have been trained on it and you need to deliver yesterday?

It's an insanely conservative tool


You already see this if you use a language outside of Python, JS or SQL.

that is solved via larger contexts

It’s not, unless contexts get as large as comparable training materials. And you’d have to compile adequate materials. Clearly, just adding some documentation about $tool will not have the same effect as adding all the gigabytes of internet discussion and open source code regarding $tool that the model would otherwise have been trained on. This is similar to handing someone documentation and immediately asking questions about the tool, compared to asking someone who had years of experience with the tool.

Lastly, it’s also a huge waste of energy to feed the same information over and over again for each query.


- context of millions of tokens is frontier

- context over training is like someone referencing docs vs vaguely recalling from decayed memory

- context caching


You’re assuming that everything can be easily known from documentation. That’s far from the truth. A lot of what LLMs produce is informed by having been trained on large amounts of source code and large amounts of discussions where people have shared their knowledge from experience, which you can’t get from the documentation.

Yea, I'm thinking along the same lines.

The companies valuing the expensive talent currently working on Google will be the winner.

Google and others are betting big right now, but I feel the winner might be those who watches how it unfolds first.


The LLM codegen at Google isn't unsupervised. It's integrated into the IDE as both autocomplete and prompt-based assistant, so you get a lot of feedback from a) what suggestions the human accepts and b) how they fix the suggestion when it's not perfect. So future iterations of the model won't be trained on LLM output, but on a mixture of human written code and human-corrected LLM output.

As a dev, I like it. It speeds up writing easy but tedious code. It's just a bit smarter version of the refactoring tools already common in IDEs...


What about (c) the human doesn't realize the LLM-generated code is flawed, and accepts it?

I mean what happens when a human doesn't realize the human generated code is wrong and accepts the PR and it becomes part of the corpus of 'safe' code?

Presumably someone will notice the bug in both of these scenarios at some point and it will no longer be treated as safe.

Do you ask a junior to review your code or someone experienced in the codebase?

maybe most of the code in the future will be very different from what we’re used to. For instance, AI image processing/computer vision algorithms are being adopted very quickly given the best ones are now mostly transformers networks.

My main gripe with this form of code generation is that is primarily used to generate “leaf” code. Code that will not be further adjusted or refactored into the right abstractions.

It is now very easy to sprinkle in regexes to validate user input , like email addresses, on every controller instead of using a central lib/utility for that.

In the hands of a skilled engineer it is a good tool. But for the rest it mainly serves to output more garbage at a higher rate.


>It is now very easy to sprinkle in regexes to validate user input , like email addresses, on every controller instead of using a central lib/utility for that.

Some people are touting this as a major feature. "I don't have to pull in some dependency for a minor function - I can just have AI write that simple function for me." I, personally, don't see this as a net positive.


Yes, I have heard similar arguments before. It could be an argument for including the functionality in the standard lib for the language. There can be a long debate about dependencies, and then there is still the benefit of being able to vendor and prune them.

The way it is now just leads to bloat and cruft.


> The direction of travel is very clear

And if we get 9 women we can produce a baby in a single month.

There's no guarantee such progression will continue. Indeed, there's much more evidence it is coming to a a halt.


It might also be an example of 80/20 - we're just entering the 20% of features that take 80% of the time & effort.

It might be possible but will shareholders/investors foot the bill for the 80% that they still have to pay.


Its not even been 2 years, and you think things are coming to a halt?

Yes. The models require training data and they already been fed the internet.

More and more of the content generated since is LLM generated and useless as training data.

The models get worse, not better by being fed their own output, and right now they are out of training data.

This is why Reddit just went profitable, AI companies buy their text to train their models because it is at least somewhat human written.

Of course, even reddit is crawling with LLM generated text, so yes. It is coming to a halt.


Data is not the only factor. Architecture improvements, data filtering etc. matter too.

I know for a fact they are because rate _and_ quality of improvement is diminishing exponentially. I keep a close eye on this field as part of my job.

> Don't look at the state of the art not, look at the direction of travel.

That's what people are doing. The direction of travel over the most recent few (6-12) months is mostly flat.

The direction of travel when first introduced was a very steep line going from bottom-left to top-right.

We are not there anymore.


> I'm continually surprised by the amount of negativity

Maybe I'm just old, but to me, LLMs feel like magic. A decade ago, anyone predicting their future capabilities would have been laughed at.


Magic Makes Money - the more magical something seems, the more people are willing to pay for that something.

The discussion here seems to bare this out: CEO claims AI is magical, here the truth becomes that it’s just an auto-complete engine.


Nah, you just were not up to speed with the current research. Which is completely normal. Now marketing departments are on the job.

Transformers were proposed in 2017. A decade ago none of this was predictable.

emacs psichologist was there from before :D

And so were a lot of markov chain based chatbots. Also Doretta, the microsoft AI/search engine chatbot.

Were they as good? No. Is this an iteration of those? Absolutely.


Kurzweil would disagree)

That's the hype isn't it. The direction of travel hasn't been proven to be more than a surface level yet.

Because there seems to be a fundamental misunderstanding producing a lot of nonsense.

Of course LLMs are a fantastic tool to improve productivity, but current LLM's cannot produce anything novel. They can only reproduce what they have seen.


But they assist developers and collect novel coding experience from their projects all the time. Each application of LLM creates feedback to the AI code - the human might leave it as is, slightly change it, or refuse it.

> LLM based systems will be writing more and more code at all companies.

At Google, today, for sure.

I do believe we still are not across the road on this one.

> if this can be accompanied by an increase in software quality, which is possible. Right now its very hit and miss

So, is it really a smart move of Google to enforce this today, before quality have increased? Or did this set off their path to losing market shares because their software quality will deteriorate further over the next couple years?

From the outside it just seems Google and others have no choice, they must walk this path or lose market valuation.


> I'm continually surprised by the amount of negativity that accompanies these sort of statements.

I'm excited about the possibilities and I still recoil at the refined marketer prose.


I'm not really seeing this direction of travel. I hear a lot of claims, but they are always 3rd person. I don't know or work with any engineers who rely heavily on these tools for productivity. I don't even see any convincing videos on Youtube. Just show me on engineer sitting down with theses tools for a couple hours and writing a feature that would normally take a couple of days. I'll believe it when I see it.

Well, I rely on it a lot, but not in the IDE, I copy/paste my code and prompts between the ide and LLM. By now I have a library of prompts in each project I can tweak that I can just reuse. It makes me 25% up to 50% faster. Does this mean every project t is done in 50/75% of the time? No, the actual completion time is maybe 10% faster, but i do get a lot more time to spend on thinking about the overall design instead of writing boilerplate and reading reference documents.

Why no youtube videos thought? Well, most dev you tubers are actual devs that cultivate an image of "I'm faster than LLM, I never re-read library references, I memorise them on first read" and do on. If they then show you a video how they forgot the syntax for this or that maven plugin config and how LLM fills it in 10s instead of a 5min Google search that makes them look less capable on their own. Why would they do that?


Why don’t you read reference documents? The thing with bite-sized information is that is never gives you a coherent global view of the space. It’s like exploring a territory by crawling instead of using a map.

Can you give me an example of one of these useful prompts? I'd love to try it out.

you said it, bro.

I think that at least partially the negativity is due to the tech bros hyping AI just like they hyped crypto.

To me the most interesting part of this is the claim that you can accurately and meaningfully measure software engineering productivity.

You can - but not on the level of a single developer and you cannot use those measures to manage productivity of a specific dev.

For teams you can measure meaningful outcomes and improve team metrics.

You shouldn’t really compare teams but it also is possible if you know what teams are doing.

If you are some disconnected manager that thinks he can make decisions or improvements reducing things to single numbers - yeah that’s not possible.


> For teams you can measure meaningful outcomes and improve team metrics.

How? Which metrics?


My company uses the Dora metrics to measure the productivity of teams and those metrics are incredibly good.

These are awesome, but feel more applicable to DevOps than anything else. Development can certainly affect these metrics, but assuming your code doesn't introduce a huge bug that crashes the server, this is mostly for people deploying apps.

I think it's harder to measure things like developer productivity. The closest thing we have is making an estimate and seeing how far off you are, but that doesn't account for hedging estimates or requirements suddenly changing. Changing requirements doesn't matter for DORA as it's just another sample to test for deployment.


There's only one metric that matters at the end of the day, and that's $. Revenue.

Unfortunately there's a lot of lag


> Unfortunately there's a lot of lag

A great generalisation and understatement! Often looking like you are becoming more efficient is more important than actually being more efficient, e.g you need to impress investors. So you cut back on maintenance and other cost centres and the new management can blame you in 6 years time for it when you are far enough away from it to not hurt you.


s/Revenue/profit/g

That is what we pay managers -to figure out- for. They should find out which and how by knowing the team, familiarity with domain knowledge, understanding company dynamics, understanding customer, understanding market dynamics.

That's basically a non-answer. Measuring "productivity" is a well known hard problem, and managers haven't really figured it out...

It's not a non-answer. Good managers need to figure out what metrics make sense for the team they are managing, and that will change depending on the company and team. It might be new features, bug fixes, new product launch milestones, customer satisfaction, ad revenue, or any of a hundred other things.

I would want a specific example in that case rather than "the good managers figure it out" because in my experience, the bad managers pretend to figure it out while the good managers admit that they can't figure it out. Worse still, if you tell your reports what those metrics are, they will optimize them to death, potentially tanking the product (I can increase my bug fix count if there are more bugs to fix...).

So for a specific example I would have to outline 1-2 years of history of a team and product as a starter.

Then I would have to go on outlining 6-12 months of trying stuff out.

Because if I just give "an example" I will get dozens of "smart ass" replies how this specific one did not work for them and I am stupid. Thanks but don't have time for that or for writing an essay that no one will read anyway and call me stupid or demand even more explanation. :)


I get it, you are a true believer. I just disagree with your belief, and the fact that you can't bring credible examples to the table just reinforces that disagreement in my mind.

The thing is even bad managers can thrive in a company with a large userbase like Google. There is a lot of momentum built into product and engineering.

I heard lines of code is a hot one.

So basically you have nothing useful to say?

I have to say that there is no solution that will work for "every team on every product".

This seems to be useful to understand and internalize that there are no simple answers like "use story points!".

There is also loads of people who don't understand that, so I stand by that is useful and important to repeat on every possible occasion.


Economists are generally fine with defining productivity as the ratio of aggregate outputs to aggregate inputs.

Measuring it is not the hard part.

The hard part is doing anything about it. If you can't attribute specific outputs to specific inputs, you don't know how to change inputs to maximize outputs. That's what managers need to do, but of course they're often just guessing.


Measuring human productivity is hard since we can't quantify output beyond silly metrics like lines of code written or amount of time speaking during meetings. Maybe if we were hunter/gatherers we could measure it by amount of animals killed.

Well I pretty much see which team members are slacking and which are working hard.

But I do code myself, I write requirements so I do know which ones are trivial and which ones are not. I also see when there are complex migrations.

If you work in a group of people you will also get feedback - doesn't have to be snitching but still you get the feel who is a slacker in the group.

It is hard to quantify the output if you want to be removed from the group "give me a number" manager. If you actually do the work of a manager so you get the feel of the group like who is "Hermione Granger" nagging that others are slacking and disregard their opinion, you see who is the "silent doer" or you see who is "we should do it properly" bullshitter you can make a lot of meaningful adjustments.


> Maybe if we were hunter/gatherers we could measure it by amount of animals killed.

Even that would be hard since hunting is complex. If you are the one chasing the pray into the arms of someone else, you surely want it to be considered a team effort.

You need like 'blueberries picked'.


That's why upthread we have https://news.ycombinator.com/item?id=41992562

"You can [accurately and meaningfully measure software engineering productivity] - but not on the level of a single developer and you cannot use those measures to manage productivity of a specific dev."

At the level of a company like Google, it's easy: both inputs and outputs are measured in terms of money.


As you point back to my comment.

I am not Amazon person - but from my experience 2 pizza teams was what worked and I never implemented it myself just what I observed in wild.

Measuring Google in terms of money is also flawed, there is loads of BS hidden there and lots of people paying big companies more just because they are big companies.


> Maybe if we were hunter/gatherers we could measure it by amount of animals killed.

So that's how animal husbandry came about!


haha that is not what managers do. Managers follow their KPIs exactly. If their KPIs say they get payed a bonus if profit goes up, then manager does smart number stuff and sees "if we fire 15% of employees this year, my pay goes up 63%" and then that happens

That sounds like a micro manager. I would imagine good engineers can figure out something for themselves.

I knew a superstar developer who worked on reports in an SQL tool. In the company metrics, the developer scored 420 points per month, the second developer scored 60 points. “Please learn how to score more points from the leader”, the boss would say.

The superstar developer’s secret… he would send blank reports to clients (who would only realize it days later, and someone else would end up redoing the report), and he would score many more points without doing anything. I’ve seen this happen a lot in many different companies. As a friend of mine used to say, “it’s very rare, but it happens all the time.”

I have no doubt that AI can help developers, but I don’t trust the metrics of the CEO or people who work on AI, because they are too involved in the subject.


> When people are pressured to meet a target value there are three ways they can proceed:

1) They can work to improve the system

2) They can distort the system

3) Or they can distort the data

https://commoncog.com/goodharts-law-not-useful/


Honestly I doubt he got away with this for long (unless it was a very dysfunctional org). Being the best gets you noticed (in a good way), and screwing people over gets you noticed too (in a bad way), the combination of the two paints a target on your back.

> Being the best gets you noticed (in a good way), and screwing people over gets you noticed too (in a bad way),

ah, to be young again...


I don't know what you're implying - I have had a few instances in my career when I went above and beyond and while I didn't receive too much praise for my efforts directly, after a while I noticed people who had no business knowing who I was, actually did.

Now, I was really bad at capitalizing on it, so nothing much came of it, but still, there are some positive things that higher-ups do notice.


At scale you can do this in a bunch of interesting ways. For example, you could measure "amount of time between opening a crash log and writing the first character of a new change" across 10,000s of engineers. Yes, each individual data point is highly messy. Alice might start coding as a means of investigation. Bob might like to think about the crash over dinner. Carol might get a really hard bug while David gets a really easy one. But at scale you can see how changes in the tools change this metric.

None of this works to evaluate individuals or even teams. But it can be effective at evaluating tools.


There's lots of stuff you can measure. It's not clear whether any of it is correlated with productivity.

To use your example, a user with an LLM might say "LLM please fix this" as a first line of action, drastically improving this metric, even if it ruins your overall productivity.


You can come up with measures for it and then watch them, that’s for sure.

when metric becomes the target it ceases to be a good metric. when discovered how it works developers will type the first character immediately after opening the log.

edit: typo


Only if the developer is being judged on the thing. If the tool is being judged on the thing, it's much less relevant.

That is, I, personally, am not measured on how much AI generated code I create, and while the number is non-zero, I can't tell you what it is because I don't care and don't have any incentive to care. And I'm someone who is personally fairly bearish on the value of LLM-based codegen/autocomplete.


That was my point, veiled in an attempt to be cute.

Is AI ready to crawl through all open source and find / fix all the potential security bugs or all bugs for that matter? If so will that become a commercial service or a free service?

Will AI be able to detect bugs and back doors that require multiple pieces of code working together rather than being in a single piece of code? Humans have a hard time with this.

- Hypothetical Example: Authentication bugs in sshd that requires a flaw in systemd which then requires a flaw in udev or nss or PAM or some underlying library ... but looking at each individual library or daemon there are no bugs that a professional penetration testing organization such as the NCC group or Google's Project Zero would find. In other words, will AI soon be able to find more complex bugs in a year than Tavis has found in his career and will they start to compete with one another and start finding all the state sponsored complex bugs and then ultimately be able to create a map that suggests a common set of developers that may need to be notified? Will there be a table that logs where AI found things that professional human penetration testers could not?


No, that would require AGI. Actual reasoning.

Adversaries are already detecting issues tho, using proven means such as code review and fuzzing.

Google project zero consists of a team of rock star hackers. I don't see LLM even replacing junior devs right now.


Seems like there is more gain on the adversary side of this equation. Think nation-states like North Korea or China, and commercial entities like Pegasus Group.

Google's AI would have the advantage of the source code. The adversaries would not. (At least, not without hacking Google's code repository, which isn't impossible...)

FWIW: NSO is the group, Pegasus is their product

You mention safety as #1, but my impression is that Google has taken a uniquely primitive approach to safety with many of their models. Instead of influencing the weights of the core model, they check core model outputs with a tiny and much less competent “safety model”. This approach leads to things like a text-to-image model that refuses to output images when a user asks to generate “a picture of a child playing hopscotch in front of their school, shot with a Sony A1 at 200 mm, f2.8”. Gemini has similar issue: it will stop mid-sentence, erase its entire response and then claim that something is likely offensive and it can’t continue.

The whole paradigm should change. If you are indeed responsible for developer tools, I would hope that you’re activity leveraging Claude 3.5 Sonnet and o1-preview.


As someone working in cybersecurity and actively researching vulnerability scanning in codebases (including with LLMs), I’m struggling to understand what you mean by “safe.” If you’re referring to detecting security vulnerabilities, then you’re either working on a confidential project with unpublished methods, or your approach is likely on par with the current state of the art, which primarily addresses basic vulnerabilities.

How are you measuring productivity? And is the effect you see in A/B tests statistically significant? Both of these were challenging to do at Meta, even with many thousands of engineers —- curious what worked for you.

Was this comment cleared by comms

Is any of the AI generated code being committed to Google's open source repos, or is it only being used for private/internal stuff?

I’ve been thinking a lot lately about how an LLM trained in really high quality code would perform.

I’m far from impressed with the output of GPT/Claude, all they’ve done is weight against stack overflow - which is still low quality code relative to Google.

What is probability Google makes this a real product, or is it too likely to autocomplete trade secrets?


Seems like everything is working out without any issues. Shouldn't you be a bit suspicious?

I assume the amount of monitoring effort is less than the amount of effort that would be required to replicate the AI generated code by humans, but do you have numbers on what that ROI looks like? Is it more like 10% or 200%?

Would you say that the efficiency gain is less than, equal to, or greater than the cost?

It's always felt like having AI in the cloud for better autocomplete is a lot for a small gain.


> We work very closely with Google DeepMind to adapt Gemini models for Google-scale coding and other Software Engineering usecases.

Considering how terrible and frequently broken the code that the public facing Gemini produces, I'll have to be honest that that kind of scares me.

Gemini frequently fails at some fairly basic stuff, even in popular languages where it would have had a lot of source material to work from; where other public models (even free ones) sail through.

To give a fun, fairly recent example, here's a prime factorisation algorithm it produce for python:

  # Find the prime factorization of n
  prime_factors = []
  while n > 1:
    p = 2
    while n % p == 0:
      prime_factors.append(p)
      n //= p
    p += 1
  prime_factors.append(n)
Can you spot all the problems?

I'm the first to say that AI will not replace human coders.

But I don't understand this attempt to tell companies/persons that are successfully using AI that no they really aren't.

In my opinion, if they feel they're using AI successfully, the goal should be to learn from that.

I don't understand this need to tell individuals who say they are successfully using AI that, "no you aren't."

It feels like a form of denial.

Like someone saying, "I refuse to accept that this could work for you, no matter what you say."


They probably use AI for writing tests, small internal tools/scripts, building generic frontends and quick prototypes/demos/proofs of concept. That could easily be that 25% of the code. And modern LLMs are pretty okayish with that.

I believe most people use AI to help them quickly figure out how to use a library or an API without having to read all their (often out dated) documentation instead of helping them solve some mathematical challenge

I've never had an AI not just make up API when it didn't exist, instead of saying "it doesn't exist". Lol

If the documentation is out of date, such that it doesn't help, this doesn't bode well for the training data of the AI helping it get it right, either?

AI can presumably integrate all of the forum discussions talking about how people really use the code.

Assuming discussions don't happen in Slack, or Discord, or...


Unfortunately, it often hallucinates wrong parameters (or gets their order wrong) if there are multiple different APIs for similar packages. For example, there are plenty ML model inference packages, and the code suggestions for NVIDIA Triton Inference Server Python code are pretty much always wrong, as it generates code that’s probably correct for other Python ML inference packages with slightly different API.

I often find the opposite. Documentation can be up to date, but AI suggests deprecated or removed functions because there’s more old code than new code. Pgx v5 is a particularly consistent example.

And all the code on which it was trained...

Forum posts can also be out of date.

I think that too but google claims something else.

We are sorely lacking a "Make Computer Science a Science" movement, the tech lead's blurb is par for the course, talking about "SWE productivity" with no reference to scientific inquiry and a foundational understanding of safety, correctness, verification, validation of these new LLM technologies.

Did you know that Google is a for-profit business and not a university? Did you know that most places where people work on software are the same?

So are most medical facilities. Somehow, the vibes are massively different.

That's rich? Never heard of the opioid crisis? Or the over-prescription of imaging tests?

Did you know that Software Engineering is a university level degree? That it is a field of scientific study, with professors who dedicate their lives to it? What happens when companies ignore science and worse yet cause harm like pollution or medical malpractice, or in this case, spread Silicon Valley lies and bullshit???

Did you know? How WEIRD.

How about you not harass other commenters with such arrogantly ignorant sarcastic questions?? Or is that part of corporate "for-profit" culture too????


> Did you know that Software Engineering is a university level degree? That it is a field of scientific study, with professors who dedicate their lives to it?

So is marketing? So finance? So is petroleum engineering?


> Can you spot all the problems?

You were probably being rhetorical, but there are two problems:

- `p = 2` should be outside the loop

- `prime_factors.append(n)` appends `1` onto the end of the list for no reason

With those two changes I'm pretty sure it's correct.


You don't need to append 'p' in the inner while loop more than once. Maybe instead of an array for keeping the list of prime factors do it in a set.

It’s valid to return the multiplicity of each prime, depending on the goal of this.


`n` isn't defined

The implicit context that the poster removed (as you can tell from the indentation) was a function definition:

    def factorize(n):
      ...
      return prime_factors

We collectively deride leetcoding interviews yet ask AI to flawlessly solve leetcode questions.

I bet I'd make more errors on my first try at it.


Writing a prime-number factorization function is hardly "leetcode".

I didn't say it's hard, but it's most definitely leetcode, as in "pointless algorithmic exercise that will only show you if the candidate recently worked on a similar question".

If that doesn't satisfy, here's a similar one at leetcode.com: https://leetcode.com/problems/distinct-prime-factors-of-prod...

I would not expect a programmer of any seniority to churn stuff like that and have it working without testing.


> "pointless algorithmic exercise that will only show you if the candidate recently worked on a similar question".

I've been able to write one, not from memory but from first principles, any time in the last 40 years.


Curious, I would expect a programmer of your age to remember Knuth's "beware of the bugs in above code, I have only proven it's correct but haven't actually run it".

I'm happy you know math, but my point before this thread got derailed was that we're holding (coding) AI to a higher standard than actual humans, namely to expect to write bug-free code.


> my point before this thread got derailed was that we're holding (coding) AI to a higher standard than actual humans, namely to expect to write bug-free code

This seems like a very layman attitude and I would be surprised to find many devs adhering to this idea. Comments in this thread alone suggests that many devs on HN do not agree.


I hold myself to a higher standard than AI tools are capable of, from my experience. (Maybe some people don't, and that's where the disconnect is between the apologists and the naysayers?)

Humans can actually run the code and knows what it should output. the LLM can't, and putting it in a loop against code output doesn't work well either since the LLM can't navigate that well.

A senior programmer like me knows that primality-based problems like the one posed in your link are easily gamed.

Testing for small prime factors is easy - brute force is your friend. Testing for large prime factors requires more effort. So the first trick is to figure out the bounds to the problem. Is it int32? Then brute-force it. Is it int64, where you might have a value like the Mersenne prime 2^61-1? Perhaps it's time to pull out a math reference. Is it longer, like an unbounded Python int? Definitely switch to something like the GNU Multiple Precision Arithmetic Library.

In this case, the maximum value is 1,000, which means we can enumerate all distinct prime values in that range, and test for its presence in each input value, one one-by-one:

    # list from https://www.math.uchicago.edu/~luis/allprimes.html
    _primes = [
        2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59,
        61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131,
        137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197,
        199, 211, 223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271,
        277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353,
        359, 367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
        439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503, 509,
        521, 523, 541, 547, 557, 563, 569, 571, 577, 587, 593, 599, 601,
        607, 613, 617, 619, 631, 641, 643, 647, 653, 659, 661, 673, 677,
        683, 691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761, 769,
        773, 787, 797, 809, 811, 821, 823, 827, 829, 839, 853, 857, 859,
        863, 877, 881, 883, 887, 907, 911, 919, 929, 937, 941, 947, 953,
        967, 971, 977, 983, 991, 997]

    def distinctPrimeFactors(nums: list[int]) -> int:
        if __debug__:
            # The problem definition gives these constraints
            assert 1 <= len(nums) <= 10_000, "size out of range"
            assert all(2 <= num <= 1000 for num in nums), "num out of range"

        num_distinct = 0
        for p in _primes:
            for num in nums:
                if num % p == 0:
                    num_distinct += 1
                    break
        return num_distinct
That worked without testing, though I felt better after I ran the test suite, which found no errors. Here's the test suite:

    import unittest

    class TestExamples(unittest.TestCase):
        def test_example_1(self):
            self.assertEqual(distinctPrimeFactors([2,4,3,7,10,6]), 4)

        def test_example_2(self):
            self.assertEqual(distinctPrimeFactors([2,4,8,16]), 1)

        def test_2_is_valid(self):
            self.assertEqual(distinctPrimeFactors([2]), 1)

        def test_1000_is_valid(self):
            self.assertEqual(distinctPrimeFactors([1_000]), 2) # (2*5)**3

        def test_10_000_values_is_valid(self):
            values = _primes[:20] * (10_000 // 20)
            assert len(values) == 10_000
            self.assertEqual(distinctPrimeFactors(values), 20)

    @unittest.skipUnless(__debug__, "can only test in debug mode")
    class TestConstraints(unittest.TestCase):
        def test_too_few(self):
            with self.assertRaisesRegex(AssertionError, "size out of range"):
                distinctPrimeFactors([])
        def test_too_many(self):
            with self.assertRaisesRegex(AssertionError, "size out of range"):
                distinctPrimeFactors([2]*10_001)
        def test_num_too_small(self):
            with self.assertRaisesRegex(AssertionError, "num out of range"):
                distinctPrimeFactors([1])
        def test_num_too_large(self):
            with self.assertRaisesRegex(AssertionError, "num out of range"):
                distinctPrimeFactors([1_001])

    if __name__ == "__main__":
        unittest.main()
I had two typos in my test suite (an "=" for "==", and a ", 20))" instead of "), 20)"), and my original test_num_too_large() tested 10_001 instead of the boundary case of 1_001, so three mistakes in total.

If I had no internet access, I would compute that table thusly:

  _primes = [2]
  for value in range(3, 1000):
    if all(value % p > 0 for p in _primes):
        _primes.append(value)
Do let me know of any remaining mistakes.

What kind of senior programmers do you work with who can't handle something like this?

EDIT: For fun I wrote an implementation based on sympy's integer factorization:

    from sympy.ntheory import factorint
    def distinctPrimeFactors(nums: list[int]) -> int:
        distinct_factors = set()
        for num in nums:
            distinct_factors.update(factorint(num))
        return len(distinct_factors)
Here's a new test case, which takes about 17 seconds to run:

        def test_Mersenne(self):
            self.assertEqual(distinctPrimeFactors(
                [2**44497-1, 2,4,3,7,10,6]), 5)

Empirical testing (for example: https://news.ycombinator.com/item?id=33293522) has established that the people on Hacker News tend to be junior in their skills. Understanding this fact can help you understand why certain opinions and reactions are more likely here. Surprisingly, the more skilled individuals tend to be found on Reddit (same testing performed there).

I’m not sure that’s evidence; I looked at that and saw it was written in Go and just didn’t bother. As someone with 40 years of coding experience and a fundamental dislike of Go, I didn’t feel the need to even try. So the numbers can easily be skewed, surely.

Only individuals who submitted multiple bad solutions before giving up were counted as failing. If you look but don't bother, or submit a single bad solution, you aren't counted. Thousands of individuals were tested on Hacker News and Reddit, and surprisingly, it's not even close: Reddit is where the hackers are. I mean, at the time of the testing, years ago.

That doesn’t change my point. It didn’t test every dev on all platforms, it tested a subset. That subset may well have different attributes to the ones that didn’t engage. So, it says nothing about the audience for the forums as a whole, just the few thousand that engaged.

Perhaps even, there could be fewer Go programmers here and some just took a stab at it even though they don’t know the language. So it could just select for which forum has the most Go programmers. Hardly rigourous.

So I’d take that with a pinch of salt personally


Agreed. But remember, this isn't the only time the population has been tested. This is just the test (from two years ago, in 2022) that I happen to have a link to.

The population hasn’t been tested. A subset has.

It's also fine to be an outlier. I've been programming for 24 years and have been hanging out on HackerNews on and off for 11. HN was way more relevant to me 11 years ago than it is now, and I don't think that's necessarily only because the subject matter changed, but probably also because I have.

How is that thing testing? Is it expecting a specific solution or actually running the code? I tried some solutions and it complained anyways

The way the site works is explained in the first puzzle, "Hack This Site". TLDR, it builds and runs your code against a test suite. If your solutions weren't accepted, it's because they're wrong.

Where is the data?

Yeah, this is useless.



Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: