AI code generation is all the rage nowadays.

It’s not hard to see why. Look at it from a business leader’s perspective. Imagine not having to pay expensive developers whom you have to train for years before they start really producing quality work. Who demand more and more money and perks on top of that. Who loudly and publicly protest your business decisions. Just have an AI write the code instead! It never complains, never grows tired. It just chugs out code as long as you ask.

On the surface, the code these tools generate is plausible. Sometimes it even does what it’s supposed to. And even when it stumbles, it feels that these tools are improving at such a rapid pace that the next generation in a few months will easily handle whatever the existing generation cannot.

A lot has been written about what AI-assisted tools can and cannot do. There’s been a lot of consternation about whether these tools will make software developers obsolete. I feel this discourse represents a misunderstanding of what software development is all about. The questions on my mind are rather: what will be the effect of the use of these tools on the software we develop? And will they ultimately live up to their hype or will this be more of a passing fad?

I recently did my own experiments to gain experience and form my own opinion. Here is how I think about it.

What experiences I have gathered

I’ve tried several exercises, but will focus on the following:

  • Building a todo list app with frontend from scratch using AI-assisted coding,
  • Attempting to make improvements to the computer AI for an existing strategy game, and
  • Converting integration tests of a web service into equivalent unit tests.

I used various tools for these, including Cursor, Cline, Roo Code, and Claude Code.

In each of these cases, I carefully examined the code generated by the AI tool at each stage, making adjustments either in the prompts or by directly editing the code. This is in contrast to vibe coding, where one just asks the AI to generate something and evaluating the results only based on their observable behaviour.

The exercise to build a todo list was reasonably successful. I built a web frontend, backend in Python, and unit tests of the backend. I used only AI prompts, with no manual editing of code. The biggest difficulty was with the tests: it took several rounds of prompting to get them right. All in all, I feel I was able to finish the work somewhat more quickly than I would have writing the code manually. That said, this feels like such a basic exercise that I would expect any AI tool to master it easily.

My attempt to make improvements to the computer AI was absolutely unsuccessful. The tools produced only nonsense despite several rounds of prompting.

The conversion of integration tests to unit tests was half-successful. The tool did scaffold the tests in a halfway reasonable way, but it took many rounds of prompting to get it to produce well- structured unit tests. The tests it produced were somewhat reasonable, but did not really correspond to what had been covered in the integration tests. They also did not run successfully. I ended up having to do extensive manual edits to obtain reasonable unit tests. In the end, it is doubtful that I saved any time compared with just writing the tests by hand.

The code the AI tools wrote was generally idiomatic. This makes sense, because it’s trained on the code everyone else is writing. But that also means it suffers from the same problems which are widespread among existing code bases.

One sees this especially in the generated tests – a topic about which I am especially picky. In my experience, test code is some of the sloppiest and most neglected code one finds, to the detriment of the overall health of the code base. At first, the tests were a disorganised mess. They were not set up in an arrange-act-assert structure with clearly identifiable steps. The assertions in particular were strangely built. The would devote several lines of code to extract data, then use an advanced assertion framework (which could already use matchers to extract the data) to make only simple assertions. The tests had vague names like test<method>. All in all, they looked like what an amateur developer might think test code looks like. But they made no sense as tests.

This demonstrates two points about the AI tools:

  • Their code is only as good as their training data. Since most test code is of pretty poor quality, the AI tools all build tests of similarly poor quality.
  • The models don’t actually understand what they are doing. They create tests which seem to be “typical” without considering whether the set of tests which they are writing actually makes sense. The experience of a real software developer is really important here.

In all cases, it took several rounds of prompts to get the tests to a satisfactory state. The AI did respond decently well to prompts for correction, and applied corrections reasonably consistently. It sometimes even “remembered” enough context to apply previous comments proactively. However, the amount of time I spent reprompting to get the tests right clearly exceeded the amount of time I would have spent writing them manually.

The AI tools also tended to do too much at once. For example, in one case I asked it to scaffold a project, then to change the name of the top-level directory to “todo-list”. It then surmised that the project is a todo list and implemented the whole thing! I find this really problematic. There are few things we really know about how to write good software. One of them is that we get the best results when we work incrementally. AI tools which actively work against this principle are prone to create really poor results for code health.

No one really knows how this will play out long-term

A few months ago, a CEO confidently proclaimed in my presence, “everyone will be coding like this in five years”. I could only roll my eyes. The reality is, no one knows how this will play out. And I think there are good reasons to be skeptical that this will have the kind of impact its enthusiasts preach.

Suppose that CEO turns out to be right, though. What will be the effect on developers themselves? And on the long-term health of the projects involved?

I work best when I have a close connection to the code. When AI writes the code, do we lose that close connection? Do we start just accepting the suggested code out of habit without really looking carefully at it? Do we end up with unmaintainable AI-generated slop in the long run, where not even the AI can work on it effectively? Do developers lose the ability to do their jobs properly?

Where do I see this fitting in?

Let’s take a step back from the hype. What does this really look like to me?

AI-assisted coding is useful primarily for cases where typing is a bottleneck. And then only when there are many existing examples from which to draw. So things like:

  • “Bootstrap a project which looks like this”,
  • “Write a mapping from this structure to this other structure, including unit tests”.

If there aren’t many good code examples out there, the AI will probably produce pretty sloppy code. To get a decent result, one needs to be ready to spend a lot of time issuing follow-up prompts to clean up up the results. Or one will have to spend a lot of time cleaning it up oneself. This eats into its perceived advantages.

Another case where I find these tools helpful is that one doesn’t actually care about the resulting code because one is building a prototype only. Often the AI will produce something which even works. In a prototype, that’s all which matters.

Finally, this can be useful if one is learning a new language or framework and wants to see how an idiomatic solution to a problem might appear. In that case, the fact that the AI was trained on lots of existing code is helpful, for obvious reasons. As long as one is using that only as a learning exercise, there’s no harm done.

All said, this is not a 10x technology, and probably never will be. It’s a 10% technology. Maybe.

Contrary to what many non-devs seem to think, software development is mostly not about typing code into a computer. It’s much more a process of learning and discovery. AI-assisted code generation doesn’t usually help much with that. Indeed, it could well be a hindrance.

What worries me about this

So these tools could be useful for a certain set of tasks. In the best world, that makes developers a bit more productive and takes a bit of drudgery from their work. But that’s not the only way it could play out. What other scenarios concern me?

The concept of vibe coding really makes me shudder, at least when applied to real production applications and not just prototypes. The idea that an AI model would “understand” all the context needed to make code which not only works, but contributes to a sustainable code base is illusory. Particularly since the AI doesn’t actually have any way to distinguish “good” from “bad” code. Being a sophisticated statistical model, it can at best distinguish idiomatic from non-idiomatic code. If you’re building the same todo-list app a hundred thousand devs have already built in the past, then it’ll probably do a good job. If you’re building a highly specialized software solution (where devs, you know, actually deliver value), not so much.

Will developers really do the work to carefully review and fix AI-generated code? The temptation is already strong to just toss it over the fence and move on. But having to work with code by hand gives one a strong feel for how clean or messy it is. Messy code feels painful. One wants to clean it up. If AI is writing the code, one doesn’t feel that pain as viscerally. So AI risks removing the main drive to improve the code over time.

Perhaps worse is the prospect of developers outsourcing their thinking to AI. My recommendation is: Don’t AI-code anything you don’t understand, at least not for real production code. One can use AI assistance to learn how to do something. But one should then still take the time to actually implement it oneself. Once one knows how it really works, one can use AI to remove the toil.

I’ve heard the argument that an AI is just like a junior dev. But you don’t hire a junior dev because they were particularly productive. You hire one because they will grow into a professional and, later, a senior. And they do that because you work with them, helping them to grow. When you issue corrective feedback on a code review, you know that the dev will take it to heart and do better next time. This doesn’t work with models. You don’t train them. The software publisher does. No amount of working with them will make them better devs for you. You can add some global prompt to customize their behaviour to a degree, but they won’t really learn and grow the way real humans do.

Conclusion

It shouldn’t be surprising: LLMs are trained on massive volumes of data, and models used in AI coding tools are no different. They are pretty good at producing code which looks like the code the average developer will produce. They “understand” a prompt and the context of the existing code well enough that they can adapt the generated code to the specifics of both the prompt and the existing code. That is a real improvement over the past.

But they are just immensely complicated statistical models, predicting what code is “most likely” to appear. This is a far cry from being able to think and reason, or from having a true mental model of either the code or the system being built. They do not truly understand the problem you, as a developer, are trying to solve.

I think that if you’re a conscientious, professional software developer, it’s fine to add AI tools to your toolbox. They are useful for certain tasks.

But don’t be swept up by the hype. My prediction is that it’s not going to revolutionize software development any more than all the previous fads did. At least, not in a good way. So whatever you do, don’t let your craft atrophy. Sooner or later, the bubble will burst and the hype will fade.

And for heaven’s sake, keep hiring junior devs!

Category:  Software-development  |  Tags:  ai   software engineering