Taming Autonomy Pitfalls in Code Generation
Hello everyone, you may not know me, especially if you do, please read this as an attempt to help, not harm. Put on your favorite song, take a walk, step outside and engage the world — anything that reminds you that you have power, you can make a difference.
Summary
Automation and how to use automation techniques is a problem that has challenged society for the entirety of human history. Every major innovation is a gamble. The goal of this article is to help folks break away from the cycle of the “Casino Economy“ tech bro way of looking at generative code and trying to solidify how to reduce the “crap shoot“ that generative code can produce.
Scarcity of resources is the bane of every economy and bottom line. How much you can afford to lose at the table is up to you, but this article should help folks understand the limitation better so you know how the odds are stacked and how you can use an established technique to tame the pitfalls of autonomy called the Systems Design Lifecycle.
My writing is really only based on user experience, but there is an interesting article on this that has been published on limitations with prompt engineering. Also, I have been reading this research on hallucination and overcompensating for answers, which I have also written on in the past.
I read all research with a grain of salt, and I do see some potential gaps with the research, however I can say that with a high degree of certainty is that fully automated generative code without human intervention is problematic at best. This research is also based on what as of this writing is not the latest models, for example Claude Sonnet 4.6 is in use as of March 2026.
The best advice I can give: think of this technology like many technologies at the beginning of the Industrial Revolution. Really good ideas that are still being refined, and code generation is still in its infancy.
There is a high probability that the current implementation or current architectures will be obsolete quickly and valuations of current technology are well over extended. For example, back propagation is a very expensive computational endeavor and in my opinion will run up against a very real wall of practical and physical engineering limitations. Sooner than later, that will hit your fiscal bottom line in the form of increased burden to your profitability.
Run, Rabbit Run
Over the past eighteen months or so, I have leaned into generative code. I write on this in different posts, trying to strip away irrelevancies into “how does this thing I am using behave?“ I have felt like I have been digging holes, one after another trying to catch an endless summer that was perhaps never there.
I’m going to look at this through the lens of what I see as the primary issue: when I am working with a model, more often than not the reasoning engine runs amok. My interaction with the machine is jaded as I watch the how thinking, then token generation slam out more information than I can handle or think through. Or it can think through.
The way I have moved from “vibe coding“ to “sanity coding“ is by understanding how model training, back propagation, weights, tolerances, temperature effect output. I actually ask the model what it is good at. Where it is sloppy, where it can focus.
And, I test a lot.
Claude in particular has added skills (which is tacitly similar to my approach), and my primary tool right now is Visual Studio Code and the Github Co-pilot chat/cli. What I am using has new tools that allows planning, execution, prompt chains, custom agents and this has helped significantly.
Dig that hole, forget the sun
The two primary models I use are Gemini and Claude, mostly sonnet and sometimes Opus. I have bench tested and stress tested many other models out there, but dropped supporting them because they just do not work and are fatally flawed. I’ve tried—believe me—many of my weekends have been spent asking the same questions over and over again to try and get stable results from other models that just don’t work.
I use Claude Sonnet to plan and execute and Gemini to check work. This is a winning combination that seems to be mostly self healing.
Before we go any further, read the Visual Studio Code Github Co-pilot Overview and dig deep (yes that is a pun to the whole rabbit motif) and check out what the actual capabilities of the co-pilot extension can do. In the Best Practices section there are now good roadmaps for producing better code.
This gets intense quickly. And in some cases, this is a rabbit hole, Ala Alice in Wonderland. Generative code is a science and an art. You have to really understand—not asking a model, but understanding—a lot of coding and IT concepts to get away from vibe coding. Let’s establish what we mean by that.
Generated code should be mature
That is, don’t take the result at face value. Ask questions: ie is this good enough? Can I live with it? Is this going to bite me in the future. Is it economical? Does it actually meet the goal? Re-frame the question back at AI: “Is what you generated garbage? What are the pitfalls in your approach.” Frankly: interrogate it, it will make bad decisions.
One major gripe I have about generated code is that the one thing it has proven to me is that the training data is flawed. And, most code out in the ether is really bad. The examples are bad and models are being trained on opinionated code, rather than well functioning secure coding principles. Or the training is losing fidelity between the opinions and best practices. Variables are run-on sentences. I just don’t think the LLM works well in these cases: you must influence it in a very specific way to re-gain fidelity.
Your process should be mature
As I mentioned and in my opinion, taming generative code is about the Systems Development Life Cycle (SDLC). Your baseline may have a lot of risk or it may not. I personally sway from a lot of risk tolerance to zero risk tolerance depending on what I am doing. Mostly, I just hate re-work. It is a waste of my time to keep re-working solutions unless I really need to. My goal is to not get sued, because I can’t afford an attorney.
Generative Code Accelerates the Life Cycle
The thing I like the most about generative code is that it does accelerate SDLC. For me, the days are waning that we have to go through these huge requirements gathering sessions to get to a prototype. What we can start doing, is taking the basic requirements, lightly refine those, plan out a quick prototype and then get into the iteration cycles. My motto has become: “Always be monitoring/maintaining“ in the sense that the faster we can ship the faster we can maintain.
The problem still being tech debt. I equate vibe coding to uncontrollable and irresponsible tech debt. I’ve seen it, I see it. I personally feel that folks don’t really care about good code, they care about code making money. That’s not a criticism, that is reality. If the money is rolling in with bad code: I’m not here to judge the outcome.
As a director my goal is to balance the “karmic credit card“ where if the bill comes due, I have options to pay off the balance. I feel out of everything I do fiscal sanity is my core business benefit to an organization, I am fiscally aware and have a steely eye on risk management.
Maintenance ≠ Re-work
What this does not mean though is constantly refactoring bad implementation ie: re-work. Having an incomplete feature is one thing, finding an esoteric bug is one thing, generating absolutely terrible code is another.
For example, I still will not use generative code to design a data architecture. It is really bad. Like so bad I had to give up on it. On my worst day, I am legendary compared to AI generated data models.
The Approach
Regardless of how you use generative code or not, this has been my basis for a long time. I refine it and tweak things to stay LEAN, but in general my approach is this, and it works.
Create style guides. I’ve been doing this with teams for 20 years. Establish the core stylistic patterns that you want to see. Use best practices.
Design the data model with JSON Schema, ERD by hand. This is the core data pattern, and you need to lock in stylistic choices first.
Don’t get too locked up into normal form in this pass. Normal form is great, but can get clunky if overused.
The key is data quality and integrity. Add those as needed at the database level.
My preference has always been that the database be the final say in validation. Let the data layer support basic validation as appropriate.
Choose code checking software to look for bad patterns. This can be biome, ultra cite, sonar cube, eslint, pylint etc. The point being that your IDE should be helping you see code issues far out before shipping.
Blundering into anything is dangerous: read the f-ing manual yourself.
VS Code
My coding environment is set up as workspaces. This comes from using Visual Studio back in the day and has progressed from there. We can debate organization till we’re blue in the face, but I use workspaces to tweak settings in VS Code. If I am working on typescript, I’ve configured the IDE for typescript, python changes the config etc, etc.
What this does is free up the settings in the repo and I can activate the workspace and not have to care about conflicting settings.
What my IDE does is give me feedback all the time. Errors, warnings, problems I can see all that in my editor at a glance. I’m a stickler for strict typing, and pretty much avoid loose typing at all costs.
Taming the beast
I think the thing I hate the most about interacting with the models in a chat is this false equivalency that my “preference“ will be adhered to. I constantly get into these circular “false choices“. I’m dealing with this right now, where ChatGPT is giving some elongated response asking “what do I prefer“ - when it has already given the answer on my preference. Honestly, I ignore this, until it tells me what to do to change the answer. It’ll just spin you around in circles. Although, I have noticed in the last like week that the responses are starting to infer what I am saying better.
Training
In order to understand how generation is going to work, there is this opaque layer on how a model was trained. The industry is so hell bent on training to be generally available, what it is really missing that generality does not equate to expertise. There really is no sweet spot.
To tame this, we must give the model implicit and explicit instructions in varying degrees. This is harder than it looks. No matter what, you are going to be fighting training, back propagation and this weird vanishing point that generation gets in to.
I am not an AI expert, but what I do know is that back propagation is computationally expensive and is flawed when trying to statistically pick from the endless shades of gray. It will pick the wrong shade of gray, or the wrong needle in the endless stack of needles.
A neural network is a chain of decisions. I read a very nice article on this a while ago that mentioned: “Good learning is calm and consistent“ and that is where I think the training is moving to. That is where I am getting to in the implementation.
Custom Agents, Prompts, Skills and Instructions
Instruction Files
I think for the most part instructions are nonsense. I just don’t think they work. I’ve had the chat basically tell me they don’t work. I feel the reason is that the instructions are just too derivative and lack exactitude. The paper I mentioned earlier recently that said something similar. Instructions are ok for the basics, but they get lost in the context window. As soon as the instructions are loaded, they seem to have a very short shelf life. Much like neutrinos, a billion trillion instructions can pass through instantly, with maybe one hitting its mark.
I still have them, but they are very tame and short.
Prompts
We are working on a library of prompts that help us give enough information to the model with specific examples of good code vs. bad code and implementation details that it misses. For example, we use a translation tool on the site, so we have very specific localization that has to be performed.
What these concrete examples use it help us give the correct information to the model, so that the reference material is not lost. Like I mentioned before, the team is working on looking at skills as well in concert with prompts.
I really watched this sort of propaganda thing on prompt engineering being dead, and I pretty much had my eye brow raised the entire time. Is it? Welcome to another shade of gray in the diffused forrest of gray.
Custom Agents
I do use custom agents, but in a very specific way. What I have found is, that given too much autonomy, the agent will act with impunity and with false authority. Here is how I look at agents.
Requirements Gathering
The agent is being told to be an architect. It is not allowed to write code, show code, edit files anything. It is allowed to search the codebase, the internet and ask questions. That is its sole task.
The goal is to take my crummy language and refine this into a “spec sheet“ if you will that focuses the task in language that the model understands. The biggest thing I have noticed is that I cannot give the correct tokens (aka natural language) that give the correct context. I am not a model. Ergo, the first task is to establish the requirements in a way that the model and I come to an agreement.
This part, using the requirements agent has increased code quality immensely. Until we agree and have a spec sheet, we do not design.
Design Solution
Now, step 2 is designing the solution. The agent is still not allowed to create code, but can design the solution. This is usually the phase that first pass code comes in, and is usually a lot to parse. However, since we have not written anything I notice that this happens isolated from over doing things. It is strange, if I allow the agent to just move forward on its own it will reason itself into oblivion. Same if I don’t temper the code responses.
In this phase, we are creating light/medium code review. I will usually check the code it suggests and steer it. I hate how it names variables still, I’ll interject on patterns I think it’s not hitting etc.
But, we have not generated code yet.
Implement Solution
After requirements and design, the next phase, is handing off the solution we have agreed on to multiple sub agents.
Implement solution uses a few subagents: “Create Code“, “Look for Deprecated Code“, “Analyze/Test Code“ (I have a few I am experimenting with) which breaks off the implementation phase into separate agents all attacking the problem from different angles. Finally the agent writes code, and then we start the cycle over.
What is this doing?
On a basic level, this is systems design with a few well placed generative short cuts. Externally away from code writing, we have requirements that we need to have set as our goal. In the agents we have defined what they should be doing and reinforcing that the code being written has some sort of maturity. We’ve defined a repeatable and refine-able process that can be improved over time.
So, instead of letting the wheels come off the wagon, we are focusing the model into not making assumptions about the goal. Because we as humans cannot state the goal explicitly, we implicitly give commands that are then turned into directives that the model can generate against.
In essence, we are acting as our own back propagation (re-training), finding what parts of the parameters we are seeing as having an issue and fixing them as we are in the stream.
Advantages
The advantage to this approach is that this puts folks in the driver’s seat. I can get in front of hallucination quickly and with authority. I am still tiptoeing into the world of “skills“ along with this approach and there will be more to come on skills as benchmarking is my jam in process refinement.
I’m going to say this again because it bears repeating: with this approach, I can get in front of hallucination and bad premise. I am not 100% convinced that skills and other approaches short-circuit bad pre-training yet. Humans are better decision makers in the long run. (Mostly)
Like a good process: refinement, integration and improvement is built in.
There is a trade off in that it has taken me hours of decisions to get to really good well though out well architected code that isn’t just “hello world”. Vibe coding is faster, but sloppier.
And I expect this time-box as a given to good code. The kicker however is every decision now becomes a well coded result, not bureaucratic overhead to a deliverable. I’m not fighting group-think that is blocking technical delivery. I can back fill iteration into the group as they have more concrete things to look at and refine.
Hand to heart, most of the post generative discussion will be “This is blue is not blue enough“ or “Can we fix this part of the UI“ instead of months of blocking discussions, as people work better with a problem they can fix instead of hypothesizing upon. That is a huge win and lets folks focus on the strength of the group.
Drawbacks
Decision. Fatigue. I’ve been noticing with this approach is that the model will ask many clarifying questions and is very detail oriented. This detail however, is like having to be a full product manager, engineer, architect, data scientist, and executive all in one. This is way more than full stack, this is full enterprise. There is so much to unpack in each question response session to get requirements enshrined, that the process is a grind.
I nitpick a lot—I play 3d chess—and this level of detail is challenging even for me. Sometimes I have to just walk away for a minute. Decisions need a lot of care water feed as you work through them, so if you don’t have a good handle on your goals: this process will be a huge grind.
Also, generative code: well, it still messes up. It just depends on how big the task is. Models are getting better and we will continue to see improvements. “Skills“ are on my radar to keep evaluating. My hope is that we can create better refined models to code with that don’t have such bad pre-training. I just don’t see why the market can’t promote better coding standards as part of the default.
I swear every time I see “‘ZodIssue’ is deprecated“ it drives me nuts, because it makes me not trust the model. Then, I start wondering what else in my codebase is out of date. And so forth, and so on. Also, it is just ridiculous that I can’t convince a model that the code it made five minutes ago that we are now changing does not need backwards compatibility. (although 4.6 has finally stopped saying perfect so much)
Take Aways
You may already be using an approach like this or are further along than I am. And for my friends who do not really know how process improvement works, this blog is really to talk about how to improve process. I see folks at all skill levels using generative code, and my gut tells me that all skill levels in development and engineering can benefit from SDLC and better process.
With this process that I am using, what I am noticing is that the model is more thoughtful and in general interrogates me for answers. And it thinks of stuff I don’t think about: and I think about things it can’t see or imagine. The symbiosis is the key.
What I am learning, is much like the way we’ve done things in the past is that the cycle of development should not change. I wrote a paper in college, about the use of CGI where I predicted that practical effects and CGI will co-exist in different ways, where sometimes practical effects are better than CGI and vice versa. When I wrote my thesis paper, CGI was very heavy in almost every aspect of film and now folks are really wanting a balance of the two. Because practical effects just look better in some ways.
The same still holds true here: we are using practical effects (the human) to be there in the correct ways and the model (CGI) to be there a supporting player.
Personally, I’m finding this solution to be much better, and breaking that “Uncanny Valley“ of “Technically True“ code to “Yes this works and makes sense.“
That’s sanity coding: everyone needs a lot of sanity checks.
Softly Spoken Magic Spells
Here are the custom agents that I am using - please adapt these to your needs.


