- Perspectives
- Posts
- The Claude upgrade that was retrograde
The Claude upgrade that was retrograde
Claude 3.7 seemed like a winner, but then it wasn’t and it likely cost Cursor millions.
Like many new AI developers, I was eager to try my IDE’s implementation of Claude 3.7 upon its release. By various benchmarks, it seemed like an unconditional upgrade.

The hype quickly built across AI editor makers eager to deploy a better way to build apps.
Replit Agent v2 is the most fun I had making something. It rarely gets super stuck in loops and can always make a bit of progress, thanks to reasoning capabilities powered by Sonnet 3.7.
— Amjad Masad (@amasad)
6:10 PM • Feb 25, 2025
But soon, problems started to arise.
It became clear that Replit and Cursor hadn't tested their Claude v3.7 implementations before go-live because it's not ready for prime time for AI code editing. As we know from the OpenAI side of things, just because the number is higher doesn't mean it's better, at least for this use case.
Like many others, I had major issues with Agent v2. The core problem was this - Agent v2 no longer reliably listened to what you asked it to do. It's no longer the obedient developer that follows you as the product and engineering manager. It constantly does what it thinks is best and terrible results follow.
In using agency to try to be more helpful, the model has become significantly less effective.
I've had a couple of major issues recently. So many that I've downgraded back to v1 Agent (3.5):
v2 (3.7) will ignore direct and repeated asks to focus on your task, instead returning to bugs or outstanding problems. It's very difficult to ensure compliance
v2 no longer allows distinct plans for specific issues or feature development. Every chat is a continuation of previous ones and constantly revisits earlier discussions when working on new issues
v2 has memory, but it has backfired because there's no way to direct it to focus on new things. Every new chat reverts to old issues
v2 has a longer context window, but it can't actually handle it and hallucinates after just a few steps of its new 120 step context window. You no longer get warnings for a big chat, and by the time it reaches the 120 step limit, it's already far gone
v2 solves errors on its own, but due to its self-reinforcing reasoning, it does so in error-prone ways. Imagine talking to yourself and confirming your own mistaken beliefs instead of asking your human partner for help
v2 ignores direct requests to change its stack if it thinks a point solution is better, even if it's better overall to adjust the stack
v2 struggles to restart when stuck. v1 will happily scrap a flawed codebase, whereas v2 insists on trying to fix it
Perhaps v3.7 is theoretically beating the benchmarks, but its own agency works against it. It is useful for tackling one-off tough issues (which is now how I use it via t3.chat but for ongoing, iterative coding development, v3.5 is still the best.
I pleaded with Replit to not force an upgrade for all users to v3.7, and hopefully someone heard me because they did not force the upgrade. However, their founder keeps posting notes about how v3.7 is a “leapfrog” despite the ongoing issues. I am 100% sure that forcing an upgrade would have resulted in Replit losing millions in revenue from churn, just as Cursor likely has.
The paradox of agency and the best of both worlds
This has highlighted the paradox of agency in the AI realm. Andrej Karpathy was right when he talked about the tradeoff between Agency and Intelligence for humanity in the world of AI:
Agency > Intelligence
I had this intuitively wrong for decades, I think due to a pervasive cultural veneration of intelligence, various entertainment/media, obsession with IQ etc. Agency is significantly more powerful and significantly more scarce. Are you hiring for agency? Are
— Andrej Karpathy (@karpathy)
6:58 PM • Feb 24, 2025
It applies equally well to AI. If we want AI to have a meaningful impact on our lives, we should focus on connecting them to more tasks (e.g. embedding, integrations, use cases) rather than making them "smarter". AIs with more agency should be closer to AGI than those that in theory beat benchmarks but in practice could do nothing for us. Indeed, despite more advanced AI models and much more advanced computing having existed for the past 20 years, humanity was more impressed by the advent of ChatGPT which simply showcased AI gaining agency to share its thoughts with the user.
Ultimately, Agency is a double-edged sword. It can enhance productivity and efficiency with the right implementations and guardrails. When unchecked (as in early implementations of Claude v3.7), it can be disastrous.
Still, there’s a way to achieve the best of both worlds.
I find that 3.7 excels at creating infrastructure and solving nasty back-end bugs. If you ask it to create a simple portfolio grid or a stylistically minimal site, it can't. Meanwhile, 3.5 will make the nicest landing pages, display components, and organize file directories, but can get stuck in thorny technical and infrastructure issues.
3.5 creates a plan of attack, checks in regularly, explains itself, and won't destroy your codebase without a recent rollback. 3.7 is determined to solve your problems and will iterate on its own to tackle hard problems, but doesn't create plans, check in, explain itself or provide good rollbacks as it unilaterally destroys your codebase.
I want both, with the ability to pick between them based on context.