The Claude upgrade that was retrograde

Claude 3.7 seemed like a winner, but then it wasn’t and it likely cost Cursor millions.

Like many new AI developers, I was eager to try my IDE’s implementation of Claude 3.7 upon its release. By various benchmarks, it seemed like an unconditional upgrade.

The hype quickly built across AI editor makers eager to deploy a better way to build apps.

But soon, problems started to arise.

It became clear that Replit and Cursor hadn't tested their Claude v3.7 implementations before go-live because it's not ready for prime time for AI code editing. As we know from the OpenAI side of things, just because the number is higher doesn't mean it's better, at least for this use case.

Like many others, I had major issues with Agent v2. The core problem was this - Agent v2 no longer reliably listened to what you asked it to do. It's no longer the obedient developer that follows you as the product and engineering manager. It constantly does what it thinks is best and terrible results follow.

In using agency to try to be more helpful, the model has become significantly less effective.

I've had a couple of major issues recently. So many that I've downgraded back to v1 Agent (3.5):

  • v2 (3.7) will ignore direct and repeated asks to focus on your task, instead returning to bugs or outstanding problems. It's very difficult to ensure compliance

  • v2 no longer allows distinct plans for specific issues or feature development. Every chat is a continuation of previous ones and constantly revisits earlier discussions when working on new issues

  • v2 has memory, but it has backfired because there's no way to direct it to focus on new things. Every new chat reverts to old issues

  • v2 has a longer context window, but it can't actually handle it and hallucinates after just a few steps of its new 120 step context window. You no longer get warnings for a big chat, and by the time it reaches the 120 step limit, it's already far gone

  • v2 solves errors on its own, but due to its self-reinforcing reasoning, it does so in error-prone ways. Imagine talking to yourself and confirming your own mistaken beliefs instead of asking your human partner for help

  • v2 ignores direct requests to change its stack if it thinks a point solution is better, even if it's better overall to adjust the stack

  • v2 struggles to restart when stuck. v1 will happily scrap a flawed codebase, whereas v2 insists on trying to fix it

Perhaps v3.7 is theoretically beating the benchmarks, but its own agency works against it. It is useful for tackling one-off tough issues (which is now how I use it via t3.chat but for ongoing, iterative coding development, v3.5 is still the best.

I pleaded with Replit to not force an upgrade for all users to v3.7, and hopefully someone heard me because they did not force the upgrade. However, their founder keeps posting notes about how v3.7 is a “leapfrog” despite the ongoing issues. I am 100% sure that forcing an upgrade would have resulted in Replit losing millions in revenue from churn, just as Cursor likely has.

The paradox of agency and the best of both worlds

This has highlighted the paradox of agency in the AI realm. Andrej Karpathy was right when he talked about the tradeoff between Agency and Intelligence for humanity in the world of AI:

It applies equally well to AI. If we want AI to have a meaningful impact on our lives, we should focus on connecting them to more tasks (e.g. embedding, integrations, use cases) rather than making them "smarter". AIs with more agency should be closer to AGI than those that in theory beat benchmarks but in practice could do nothing for us. Indeed, despite more advanced AI models and much more advanced computing having existed for the past 20 years, humanity was more impressed by the advent of ChatGPT which simply showcased AI gaining agency to share its thoughts with the user.

Ultimately, Agency is a double-edged sword. It can enhance productivity and efficiency with the right implementations and guardrails. When unchecked (as in early implementations of Claude v3.7), it can be disastrous.

Still, there’s a way to achieve the best of both worlds.

I find that 3.7 excels at creating infrastructure and solving nasty back-end bugs. If you ask it to create a simple portfolio grid or a stylistically minimal site, it can't. Meanwhile, 3.5 will make the nicest landing pages, display components, and organize file directories, but can get stuck in thorny technical and infrastructure issues.

3.5 creates a plan of attack, checks in regularly, explains itself, and won't destroy your codebase without a recent rollback. 3.7 is determined to solve your problems and will iterate on its own to tackle hard problems, but doesn't create plans, check in, explain itself or provide good rollbacks as it unilaterally destroys your codebase.

I want both, with the ability to pick between them based on context.