Whose Productivity Are You Measuring?

May 26, 2026

Header illustration: AI-generated with ChatGPT from an original concept and prompt by Vivek Vaidya.

Two trends in AI-assisted software engineering are getting loud right now. Both seemingly measure productivity. The reality is that neither one does. And we've seen this movie before.

Trend one: tokenmaxxing.

Engineering leaders have started measuring developer productivity by tokens consumed. More tokens, more productivity. Some are building reward systems around it. Leaderboards. Internal dashboards. Quarterly reviews.

This is going to age badly.

Tokens are an input. Software that ships, works, and doesn’t fall over in production is the output. Measuring tokens is measuring fuel burn, not distance traveled. A developer burning through millions of tokens to produce a feature that could have been a fifty-line config change is not winning. They are losing expensively.

We have been here before. The lines-of-code era taught us this exact lesson. More LOC was supposed to mean more productivity. It meant the opposite. Verbose, unmaintainable, bloated code that future engineers had to dig out from under. The metric got gamed. The metric always gets gamed.

Now ask yourself who is promoting tokenmaxxing.

Model providers sell tokens. Coding tool vendors monetize throughput. The influencer economy that sits on top of both gets paid to evangelize usage. Of course the loudest voices are telling you to use more. No one is running ads that say you are using too much. The incentive structure produces the messaging it produces, and we are all watching it happen in real time.

I am not saying these tools are bad. I am saying the people telling you how to measure their value are not neutral parties.

The substantive critique stands on its own. Tokens are not the output. But the incentives explain why a bad metric got popular so fast.

Trend two: parallel sessions.

The other thing I keep seeing. Developers spinning up ten, twenty, thirty parallel Claude Code sessions. Building features in parallel. Fixing bugs in parallel. All on the same product. All at once.

The pitch is seductive. If one agent can do the work of one developer, then ten agents can do the work of ten developers. Linear scaling. Free productivity.

It does not work that way. It cannot work that way.

Multi-tasking is hard for humans. Context switching is hard. These are not new findings. Thirty years of cognitive science has been clear on this. AI does not change the underlying problem because the bottleneck was never the typing. The bottleneck is the human who has to understand what each agent is doing, review the output, integrate it with the rest of the system, and decide whether it is correct.

Here is the paradox. The more complex the task, the more attention you have to pay to what the AI is doing. Even with the best CLAUDE.md. Even with the best skills, the best agents, the best harnesses. Complex changes require judgment, and judgment does not parallelize.

If the sessions are small, contained, independent changes, fine. Refactor this file. Update those tests. Bump that dependency. Parallel works for embarrassingly parallel problems. It always has.

But the moment two sessions touch overlapping logic, or the moment a change requires understanding how it interacts with the rest of the system, the developer becomes the bottleneck. And a developer juggling ten contexts is a developer doing none of them well.

I have watched smart engineers do this. They open ten sessions. They feel productive. The dashboard lights up. The tokens flow. And then they spend the next three days untangling the merge conflicts, the half-broken abstractions, the agents that confidently went down the wrong path because no one was paying attention.

The output is not ten features. The output is one feature, eventually, after a lot of rework.

The pattern.

Both trends substitute an easy-to-measure input for a hard-to-measure output. Both flatter the people doing the measuring.

Tokens consumed are not a measure of productivity. Parallel sessions does not measure throughput. Lines of code was not quality, and we figured that out, eventually, after a decade of damage.

The engineers who get the most out of AI are the ones who know that out-of-the-box Claude Code will generate an N+1 query pattern. Who know it reaches for JSONB when a normalized schema is the right call. Who know it picks synchronous architecture when the problem screamed for async.

I am all for tokenmaxxing. But somewhere along the way we forgot value for money.

None of that shows up in a token bill.

Discussion about this post

Ready for more?