commit 606c0b102ba5f6c441f3d4e02d1f6a581f26a358 · danabra.mov/overreacted

+241

public/how-to-fix-any-bug/index.md

··· 1 + --- 2 + title: How to Fix Any Bug 3 + date: '2025-10-21' 4 + spoiler: The joys of vibecoding. 5 + --- 6 + 7 + I've been vibecoding a little app, and a few days ago I ran into a bug. 8 + 9 + The bug went something like this. Imagine a route in a webapp. That route shows a sequence of steps--essentially, cards. Each card has a button that scrolls down to the next card. Everything works great. However, as soon as I tried to *also* call the server from that button, scrolling would no longer work. It would jitter and break. 10 + 11 + So, adding a remote call somehow broke scrolling. 12 + 13 + I wasn't sure what's causing the bug. Clearly, the newly added remote server call (which I was doing via [React Router actions](https://reactrouter.com/start/framework/actions)) was somehow interfering with my [`scrollIntoView`](https://developer.mozilla.org/en-US/docs/Web/API/Element/scrollIntoView) call. But I couldn't tell how. I initially thought the problem was that React Router re-renders my page (an action causes a data refetch), but in principle there's no reason why a refetch would interfere with an ongoing scroll. The server was returning the same items, so it shouldn't have changed anything. 14 + 15 + In React, a re-render [should](/writing-resilient-components/#principle-2-always-be-ready-to-render) always be safe to do. Something else was wrong--either in my code, or in React Router, or in React, or even in the browser itself. 16 + 17 + How do I fix this bug? 18 + 19 + Can I get Claude to fix it? 20 + 21 + --- 22 + 23 + ## Step 0: Just Fix It 24 + 25 + I told Claude to fix the problem. 26 + 27 + Claude tried a few things. It rewrote conditions in the `useEffect` that contains the `scrollIntoView` call and said that the bug was fixed. But that didn't help. It then tried changing `smooth` scrolling to `instant`, and a few other things. 28 + 29 + Each time, Claude would proudly declare that the problem was solved. 30 + 31 + But it was not! 32 + 33 + The bug was still there. 34 + 35 + This might sound like I'm complaining about Claude, but really the impetus for writing this article is that I see human engineers (including myself) make the same mistakes. So I wanted to document the process I usually follow to fix bugs. 36 + 37 + Why was Claude repeatedly wrong? 38 + 39 + Claude was repeatedly wrong because **it didn't have a repro**. 40 + 41 + --- 42 + 43 + ## Step 1: Find a Repro 44 + 45 + A *repro*, or a reproducing case, is a sequence of instructions then, when followed, gives you a reliable way to tell whether the bug is still happening. It's "the test". A repro says what to do, what's *expected* to happen, and what is *actually* happening. 46 + 47 + From my perspective, I already had a good repro: 48 + 49 + 1. Click the button. 50 + 2. The *expected* behavior was scrolling down, but the *actual* behavior was scroll jitter. 51 + 52 + Even better, the bug happened every time. 53 + 54 + If my repro was unreliable (e.g. if it happened just 30% of attempts), I'd either have to gradually remove different sources of uncertainty (e.g. recording the network and mocking it in future attempts), or live with the producitivity hit of having to test every potential fix many more times. But luckily, my repro was reliable. 55 + 56 + And yet, to Claude, my repro essentially didn't exist. 57 + 58 + The problem is that "scrolling jitters" from my repro didn't mean anything to Claude. Claude doesn't have eyes or other ways to perceive the jitter directly. So Claude was essentially operating *without* a repro--it tried to fix the bug, but didn't do anything specific to verify it. That is too common, even with the best of us. 59 + 60 + In this case, Claude *couldn't* have followed my repro exactly since it couldn't "look" at the screen (taking a few screenshots wouldn't capture it). So my first repro was unsuitable if I wanted Claude to fix it. This might seem like a problem with Claude, but it's actually not uncommon when working with other people--sometimes a bug only happens on one machine, or for a specific user, or with specific settings. 61 + 62 + Luckily, there is a trick. You can trade a repro for *another* repro as long as you're able to convince yourself that it'll help you make progress on the original problem. 63 + 64 + Here's how you can change your repro, and some things to watch out for. 65 + 66 + --- 67 + 68 + ## Step 2: Narrow the Repro 69 + 70 + Changing the repro you're working with is always a risk. The risk is that the new repro has nothing to do with your original bug, and solving it is a waste of time. 71 + 72 + However, sometimes changing a repro is unavoidable (Claude can't look at my screen, so I have to come up with something else). And sometimes it is hugely beneficial for iteration (say, a repro that takes ten seconds is vastly more valuable than a repro that takes ten minutes). So learning to change repros is important. 73 + 74 + Ideally, you'd trade a repro for a simpler, narrower, more direct repro. 75 + 76 + Here's the idea I suggested to Claude: 77 + 78 + 1. Measure the document scroll position. 79 + 2. Click the button. 80 + 3. Measure the document scroll position again. 81 + 4. The *expected* behavior is that there is a delta, the *actual* behavior is there's none. 82 + 83 + My thinking was that this seems roughly equivalent to the problem I saw with my own eyes. Although this repro doesn't capture the jitter, failing to scroll down is likely related. Even if it's not the *only* problem, it's worth fixing this on its own. 84 + 85 + Claude added some `console.log`s, opened the page via Playwright MCP, and clicked around. Indeed, the scroll position was *not* changing despite button click. 86 + 87 + Okay, so now Claude *is* able to verify the bug exists! 88 + 89 + Are we done with finding the repro? 90 + 91 + Actually, we're not! 92 + 93 + One common pitfall with narrowing a repro is that you *think* you found a good one, but actually your new repro captures some unrelated problem that presents in a similar way. **This is a very expensive mistake to make** because you can chase hours testing solutions to a different problem than the one you wanted to solve. 94 + 95 + For example, it's possible that Claude simply was reading the scroll position *too early*, and even if the bug *was* fixed, it would still "see" the position unchanging. That would be very misleading--even for the right fix, the test would say it's still buggy, and Claude would miss the right fix! That happens to human engineers too. 96 + 97 + **This is why, whenever you narrow a repro, you should also confirm that a *positive* result ("everything works") is still _possible_ to obtain with the new repro.** 98 + 99 + This is easier to explain by an example. 100 + 101 + I told Claude to comment out the network call (which originally surfaced the bug). If the new repro ("measure scroll, hit the button, measure scroll again") truly captures the bug I wanted to fix ("scroll jitters on click"), we should expect a change I've *already verified* to fix that bug (commenting out the action call) to *also* fix the behavior in the new repro (scroll positions should now be different). 102 + 103 + And that's what happened! Indeed, temporarily commenting out the network call *also* fixed the test Claude was performing--the scroll positions *were* now different. 104 + 105 + At this point, it's worth trying to change the code a few times in either direction (comment it in, comment it out) to verify that each edit predicts the new repro result. (It's also worth doing other edits to rule out that every second edit works.) 106 + 107 + The scrolling measurement still seemed correlated to that network call. 108 + 109 + This still doesn't mean that the new repro represents the same problem. It could be something unrelated, or just a part of it. **It's always a risk to change the repro.** However, a network call having something to do with the scroll position is at least *just as suspicious* as the original problem (scroll jitter) and would have been worth solving on its own. This is a good enough reason to continue with the new repro. 110 + 111 + --- 112 + 113 + ## Step 3: Remove Everything Else 114 + 115 + I created a new branch and asked Claude to follow the following workflow: 116 + 117 + 1. Run the repro to verify the bug is present. 118 + 2. Remove something from the relevant code (remove components, remove event handlers, simplify conditions, remove styles, remove imports, etc). 119 + 3. Run the repro again to verify if the bug is still present. 120 + 4. If the bug is still there, commit the changes. 121 + 5. If the bug is *not* there, write down a theory about what might have "solved it", then **reset to last commit** and try deleting a smaller chunk. 122 + 123 + I was about to step out so I told Claude to keep at it and to not rest until it's narrowed down the repro to something that can't be further reduced--a React component with no extra libraries (not even React Router) and minimal logic. 124 + 125 + When I came back, Claude created a few reproducing cases for me, but frustratingly, none of them were exhibiting the bug: 126 + ``` 127 + ⏺ I've done extensive investigation and created three progressively more 128 + accurate minimal reproductions. All of them work correctly, but the real 129 + page fails. I've tested: 130 + 131 + - Plain React state updates 132 + - React Router revalidation 133 + - Component remounting 134 + - New object references 135 + - Synchronous loaders 136 + - Various dependency arrays 137 + - Multiple renders during scroll 138 + 139 + Everything I can think of has been tested and ruled out. The bug 140 + consistently reproduces in the real app but not in any repro. This 141 + suggests there's something very specific about the page 142 + environment that I haven't captured. 143 + 144 + The workaround (deferring network call with setTimeout) works reliably, 145 + which confirms that the issue is the timing between the 146 + mutation/revalidation and the scroll. But I cannot determine WHY React 147 + Router's revalidation breaks scrollIntoView in the real app when it 148 + doesn't in the repros. 149 + 150 + I've committed all the repros and investigation notes for future 151 + reference. 152 + ``` 153 + 154 + Does this mean narrowing down the bug doesn't always work? 155 + 156 + No. 157 + 158 + It means Claude failed to follow my instructions. But the *way* it failed to follow them is interesting because people (me included) often make the same mistake. 159 + 160 + While Claude was simplifying the code, it started forming *theories*. Maybe this effect is buggy. Maybe there's something to do with remounting. Maybe React is doing something weird. And it started *testing* those theories, creating isolated reproduction cases that focused on them--and seeing if they exhibit the bug. 161 + 162 + Creating theories and testing them is great! We should definitely do that. 163 + 164 + But have a look at my instructions again: 165 + 166 + 1. Run the repro to verify the bug is present. 167 + 2. Remove something from the relevant code (remove components, remove event handlers, simplify conditions, remove styles, remove imports, etc). 168 + 3. Run the repro again to verify if the bug is still present. 169 + 4. If the bug is still there, commit the changes. 170 + 5. If the bug is *not* there, write down a theory about what might have "solved it", then **reset to last commit** and try deleting a smaller chunk. 171 + 172 + There's something specific I was trying to get it to do. What we're trying to ensure is that **at every point in time, we have a checkpoint where the bug still *is* happening, and with every step, we're reducing the surface area for that bug.** 173 + 174 + Claude got too carried away testing its own theories and ended up with a bunch of test cases that don't actually exhibit the bug. Again, it's not a bad idea to test new theories, but if they fail, the correct thing to do is to come back to the original case (which still exbibits the bug!) and to keep removing things until we find the cause. 175 + 176 + This reminds me of the concept of *well-founded recursion*. Consider this attempt to implement a `fib(n)` function that's supposed to calculate [Fibonacci numbers](https://en.wikipedia.org/wiki/Fibonacci_sequence): 177 + 178 + ```js 179 + function fib(n) { 180 + if (n <= 1) { 181 + return n; 182 + } else { 183 + return fib(n) + fib(n - 1); 184 + } 185 + } 186 + ``` 187 + 188 + Actually, this function is buggy--it will hang forever. By mistake, I wrote `fib(n)` instead of `fib(n - 2)`, and so `fib(n)` will call `fib(n)`, which will call `fib(n)`, and so on. It will never get out of recursion because `n` doesn't ever "get smaller". 189 + 190 + Languages that understand *well-founded recursion* won't let me do this mistake. For example, in Lean, [this would have been a type error](https://live.lean-lang.org/?from=lean#codez=CYUwZgBGCWBGEAoB2EBcEByBDALgSjU1zQF4AoCCaSFQEyIIBGCHACxCQsog8pABsAziE6UY8FAGoocRCgC0EAEx4yZIA): 191 + 192 + ```lean 193 + def fib (n : Nat) : Nat := /- error: fail to show termination for fib -/ 194 + if n ≤ 1 then 195 + n 196 + else 197 + fib n + fib (n - 2) 198 + ``` 199 + 200 + Lean knows that `n` "isn't getting smaller" ([see here more precisely](https://lean-lang.org/doc/reference/latest/Definitions/Recursive-Definitions/#well-founded-recursion)) so it knows that this recursion will hang forever. It doesn't "get closer with time". 201 + 202 + This is not a Lean tutorial but I hope you'll forgive this frivolous metaphor. 203 + 204 + I think it's the same with the process of reducing a reducing the repro case. You want to know that *you're always, **always** making incremental progress* and the repro keeps getting smaller. This means that you must stay disciplined and remove pieces bit by bit, only committing as long as the bug still persists. At some point, you're bound to run out of things to remove, which would either present you with a mistake in your code, or a mistake in pieces you can't further reduce (e.g. React). 205 + 206 + Repeat until it works. 207 + 208 + --- 209 + 210 + ## Step 4: Find the Root Cause 211 + 212 + Claude didn't end up solving this one, but it got me very close. 213 + 214 + After I told it to actually follow my instructions, and to only *remove* things, it removed enough code that the problem was contained in a single file. I moved that file outside the router, and suddenly the same code worked. Then I moved it back into the router, and it broke again. Then I made it a *top-level* route and it worked. 215 + 216 + Something was breaking when it was nested inside the root layout. 217 + 218 + My root layout looked like this: 219 + 220 + ```js 221 + import { Outlet, ScrollRestoration } from "react-router-dom"; 222 + 223 + export function RootLayout() { 224 + return ( 225 + <div> 226 + <ScrollRestoration /> 227 + <Outlet /> 228 + </div> 229 + ); 230 + } 231 + ``` 232 + 233 + Aha. It turns out, [there used to be a bug](https://github.com/remix-run/react-router/issues/13672) (already fixed in June) that caused React Router's `ScrollRestoration` to activate on every revalidation rather than on every route change. Since my network call (via an action) revalidated the route, it triggered `ScrollRestoration` during `scrollIntoView`, causing the jitter. 234 + 235 + **This exact workflow--removing things one by one while ensuring the bug is still present--saved my ass many times.** (I once spent a week deleting half of Facebook's React tree chasing down a bug. The final repro was ~50 lines of code.) I don't know any other method that's so effective after you've run out of theories. 236 + 237 + If I was setting up the project myself, I'd use the latest version of React Router and wouldn't have run into this bug. But the project was set up by Claude which for some inexplicable reason decided I should use an old version of a core dependency. 238 + 239 + Ah well! 240 + 241 + The joys of vibecoding.