Many years ago I had a thought as I was falling asleep, "Could I pick apart a bunch of images to recreate a particular target?" This thought lead to one of my most successful projects yet, pixel-matcher. Once I had it in a good state, I posted to Reddit which made it to the frontpage, and resulted in quite a bit of discussion.
One post that jumped out was someone who had recreated the effect in ShaderToy. To my astonishment, this in-browser implementation was able to run in real-time with multiple video feeds. In contrast, my program took seconds per frame, many thousands of times slower.
I immediately got to work implementing the algorithm in OpenCL to parallelize the work, and see if I could get close to real-time. This wasn't a huge issue, I've done a bit of GLSL work in the past so I was familiar with the implementation. However, once I ran it, the results were completely different and not just that, they were utterly boring! Whereas my CPU native version would twist and pulse as it made its way to the final image, this new version was bizarrely utilitarian. It grabbed exactly the pixels it needed, only ever working towards the optimum solution.
After a huge amount of debugging, I finally found the issue, or rather, lack of issue! It turned out that the CPU native version had the distances calculated using an unsigned 8-bit integer, because of how I initialized the Numpy array. 8-bits gets you 0-255, so if the difference found was negative, or if it went over 255, the value would either overflow or underflow.
With this knowledge in hand, I intentionally introduced the overflow and underflow bug and all of a sudden, everything was back how it was before. Pulsing, twisting, ripping and rending, constantly evolving, and ever hypnotizing.