The Failure logs.

AI can generate perfect code, but it can't tell you why it broke. Here are some real problems I've faced and how I systematically dismantled them.

CASE #01

The Ghost of the Tab Filter

The Problem

User filter state was resetting every time they switched tabs in the history dashboard.

The Fix

Discovered that the state was tied to a component that was unmounting. Refactored the state to a parent provider with memoized selectors.

The Insight

Local state is great until it isn't. Architectural lift is better than a quick patch.

CASE #02

Map Performance Meltdown

The Problem

The interactive map became sluggish with >100 markers, dropping to 15 FPS.

The Fix

Debugged rendering cycles and found unnecessary re-renders in the marker component. Implemented custom clustering logic and React.memo.

The Insight

Don't trust third-party components to be performant out of the box. Measure first.

CASE #03

The Phantom Scroll on First Load

The Problem

Goal: build a cinematic, animated portfolio with hero, particle background, and multiple glass sections. Complexity: heavy visuals, multiple sections, and scroll-triggered animations. Challenge: on first page load the scrollbar looked almost full and the first scroll 'stalled' as if the page was only one screen tall. After the first scroll, everything became normal.

The Fix

Root cause: unstable scroll height calculation on initial load when the scroll container is the default html/body. Fix: move scrolling to a dedicated container by setting body overflow hidden and making main the scroll container (height 100vh/100svh with overflow-y: auto). Updated scroll listeners/observers to use main.

The Insight

When initial scroll feels 'stuck' and the scrollbar size is wrong, suspect the scroll container, not the sections. A dedicated scroll container can stabilize layout on first paint.

CASE #04

Jellyfish Particles That Felt Dead

The Problem

The goal was a living, jellyfish-like particle field for the hero. The first shader version used simple parametric waves, but the motion felt flat and the 'breathing' looked like static noise instead of a swimming organism.

The Fix

Rebuilt the motion model: separated bell pulse and tentacle waves, added layered flow fields and slow expansion cycles, then tuned amplitudes for water-like motion. The result is a more organic, fluid swim with visible waves.

The Insight

If motion feels dead, it's usually the model, not the color. Separate the anatomy (bell vs tentacles), then layer multiple time scales.

CASE #05

The Double-Deduct Disaster

The Problem

E-commerce client reported customers seeing negative stock after flash sales. Orders were overselling products — 50 units in stock but 63 orders went through. Production incident during peak hours.

The Fix

Race condition: multiple concurrent requests reading the same stock count before any write completed. Implemented pessimistic locking with SELECT FOR UPDATE in a transaction, added Redis distributed lock as first defense, and database constraint as final safety net.

The Insight

Optimistic concurrency is fine for low traffic. For inventory during flash sales, assume the worst: lock first, validate second, and always have a database constraint as your last line of defense.

CASE #06

CRM Sync That Ate the Server

The Problem

CRM dashboard took 45 seconds to load customer list. Backend CPU spiked to 100% whenever sales team opened the page. Database showed thousands of queries per single request.

The Fix

Classic N+1 problem multiplied: fetching 100 customers, then for each: orders, contacts, last activities, and tags — separately. Rewrote with Prisma includes and proper relation loading. Added cursor pagination instead of offset. Query count dropped from 500+ to 3.

The Insight

N+1 is sneaky in ORMs. Always check query logs in development. If you see the same query pattern repeating, you have N+1. Eager loading and pagination are not optional for lists.

CASE #07

The Midnight Token Massacre

The Problem

Production CRM went down at 2 AM. All API requests returning 401 Unauthorized. No code deployment, no infrastructure changes. Support tickets flooding in from APAC clients.

The Fix

JWT refresh token rotation was working, but the cron job that cleaned expired tokens had a bug: it was deleting ALL tokens older than 24 hours, including valid refresh tokens. Added token type check to the cleanup query. Implemented graceful token refresh with retry mechanism.

The Insight

Background jobs are invisible until they break everything. Always test cleanup/maintenance jobs with production-like data volumes. Add safeguards that prevent mass deletions.

CASE #08

Payment Webhook Hell

The Problem

E-commerce orders stuck in 'pending' status. Customers paid but order never confirmed. Payment gateway showing successful charges, but our system had no record of webhook receipt.

The Fix

Webhook endpoint was throwing 500 due to database timeout on order update. Gateway retried 3 times, then gave up. No logging of failed webhooks. Added: idempotency keys, webhook queue with BullMQ, dead letter queue for failed processing, and Slack alerts for stuck orders.

The Insight

Webhooks are fire-and-forget from the sender's perspective. Your receiver must be bulletproof: fast acknowledgment, async processing, idempotency, and alerting. Never do heavy work in the webhook handler itself.

CASE #09

The Cache Stampede

The Problem

Every day at midnight, the e-commerce product catalog API would timeout for 2-3 minutes. Redis showed cache misses spiking. Database connections exhausted.

The Fix

Cache TTL for all products was set to expire at the same time (midnight). When cache expired, hundreds of concurrent requests hit the database simultaneously. Implemented cache stampede protection: random TTL jitter, mutex lock for cache regeneration, and stale-while-revalidate pattern.

The Insight

Uniform cache expiration is a time bomb. Add randomness to TTL. For hot data, use background refresh before expiration rather than on-demand regeneration.