The Notification Grind: Go, Node, and RabbitMQ in Multi-Tenant Hell
Systems Optimization & Performance Engineering8 min read

The Notification Grind: Go, Node, and RabbitMQ in Multi-Tenant Hell

JeJozef ehj··8 min read

Remember that Tuesday, 2 AM? The one where a minor tenant's custom webhook brought down notification delivery for everyone else? Yeah, that one. It was a classic. A seemingly innocuous change, pushed with the usual 'looks fine in dev' shrug, and suddenly, hundreds of thousands of notifications were stuck in a RabbitMQ queue, slowly suffocating while our Node.js consumer choked on what looked suspiciously like a single, poorly behaved external API call. That’s when the real fun started, trying to untangle why a service designed for 'high throughput' was effectively operating on a prayer.

The premise was simple enough on paper: a multi-tenant notification service. Events hit an endpoint, get enriched, and then fan out to various tenants via email, webhook, push notification, whatever the customer_preferences table decided that day. Early on, the direct approach – fire-and-forget sending – proved to be a liability. One blocked http.post and the whole API request was held up, burning precious database connections and user patience. So, naturally, we introduced RabbitMQ. Not because we loved managing another piece of infrastructure, but because the alternative was watching our database connection pool get vaporized by a single slow external provider. RabbitMQ, with its durability and fanout exchanges, became the unsung hero, at least initially. It handled the sheer volume, buffering the chaos until our workers could process it.

Then came the choice of worker. We started with Node.js, because, well, it was fast to build, await felt like magic, and everyone on the team knew JavaScript. The initial prototypes hummed along, consuming messages, hitting various third-party APIs, and doing their thing. Performance locally was great, on staging, it was acceptable. Production, however, has a funny way of introducing concepts like P99 latency and resource starvation into the conversation. For a while, things were 'fine'. Then, one tenant decided to integrate with a particularly sluggish, rate-limited, and occasionally offline webhook endpoint. Our Node.js consumer, designed around the event loop, suddenly started exhibiting… interesting behavior.

What happened was predictable if you’ve been around the block, but insidious if you bought into the 'JavaScript is non-blocking' narrative completely. While await makes asynchronous operations look synchronous, that single event loop is still processing your code. If one of those await calls takes too long to resolve – say, a 10-second webhook timeout – it doesn't block the thread in the traditional sense, but it absolutely monopolizes the event loop. Suddenly, a single, poorly performing message for one tenant could effectively backlog the entire worker process, delaying notifications for all other tenants. We’d see CPU utilization that looked like a flatline, yet message processing rate plummeted. The profiler would show the event loop spending an eternity waiting for that one HTTP call, while other messages sat idle. You could throw more instances at it, sure, but that’s scaling horizontally to mask a vertical bottleneck, which always feels a bit like vibe coding, not engineering.

This is where the idea of bringing Go into the mix started gaining traction. A lot of the initial resistance was the usual: 'another language to maintain,' 'steep learning curve,' 'do we really need this complexity?' But the recurring 2 AM pager incidents had a way of cutting through the rhetoric. We needed predictable concurrency, real concurrency, not just async I/O on a single thread. Go, with its goroutines and channels, offered exactly that. Instead of one event loop juggling all the balls, we could have a pool of goroutines, each a lightweight thread of execution, ready to pick up a message. If one goroutine got stuck waiting on a slow webhook, the others were unaffected. They’d keep chewing through the queue, isolating the problem to that specific message or tenant's goroutine, rather than the entire process.

The transition wasn’t entirely painless. Writing RabbitMQ consumers in Go involves a bit more boilerplate compared to a quick amqplib import and a for await loop in Node. Error handling is explicit, resource management is explicit, context cancellation needs to be wired up for graceful shutdowns. But what we got in return was immense. The Go consumers were lean, fast, and surprisingly resilient. Resource usage (CPU, memory) was consistently lower, even under heavy load. P99 latency became far more stable. Debugging slowdowns shifted from 'why is the event loop blocked?' to 'which goroutine is taking too long?', a much more actionable question with Go's built-in profiling tools. We could throttle specific tenants by limiting their goroutine allocation, a level of control that was a nightmare to implement reliably in Node.js without resorting to complex queue-per-tenant architectures or some deeply embedded scheduler hacks.

It wasn't about Go being inherently 'better' for everything. Node.js still has its place for certain kinds of services, especially those with minimal CPU-bound work and predictable I/O patterns. But for a critical, multi-tenant notification service, where unpredictable external API calls and varied processing demands per message are the norm, and where one tenant's misbehavior cannot be allowed to impact others, Go's explicit concurrency model and efficient resource utilization became the clear winner. We traded 'fast to prototype' for 'reliably performs at scale under duress.' And honestly, when you're staring down another 2 AM pager, that trade-off feels less like a choice and more like self-preservation. Turns out, not everything can be solved by throwing more await at the problem and hoping for the best.

So, yeah, that's how we ended up with a mixed polyglot stack, not because some AI-generated architecture diagram told us to, but because production pain forced our hand. And now, Tuesdays at 2 AM are usually just quiet.

Frequently Asked Questions

Why use a message queue like RabbitMQ for notifications?+

Initially, it was to decouple the sending of notifications from the user-facing API request. Direct sends would block the API, consuming resources and user patience if external services were slow or failed. RabbitMQ provides durability, buffering, and allows for asynchronous processing, preventing cascading failures and improving overall system resilience.

What were the main issues with Node.js for this type of service?+

Node.js struggled with unpredictable external API call latencies inherent in a multi-tenant setup. Its single-threaded event loop, while great for async I/O, meant a single, slow `await` for one tenant could effectively monopolize the loop, delaying all other messages for all other tenants. This led to high P99 latencies and unpredictable throughput despite low CPU usage.

How did Go solve these problems for the notification service?+

Go's goroutines and channels provide true concurrency. Each message consumer could run in its own lightweight goroutine. If one goroutine got stuck waiting on a slow external call, other goroutines continued processing other messages independently. This isolated the problem to individual messages/tenants, ensured predictable performance, and offered better resource utilization and control, like tenant-specific throttling.

Je
Studies and Development Engineer
More

Continue reading

AI Coding Assistants Fail at Production Debugging

AI coding assistants generate correct-looking code but often fail in production debugging. Learn why runtime profiling, system constraints, and execution paths matter more than generated solutions.

3 min

Cutting Through the Noise: A Late-Night Rant on Directness in Systems

Another 3 AM production incident survived. Time to talk about why we make our systems so damn complicated, and why sometimes, the most elegant solution is the one that just gets straight to the point.

6 min