Reading about system design is one thing. Performing under interview pressure is another. This lesson is a complete mock interview walkthrough, presented as a dialogue between an interviewer and a candidate. We’ll design a Notification System from scratch, and I’ll provide commentary on what the candidate does well and where they could improve.
Read this like a script. Pay attention not just to what the candidate says, but how they say it — the structure, the check-ins, the way they make trade-offs explicit.
Phase 1: Requirements (5 minutes)
Interviewer: Design a notification system.
Candidate: Before I start designing, I’d like to understand the scope. When you say notification system, are we talking about a platform that sends notifications across multiple channels — like push notifications, email, and SMS? Or is there a specific channel you’d like me to focus on?
Interviewer: Let’s support all three: push, email, and SMS.
Candidate: Got it. Let me ask a few more questions to scope this properly.
First, who triggers notifications? Is this driven by API calls from other services, or also by user-facing events like “someone liked your post”?
Interviewer: Both. Other services call the notification API, and some notifications are triggered by system events.
Candidate: Second, do users have preferences? For example, can a user say “I only want push notifications, not email”?
Interviewer: Yes, users can set per-channel preferences.
Candidate: Third, what about templates? Can notification content be templated, like “Hi {name}, your order {order_id} has shipped”?
Interviewer: Yes, we need template support.
Candidate: Last, what’s the scale? How many notifications per day?
Interviewer: Let’s say 10 million notifications per day, with peaks up to 100 million during events like Black Friday.
Candidate: OK, let me summarize what I’m hearing:
- Functional: Send notifications via push (iOS/Android), email, and SMS. Support user channel preferences, templates with variable substitution, and delivery tracking.
- Non-functional: Handle 10M notifications/day (steady) to 100M (peak). Push should be sub-second. At-least-once delivery — it’s better to send a duplicate than to lose a notification. We need to be robust against vendor failures (FCM going down, SES throttling).
Does that capture it?
Interviewer: That’s a great summary. I’d also add that we need some form of rate limiting — we don’t want to spam users with 50 push notifications in an hour.
Candidate: Good call. I’ll include per-user rate limiting. Let me move to the high-level design.
Commentary: This is a strong requirements phase. The candidate asked targeted questions (4 questions, not 15), confirmed scope, summarized back for alignment, and the interviewer even added a requirement they hadn’t planned to mention. Total time: about 4 minutes.
Phase 2: High-Level Design (10 minutes)
Candidate: Let me draw the high-level architecture, and then I’ll walk you through the data flow.
Here’s the overall architecture. Let me walk through each component:
1. Triggers (left side). Notifications enter the system in three ways:
- API requests from other services (e.g., the order service calls
POST /notificationswhen an order ships) - Event bus events (e.g., a “user_signup” event triggers a welcome email)
- Scheduled jobs (e.g., “send weekly digest every Monday at 9am”)
2. Notification Service. This is the brain. It receives a notification request and does three things:
- Validates the request (does this user exist? is the payload well-formed?)
- Checks the user’s preferences (did they opt out of email?)
- Renders the template (replaces
{name}with “Alice”)
3. Priority Queue. After validation, the notification is enqueued in a priority queue. I’d use Kafka with separate topics per priority level:
- P0 (Critical): OTP codes, security alerts — must be delivered in seconds
- P1 (High): Order updates, payment confirmations
- P2 (Normal): Social notifications (likes, comments)
- P3 (Low): Marketing, weekly digests
The priority queue is key because it decouples ingestion from delivery. If our SMS vendor goes down, notifications queue up instead of being lost.
4. Channel Dispatchers. Separate consumer groups for each channel:
- Push: FCM for Android, APNs for iOS
- Email: Amazon SES or SendGrid
- SMS: Twilio
Each dispatcher has its own retry logic because vendor APIs behave differently. FCM has batching, SES has rate limits, Twilio has per-number throttling.
5. Delivery Tracker. Receives callbacks from vendors (delivered, bounced, opened, clicked) and updates delivery status in a database.
6. Analytics. Aggregates delivery metrics for dashboards: delivery rate, open rate, bounce rate per notification type.
Interviewer: Why Kafka for the queue? Why not SQS?
Candidate: Good question. I chose Kafka because of three properties:
- Ordering within partitions: Notifications for the same user arrive in order, which matters for coherent notification sequences
- Replay capability: If a consumer crashes, it can re-read messages from its last offset
- High throughput: Kafka handles 100K+ messages/second easily, which we need for peak traffic
The trade-off is operational complexity. SQS would be simpler to operate but doesn’t give us ordering guarantees or replay. If we’re in AWS and want to minimize ops burden, SQS with FIFO queues is a reasonable alternative.
Interviewer: Makes sense. What about the rate limiter you mentioned?
Candidate: Right. I’d implement per-user rate limiting in the Notification Service, before the queue. Something like:
class UserRateLimiter:
def __init__(self, redis_client):
self.redis = redis_client
def is_allowed(self, user_id: str, channel: str) -> bool:
key = f"ratelimit:{user_id}:{channel}"
current = self.redis.incr(key)
if current == 1:
self.redis.expire(key, 3600) # 1 hour window
limits = {
"push": 10, # Max 10 push per hour
"email": 5, # Max 5 emails per hour
"sms": 3, # Max 3 SMS per hour
}
return current <= limits.get(channel, 10)Using a Redis sliding window counter. If a user has already received 10 push notifications this hour, additional ones are dropped (or downgraded to email). Critical notifications (P0) bypass the rate limiter — you always want to deliver OTP codes.
Interviewer: Good. Let’s dive deeper.
Commentary: The candidate covered the full architecture in about 8 minutes, explained the data flow clearly, and answered a probing question about technology choice with a real trade-off. They also proactively mentioned the rate limiter before being asked again. The code example for rate limiting was a nice touch — concrete enough to show understanding, short enough not to waste time.
Phase 3: Deep Dive (20 minutes)
Interviewer: I’d like to go deeper on a few things. First, how do you guarantee at-least-once delivery? What if the push vendor says “success” but the notification never actually reaches the device?
Candidate: Great question. At-least-once delivery requires us to track the full lifecycle of every notification. Let me break this down.
Every notification gets a unique ID when it enters the system. This ID flows through the entire pipeline:
class Notification:
id: str # UUID, globally unique
user_id: str
channel: str # push, email, sms
priority: int
status: str # created, queued, dispatched, delivered, failed
created_at: datetime
dispatched_at: datetime | None
delivered_at: datetime | None
retry_count: intFor push notifications, FCM and APNs give us delivery receipts. FCM returns a message ID on success. But “success” from FCM means “accepted by Google’s servers,” not “displayed on the user’s phone.” The device might be offline. So we track two states:
dispatched= vendor accepted itdelivered= device acknowledged it (via FCM delivery receipts or APNs feedback)
If we don’t get a delivery confirmation within a timeout window (say 30 minutes), we can retry via a different channel. For example, if push wasn’t confirmed, fallback to email.
For email, SES provides delivery notifications via SNS webhooks: delivered, bounced, or complained. We subscribe to these and update the notification status.
For SMS, Twilio provides status callbacks: queued, sent, delivered, failed. We register a webhook URL when sending the SMS.
The key principle: the notification stays in dispatched status until we get a positive confirmation. A background job scans for notifications stuck in dispatched for more than their timeout window and triggers a retry or fallback.
Interviewer: What about idempotency? You said at-least-once, so there might be duplicates. How do you prevent sending the same email twice?
Candidate: Each notification has a unique notification_id. Before dispatching, the channel worker checks:
def dispatch_notification(notification: Notification):
# Idempotency check
if self.delivery_store.was_dispatched(notification.id, notification.channel):
return # Already sent on this channel, skip
try:
vendor_response = self.send(notification)
self.delivery_store.mark_dispatched(
notification.id,
notification.channel,
vendor_message_id=vendor_response.id
)
except VendorException as e:
self.retry_queue.enqueue(notification, retry_count=notification.retry_count + 1)The was_dispatched check uses a database unique constraint on (notification_id, channel). Even if two workers somehow process the same message (consumer rebalance, Kafka offset reset), only one will succeed in marking it dispatched.
There’s still a small window for duplicates: if the vendor accepts the notification but our mark_dispatched call fails. In practice, this is rare and acceptable. Users would rather get a duplicate shipping notification than miss it entirely.
Interviewer: Good analysis. Now let’s talk about the template engine. How does that work?
Candidate: The template engine has two parts: template storage and rendering.
Template storage: Templates are stored in a database or configuration service. Each template has:
class NotificationTemplate:
template_id: str
channel: str # Different template per channel
subject: str # For email: "Your order {{order_id}} has shipped"
body: str # "Hi {{user_name}}, your order..."
variables: list[str] # ["user_name", "order_id", "tracking_url"]
version: int # Template versioning for A/B testsRendering happens in the Notification Service before queueing. We use a simple template engine like Handlebars or Jinja2:
def render_notification(template_id: str, channel: str, context: dict) -> str:
template = self.template_store.get(template_id, channel)
# Validate all required variables are present
missing = set(template.variables) - set(context.keys())
if missing:
raise MissingVariablesError(f"Missing: {missing}")
# Render with HTML escaping for email, plain text for SMS
if channel == "email":
rendered = jinja_env.get_template(template.body).render(**context)
elif channel == "sms":
rendered = template.body
for key, value in context.items():
rendered = rendered.replace(f"{{{{{key}}}}}", str(value))
elif channel == "push":
rendered = json.dumps({
"title": template.subject.format(**context),
"body": template.body.format(**context),
})
return renderedWe render before queueing so that the dispatchers receive fully-formed messages. This means the queue messages are self-contained — dispatchers don’t need access to the template store or user data store, which simplifies their failure modes.
The trade-off is that rendered messages are larger than template references, which means higher queue storage. But at 10M messages/day with average 1KB per rendered message, that’s only 10GB/day, which is nothing for Kafka.
Interviewer: What happens when a vendor goes down? Say FCM is having an outage.
Candidate: This is where the retry mechanism becomes critical.
We use exponential backoff with jitter:
class RetryPolicy:
MAX_RETRIES = 3
BASE_DELAY = 1.0 # seconds
MAX_DELAY = 30.0 # seconds
JITTER_MAX = 0.5 # seconds
def get_delay(self, retry_count: int) -> float:
delay = min(
self.BASE_DELAY * (2 ** retry_count),
self.MAX_DELAY
)
jitter = random.uniform(0, self.JITTER_MAX)
return delay + jitter
def should_retry(self, retry_count: int, error: Exception) -> bool:
if retry_count >= self.MAX_RETRIES:
return False
# Don't retry client errors (bad token, invalid payload)
if isinstance(error, ClientError):
return False
# Do retry server errors (5xx, timeout, rate limited)
if isinstance(error, (ServerError, TimeoutError, RateLimitError)):
return True
return FalseThe key distinction: we only retry on transient failures (server errors, timeouts, rate limiting). Client errors (invalid device token, malformed email address) go straight to the dead letter queue because retrying won’t fix them.
When retries are exhausted, the notification lands in a dead letter queue (DLQ). We have:
- Automated monitoring: If the DLQ depth exceeds a threshold (say 1000 messages in 5 minutes), we page the on-call engineer via PagerDuty
- Root cause buckets: DLQ messages are tagged with failure reasons. If 90% of failures are “FCM unavailable,” that’s a vendor outage. If they’re “invalid device token,” that’s a data quality issue.
- Manual replay: After the root cause is fixed, an engineer can replay messages from the DLQ back into the main queue
For extended vendor outages, we implement channel fallback. If push fails three times, we escalate to email. If email also fails, we escalate to SMS (SMS is the most expensive channel, so it’s the last resort).
FALLBACK_CHAIN = {
"push": ["email", "sms"],
"email": ["push", "sms"],
"sms": ["push", "email"],
}
def handle_permanent_failure(notification: Notification):
current_channel = notification.channel
fallbacks = FALLBACK_CHAIN.get(current_channel, [])
for fallback_channel in fallbacks:
if self.user_preferences.allows(notification.user_id, fallback_channel):
notification.channel = fallback_channel
notification.retry_count = 0
self.queue.enqueue(notification)
return
# No fallback available, send to DLQ
self.dlq.enqueue(notification)Interviewer: How do you handle the scheduling component? You mentioned cron-triggered notifications earlier.
Candidate: For scheduled notifications, we have a Scheduler Service that runs alongside the Notification Service. It has two modes:
One-time scheduled: “Send this notification at 3 PM tomorrow.” We store the notification in a scheduled_notifications table with a send_at timestamp. A poller job runs every minute, picks up notifications where send_at <= now(), and submits them to the Notification Service.
Recurring: “Send weekly digest every Monday at 9 AM.” We use a cron-like scheduler. Each recurring notification has a cron expression. The scheduler evaluates all active cron expressions every minute and generates notification requests for any that match.
The poller needs to be idempotent: if two poller instances pick up the same scheduled notification, the idempotency check in the Notification Service prevents duplicate sends.
For time zones, we store the user’s timezone and convert send_at to UTC at scheduling time. The weekly digest for a user in Tokyo (Monday 9 AM JST) and a user in New York (Monday 9 AM EST) will be submitted at different UTC times.
Interviewer: Good. Let’s move to wrap-up.
Commentary: The candidate covered three deep dives with strong technical depth: delivery guarantees with idempotency, the template engine with a clear rendering strategy, and the retry mechanism with fallback chains. Each one included code examples and explicit trade-offs. The fallback chain was a particularly strong signal — it shows the candidate thinks about real failure scenarios.
Phase 4: Wrap-Up (5 minutes)
Candidate: Let me summarize the key design decisions and trade-offs, and then mention what I’d add with more time.
Key decisions:
- Kafka for the queue because we need ordering, replay capability, and high throughput. Trade-off: operational complexity.
- Render before queueing so dispatchers are stateless. Trade-off: larger messages.
- Per-user rate limiting in Redis to prevent notification fatigue. P0 notifications bypass the limiter.
- Exponential backoff + DLQ for retry, with channel fallback for permanent vendor failures.
- Idempotency via unique notification IDs to achieve at-least-once delivery without duplicates.
What I’d add with more time:
- A/B testing for templates: Route 10% of users to a new template version, measure engagement
- Notification grouping/batching: Instead of 5 separate “someone liked your post” push notifications, bundle them into “5 people liked your post”
- Quiet hours: Don’t send push notifications at 3 AM in the user’s timezone
- Analytics pipeline: Funnel analysis (sent → delivered → opened → clicked) per notification type
- Multi-tenancy: If this becomes a platform serving multiple products, we’d need tenant isolation, per-tenant rate limits, and separate Kafka topics
Interviewer: Great job. This was a thorough design.
Commentary: The wrap-up was clean and organized. Summarizing trade-offs shows the interviewer you were deliberate about every choice. The extensions list demonstrates breadth — you know there’s more to build, and you can prioritize what matters most.
Debrief: What Went Well and What Could Improve
Let’s evaluate this performance against the standard system design interview dimensions.
What Went Well
Requirements gathering (9/10). The candidate asked focused, high-signal questions and confirmed understanding with a summary. The interviewer even volunteered an additional requirement (rate limiting) because the conversation flowed naturally. One improvement: could have explicitly stated latency targets instead of just mentioning “sub-second for push.”
High-level design (9/10). Clean architecture with clear data flow. Every component had a stated purpose. The Kafka vs SQS discussion showed real understanding, not just name-dropping. The rate limiter code was a nice proof that the candidate can translate design into implementation.
Deep dives (8/10). Strong coverage of delivery guarantees, template rendering, and retry logic. The idempotency discussion was particularly good — acknowledging the duplicate window and explaining why it’s acceptable shows mature engineering judgment. The channel fallback chain was a standout moment.
Could improve: the candidate could have discussed exactly-once semantics more carefully. The idempotency check has a race condition if two workers read the “not dispatched” status simultaneously. A distributed lock or database transaction would be needed.
Communication (9/10). Regular check-ins, clear structure (“let me walk you through each component”), and natural responses to probing questions. The candidate never monologued for more than 2-3 minutes.
Trade-offs (9/10). Almost every decision included an explicit trade-off. “I chose X because Y, accepting trade-off Z” was a consistent pattern. The Kafka vs SQS discussion and the render-before-queue decision both demonstrated this well.
Time management (8/10). Good allocation: roughly 4 minutes on requirements, 8 minutes on HLD, 20 minutes on deep dives, 5 minutes on wrap-up. Could have spent slightly less time on the template engine (which is straightforward) and more on the scheduling component (which has interesting distributed systems challenges).
What Could Improve
-
Exactly-once delivery nuance. The candidate mentioned at-least-once delivery but didn’t fully address the race condition in the idempotency check. In a real interview, the interviewer might probe harder on this: “What if two Kafka consumers process the same notification simultaneously?” The answer involves either a database transaction with a unique constraint or a distributed lock.
-
Monitoring and observability. The candidate mentioned analytics dashboards but didn’t discuss how to monitor the health of the notification pipeline itself. Key metrics like queue depth, dispatch latency, vendor error rates, and DLQ growth rate are critical for operating this system. A brief mention of “I’d add CloudWatch alarms on queue depth and vendor error rates” would have strengthened the design.
-
Cost awareness. SMS is expensive ($0.01+ per message), push is free, email is cheap ($0.0001 per message). The candidate mentioned SMS as “last resort” but could have explicitly framed channel fallback in terms of cost optimization, not just reliability.
-
Database schema. The candidate showed entity models but didn’t discuss the database choice for notification state tracking. A time-series database or Cassandra might be better than Postgres for write-heavy notification status updates at 10M+/day.
Overall Score
This is a strong pass at the senior engineer level. The candidate demonstrated:
- Clear problem decomposition
- Practical architecture without over-engineering
- Deep understanding of failure modes
- Consistent trade-off analysis
- Good collaboration with the interviewer
The areas for improvement are minor and wouldn’t prevent a hire decision. They represent growth areas for a principal engineer level, where the bar for distributed systems nuance and operational thinking is higher.
How to Use This Walkthrough
- Read it once to absorb the overall flow and pacing
- Re-read with a timer — notice how each phase fits within its time box
- Practice the same problem with a friend or recording, then compare your approach to this one
- Try a different problem (design a chat system, design a ride-sharing service) using the same structure: requirements, HLD, deep dives, wrap-up
- Focus on your weakest phase. If you always run out of time, practice time management. If your trade-offs are vague, practice the “X because Y, accepting Z” pattern.
The difference between a candidate who barely passes and one who crushes the interview isn’t raw knowledge — it’s the ability to communicate a coherent design under time pressure, with clear reasoning at every step. That’s a skill you build through practice, not reading.
Go practice.
