Protected case study

This work is under NDA. Enter the password to continue.

Back

Fixing Alerting at Checkly

Product Designer (sole designer, ~50-person company) · 1 PM, 1 Frontend Engineer · 2024–2025

The brief

Every minute of confusion in your alert configuration is a minute added to your MTTR

Checkly is a monitoring platform used by engineering teams at Vercel, 1Password, LinkedIn, and Axon. Its core promise is precise alerting: you get notified exactly how and when you want, so you can detect issues fast and bring down your mean time to repair. High-impact outages cost a median of $2 million per hour.

When groups were introduced as an organisational layer (account → groups → checks), the alert configuration model wasn't rethought to match. Adding a level of abstraction always has a multiplier effect on complexity. And the UX hadn't caught up.

The problem

The one place meant to build confidence was a source of confusion and anxiety

The Subscriptions modal on the Alert Channels page is the one place in Checkly where you can see everything an alert channel is subscribed to. It should be the clearest, most confidence-building screen in the product. Three things were broken:

The status indicators were alarming rather than informative. A group without override settings showed a bright "INACTIVE" badge, reading like something is broken when the individual checks inside might be perfectly configured. A group with overrides showed a subtle info icon, easy to miss and carrying the opposite meaning. Two completely different treatments for two sides of the same coin.

You couldn't see inside groups. Groups appeared as collapsed, unexpandable rows. To find out what checks lived inside, you had to navigate away and come back. This created a disorienting loop that never resolved into clarity.

The modal was neither fully informative nor fully actionable. It existed in an uncomfortable middle ground. For a monitoring tool whose users need precision and control, that's the worst possible state.

My process

I didn't start at the UI. I started with what the team already knew.

Documentation and feedback first. The problem was already on my PM's backlog. Users had submitted complaints about the confusion around groups and alerts. I read through all existing documentation and user feedback to understand what had been identified and what hadn't.

Testing it myself. I walked through the alert configuration flow as a user would. The mental model didn't click. The relationship between groups, checks, override settings, and alert channels was genuinely hard to hold in your head. That was a signal, not a failure of my understanding.

Ad-hoc calls with the team. I had repeated conversations with my PM and engineer. During those calls, even the people who built it got confused as we talked through different states. That confirmed this wasn't just a new-user problem.

Mapping every possible state. I built out a FigJam board documenting every combination: groups with overrides, groups without, channels subscribed to groups that couldn't fire notifications. Walking the team through this map is how we found the biggest issue.

The insight

Users could do everything right and still never get notified

The critical discovery was what I started calling the "CLI trap." It was possible to subscribe an alert channel to a group that had no group-level alert settings configured. That channel would never receive a notification. The user had done everything right from their perspective: set up a channel, subscribed it to a group. But the system would stay silent.

This wasn't just a UI problem. The same trap existed in the CLI. A user could configure their entire alerting setup programmatically and still end up with channels subscribed to groups that would never notify them.

My position was simple: remove the trap entirely. Don't redesign around a broken interaction. Eliminate it.

What I proposed

Don't redesign around a broken interaction. Eliminate it.

Fix the status indicators. Stop using warning-like states as the default way to communicate group configuration. A user should be able to glance at this page and immediately understand what they're subscribed to and why.

Make groups expandable. Even groups with override settings should open to show the checks inside. You might not be able to subscribe to those checks individually, but seeing what's in the group turns the modal from an opaque list into an actual picture of your configuration.

Link group names to their settings page. A small change that removes the disorienting navigation loop.

Remove the CLI trap. Prevent users from subscribing to groups that can't fire notifications, both in the UI and the programmatic interface.

Future vision

One place for full visibility and full control

With my frontend engineer, I did a brief feasibility assessment on a more ambitious direction: consolidating all notification settings into this single Subscriptions modal. She confirmed it was technically achievable. The roadmap was to evolve this modal from a partially informative view into the single, powerful place where you understand and control your entire alert configuration. That's the experience that delivers on Checkly's core promise of reducing MTTR.

What this shows

Sitting with confusion long enough to find where the model breaks

  • Deep domain learning to understand a complex, technical system before proposing changes
  • Systematic state mapping that revealed a structural flaw the team hadn't identified
  • Scoped, shippable proposals grounded in a longer-term vision
  • A belief that when a fundamental interaction is broken, the best design move is often to remove the broken path entirely