Leveling Up Customer Experience Monitoring at Twitch: The QoUX Journey

About the author: I’m Chirag Sachdeva, an aerospace engineer turned software engineer on Twitch’s Memberships team. I’ve been a proud member of the Twitch family for the past four years working on initiatives like subscriptions, gifting, turbo, and founder’s badge. For me, customer experience is always top of mind and is one of the driving factors behind my work here at Twitch.

In the fast-paced world of live streaming, the quality of user experience directly impacts viewer retention and satisfaction. I’m excited to share how Twitch transformed its monitoring capabilities through Quality of User Experience (QoUX), an initiative that has allowed us to understand user behavior, catch user-impacting issues faster and with lower effort, and respond quickly.

The Challenge

Despite robust backend monitoring, we faced a critical blind spot: understanding exactly how users experienced our platform in real-time. Let me walk you through a specific example that illustrates this problem.

Imagine you’re a Twitch viewer who wants to gift a multi-month subscription to your favorite streamer. You click the gift button, select the recipient, but when you try to choose a 3-month subscription option, the button simply doesn’t respond. From your perspective, the feature is broken - but from our backend perspective, everything looked normal.

This exact scenario happened when we launched a gifting experiment that inadvertently disabled the multi-month selection button for a small subset of users. Our backend systems showed no errors because technically, no requests were failing - the button simply wasn’t sending requests at all. The code on the user’s device (the client) was preventing the interaction from even reaching our servers.

Traditional server-side metrics, while valuable, couldn’t tell us if a subscriber was struggling to complete a purchase or if a gifting flow was failing in a specific region. Prior to QoUX, our monitoring and alerting systems primarily focused on backend services and infrastructure, leaving gaps in detecting these types of client-side outages.

You might wonder: why didn’t we just do client-side monitoring from the start? The tradeoffs are significant:

Data Volume. Client-side monitoring generates vastly more data than backend monitoring. With millions of users, each performing dozens of interactions, the amount of telemetry can quickly become overwhelming.
Privacy Considerations. Client-side monitoring requires careful implementation to respect user privacy and comply with regulations.
Reliability Challenge. Backend systems operate in controlled environments, while client monitoring must work across countless device types, browsers, network conditions, and extensions that might interfere with tracking.
Development Complexity. Implementing robust client-side monitoring requires additional code in every feature, increasing development time and potential for bugs.
Monitoring Complexity: Transforming raw events into actionable insights isn’t trivial. These need to be carefully aggregated and transformed to avoid both false positives and false negatives.

QoUX was born out of the need to monitor software closer to where customers interact with it — on the client side — while addressing these challenges. This initiative has become a core part of Twitch’s operational capabilities, allowing us to get a better understanding of user behavior, while more readily identifying any issues that impact users and leading to shorter response times.

What is QoUX?

Quality of User Experience (QoUX) is a configurable framework designed to monitor and analyze client-side experiences. It is built on three core principles:

Client-side First. Monitor where the user actually interacts with Twitch
Real-time Visibility. Instant insights into user experiences
Actionable Metrics. Data that directly informs product decisions

How Client-Side Events Work

At the heart of QoUX are client-side events - specialized telemetry signals emitted directly from user’s devices. These events are:

Emitted By The Client Application. Whether it’s the Twitch web app, mobile app, or console app, the client code itself generates these events.
Triggered During Key User Interactions. Events fire when users take specific actions like clicking buttons, attempting purchases, or navigating between pages.
Enriched With Contextual Data. Each event contains details about the user’s device, region, and the specific component or feature being used.
Designed For Minimal Performance Impact. Events are batched and compressed to avoid affecting the user experience.

Let me walk you through a real example of how this works:

Raw event coming from the client : When a user clicks the multi-month subscription button, an event is fired containing:
- Button type: “multi_month_selection”
- Funnel ID: “gift_subscription_flow”
- Event type: “click”
- Client information: Browser/app version, device type
- Region: User’s geographic location
- Timestamp: When the interaction occurred
After transformation + aggregation : These raw events are:
- Streamed to Kinesis
- Processed by Lambda functions that filter, transform and aggregate them
- Grouped into 5-minute intervals (the interval varies based on the normal volume of that event type)
- Enriched with additional context from our systems
What shows up at the end : The processed data:
- Gets emitted to CloudWatch dashboards with alarming set
- For this specific multi-month button event, if 5 data points are under a threshold or no data points are received (meaning no clicks on that button) in the last 25 minutes, our on-call engineers get paged

The Technology Behind QoUX

At the core of QoUX’s monitoring capabilities is a framework that facilitates the transformation, aggregation, and streaming of event metrics. But why build a custom framework instead of just sending teams raw events?

Advantages of the QoUX framework:

Data Volume Management. Raw client events would overwhelm teams with petabytes of data. QoUX intelligently aggregates and filters to provide meaningful signals.
Standardization. Ensures consistent event structure and processing across all Twitch features.
Operational Efficiency. Teams don’t need to build their own processing pipelines or monitoring systems.
Privacy by Design. Handles PII and sensitive data appropriately at the framework level.

Events are posted to a Kinesis Stream, where they are transformed, modeled, filtered and aggregated by Lambda functions before being published as custom metrics to CloudWatch. The framework combines event data into minutely distributions with team-defined dimensions, compresses and writes this data to CloudWatch Metrics directly or via the Embedded Metrics Format (EMF) for high-cardinality logs, enabling additional analysis with Log Insights and top-N summaries with Contributor Insights.

Real-Time Monitoring and Alerting

The CloudWatch alarms aren’t simply set with static thresholds. They’re trained through:

Historical analysis of normal usage patterns
Adaptive thresholds that account for time-of-day and day-of-week variations
Machine learning models that detect anomalies beyond simple threshold breaches
Correlation analysis to reduce false positives

These metrics include:

Component Availability. Metrics for component-level availability and latencies, segmented by region.
User Actions. Data from user actions like button clicks, checkout attempts, and gifting activities.
Failures. Alerts triggered by failures in these flows, including specific failure reasons.

Real Impact

Going back to our multi-month gifting example: When we launched a gifting experiment that inadvertently disabled the multi-month selection button for a subset of users, QoUX detected the issue within 25 minutes of deployment. Before QoUX, we would have only discovered this problem after customers reported it, which could have taken hours or even days, especially for features used by smaller segments of our audience.

The custom CloudWatch alarms monitor for spikes in error rates, failures in critical actions, and latency issues, ensuring we can detect client-side outages or degraded performance faster than ever before.

QoUX Use Cases

Real-Time Monitoring of Customer Experience

QoUX’s integration with CloudWatch allows for the real-time monitoring of key features such as Subscriptions, Gifting, and Checkout. By tracking regional availability, latencies, and failure events for these components, we can quickly identify outages and failures that directly impact customer experience and revenue. This approach has led to the detection of multiple critical outages that went unnoticed before, due to lack of client-side monitoring.

Feature Launch Monitoring

One of the standout capabilities of QoUX is its ability to monitor user interactions in real time during feature launches. For instance, during a recent feature launch that modified a key user interface element, we observed an unexpected surge in subscription cancellations among long-term users. Thanks to QoUX’s monitoring capabilities, we identified this issue within 10 minutes of deployment. We were able to quickly adjust the user experience by altering the visibility of the problematic UI element, significantly reducing the negative impact on user retention. This rapid insight and response capability exemplifies the power of our QoUX system in maintaining a positive user experience, even during major feature rollouts.

Advanced Anomaly Detection through CloudWatch Anomaly Detection

Our anomaly detection leverages CloudWatch Anomaly Detection, which uses machine learning to establish dynamic baselines for user interaction patterns. Here’s how we’ve implemented it:

Baseline Creation. We configure CloudWatch to analyze 2-4 weeks of historical data for each metric to establish normal patterns
Adaptive Thresholds. In addition to static thresholds, we use CloudWatch’s anomaly detection bands that automatically adjust for time-of-day and day-of-week patterns
Contextual Alerting. When anomalies are detected, our alerts include specific context like failure reasons, affected regions, and component details

This approach has proven particularly effective for metrics with cyclical patterns or gradual trends that would trigger false alarms with traditional static thresholds.

Unified Cross-Team Monitoring and Alerting

QoUX has transformed our organizational monitoring capabilities by breaking down traditional silos and enabling seamless cross-team visibility. Through CloudWatch cross-account observability, we’ve created an ecosystem where teams can independently access, analyze, and set up alerts on shared metrics while maintaining clear ownership boundaries.

For example, when a customer-facing issue occurs in the subscription flow, both the Subscriptions team and the Checkout team have immediate access to the same source of truth. This shared visibility enables faster coordination and more effective incident response, as teams can see exactly which components are affected and collaborate on solutions without wasting time reconciling different monitoring systems.

The unified approach has reduced our mean time to detection by approximately 40% and improved cross-team collaboration during incidents, as everyone works from the same real-time data.

Lessons Learned and Next Steps

Implementing client-side monitoring through QoUX has demonstrated how powerful real-time visibility into user experiences can be. By monitoring where users actually interact with your application, you gain immediate insights into issues that traditional backend monitoring might miss entirely.

Key Takeaways

Start Small and Expand. Begin by instrumenting your most critical user flows - those directly tied to revenue or core functionality. In our case, subscription and checkout flows were natural starting points.
Choose the Right Metrics. Focus on metrics that directly indicate user success or failure, not just technical performance. Button clicks, completion rates, and error frequencies often tell a more complete story than server response times.
Leverage AWS Services Effectively. CloudWatch Anomaly Detection provides a powerful foundation for dynamic alerting without requiring complex machine learning expertise. The combination of Kinesis for streaming, Lambda for processing, and CloudWatch for monitoring creates a flexible and scalable architecture.
Balance Data Volume with Actionability. Client-side monitoring generates significantly more data than server-side monitoring. Be strategic about what you collect and how you aggregate it to avoid overwhelming your systems and teams.

Getting Started with Your Own QoUX Implementation

Audit Your Current Monitoring. Identify gaps in your existing monitoring strategy, particularly around user-facing components and interactions.
Instrument Key User Journeys. Add client-side event emission to critical paths in your application, focusing on:
- Conversion funnels
- Authentication flows
- Payment processes
- Core feature interactions
Build Your Processing Pipeline. Use AWS services like Kinesis, Lambda, and CloudWatch to collect, process, and visualize your client-side events.
Establish Baselines and Set Alerts. Collect data for 2-4 weeks to establish normal patterns before setting up anomaly detection alerts.
Create Cross-Team Visibility. Ensure all stakeholders have access to the same monitoring data to improve collaboration during incidents.

By implementing your own version of QoUX, you can transform how you understand and respond to user experiences, catching issues before they impact your business and creating a more reliable platform for your users.