Module 1 · Support English
🔥 1 day streak

Lesson 02: RCA Writing

2026-06-13 · ~20 min · B1 → C1 · Section 1 / 8
Section 1 Today's Scenario
#INC-9482 P1 · Critical Needs RCA

MCP Integration / Claude Code

Intermittent timeouts during long-running tool execution

The MCP server experienced severe timeouts yesterday when Claude Code CLI attempted to execute long-running automation tasks. The immediate bleeding has stopped, but you now need to write a formal Root Cause Analysis (RCA) to explain the failure to management and detail the permanent fix.

昨天在执行耗时自动化任务时,MCP服务器出现了严重的超时问题。虽然已经止血,但你现在需要编写一份正式的根本原因分析(RCA)向管理层解释故障原因,并详细说明永久性修复方案。

Section 2 Core Vocabulary Click gray bar to reveal Chinese
Root Cause /ruːt kɔːz/ 根本原因

The fundamental, underlying reason a problem occurred, rather than just the symptom.

"The root cause of the outage was a misconfigured Kafka connection pool in the MCP server."

Trigger /ˈtrɪɡər/ 触发因素 / 导火索

The specific event or condition that initiated the failure sequence.

"The trigger was a sudden spike in concurrent requests from the Claude Code CLI during prime time."

Impact /ˈɪmpækt/ 影响范围

The measure of how severely business processes or customers were affected by the incident.

"The customer impact was limited to users attempting to generate new lesson pipelines; existing pipelines ran normally."

Remediation /rɪˌmiːdiˈeɪʃən/ 补救措施 / 修复方案

The process of correcting a fault or deficiency to prevent future recurrence.

"Our primary remediation strategy is to implement an asynchronous queueing system for all LangGraph agents."

Action Item /ˈækʃən ˈaɪtəm/ 待办/后续行动项

A specific, documented task assigned to an engineer following an incident review.

"I created a Jira ticket for the first action item: upgrading the Redis cluster memory."

Timeline /ˈtaɪmlaɪn/ 时间线

A chronological arrangement of events detailing how an incident unfolded and was resolved.

"Please review the timeline section of the RCA to see exactly when the memory spike occurred."

Section 3 Native Engineer Expressions
JG

"We have identified the root cause as..."

我们已将根本原因确认为…… · Standard, confident phrasing for officially stating the core issue in an RCA.

JG

"The incident was triggered by..."

故障是由……触发的 · Used to explain the immediate, proximate cause that exposed the deeper root cause.

JG

"Customer impact was limited to..."

客户受到的影响仅限于…… · Used to scope the damage and reassure stakeholders that the blast radius was contained.

JG

"As a permanent fix, we will..."

作为永久性修复方案,我们将…… · Used to transition the document from short-term workarounds to long-term architectural improvements.

JG

"Action items have been scheduled for the next sprint."

后续行动项已安排在下一个冲刺阶段 · Used to confirm accountability and assure readers that remediation work is actually planned.

Section 4 Technical Reading

Incident Summary: On June 12th, the MCP Automation Platform experienced a 45-minute degradation. The incident was triggered by a sudden spike in concurrent agent requests from the Claude Code CLI.

Our investigation revealed that we have identified the root cause as thread pool exhaustion within the LangGraph orchestration layer. Because the maximum thread limit was hardcoded to 50, subsequent long-running tool executions queued up and eventually timed out. Customer impact was limited to the Lesson Generation Pipeline; all core database functions and other external endpoints operated normally.

For immediate remediation, we restarted the service and temporarily increased the hardcoded limit. As a permanent fix, action items include implementing dynamic scaling for the thread pool based on CPU utilization and adding exponential backoff logic to the CLI client. These tasks have been scheduled for the next sprint.

Comprehension Check

1. What was the *root cause* of the degradation?

A sudden spike in concurrent agent requests.
Thread pool exhaustion in the LangGraph layer.
A complete database failure in the core system.
A syntax error in the Claude Code CLI logic.

2. How broad was the customer impact?

All automation endpoints were fully offline for 45 minutes.
Data was permanently lost in the database layer.
It was restricted to the Lesson Generation Pipeline.
The CLI client was entirely deleted for all users.

3. What is one of the planned action items for a permanent fix?

Adding exponential backoff logic to the CLI client.
Hardcoding the maximum thread limit to 100 instead of 50.
Migrating the entire platform off of LangGraph.
Disabling the Lesson Generation Pipeline permanently.
Section 5 Writing Task

Write an Executive Summary for an RCA Ticket

You just resolved a Sev-2 incident where the Daily English Lab Agent stopped generating lessons due to a memory leak in the prompt generation module. Write a concise executive summary for the official RCA.

  • 1.State the root cause clearly (memory leak).
  • 2.Describe the customer impact (e.g., 10 minutes of downtime).
  • 3.Mention one remediation step or action item.
  • 4.Keep it under 80 words.
0 words
Section 6 AI Review Rubric
Grammar / 20 pts
Uses correct past tense for incidents and future tense for action items.
Vocabulary / 20 pts
Effectively uses target words like "root cause", "impact", or "remediation".
Clarity / 20 pts
Root cause and impact are explicitly stated within the first 2 sentences.
Professionalism / 20 pts
Tone is calm, factual, blameless, and appropriate for management.
Native-like Expression / 20 pts
Incorporates at least one native phrasing pattern (e.g., "Customer impact was limited to...").
Total 100 pts
Section 7 Spaced Repetition Review Tap card to flip

3 Words from Previous Lessons

Intermittent

间歇性的

Occurring at irregular intervals; not continuous.

Bottleneck

瓶颈

A point of congestion in a system that slows down performance.

Workaround

临时解决方案

A temporary bypass of a recognized problem.

2 Expressions from Previous Lessons

"I'm actively investigating the issue now."

"Could you provide the exact steps to reproduce?"

Section 8 Challenge Zone ⚡ Above current level

When writing an RCA, why is it critical to explicitly distinguish between the "trigger" (the proximate cause) and the "root cause"? How does confusing these two impact your long-term engineering strategy?

Answer in English. Use technical vocabulary from this lesson. No word limit.