BMC SEL

Chapter 7.4: Managing the BMC System Event Log (SEL)

Your primary tool for investigating hardware events, tracking their resolution, and clearing critical system alerts.

ℹ️ BMC System Event Log Management

  • Available to: All user roles

  • Scope: Individual node level

  • Permissions:

    • Admin roles: Full event management including resolution status changes

    • Viewer roles: Read-only access to event logs

  • Data Source: Hardware events logged directly by BMC


Overview: The Node's Black Box Recorder

The BMC System Event Log (SEL) tab is the official hardware and service-level event log for the node, recorded directly by the Baseboard Management Controller (BMC). Think of it as the node's "black box recorder." It operates independently of the main operating system, meaning it will capture critical hardware events even if the OS has crashed.

When the Dashboard shows a CRITICAL alert, or a sensor reports an abnormal reading, this is where you come to find the detailed "who, what, and when" of the incident.


The Incident Response Workflow

Managing events in the SEL follows a clear, three-step process from investigation to resolution.

Investigate Event → Filter & Analyze → Resolve & Apply

Step 1: Investigate the Event

The event list is your primary source of information. Understanding how to read it is the first step in any diagnosis.

The main BMC SEL interface, showing the event list with color-coded severity indicators.

Event Table Column Reference

Column
Description & Why It Matters
Usage Priority

Severity

The Event's Impact:

Color-coded for immediate recognition.

Critical (Red): Requires immediate attention.

Warning (Orange): A non-critical issue that should be investigated.

OK (Green): Informational events.

HIGHEST - Start here

Time

The Exact Timestamp:

Crucial for correlating hardware events with other system logs to pinpoint a root cause.

HIGH - For correlation

Description

The "What Happened":

A human-readable summary of the event. This is your most important clue.

HIGHEST - Key diagnostic info

Status

The Event's Lifecycle:

An interactive toggle showing if the event is Unresolved (default, requires action) or Resolved (acknowledged and handled).

HIGH - For management

Event Investigation Best Practices

Priority Reading Order:

  1. Critical Events First: Focus on red-coded events immediately

  2. Recent Events: Check latest timestamps for current issues

  3. Event Correlation: Look for patterns or sequences of related events

  4. Context Analysis: Read event descriptions for specific component details

Key Information to Extract:

  • Component Identity: Which specific hardware component is affected

  • Event Type: Hardware failure, threshold violation, or status change

  • Timing: When the event occurred for correlation with other activities

  • Severity Assessment: Impact level and urgency of response needed


Step 2: Filter to Find Specific Incidents

In a busy system or after a major event, the log can become long. Use the filter to quickly pinpoint the events you need.

Filtering Workflow

  1. Click the + Add Filter button

  2. Set your criteria in the dialog by Date Range or Severity

  3. Click Apply

The "Add Filter" dialog box, showing the date and severity options.

Common Filter Scenarios

Filter Type
Use Case
Example

Severity: Critical

Find all critical hardware failures

Power supply failures, fan failures

Date Range:

Last 24 hours

Focus on recent events

Troubleshooting current issues

Severity:

Warning + Critical

Exclude informational noise

Focus on actionable events

Specific Time Range

Correlate with known incidents

Match with maintenance windows

Advanced Filtering Strategies

Incident Investigation:

  • Start Broad: Begin with Critical + Warning events

  • Narrow by Time: Focus on timeframe when issues began

  • Clear and Refocus: Remove filters to see full context when needed

Regular Monitoring:

  • Daily Review: Filter to last 24 hours, all severities

  • Weekly Audit: Review unresolved events across longer periods

  • Maintenance Correlation: Filter around planned maintenance activities


Step 3: Manage and Resolve the Event

This is the most critical part of the workflow. After you have taken action to fix the underlying physical issue, you must update the event's status in EDCC to clear the alert.

Resolution Workflow

Step-by-Step Process:

  1. Fix the Physical Issue: Address the underlying hardware problem first

  2. Locate the Event: Find the Unresolved event you have fixed

  3. Toggle Status: Click the toggle switch in the Status column to change it to Resolved

  4. Save Changes: Crucially, click the Apply button in the top-right corner of the page to save this change

Critical Warning: Changes Are Not Saved Automatically

Resolution Best Practices

Before Marking Resolved:

  • Verify Fix: Confirm the physical issue has been actually addressed

  • Check Sensors: Verify related sensors now show normal readings

  • Test Functionality: Ensure affected component is operating properly

  • Document Action: Note what was done to resolve the issue (for future reference)

Resolution Workflow Safety:

  • One at a Time: Resolve events individually to avoid mistakes

  • Double-Check: Verify you're resolving the correct event

  • Apply Immediately: Click Apply after each resolution

  • Verify Results: Check Dashboard to confirm alert clearance


Overview: The health status you see on the main Dashboard is directly controlled by the events in this log. Understanding this relationship is key to effective monitoring.

Dashboard Health Status Logic

SEL Event Status
Dashboard Impact
Required Action

Unresolved Critical Event(s)

POD Health shows CRITICAL

Resolve physical issue + mark event Resolved + Apply

Unresolved Warning Event(s)

POD Health shows WARNING

Investigate and resolve as appropriate

All Events Resolved

POD Health shows GOOD

Normal monitoring

Alert Clearance Process

Physical Issue → SEL Event → Dashboard Alert
      ↓              ↓           ↓
Physical Fix → Mark Resolved → Alert Cleared
              + Click Apply

Key Points:

  • If a node has even one Unresolved Critical event in its SEL, the overall POD Health on the Dashboard will be flagged as CRITICAL

  • To clear that CRITICAL status, you must complete the workflow: fix the hardware issue, then mark the corresponding event(s) here as Resolved and click Apply

  • Dashboard alerts will NOT clear until both the physical issue is resolved AND the event status is updated in the SEL


Event Log Management Strategies

Daily Operations

Morning Health Check:

  1. Review Unresolved Events: Check for any critical or warning events

  2. Verify Recent Events: Look for new events since last check

  3. Correlate with Dashboard: Ensure Dashboard status matches SEL status

  4. Plan Actions: Prioritize critical events for immediate attention

Incident Response

When Dashboard Shows Critical:

  1. Navigate to SEL: Go directly to affected node's BMC SEL tab

  2. Identify Root Cause: Find the critical event(s) causing the alert

  3. Gather Context: Use filters to see event timeline and related events

  4. Plan Response: Determine physical action needed based on event details

Maintenance Coordination

Before Maintenance:

  • Document Baseline: Note current unresolved events

  • Plan Resolution: Identify which events maintenance will address

After Maintenance:

  • Verify Fixes: Check that physical work resolved the issues

  • Update SEL Status: Mark resolved events and click Apply

  • Confirm Dashboard: Verify Dashboard reflects successful resolution


Chapter Summary & Key Takeaways

  • Dashboard Alerts Start Here: An alert on the Dashboard is a symptom. The detailed event in the SEL is the diagnosis

  • Resolution is a Two-Step Process: You must first fix the physical hardware issue, then mark the event as Resolved in this interface

  • "Apply" is the Final Step: Dashboard alerts will not clear until you have marked an event as Resolved and clicked the Apply button

  • Use Filters: In a "log storm," the filter is your best tool for finding the initial root-cause event

  • Admin Rights Required: Event resolution requires Admin permissions - Viewers can investigate but cannot resolve

  • Direct Dashboard Connection: SEL event status directly controls Dashboard health indicators

What's Next: Chapter 7.5 will explore the Operations tab, where you'll learn to execute direct BMC commands for power management, firmware updates, and system maintenance operations.

💡 Pro Tip: Develop a habit of checking the SEL whenever Dashboard health changes - it's your fastest path to understanding what happened and what needs to be fixed.