Monitoring Services Health

Chapter 8: Monitoring Services Health

A centralized dashboard for viewing the health of internal Redfish services across your infrastructure.

ℹ️ Service Monitoring

  • Available to: All user roles with monitoring permissions

  • Scope: Organization / Hierarchy View / POD levels

  • Permissions: Monitor service health status for authorized scope

  • Purpose: Fleet-wide Redfish service health monitoring across different management scopes


Overview: Your Fleet-Wide Systems Check

The MANAGE → Services page is your fleet-wide systems check for the internal management engine. While the Node Detail → Services tab focuses on a single node, this page provides a high-level, aggregated view. Its primary purpose is to help you diagnose issues with the management plane itself.

Scope-Based Visibility:

  • Organization scope: View service health across all PODs you have access to

  • Hierarchy View scope: View service health across PODs within the selected Hierarchy View

  • POD scope: View service health for nodes within the selected POD

If EDCC is having trouble communicating with one or more nodes, this is the first place to look to understand the scope of the problem. It helps you quickly answer key questions depending on your selected scope:

  • Organization/HV level: "Is there a widespread service issue affecting multiple PODs?"

  • POD level: "Is there a widespread service issue affecting multiple nodes in this POD?"

  • All levels: "Is this a systemic problem or an isolated incident?"


Understanding Scope-Based Monitoring

How Scope Affects Service Monitoring

The services you see on this page depend on your currently selected management scope. This dynamic filtering helps you focus on the relevant infrastructure without being overwhelmed by nodes outside your current context.

Scope Selection Examples:

Selected Scope
What You See
Example Use Case

Organization

All nodes from all authorized PODs

Global infrastructure health check across entire organization

Hierarchy View

All nodes from PODs within selected HV

Regional data center monitoring (e.g., "East Coast HV" shows POD1, POD2, POD3)

POD

All nodes within selected POD

Detailed investigation of specific POD issues

Permission Filtering:

User with POD1, POD3, POD5 access:
• Select "Organization" → See nodes from POD1 + POD3 + POD5
• Select "HV1" (contains POD1,2,3,4,5) → See nodes from POD1 + POD3 only
• Select "POD1" → See nodes from POD1 only

The Monitoring & Triage Workflow

The interface is designed for a simple, two-step workflow: first, get a high-level overview from the status list, then use the powerful filtering and action tools to investigate specific problems.

Select Scope Flest Status Overview Filter Problem Nodes Investigate Individual Issues

Multi-Scope Investigation Pattern

Top-Down Approach (Recommended for widespread issues):

  1. Organization view: Identify which PODs are affected

  2. Narrow to HV: Focus on specific regional or logical groupings

  3. Drill to POD: Investigate specific node issues

  4. Node Detail: Deep-dive individual server diagnostics

Targeted Approach (For known issues):

  1. Direct POD selection: Jump straight to affected POD

  2. Filter critical nodes: Isolate problem servers

  3. Use quick actions: Navigate to diagnostic pages


Reading the Node Service Status List

Overview: This table is your main dashboard, acting as a "fleet status board" for the selected service. Each row represents a node, and each column gives you a piece of critical diagnostic information. It allows you to spot systemic issues at a glance, for example, if a recent network change has impacted communication for multiple nodes across different PODs.

The Redfish SEL Health view showing a list of nodes, with some marked as critical
Column
Description & Why It Matters
Diagnostic Value

Name

Identifier: The unique name of the node.

Node identification

Service Status

The Key Indicator: This shows the current health of the selected service on that node (Critical, Good). The entire row is color-coded, so you can spot failing nodes instantly.

PRIMARY - Immediate health status

Service Last Update

The "Heartbeat" Timestamp: This shows the last time EDCC successfully received a status update from this service. A stale or old timestamp is a major red flag, indicating a potential network issue or an unresponsive BMC.

CRITICAL - Communication health

Actions

Your Investigation Shortcuts: These are dynamic, context-aware links designed to take you directly to the most relevant page for troubleshooting that specific service failure.

TOOLS - Direct investigation paths

Key Monitoring Indicators

Service Status Priority:

  • Critical (Red): Immediate attention required - service failure detected

  • Good (Green): Service operating normally

  • Missing Status: Possible communication failure

Service Last Update Analysis:

  • Recent Timestamp: Normal communication with BMC

  • Stale Timestamp: Warning sign of network or BMC issues

  • Missing Timestamp: Communication completely broken

Cross-Scope Pattern Recognition

What to Look For:

  • Organization view: Are problems concentrated in specific PODs?

  • Hierarchy View: Are issues affecting one region more than others?

  • POD view: Are problems clustered on specific racks or network segments?

  • All scopes: Are similar services failing across multiple nodes?


Filtering and Investigating

Filtering for Problem Nodes

In a large deployment with hundreds of nodes across multiple PODs, the filter is your most essential tool. It allows you to cut through the noise and instantly isolate the nodes that need attention, regardless of your current scope.

Process:

  1. Click the + Add Filter button

  2. Filter the list by Name or Service Status

  3. Click Apply

The filter dialog showing the options to filter by Name and Service Status

Common Filter Scenarios by Scope:

Organization/HV Scope:

  • Critical Status Only: Find all critical nodes across multiple PODs

  • POD-Specific Issues: Filter by node names containing POD identifiers

  • Stale Updates: Identify communication problems across infrastructure

POD Scope:

  • Critical Status Only: Focus on nodes requiring immediate attention

  • Rack-Based Filtering: Filter by naming patterns for specific racks

  • Service Type Issues: Isolate specific service failures

Using Quick Actions to Drill Down

The Actions column provides the most efficient path from problem detection to root cause investigation. These shortcuts are designed to minimize "context switching" by providing a direct, one-click path to the most relevant diagnostic page.

Action Options by Service Type:

Service Type
Available Actions
Purpose

Redfish SEL Health

BMC SEL, BMC Web

Jump to event logs or native BMC interface

Redfish System Information

System Info, BMC Web

Access hardware details or BMC interface

Investigation Path:

Flest View Problem Quick Action Node-Level Diagnostics Root Cause Resolution


Managing Service Events

Understanding the Batch "Delete" Action

This function is an administrative tool for managing the user interface, not the nodes themselves. After you have resolved a widespread service issue, you can use this batch action to clear the resolved Critical status indicators from this screen in a single operation.

When to Use Delete Action:

  • After Issue Resolution: Clear resolved service alerts from the display

  • UI Cleanup: Remove outdated status indicators after maintenance

  • Event Management: Maintain clean service status overview

  • Works Across All Scopes: Clean up events regardless of Organization/HV/POD selection

What Delete Does NOT Do:

  • Remove nodes from POD

  • Delete actual service configurations

  • Affect node functionality

  • Remove nodes from other EDCC functions

  • Change your current scope selection


Chapter Summary & Key Takeaways

  • Multi-Scope Monitoring: This page works across Organization, Hierarchy View, and POD scopes - not POD-only

  • Dynamic Visibility: What you see depends on your selected scope and user permissions

  • Use This Page for Fleet-Wide Issues: Best for spotting problems that affect multiple nodes across your authorized infrastructure

  • Last Update is a Key Indicator: A stale timestamp can be a sign of a network or BMC connectivity issue, even if the status is "Good"

  • Actions are Your Shortcuts: Use the Actions column to jump directly from fleet-level alerts to node-level diagnostic pages

  • Delete Clears the View, Not the Node: The batch Delete action is for UI cleanup only and does not affect nodes or your scope selection

  • Pattern Recognition: Look for patterns across multiple nodes and PODs to identify systemic issues vs. isolated problems

  • Scope Switching: Use the hamburger menu to navigate between different scopes for top-down or targeted investigation

What's Next: Chapter 9 will explore POD View topology management, where you'll learn to visualize and organize your infrastructure layout.

💡 Pro Tip: Use Organization or HV scope as your first stop when Dashboard shows widespread health issues - it quickly reveals whether problems are affecting specific PODs or multiple systems across your infrastructure. Then narrow your scope to investigate specific problem areas.