Sensors

Chapter 7.3: Monitoring Hardware Sensors

Your toolkit for precise, in-depth hardware health diagnostics.

ℹ️ Hardware Sensor Monitoring
Available to: All user roles
Scope: Individual node level
Permissions: Read-only for all users
Data Source: Real-time readings from BMC sensors via Redfish protocol

Overview

The Sensors tab is your tool for detailed hardware diagnostics. While the Summary tab's graphs help you spot trends over time, this page provides the exact, real-time numerical values and the official manufacturer-defined operating thresholds for every sensor in the node. This is the difference between seeing a fever chart and reading the exact temperature on a digital thermometer.

All data here is polled directly from the Baseboard Management Controller (BMC), making it completely independent of the operating system. Use this read-only page to answer one critical question: "Is this component currently operating within its safe, predefined limits?"

Always Get the Latest Data

The sensor data is a snapshot from when the page was loaded. To get the most up-to-the-minute readings from the node's BMC, remember to perform a manual browser refresh (F5).

How to Triage a Sensor's Health

The tables on this page are designed for quick and accurate assessment. To interpret any sensor reading, follow this three-step process.

Three-Step Health Assessment Process

Check the Status: Look at the color-coded dot first for an immediate health summary.
Read the Current Value: See the real-time measurement being reported by the component.
Compare with Thresholds: Verify where the current value falls within the hardware's official safe operating range.

Sensor Table Column Reference

Column

Description & Why It Matters

How to Use

Status

Your At-a-Glance Health Indicator: Good is normal. Warning or Critical means the sensor has crossed a predefined threshold and requires immediate attention.

First check - prioritize non-green statuses

Current Value

The Real-Time Reading: The precise, real-time measurement from the sensor (e.g., Volts, RPM, °C).

Exact measurement - compare against expected ranges

Thresholds

The Official Safe Operating Limits: These read-only values are defined by the hardware manufacturer. A Current Value outside these boundaries triggers a Status change.

Reference ranges - understand normal vs. abnormal

Diagnostic Priority: Always start with Critical status sensors, then Warning, then verify Good sensors for baseline understanding.

Sensor Category Deep Dive

Each category of sensors provides insight into a different aspect of the node's physical health.

Discrete Sensors

These sensors act as simple binary (on/off, true/false) indicators for various system states. They are excellent for quick, definitive checks.

Common Examples:

Chassis Intrusion: Detects if the case has been opened
PSU Redundancy: Confirms if the redundant power supply is healthy
System Status: Overall system health indicators

🖼️ $$Image: The Discrete Sensor table, showing examples like PSU Redundancy and Chassis Intrusion status.$$

Interpretation Guide:

Good/OK: Component functioning normally
Warning: Attention required but not critical
Critical: Immediate action required

Voltage Sensors

Think of these as the "heartbeat" and "blood pressure" monitor for the node's power system. They ensure that stable and correct voltages are being delivered to sensitive components.

How to Interpret:

Normal Behavior: Current Value should be extremely stable.
Warning Signs: Significant fluctuations or drifts into Warning range.
Potential Issues: Early indicator of failing Power Supply Unit (PSU) or motherboard issue.

🖼️ $$Image: The Voltage sensor table, highlighting the Current Value in relation to the warning and critical thresholds.$$

Critical Voltage Rails to Monitor:

12V Rails: Primary power distribution
5V Rails: Legacy component power
3.3V Rails: Logic and memory power

Fan Sensors

These sensors are the "respiratory check" for the node's cooling system, reporting the speed (RPM) of each fan.

How to Interpret:

A fan's status is determined by comparing its Current Value (RPM) against its predefined Thresholds.

Good: The fan is spinning within its normal, expected RPM range.
Warning: The fan's speed is too slow (impending failure) or too fast (high heat load), crossing a Warning threshold. This requires investigation.
Critical: The fan's speed has crossed a Critical threshold. This could mean it is spinning dangerously slow, dangerously fast, or has stopped entirely (0 RPM). This state requires immediate attention.

Fan Monitoring Best Practices:

Redundancy Check: Verify multiple fans are operational.
RPM Consistency: Compare similar fans for consistent speeds.
Trend Analysis: Monitor for gradual RPM degradation.

Temperature Sensors

This is the "fever check" for your node, providing precise temperature readings from critical components. This is your primary tool for identifying and preventing overheating.

How to Interpret:

Normal Operation: Temperatures within manufacturer specifications.
High Load: Elevated but within acceptable ranges during heavy workloads.
Cooling Issues: Sustained high temperatures indicating cooling system problems.

🖼️ $$Image: The Temperature sensor table, with the °C/°F toggle visible.$$

Temperature Monitoring Zones:

CPU Zones: Processor thermal management
Memory Zones: DIMM thermal monitoring
Ambient Zones: Overall chassis temperature
PSU Zones: Power Supply Unit thermal monitoring

Unit Toggle: You can switch the display between Celsius (°C) and Fahrenheit (°F) using the toggle in the top-right corner of the table.

From Sensor Alert to Actionable Insight

This page is your starting point for diagnosis. When you find a sensor with a non-good status, the goal is to turn that alert into actionable information for maintenance or support.

Your Diagnostic Workflow

Sensor Alert → Event Correlation → Evidence Gathering → Action Planning

Step-by-Step Process:

Identify the Fault: Note the full name of the sensor reporting an issue (e.g., "CPU1 DIMM A2 Temperature").
Find the Correlating Event: Immediately navigate to the BMC SEL tab. The system automatically logs a detailed event that corresponds directly to the sensor alert.
Gather the Evidence: The event log provides the precise timestamp and error details.
Take Action: This collected evidence is exactly what you need to provide to a technician for a physical inspection or to a vendor for technical support.

Documentation Template for Support

When escalating sensor issues, include:

Node Identity: System Name and Serial Number
Sensor Details: Full sensor name and current reading
Threshold Information: Operating limits and current status
Event Log Entry: Corresponding BMC SEL event with timestamp
Environmental Context: Workload and environmental conditions

Sensor Monitoring Best Practices

Daily Health Check Routine

Priority Scanning: Check for any Critical or Warning status indicators.
Baseline Verification: Note normal operating ranges for your environment.
Trend Correlation: Compare with Summary tab trends for context.
Event Correlation: Cross-reference with BMC SEL for related events.

Proactive Monitoring Strategies

Establish Baselines:

Document normal operating ranges for each sensor type.
Note typical values during different workload conditions.
Track seasonal variations in temperature readings.

Early Warning Detection:

Monitor sensors approaching warning thresholds.
Track gradual changes that might indicate developing issues.
Correlate sensor patterns across similar nodes.

Integration with Other Monitoring

Cross-Reference Points:

Summary Tab: Use graphs for trend analysis.
BMC SEL: Check for corresponding system events.
Operations Tab: Verify system operations impact on sensors.
Services Tab: Correlate with service health status.

Chapter Summary & Key Takeaways

Summary is for Trends, Sensors is for Thresholds: Use the Summary tab to see graphs over time. Use this Sensors tab to see the exact current value and compare it against the official safe operating limits.
Refresh is Required: The data is a snapshot. Always manually refresh (F5) to get the latest readings.
Your Goal is Evidence: A sensor alert is your clue. The detailed event in the BMC SEL tab is your evidence for taking action.
Follow the Workflow: Sensor Alert → Event Correlation → Evidence Gathering → Support Action.
Monitor Proactively: Establish baselines and track trends to prevent issues before they become critical.

What's Next:

Chapter 7.4 will explore the BMC System Event Log (SEL), where you'll learn to investigate the detailed hardware events that correspond to sensor alerts and system activities.

💡 Pro Tip: Create a baseline document of normal sensor readings for each node type in your environment - this makes identifying abnormal conditions much faster and more accurate.

PreviousSystem Information NextBMC SEL