Hardware Health

One of the most powerful features of the BMC is its ability to give you a real-time view into the server's physical health. Proactively monitoring hardware can help you prevent downtime by catching potential issues before they become critical failures.

In this chapter, you'll learn how to check the status of your server's key hardware components, including sensors, fans, and health indicators.

Checking overall system health

The first place to check for a high-level health summary is the Inventory and LEDs page. It consolidates the status of all monitored components into a single health indicator.

To check the overall system health:

In the sidebar menu, navigate to Hardware status > Inventory and LEDs.
Look for the Overall System Health field.

[Image, EXISTING, Source: 5.1: Screenshot of the Inventory and LEDs page, showing the LED light control and component health status.]

This field provides a top-level status based on all active sensors and event logs. It will typically show one of the following states:

Healthy: No critical or warning events are detected.
Degraded: One or more warning-level events are present. The system is still operational but requires attention.
Critical: A critical event has been detected that could lead to failure. This requires immediate investigation.

Monitoring sensors (temperature, voltage)

The Sensors page gives you detailed, real-time readings from every sensor in the system, including those for temperature, voltage, and current. This is the most important screen for diagnosing thermal or power-related issues.

Navigate to Hardware status > Sensors to view the sensor list.

[Image, EXISTING, Source: 5.2: Screenshot of the Sensors page, showing a table of sensor readings.]

The sensor table provides the following information for each component:

Name: The unique identifier for the sensor (e.g., Core 48 CPU1, P12V AUX).
State: The operational state of the sensor, which can be Enabled or Disabled.
Status: The current health of the sensor reading, which can be OK, Warning, or Critical.
Lower Critical / Lower Warning: The minimum threshold values. If the current value drops below these, an alert will be triggered.
Current Value: The real-time reading from the sensor, with the appropriate unit (e.g., Cel for Celsius, V for Volts).
Upper Warning / Upper Critical: The maximum threshold values. If the current value exceeds these, an alert will be triggered.

Best Practice: Focus on the thresholds

When reviewing sensor data, pay close attention to the Current Value in relation to the Upper Warningand Upper Critical thresholds. A value consistently near the warning level may indicate a cooling problem or a component under excessive load.

You can use the Search bar to find a specific sensor by name or use the Filter option to narrow down the list to only show sensors in a Warning or Critical state. This is extremely useful for quickly identifying the source of a problem.

[Image, EXISTING, Source: 5.2: Screenshot of the sensor filter dropdown menu, showing options to filter by Status (Critical, Warning, OK) and State (Enabled, Disabled).]

Monitoring fan status and control

System fans are critical for maintaining proper operating temperatures. The Fans page displays the real-time speed and status of each fan in the chassis and allows for manual control.

Navigate to Hardware status > Fans to view the fan information.

[Image, EXISTING, Source: 5.3: Screenshot of the Fans page, showing Fan Health, Fan Control buttons, and a list of fans with their status and speed.]

The page shows the overall Fan Health and lists each fan with its current Status and Fan speed (in RPM).

You can also use the Fan Control buttons to change the fan speed profile:

Normal duty: The standard operating mode, balancing cooling and power consumption.
High duty: Forces fans to a higher speed for maximum cooling. This is useful when you anticipate a heavy workload or if ambient temperatures are high.
Auto control: Allows the BMC to dynamically adjust fan speeds based on sensor readings. This is the recommended setting for most environments.

Using the identify LED

In a data center with racks full of identical servers, finding the specific machine you need to work on can be challenging. The System Identify LED is a simple but essential tool to solve this problem.

To turn on the identify LED:

Navigate to Hardware status > Inventory and LEDs.
In the LED light control section, find the System Identify LED toggle.
Switch it to On.

This will cause a physical LED on the front and/or back of the server chassis to light up or blink (usually blue), making it easy to spot in the rack.

Pro Tip: A technician's best friend

Before sending a remote hands technician to perform physical work on a server (like replacing a drive or a PSU), always turn on the identify LED first. This simple step prevents accidental work on the wrong machine, which can be a costly mistake.

PreviousNavigating the Dashboard NextSystem Logs and Events

Last updated 3 months ago