Intel® VROC RAID Logging and Monitoring in Linux

5.1 Intel® VROC RAID Logging

Messages from the MDRAID subsystem in the Linux* kernel are logged for status, warnings, and errors. In most Linux distributions, these entries are stored in:

/var/log/messages

This system log aggregates kernel messages together with RAID-related outputs. Administrators can use it to monitor Intel® VROC RAID activity and identify issues that require attention.

Retrieving Kernel Logs with dmesg

The dmesg command displays kernel messages in real time, including RAID-related events such as initialization, synchronization, device failures, and recovery operations.

Example output:

# dmesg
[Thu Aug 4 09:19:52 2022] md/raid1:md126: not clean -- starting background reconstruction [Thu Aug 4 09:19:52 2022] md/raid1:md126: active with 2 out of 2 mirrors
[Thu Aug 4 09:19:52 2022] md126: detected capacity change from 0 to 107374182400 [Thu Aug 4 09:19:52 2022] md: resync of RAID array md126
[Thu Aug 4 09:21:36 2022] md: md126: resync done.
[Thu Aug 4 09:21:43 2022] md126: detected capacity change from 107374182400 to 0
[Thu Aug 4 09:21:43 2022] md: md126 stopped.
[Thu Aug 4 09:21:43 2022] md: md126 stopped.
[Thu Aug 4 09:21:43 2022] md: md127 stopped.
[Thu Aug 4 09:23:14 2022] md126: detected capacity change from 0 to 16003123642368
[Thu Aug 4 09:23:38 2022] md126: detected capacity change from 16003123642368 to 0
[Thu Aug 4 09:23:38 2022] md: md126 stopped.
[Thu Aug 4 09:23:38 2022] md: md127 stopped.
[Fri Aug 5 01:52:54 2022] md/raid:md126: not clean -- starting background reconstruction [Fri Aug 5 01:52:54 2022] md/raid:md126: device nvme3n1 operational as raid disk 2
[Fri Aug 5 01:52:54 2022] md/raid:md126: device nvme0n1 operational as raid disk 1
[Fri Aug 5 01:52:54 2022] md/raid:md126: device nvme1n1 operational as raid disk 0
[Fri Aug 5 01:52:54 2022] md/raid:md126: raid level 5 active with 3 out of 3 devices,
algorithm 0
[Fri Aug 5 01:52:54 2022] md126: detected capacity change from 0 to 214748364800 [Fri Aug 5 01:52:54 2022] md: resync of RAID array md126
[Fri Aug 5 01:54:36 2022] md: md126: resync done.
[Fri Aug 5 01:54:54 2022] md/raid:md126: Disk failure on nvme0n1, disabling device.
md/raid:md126: Operation continuing on 2 devices. [Fri Aug 5 01:54:54 2022] md: recovery of RAID array md126
[Fri Aug 5 01:56:41 2022] md: md126: recovery done.
[Fri Aug 5 02:00:20 2022] md/raid:md126: Disk failure on nvme3n1, disabling device.
md/raid:md126: Operation continuing on 2 devices. [Fri Aug 5 02:00:50 2022] md: recovery of RAID array md126
[Fri Aug 5 02:02:46 2022] md: md126: recovery done.

These logs provide detailed insights into the lifecycle of a RAID volume, helping administrators quickly identify events such as rebuilds, failures, or capacity changes.

Retrieving System Journal Logs with journalctl

Below is an example snippet of what the journal log may look like:

Aug 05 01:52:55 localhost.localdomain kernel: md/raid:md126: not clean -- starting background reconstruction
Aug 05 01:52:55 localhost.localdomain kernel: md/raid:md126: device nvme3n1 operational as raid disk 2
Aug 05 01:52:55 localhost.localdomain kernel: md/raid:md126: device nvme0n1 operational as raid disk 1
Aug 05 01:52:55 localhost.localdomain kernel: md/raid:md126: device nvme1n1 operational as raid disk 0
Aug 05 01:52:55 localhost.localdomain kernel: md/raid:md126: raid level 5 active with 3 out
of 3 devices, algorithm 0
Aug 05 01:52:55 localhost.localdomain kernel: md126: detected capacity change from 0 to 214748364800
Aug 05 01:52:55 localhost.localdomain systemd[1]: Starting MD Metadata Monitor on
/dev/md127...
Aug 05 01:52:55 localhost.localdomain systemd[1]: Started MD Metadata Monitor on
/dev/md127.
Aug 05 01:52:55 localhost.localdomain kernel: md: resync of RAID array md126
Aug 05 01:52:58 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 3 (xid=0x3b9ab34d)
Aug 05 01:53:01 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 5 (xid=0x3b9ab34d)
Aug 05 01:53:06 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 8 (xid=0x3b9ab34d)
Aug 05 01:53:14 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 15 (xid=0x3b9ab34d)
Aug 05 01:53:29 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 21 (xid=0x3b9ab34d)
Aug 05 01:53:50 localhost.localdomain dhclient[6077]: DHCPDISCOVER on ens785f1 to 255.255.255.255 port 67 interval 9 (xid=0x3b9ab34d)
Aug 05 01:53:59 localhost.localdomain dhclient[6077]: No DHCPOFFERS received.
Aug 05 01:53:59 localhost.localdomain dhclient[6077]: No working leases in persistent database - sleeping.
Aug 05 01:54:37 localhost.localdomain kernel: md: md126: resync done.
Aug 05 01:54:55 localhost.localdomain kernel: md/raid:md126: Disk failure on nvme0n1, disabling device.
md/raid:md126: Operation continuing on 2
devices.
Aug 05 01:54:55 localhost.localdomain udisksd[2823]: Unable to resolve
/sys/devices/virtual/block/md126/md/dev-nvme0n1/block symlink
Aug 05 01:54:55 localhost.localdomain kernel: md: recovery of RAID array md126

Reviewing Syslog Messages (/var/log/messages)

Below is an example snippet of what the log may look like in /var/log/messages:

Aug 4 09:05:10 localhost kernel: md126: detected capacity change from 0 to 16003123642368
Aug 4 09:05:23 localhost kernel: md126: detected capacity change from 16003123642368 to 0 Aug 4 09:05:23 localhost kernel: md: md126 stopped.
Aug 4 09:05:23 localhost kernel: md: md127 stopped.
Aug 4 09:06:56 localhost kernel: md/raid:md126: not clean -- starting background reconstruction
Aug 4 09:06:56 localhost kernel: md/raid:md126: device nvme2n1 operational as raid disk 3 Aug 4 09:06:56 localhost kernel: md/raid:md126: device nvme3n1 operational as raid disk 2 Aug 4 09:06:56 localhost kernel: md/raid:md126: device nvme0n1 operational as raid disk 1 Aug 4 09:06:56 localhost kernel: md/raid:md126: device nvme1n1 operational as raid disk 0 Aug 4 09:06:56 localhost kernel: md/raid:md126: raid level 5 active with 4 out of 4
devices, algorithm 0
Aug 4 09:06:56 localhost kernel: md126: detected capacity change from 0 to 322122547200 Aug 4 09:06:56 localhost systemd[1]: Starting MD Metadata Monitor on /dev/md127...
Aug 4 09:06:56 localhost systemd[1]: Started MD Metadata Monitor on /dev/md127. Aug 4 09:06:56 localhost kernel: md: resync of RAID array md126
Aug 4 09:09:15 localhost kernel: md: md126: resync done.
Aug 4 09:09:24 localhost kernel: md126: detected capacity change from 322122547200 to 0 Aug 4 09:09:24 localhost kernel: md: md126 stopped.
Aug 4 09:09:24 localhost systemd[1]: [email protected]: Succeeded. Aug 4 09:09:24 localhost kernel: md: md126 stopped.
Aug 4 09:09:24 localhost kernel: md: md127 stopped.
Aug 4 09:10:23 localhost kernel: md/raid10:md126: not clean -- starting background reconstruction
Aug 4 09:10:23 localhost kernel: md/raid10:md126: active with 4 out of 4 devices
Aug 4 09:10:23 localhost kernel: md126: detected capacity change from 0 to 214748364800 Aug 4 09:10:23 localhost systemd[1]: Starting MD Metadata Monitor on /dev/md127...
Aug 4 09:10:23 localhost systemd[1]: Started MD Metadata Monitor on /dev/md127. Aug 4 09:10:23 localhost kernel: md: resync of RAID array md126
Aug 4 09:12:00 localhost kernel: md: md126: resync done.
Aug 4 09:16:32 localhost kernel: md126: detected capacity change from 214748364800 to 0 Aug 4 09:16:32 localhost kernel: md: md126 stopped.
Aug 4 09:16:32 localhost systemd[1]: [email protected]: Succeeded. Aug 4 09:16:32 localhost kernel: md: md126 stopped.
Aug 4 09:16:32 localhost kernel: md: md127 stopped.
Aug 4 09:19:53 localhost kernel: md/raid1:md126: not clean -- starting background reconstruction
Aug 4 09:19:53 localhost kernel: md/raid1:md126: active with 2 out of 2 mirrors
Aug 4 09:19:53 localhost kernel: md126: detected capacity change from 0 to 107374182400 Aug 4 09:19:53 localhost systemd[1]: Starting MD Metadata Monitor on /dev/md127...
Aug 4 09:19:53 localhost systemd[1]: Started MD Metadata Monitor on /dev/md127. Aug 4 09:19:53 localhost kernel: md: resync of RAID array md126
Aug 4 09:21:37 localhost kernel: md: md126: resync done.
Aug 4 09:21:44 localhost kernel: md126: detected capacity change from 107374182400 to 0 Aug 4 09:21:44 localhost kernel: md: md126 stopped.
Aug 4 09:21:44 localhost systemd[1]: [email protected]: Succeeded. Aug 4 09:21:44 localhost kernel: md: md126 stopped.
Aug 4 09:21:44 localhost kernel: md: md127 stopped.

5.2 RAID Monitoring

Once an Intel® VROC RAID volume is active, the mdmonitor daemon starts automatically. It monitors RAID events such as degraded arrays, drive failures, and rebuild progress. If configured in /etc/mdadm.conf, it can also trigger predefined actions or notifications.

Using the mdadm Monitoring Daemon

You can start the mdadm monitoring service manually with the following command:

mdadm --monitor –-scan --daemonise –-syslog

This runs mdadm as a background daemon to monitor all RAID devices and report events to syslog. Administrators can then filter syslog entries for RAID-specific events.

Before starting monitoring, you must define an email address in the mdadm.conf file to receive notifications. For example:

echo “MAILADDR root” >> /etc/mdadm.conf

Using systemctl for RAID Monitoring

The mdmonitor daemon is integrated with systemd, allowing you to manage it using systemctl commands:

Check service status:

systemctl status mdmonitor.service

Start the service manually:

systemctl start mdmonitor.service

Restart the service:

systemctl restart mdmonitor.service

Enable service to start at boot:

systemctl enable mdmonitor.service

Stop the service:

systemctl stop mdmonitor.service

5.3 RAID Alerts

Intel® VROC reports RAID alerts through the monitoring service in Linux*. Administrators can integrate custom programs with the monitoring service to receive and process these alerts, enabling proactive response to RAID events.

Table 5-1. Intel® VROC RAID Alerts in Linux

VROC

Alert/Event

Severity

Description

Fail

Critical

A member drive in the RAID has failed.

FailSpare

Critical

The spare drive used for rebuild has failed.

DeviceDisappeared

Critical

A RAID member device or volume has disappeared (removed or inaccessible).

DegradedArray

Critcal

The RAID array has entered a degraded state.

RebuildStarted

Warning

A degraded RAID has started the rebuild (recovery) process.

RebuildNN

Warning

Notification of rebuilding progress，NN is two-digit number (e.g., 20, 40, …) which indicates the rebuild has passed that many percent of the total.

RebuildFinished

Warning

The rebuild of a degraded RAID is complete or aborted.

SparesMissing

Warning

One or more spare drives defined in mdadm.conf are missing or removed.

SpareActive

Information

A spare drive has been successfully rebuilt and activated.

NewArray

Information

A new RAID array has been detected.

MoveSpare

Information

A spare drive has been reassigned from one array to another.

5.4 Developing a Program to Handle RAID Alerts

The Intel® VROC RAID monitoring service allows administrators to register custom programs to receive and process RAID alerts. This is configured in the /etc/mdadm.conf file, enabling the monitoring service to call the user-defined program whenever an event occurs.

When invoked, the program receives two or three parameters:

Event name – identifies the alert type.
RAID volume device name – indicates the affected RAID device.
Device identifier (optional) – provided when the event relates to a spare or member device.

Below is an example of a simple bash script that handles Intel VROC alerts by writing messages to a log file:

#!/bin/bash
event=$1 md_device=$2 device=$3
case $event in DegradedArray)
msg="$md_device is running in the Degraded MODE"
;;
DeviceDisappeared)
msg="$md_device has disappeared"
;;
Fail)
msg="$md_device had a failed member device: $device"
;;
FailSpare)
msg="$md_device: Spare device ($device) FAIL during rebuild"
;;
RebuildStarted)
msg="Recovery/Rebuilding of $md_device has started"
;;
Rebuild??)
msg="$md_device REBUILD is now $(echo $event|sed 's/Rebuild//')% complete"
;;
RebuildFinished)
msg="Rebuild of $md_device is completed or aborted"
;;
SpareActive)
msg="$device has become an ACTIVE COMPONENT of $md_device"
;;
NewArray)
msg="$md_device has been detected"
;;
MoveSpare)
msg="SPARE device $device has been MOVED to a new array :$md_device"
;;
SparesMissing)
msg="$md_device is MISSING one or more SPARE devices"
;;
TestMessage)
msg="TEST MESSAGE generated for $md_device"
;;
esac
# In this example, we just send the event message to the tmp log. echo "[$(date -u)] $msg" >> /tmp/vroc_alerts.log

To enable this handler:

Place the script in /usr/sbin, e.g., /usr/sbin/vroc_linux_events_handler.sh.
Add the following line to /etc/mdadm.conf:

// Some cod# cat /etc/mdadm.conf
ARRAY metadata=imsm UUID=f69f9275:68fce440:3420da7a:48e2a723
ARRAY /dev/md/vol0 container=f69f9275:68fce440:3420da7a:48e2a723 member=0 UUID=06c1975e:2c160226:ef62cbc6:b42e4570
POLICY domain=DOMAIN path=* metadata=imsm action=spare-same-slot
PROGRAM /usr/sbin/vroc_linux_events_handler.sh

Sample output written to /tmp/vroc_alerts.log by this program:

# cat /tmp/vroc_alerts.log
[Tue Feb 21 01:59:28 UTC 2023] Rebuild of /dev/md/vol0 is completed or aborted [Tue Feb 21 01:59:28 UTC 2023] /dev/md/vol0 has disappeared
[Tue Feb 21 02:13:47 UTC 2023] /dev/md/vol0 REBUILD is now 21% complete
[Tue Feb 21 02:29:53 UTC 2023] /dev/md/vol0 REBUILD is now 40% complete
[Tue Feb 21 02:43:53 UTC 2023] /dev/md/vol0 REBUILD is now 60% complete
[Tue Feb 21 02:56:54 UTC 2023] /dev/md/vol0 REBUILD is now 80% complete
[Tue Feb 21 03:10:08 UTC 2023] Rebuild of /dev/md/vol0 is completed or aborted
[Wed Feb 22 02:44:28 UTC 2023] /dev/md/vol0 had a failed member device: /dev/nvme7n1 [Wed Feb 22 02:47:01 UTC 2023] Recovery/Rebuilding of /dev/md/vol0 has started

PreviousRAID Management in Linux NextIntel® VROC RAID Advanced Usages in Linux

Last updated 1 month ago