vSAN Monitoring and Troubleshooting Tools - Part 2

May 6, 2024

vSAN Monitoring and Troubleshooting Tools – Part 2

This is part of the VMware vSAN guide post series. You can access and explore more objectives from the VMware vSAN study guide using the following link.

VMware vSAN – Study Guide

There are several tools available to achieve this purpose, and it is necessary to be familiar with these tools in order to facilitate faster troubleshooting. In the previous post, I walked through vSAN Skyline Health, in this post I will explore vSAN cluster level monitoring.

vSAN Skyline Health
vSAN Cluster Level Monitoring
vSAN Host Monitoring
vSAN VM Monitoring

vSAN Cluster Monitoring

Alongside Skyline Health, there are several cluster-level monitoring and troubleshooting tools available to help you in identifying and resolving issues from different perspectives. These tools include:

Virtual objects
Resyncing Objects
Proactive Testing
Capacity
Performance
Performance Diagnostics
Support
Data Migration Pre-Check

In this post, I will take a look at some items, and will explain the remaining items in the coming posts.

Virtual objects

Here, you can monitor the state of each object in the vSAN cluster, including performance of object and object placement according to its policy. In the following screenshot, you notice that all objects are in healthy state, which means all objects are aligned with their respective policy. By expanding a parent object, you can view its child objects along with their individual states.

Let’s click on View Placement Details to open the Physical Placement window for VM1. Here, you can view the placement for each virtual objects. Although our vSAN policy for this specific object is RAID 5, but you may notice in the following GIF that there is a RAID 1 under object as well, which it is the Performance Leg. This is automatically created by vSAN ESA to enhance performance and it is part of new architecture, if you are unfamiliar with vSAN ESA Performance Leg, you can learn more about it here and here.

However, the Capacity leg as I expected, is RAID 5 and and it includes three RAID 0 components, each placed on a separate host. You can also group components for easier viewing.

If you select an object in the Virtual Objects section and click View Performance, you can monitor the workload for your specific objects or the entire virtual machine. Select a time range and click Show Results, will display performance charts for object or virtual machine depending on which tab are you on. These performance charts include IOPS, throughput, latency and etc which help you to identify the bottlenecks and performance-related issues within vSAN. Since it is a home lab, I have simulated some random IO to generate IOPS to provide a bit meaningful data in the performance section. As you may notice I am facing a high latency on hard disk 1.

In this example, I utlized dd command on Centos 7 to create one gigabyte that were written three thousand times to simulate a read and write operations.

dd if=/dev/zero of=/tmp/vsantest bs=1G count=3000 oflag=dsync

When one or more hosts are unable to communication with others, you will observe some objects changing their states, such as becoming inaccessible or unhealthy or reduced availability and so on. Similarly, if one or more physical disks encounter issues, certain objects may encounter failure.

For complete list of state, use the following links, but now I will try to simulate some of them.

https://knowledge.broadcom.com/external/article?legacyId=2108319

I placed one host (esxi03) into maintenance mode with option “Ensure Accessibility” to see what will be happen on the virtual objects for VM1. As you can see in the following screenshot, objects state changed to “Reduced availability with no rebuild – delay timer“, means the objects suffer a failure, but vSAN did not initiate any re-protection of the object and will wait for Object Repair Time (which is 60 minutes by default) to expire before start the re-protection process.

You can change the “Object Repair Timer” under the Configure tab of vSAN cluster, there is a card for Advanced Options, click edit to adjust it to your needs.

I changed it to 10 minutes to continue the rest of the simulation, now timer has expired and I expect vSAN to attempt re-protection. However since I have no more hosts, it would not be successful. Therefore the object states should change to something else. During this process, virtual machine is accessible and working. The problem is that object is not compliant with its policy.

I have simulated a write and read operation once again to verify if it is working on the virtual machine, which resulted it is working fine!

Now I see that object states are changed to “Reduced availability with no rebuild“, it is like the previous one with one difference, the object is suffering a failure but this time, vSAN is not initiating a re-protect process. I already surpassed the delay timer, so the reason is not related to the timer anymore. In my case, I don’t have enough resource in the cluster.

Let’s do another simulation, I am utilizing RAID 5, and based on the nature of RAID 5, I expected that one failure should not impact the virtual machine, and as I observed, this expectation was and is true. Now, let’s put another host into maintenance mode using the ‘Ensure Accessibility‘ option.

Before I see what might happen, let’t think about the scenario: on one hand, placing two out of three hosts into maintenance mode is not acceptable in a RAID 5 setup with three nodes. When two nodes are lost, data loss occurs. On the other hand, selecting ‘Ensure Accessibility’ means data will be accessible even if the host enters maintenance mode. Alternatively, if the host enters maintenance mode, I should still have access to the data.

Putting two out of three hosts into maintenance mode is not recommended. I only do in my lab to see see the outcome! This test should never be performed in a production environment! otherwise you are putting your data at a great risk!

I put it into maintenance mode, and after that, I conducted some tests on my VM. As shown in above GIF, I have access to the VM, and simulating write and read operations worked well.

However, let’s see the placement of virtual objects. The virtual machine storage policy has been switched from RAID 5 to RAID 0! No Way!

But it is possible, and the reason is tied to our selection of ‘Ensure accessibility,’ which takes precedence over everything else, including the storage policy. So, if you had a RAID-5 or RAID-6 object and you repeatedly put hosts into maintenance mode, eventually, without sufficient fault domains, it may temporarily switch it to RAID-0, consolidating it onto a single host to ensure accessibility and it will revert to its original configuration once the fault domains come back online.

It was interesting for me as well, and to be honest, I was not aware of that. If you need more information about this topic, I recommend you to take a look at the following link.

https://blogs.vmware.com/virtualblocks/2018/09/10/vsan-maintenance-mode-raid-1-and-raid-5-using-ensure-accessibility

Resyncing Objects

Second tool is Resyncing Objects, as its name indicates, here you can monitor resynchronization tasks that are currently in progress. let’s bring hosts back into service (exit from maintenance mode) and then take a look at the Resyncing Options to get more information about the status of objects undergoing the process.

Various metrics offer insights into the resynchronization process. In my case, there is one object to be resynchronized. The bytes left before the resynchronization is complete are 59GB, with an estimated time of 36 minutes remaining for the process to finish. Furthermore, there are no remaining objects to be resynchronized.

In the ‘Object Lists‘ section, you’ll find a list of items that need to be synchronized. You might ask yourself , why there’s only one object listed in the upper section while there are 23 items listed below. This is because the upper section automatically refreshes every 10 seconds, whereas the lower section requires manual refresh. When I took the screenshot, I forgot to refresh the object lists before capturing it.