vSAN’s resilience features and data availability
This is part of the VMware vSAN guide post series. By using the following link, you can access and explore more objectives from the VMware vSAN study guide.
In this post, we will explore vSAN’s resilience features and data availability mechanisms, showcasing how it helps you safeguard your data in a vSAN environment.
vSAN Component States
In vSAN, components can be in an active, absent, degraded, or reconfiguring state.
- Active: An active component is in a healthy and operational state. It is actively participating in read and write operations within the vSAN cluster. Active components are fully functional and serve data to virtual machines.
- Absent: A component in the absent state is essentially missing or not available in the vSAN cluster. This can happen due to various reasons, such as hardware failures or network issues. When a component is marked as absent, it means that vSAN is unable to access the data stored on that component, and vSAN detects a temporary component failure where the component might recover and restore its working state. vSAN starts rebuilding absent components using data redundancy mechanisms (like RAID or mirroring if they are not available within a certain time interval, which is by default 60 minutes.
- Degraded: The degraded state indicates a permanent component failure and assumes that the component is not going to recover to the working state. For example, a degraded component might have experienced a disk failure or a storage controller failure, making it less reliable than other components. vSAN will start to rebuild the degraded component immediately.
- Reconfiguring: The reconfiguring state means that vSAN is actively making changes to the configuration of a component or its associated data. This can occur when you perform maintenance tasks like adding or removing disks, changing storage policies, or during the process of data repair/rebuilding after a failure.
In a normal and healthy situation, all components are marked as active. However, in the event of a problem with one of the ESXi hosts, such as a disk failure, the components are marked as degraded, and vSAN attempts to rebuild them.
For instance, if you need to patch your ESXi host and place it in maintenance mode with the default settings, the component will transition to an absent state. However, vSAN will not immediately initiate the rebuilding process. Instead, it waits for a predefined timer, which is set to 60 minutes by default. If the host does not exit maintenance mode within this timeframe, vSAN will then proceed to initiate the rebuilding process.
If you look at the following screenshot, you see that all components that belong to this virtual machine are in an Active state.
Now, if I attempt to place one host (ESXi02) into maintenance mode, all the components within this host will transition to the Absent state, as shown in the screenshot below.
vSAN resynchronization involves the restoration of degraded or missing components and the synchronization of stale components to ensure they are up to date. This process is initiated when a hardware device, host, or network experiences a failure, or when a host remains offline for an extended period after being placed in maintenance mode.
vSAN performs mainly resynchronization in the following scenarios:
- When editing a virtual machine (VM) storage policy.
- Upon restarting a host after a failure.
- If a host remains unavailable for over 60 minutes (Object Repair Timer).
- When placing a host in maintenance mode with Full data migration mode enabled.
That is one of the reasons recommended to have more than the minimum required hosts for vSAN, ensuring that resyncs can always take place and compliance is maintained.
Consider the following scenario: During the resynchronization process, a previously failed host returns to the environment. What will vSAN do? Will it continue to build new components or update the existing components?
In this case, vSAN calculates the time required for both building a new component and resynchronizing the existing component. After evaluating the time needed for each method, vSAN selects the one that takes the least amount of time.
Durability components further reduce the time needed to perform resynchronization by creating a temporary component to capture new writes. In vSAN ESA, when a host is placed into maintenance mode using the Ensure Accessibility option, an object is using RAID-5 or RAID-6 erasure coding, and there is a sufficient fault domain to write the new or updated data to, a durability component is created on a new fault domain to capture all new and incremental writes. If the absent component returns within the Object Repair Timer, it is resynchronized from the durability component and becomes active again. Afterward, the durability component is deleted.