Site icon Bugra Parlayan | Oracle Database Blog

Comprehensive Guide to Oracle Exadata Automatic Hard Disk Scrubbing

I. Introduction: Overview of the Exadata Hard Disk Scrubbing Process

Data integrity is a cornerstone of modern computing systems. Errors that may occur during the storage, reading, transmission, and processing of data can have devastating effects on business processes. Various error detection and correction mechanisms have been developed to mitigate these risks. One such mechanism is the “data scrubbing” process.

A. Data Scrubbing: General Concept

Data scrubbing is an error correction technique that periodically inspects storage devices or main memory for errors and corrects detected errors using redundant data, such as checksums or backup copies of the data. Its primary purpose is to reduce the likelihood that single, correctable errors will accumulate over time and lead to uncorrectable errors. This ensures data integrity and minimizes the risk of data loss.  

This technique is a widely used error detection and correction mechanism in memory modules (with ECC memory), RAID arrays, modern file systems like ZFS and Btrfs, and FPGAs. For example, a RAID controller can periodically read all hard disks in a RAID array to detect and repair bad blocks before applications access them, thereby reducing the probability of silent data corruption caused by bit-level errors.  

B. Exadata Automatic Hard Disk Scrubbing: Definition and Scope

The Oracle Exadata platform employs a multi-layered approach to ensure data integrity. One of these layers is the Exadata Automatic Hard Disk Scrub and Repair feature. As part of the Exadata System Software (Cell Software), this feature automatically and periodically inspects the hard disk drives (HDDs) within the Storage Servers (Cells) when the disks are idle.  

The primary goal of this process is to proactively detect and facilitate the repair of bad sectors or other physical/logical defects on the disks before applications attempt to access the affected data. This prevents “latent” or silent data corruption.  

The scope of Exadata scrubbing is important. This feature primarily targets physical bad sectors on hard disks. It focuses on detecting physical media errors that might be missed by standard drive Error Correcting Code (ECC) mechanisms or operating system checks. This complements, but does not replace, higher-level logical consistency checks performed by the database (e.g., via the DB_BLOCK_CHECKING parameter ) or the manually executable ASM disk scrubbing process. Furthermore, this automatic scrubbing process does not apply to Flash drives in Exadata; these drives are protected by different mechanisms.  

A distinctive aspect of Exadata scrubbing is its proactive nature. While database block checks typically occur during I/O operations , Exadata scrubbing specifically targets data that has not been accessed for a long time, especially when disks are idle. This approach ensures that corruption in rarely used data is detected and repaired long before it can cause an access error at a critical moment.  

C. Differences Between Exadata Hard Disk Scrubbing and ASM Disk Scrubbing

The term “scrubbing” can be used in different contexts within the Oracle ecosystem, so it’s crucial to distinguish Exadata’s automatic hard disk scrubbing from the disk scrubbing feature offered by Oracle Automatic Storage Management (ASM).  

These two mechanisms are complementary. Exadata scrubbing finds physical errors, potentially preventing them from causing logical corruptions later, while ASM scrubbing can find logical inconsistencies that might arise from sources other than physical media errors (e.g., software bugs). Oracle documentation suggests that due to the presence of automatic Exadata scrubbing in Exadata 11.2.3.3 and later, periodic ASM disk scrubbing becomes less critical for the specific purpose of proactive physical/latent error checking. However, manual ASM scrubbing retains its value for on-demand logical validation of specific files or disk groups.  

II. Internal Mechanism of the Exadata Scrubbing Process

The effectiveness of the Exadata Automatic Hard Disk Scrubbing process relies on the tight integration between the core components of the Exadata architecture: the Storage Servers (Cells) and Oracle Automatic Storage Management (ASM).

A. Role of Exadata Storage Servers (Cells)

The scrubbing process is executed by the Exadata System Software (specifically, the Cell Services – CELLSRV process) running on each Exadata Storage Server (Cell). The inspection is local to the cell where the scanned disk resides; data is not sent outside the cell during the sector check phase. This minimizes inter-cell network traffic for the inspection stage.  

The Cell Software continuously monitors disk health and I/O utilization to determine when to start, pause, or throttle the scrubbing process. Typically, scrubbing begins or resumes when the average disk I/O utilization drops below a certain threshold (often cited as 25%).  

B. Interaction with Oracle ASM for Detection and Repair

When the Exadata scrubbing process detects a bad sector on a hard disk, the procedure unfolds as follows :  

  1. Detection: The Cell Software identifies a physical read error or inconsistency during its periodic scan.
  2. Request Submission: The Cell Software that detected the faulty sector automatically sends a repair request to the Oracle ASM instance managing the disk group containing that disk.  
  3. Repair by ASM: Upon receiving the request, ASM orchestrates the repair by reading a healthy copy of the data block (extent) containing the bad sector from another storage server where a mirrored copy resides.  

This interaction exemplifies Exadata’s “Intelligent Storage” philosophy; low-level physical error detection happens within the cell, while ASM, which understands the database structure and data placement, coordinates the logical repair.

C. Leveraging ASM Mirroring for Data Recovery

Oracle ASM mirroring (Normal or High Redundancy) is fundamental to Exadata’s data protection strategy, and the repair capability of the scrubbing process is entirely dependent on this mechanism.  

ASM distributes redundant copies (extents) of data blocks across different failure groups (which in Exadata are typically the Storage Servers). This ensures data accessibility even if an entire cell becomes unavailable, as data can be accessed from other copies.  

When ASM receives a repair request triggered by scrubbing, it follows these steps:

  1. Locate Healthy Copy: ASM identifies a disk on a different storage cell that holds a valid copy of the affected data block. ASM knows which disks are “partners” and where mirrored copies are stored.  
  2. Read Data: ASM reads the correct data from the disk containing the healthy copy.
  3. Write Over Bad Sector: ASM uses the correct data read to overwrite the bad sector on the original disk, thus correcting the error.  

The success of this repair mechanism hinges entirely on the existence of valid and accessible ASM mirrors. If a second disk failure occurs in a Normal Redundancy (2 copies) disk group before a rebalance completes, or if all three copies become inaccessible simultaneously in a High Redundancy (3 copies) group, scrubbing can detect the error, but ASM cannot repair it. This underscores why High Redundancy is strongly recommended for critical systems , as the extra copy significantly reduces the probability of losing all copies concurrently.  

Furthermore, the scrubbing process not only repairs isolated bad sectors but can also serve as an early indicator of more severe disk problems. If numerous or persistent errors are detected during scrubbing, it can lead ASM to take the corresponding grid disk offline and initiate a rebalance operation to redistribute data onto the remaining healthy disks. In this context, scrubbing also acts as an early warning system that triggers ASM’s existing high availability (HA) mechanisms. Monitoring the V$ASM_OPERATION view during or after scrub periods is important for tracking such ASM recovery actions.  

D. Types of Errors Detected

Exadata Automatic Hard Disk Scrubbing primarily focuses on detecting physical bad sectors and latent media errors on hard disk drives that might not be caught by standard drive ECC or operating system checks. Damaged or worn-out sectors or other physical defects fall under this scope.  

The “logical defects” mentioned typically refer to low-level media inconsistencies rather than logical corruptions at the ASM or database level (which is the domain of ASM scrubbing ). The main goal is to find such issues before they impact data access or lead to silent data corruption.  

III. Managing and Monitoring the Exadata Scrubbing Process

Effectively utilizing the Exadata Automatic Hard Disk Scrubbing feature requires proper configuration and continuous monitoring. The primary tool for these tasks is the CellCLI (Cell Command Line Interface) utility.

A. CellCLI Commands for Configuration

CellCLI is the main command-line interface for managing Exadata storage server features. Scrubbing-related configuration is done using the ALTER CELL command and specific attributes :  

To view the current scrubbing settings, use the command: CellCLI> LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTime

Note that configuration is done on a per-cell basis, meaning these settings apply to all hard disks within a specific storage server. However, the “Adaptive Scrubbing Schedule” feature can automatically adjust the effective run frequency for specific disks identified as problematic, although the base schedule is configured cell-wide.  

B. Monitoring Scrubbing Activity

Several methods are available to understand the status and impact of the scrubbing process:

C. Starting, Stopping, and Checking Status

D. Table 1: Essential CellCLI Commands for Exadata Hard Disk Scrubbing

The following table summarizes the key CellCLI commands used to manage and monitor the Exadata hard disk scrubbing process:

CommandPurposeExampleSources
`ALTER CELL hardDiskScrubInterval = [daily\weekly\biweekly\
`ALTER CELL hardDiskScrubStartTime = [‘<timestamp>’\now]`Sets the start time for the next scheduled scrubbing operation.ALTER CELL hardDiskScrubStartTime=now
LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTimeDisplays the current scrubbing schedule configuration.LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTime
LIST METRICCURRENT WHERE name = 'CD_IO_BY_R_SCRUB_SEC'Monitors the real-time scrubbing I/O rate for each hard disk.LIST METRICCURRENT WHERE name = 'CD_IO_BY_R_SCRUB_SEC'
LIST ALERTHISTORY WHERE message LIKE '%scrubbing%'Checks logs for scrubbing start/finish/error messages.LIST ALERTHISTORY WHERE message LIKE '%scrubbing%'

IV. Performance Impacts of the Scrubbing Process

While designed to proactively protect data integrity, Exadata Automatic Hard Disk Scrubbing does have an impact on system resources, particularly the I/O subsystem. Understanding and managing this impact is crucial.

A. Resource Consumption (CPU, I/O)

The primary resource consumed by the scrubbing process is Disk I/O. The operation involves reading sectors from the hard disks. On an otherwise idle system or disk, the scrubbing process can significantly increase disk utilization, potentially reaching close to 100% for the disk being scanned.  

CPU consumption on the storage server (Cell) for the scrubbing check itself is generally low, as it’s largely an I/O-bound operation. However, if scrubbing detects an error and triggers a repair via ASM, that repair process (reading the good copy and writing it to the bad location) can consume additional resources (CPU and network) across cells and potentially database nodes, although the Exadata architecture aims to minimize this impact.  

B. Designed Operating Window (Low I/O Utilization)

A key design principle to minimize the performance impact of Exadata scrubbing is that the process only runs when the storage server detects low average I/O utilization. This threshold is commonly cited as 25%.  

The system automatically pauses or throttles scrubbing activity when I/O demand from the database workload exceeds this threshold. This mechanism aims to prevent scrubbing from significantly impacting production workloads.  

However, there’s a nuance to the “25% utilization” threshold. It may not mean absolute idleness. There could be a persistent background I/O load running just below this threshold (e.g., 20-24%). Adding the scrubbing I/O on top of this existing load will increase the total I/O. While Exadata I/O Resource Management (IORM) prioritizes user I/O , even the minimal added load from scrubbing could potentially have a noticeable effect, especially for applications highly sensitive to very low latency. Therefore, while “low impact” is the goal, “zero impact” is not guaranteed.  

C. Interaction with I/O Resource Management (IORM)

Exadata I/O Resource Management (IORM) plays a critical role in managing the performance impact of background tasks like scrubbing. IORM prioritizes and schedules I/O requests within the storage server based on configured resource plans.  

IORM automatically prioritizes database workload I/O (e.g., user queries, OLTP transactions) over background I/O processes like scrubbing. This ensures minimal impact on application performance from scrubbing activity. IORM plans can be configured to manage resources among different databases or workloads, indirectly affecting the amount of resources available for background tasks like scrubbing.  

D. Potential Performance Impact and Mitigation Methods

Despite being designed for low impact, it should be acknowledged that scrubbing can cause spikes in disk utilization and potentially increase latency, especially in situations where the system isn’t completely idle even when the “idle” threshold is met. The concern about performance impact, though often associated with general ASM scrubbing, can also apply to Exadata scrubbing.  

To mitigate this potential impact, consider these strategies:

E. Factors Affecting Scrubbing Duration

The time required to complete a scrubbing cycle depends on several factors:

It’s noteworthy that duration estimates in documentation (S2 vs S9) vary significantly. This highlights that estimates heavily depend on the Exadata generation (disk sizes/speeds), software version (potential efficiency improvements), and most importantly, the actual workload pattern and resulting “idle” time on the specific system. Relying on monitoring in your own environment is more accurate than general estimates. For instance, one observation noted a scrubbing rate of approximately 115MB/s per disk. At this rate, continuously scanning a 22TB disk (X10M ) would take roughly 54 hours. Given that scrubbing runs intermittently based on load , the actual completion time could be considerably longer.  

V. Key Benefits of the Exadata Scrubbing Process

Exadata Automatic Hard Disk Scrubbing is a valuable feature that significantly contributes to the data integrity and high availability capabilities of the Exadata platform.

A. Proactive Detection of Latent Errors and Silent Data Corruption

Its most fundamental benefit is the proactive discovery of physical media errors before they are encountered during normal database operations. This prevents “silent” data corruption, where errors occur on disk but remain undetected until the data is read (which could be much later). By checking data blocks that haven’t been accessed recently , it ensures such hidden threats are uncovered.  

B. Enhanced Data Integrity and Reliability

By detecting physical errors and enabling their repair, the scrubbing process directly contributes to the overall data integrity and reliability of the Exadata platform. This feature complements other protection layers like Oracle HARD (Hardware Assisted Resilient Data) checks , ASM mirroring , and database-level checks , providing robust defense against data corruption.  

C. Automatic Repair Mechanism

A significant advantage is that the feature automates not just detection but also the initiation of the repair process. In typical bad sector scenarios, both error detection and the triggering of repair via ASM happen automatically, requiring no manual intervention. This reduces administrative overhead and ensures timely correction of detected issues.  

D. Complements Other Exadata High Availability Features

Scrubbing is part of Exadata’s comprehensive Maximum Availability Architecture (MAA) strategy. It works alongside features like redundant hardware components , Oracle RAC for instance continuity , ASM for storage virtualization and redundancy , HARD for I/O path validation , and potentially Data Guard for disaster recovery.  

This reinforces Exadata’s “defense in depth” approach to data protection. HARD checks the I/O path during writes ; database checks can verify logical structure ; ASM provides redundant copies of data ; and scrubbing proactively inspects the physical media at rest. No single feature covers all possible scenarios, but working together, they provide robust protection. Scrubbing forms a critical layer in this strategy, specifically targeting latent physical errors that might be missed by other mechanisms.  

VI. Evolution of the Scrubbing Feature Across Exadata Versions

The Exadata Automatic Hard Disk Scrubbing feature has evolved along with the platform itself.

A. Feature Introduction

The Automatic Hard Disk Scrub and Repair feature was first introduced with Oracle Exadata System Software version 11.2.3.3.0. At that time, specific minimum database/Grid Infrastructure versions like 11.2.0.4 or 12.1.0.2 were required for the feature to function.  

B. Adaptive Scrubbing Schedule

A significant enhancement arrived with Exadata System Software version 12.1.2.3.0: the Adaptive Scrubbing Schedule. With this feature, if the scrubbing process finds a bad sector on a disk, the Cell Software automatically schedules the next scrubbing job for that specific disk to run more frequently (typically weekly). This temporarily overrides the cell-wide hardDiskScrubInterval setting for that disk. If the subsequent, more frequent run finds no errors, the disk’s schedule reverts to the global hardDiskScrubInterval setting. This feature also requires specific minimum Grid Infrastructure versions to operate.  

This adaptive approach makes scrubbing more efficient. Instead of frequently scanning all disks, it focuses more attention only on disks showing potential issues. This conserves I/O resources on healthy disks while providing quicker follow-up checks on suspect ones.

C. Other Related Developments (Post-12.1.2.3.0)

Available documentation primarily focuses on the introduction of the scrubbing feature and the adaptive scheduling enhancement. Detailed information about significant changes to algorithms, performance tuning (beyond IORM interaction), or reporting in later versions (e.g., post-12.x, 18.x, 19.x, 20.x, 21.x, 22.x, 23.x ) is not provided in the reviewed sources. Consulting the release notes for specific Exadata System Software versions might be necessary for details on newer developments.  

D. Table 2: Evolution of Key Exadata Scrubbing Features

The following table summarizes the key milestones in the development of the Exadata scrubbing feature:

Exadata Software VersionKey Feature/EnhancementDescriptionSources
11.2.3.3.0Automatic Hard Disk Scrub and Repair (Introduction)Introduced the core feature for automatic, periodic inspection and initiation of repair via ASM.
12.1.2.3.0Adaptive Scrubbing ScheduleAutomatically increases scrubbing frequency (e.g., to weekly) for disks where bad sectors were recently detected.
Post-12.1.2.3.0(Other Enhancements Unspecified)(Specific major enhancements for later versions are not detailed in the provided documentation)

VII. Configuration and Best Practices

To maximize the benefits of the Exadata Automatic Hard Disk Scrubbing feature, proper configuration and adherence to Oracle’s Maximum Availability Architecture (MAA) principles are important.

A. Default Settings and Configuration Options

B. Recommended Scheduling Strategies for Production Environments

C. Importance of ASM Redundancy

D. Integration with Overall MAA Strategy

Scrubbing is just one part of the MAA best practices recommended by Oracle for Exadata :  

E. Table 3: Exadata Scrubbing Configuration Attributes and Best Practices

This table consolidates key configuration parameters and actionable recommendations:

Parameter/AreaConfiguration/SettingDefaultRecommendationSources
hardDiskScrubIntervaldaily, weekly, biweekly, nonebiweeklyStart with default. Consider weekly for busy systems if needed, monitoring impact. Avoid none.
hardDiskScrubStartTime<timestamp>, nowNoneExplicitly set to a known low-load window (e.g., weekend night).
ASM RedundancyNormal (2 copies), High (3 copies)NormalUse High Redundancy for production disk groups to maximize repair success probability.
MonitoringCellCLI Metrics, Cell Logs, AWR, ASM Views, exachkNoneRegularly monitor scrubbing activity, completion status, performance impact, and overall system health (exachk).
Scheduling StrategyWorkload-dependentIdle-basedSchedule during predictably low-load times; ensure cycles complete.
MAA IntegrationPart of overall HANoneIntegrate with Data Guard, regular health checks, and robust monitoring per MAA guidelines.

VIII. Conclusion

Oracle Exadata Automatic Hard Disk Scrub and Repair is a proactive defense mechanism crucial for maintaining data integrity and high availability on the Exadata platform. By periodically scanning hard disks on storage servers for physical errors, this feature detects latent corruptions, especially in infrequently accessed data, before they can impact applications.

The core strength of the scrubbing process lies in the integration between Exadata System Software and Oracle ASM. While the Cell Software detects the error, ASM manages the automatic repair process using mirrored copies. The effectiveness of this repair capability is directly tied to the correctly configured redundancy of ASM disk groups, particularly High Redundancy, which is strongly recommended for production environments.

From a performance perspective, the scrubbing process is designed to run during periods of low I/O utilization detected by the system and is managed by IORM. This aims to minimize the impact on production workloads. However, it remains important for administrators to monitor scrubbing activity via CellCLI metrics, alert logs, and AWR reports, and potentially adjust the schedule based on their environment’s specific workload patterns.

Introduced in Exadata 11.2.3.3.0 and enhanced with Adaptive Scheduling in 12.1.2.3.0, this feature is an integral part of Exadata’s multi-layered data protection strategy (including HARD checks, ASM mirroring, RAC, Data Guard, etc.). Properly configuring and operating Exadata Automatic Hard Disk Scrubbing is critical for preserving data integrity, preventing unexpected outages, and maximizing the value of the Exadata investment. For best results, scrubbing configuration and operation should be considered within the framework of Oracle MAA best practices, supported by regular system health checks (exachk) and comprehensive monitoring.

Exit mobile version