This blog post is dedicated to database corruption prevention, recently I did a database corruption recovery for a client of mine which caused by having bad sector in their storage system; Corruption has few degrees with lowest as logical consistency and highest is fatal corruption; We are able to recover low to medium degrees of corruption by using DBCC commands along with some data loss possibilities.
Corruption can occur due to IO subsystem issues, during checkpoint or lazywriter process take an action. A checkpoint is a point in time created by the checkpoint process at which SQL Server can be sure that any committed transactions have had all their changes written to disk. This checkpoint then becomes the marker from which database recovery can start. The checkpoint process ensures that any dirty pages associated with a committed transaction will be flushed to disk. Unlike the lazywriter, however, a checkpoint does not remove the page from cache; it makes sure the dirty page is written to disk and then marks the cached paged as clean in the page header. By default, on a busy server, SQL Server will issue a checkpoint roughly every minute, which is marked in the transaction log. If the SQL Server instance or the database is restarted, then the recovery process reading the log knows that it doesn’t need to do anything with log records prior to the checkpoint. The time between checkpoints therefore represents the amount of work that needs to be done to roll forward any committed transactions that occurred after the last checkpoint, and to roll back any transactions that hadn’t committed. By checkpointing every minute, SQL Server is trying to keep the recovery time when starting a database to less than one minute, but it won’t automatically check- point unless at least 10MB has been written to the log within the period. Checkpoints can also be manually called by using the CHECKPOINT T-SQL command, and can occur because of other events happening in SQL Server. For example, when you issue a backup command, a checkpoint will run first. Trace flag 3502 is an undocumented trace flag that records in the error log when a checkpoint starts and stops. For example, after adding it as a startup trace flag and running a workload with numerous writes, the error log contains the entries, which indicates checkpoints running between 30 and 40 seconds apart.
The lazywriter is a thread that periodically checks the size of the free buffer list. When it’s low, it scans the whole data cache to age-out any pages that haven’t been used for a while. If it finds any dirty pages that haven’t been used for a while, they are flushed to disk before being marked as free in memory. lazywriter also monitors the free physical memory on the server and will release memory from the free buffer list back to Windows in very low memory conditions. When SQL Server is busy, it will also grow the size of the free buffer list to meet demand (and therefore the buffer pool) when there is free physical memory and the configured Max Server Memory threshold hasn’t been reached.
Recovery Interval is a server configuration option that can be used to influence the time between checkpoints, and therefore the time it takes to recover a database on startup — hence, “recovery interval.” by default the recovery interval is set to 0, which allows SQL Server to choose an appropriate interval, which usually equates to roughly one minute between automatic checkpoints. Changing this value to greater than 0 represents the number of minutes you want to allow between checkpoints. Under most circumstances you won’t need to change this value, but if you were more concerned about the overhead of the checkpoint process than the recovery time, you have the option. However, the recovery interval is usually set only in test and lab environments where it’s set ridiculously high in order to effectively stop automatic checkpointing for the purpose of monitoring some- thing or to gain a performance advantage. Unless you’re chasing world speed records for SQL Server you shouldn’t need to change it in a real- world production environment. SQL Server evens throttles checkpoint I/O to stop it from impacting the disk subsystem too much, so it’s quite good at self-governing. If you ever see the SLEEP_BPOOL_FLUSH wait type on your server, that means checkpoint I/O was throttled to maintain overall system performance.
How to Prevent
It is quiet easy to prevent the database corruption, it is required to perform some windows and SQL Server configuration. by following the below steps:
- Avoid to place TempDB in shared storage in cluster environment.
- Do not enable hardware cache level option in Windows for shared storage.
- Monitor the SQL Server 845, 824 and IO latency error messages.
SQL Server Error 824
2019-03-06 14:03:03.43 spid113 Error: 824, Severity: 24, State: 2. 2019-03-06 14:03:03.43 spid113 SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:112240573; actual 0:0). It occurred during a read of page (1:112240573) in database ID 6 at offset 0x0000d614f7a000 in file ‘E:DatabaseMicrosoft SQL ServerMSSQL12.PRDLIVEMSSQLDATAPRODDB.mdf’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. [/plain]
This error indicates that Windows reports that the page is successfully read from disk, but SQL Server has discovered something wrong with the page. This error is similar to error 823 except that Windows did not detect the error. This usually indicates a problem in the I/O subsystem, such as a failing disk drive, disk firmware problems, faulty device driver, and so on.
SQL Server Error 845
2019-03-06 14:56:34.76 spid14s Error: 845, Severity: 17, State: 1. 2019-03-06 14:56:34.76 spid14s Time-out occurred while waiting for buffer latch type 3 for page (1:70189504), database ID 6. 2019-03-06 15:01:37.66 spid14s A time-out occurred while waiting for buffer latch — type 3, bp 00000003EFFF0C00, page 1:70189504, stat 0x1c00109, database id: 6, allocation unit Id: 72057602977234944, task 0x0000000005A1C508 : 0, wait time 300, flags 0x3a, owning task 0x0000000005A4EE08. Not continuing to wait.
A process was waiting to acquire a latch, but the process waited until the time limit expired and failed to acquire one. This can occur if an I/O operation takes too long to complete, usually as a result of other tasks blocking system processes. In some instances, this error may be the result of hardware failure. Verify whether there are associated I/O failures in the error log or event log. I/O failures are typically caused by disk malfunction.
SQL Server IO Latency Error
2019-02-26 11:20:02.17 spid6s SQL Server has encountered 286 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:DatabaseMicrosoft SQL ServerMSSQL12.PRDLIVEMSSQLDATAPRODDB.mdf] in database [PRODDB] (6). The OS file handle is 0x0000000000000A68. The offset of the latest long I/O is: 0x0000008d640000
2019-02-26 11:25:06.93 spid6s SQL Server has encountered 175 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:DatabaseMicrosoft SQL ServerMSSQL12.PRDLIVEMSSQLDATAPRODDB.mdf] in database [PRODDB] (6). The OS file handle is 0x0000000000000A68. The offset of the latest long I/O is: 0x000072e05cc000
The long I/O latency can cause the SQL Server buffer latch time-out which indirectly increasing risk of database corruption.
Hopefully, nobody face database corruption as corruption recovery is very stress full process; stay tune for more blog posts as hope it was informative post.