Merge proposal.

edit

This article and Integrity (computing) seem to cover the same topic. JonHarder 00:11, 28 January 2006 (UTC)Reply

This is no longer true. Badon (talk) 22:15, 21 August 2011 (UTC)Reply

Additional Definition.

edit

Can't data corruption also be defined as when the data takes bribes?

software for corrupting data?

edit

hey um I was thinking "hey wouldnt it be fun to make an I Love Bees style web site" so I went looking for software that corrupts data on purpose and I found this: http://www.recyclism.com/corruption/index.php

maybe it should be added as well as maybe links to any other software that does anything similar --207.160.205.13 21:52, 24 September 2007 (UTC)Reply

to make data corrupt, that's just as easy as to make a politician corrupt: To make a politician corrupt, just offer him money. To make a file or a database corrupt, just open it with an application, write some data and force shutdown of the application while in writing state... —Preceding unsigned comment added by 89.125.133.162 (talk) 10:09, 27 September 2008 (UTC)Reply

Excellent information to be use as references

edit

This google search is loaded with information:

http://www.google.com/search?q=silent+data+corruption

As I get around to adding info, I'll be taking a lot of info from that search. The CERN study of the subject is particularly interesting, as is the widely unrecognized scope of the problem, even in the 2011/2012, when computing technology is commonly thought to be mature - without basic data integrity checks, it is not!

In particular, the myth that "computers don't make mistakes, people do" originating from the early days (1960's to 1980's?) is completely untrue in the context of a century-long era of unknown, unchecked, unverified, uncorrected data corruption. Badon (talk) 22:19, 21 August 2011 (UTC)Reply

edit

I added a bunch of links to the see also section, some of which should be worked into the article prose as the article gets better. Badon (talk) 22:41, 21 August 2011 (UTC)Reply

Definition of Data Corruption and focus of article

edit

The following was moved here from Talk:Hard_disk_drive#Silent_Data_Corruption since the subject is far broader than HDD failures, silent or not:

At first glance it looks like the CERN and NetAp's articles have two different definitions of "Silent Data Corruption." In the CERN case it is truly silent and deadly; that is data are lost. The loss of data is determined at the system level by comparing the read data with a copy of the originally written data. On the other hand, most if not all of the errors in the NetAp's article are detected and corrected before the are presented to the system (they may get corrupted thereinafter). The NetAp errors are only silent in the sense that u don't know u have an error until u attempt to read - it may have occurred a long time ago and just silently sat there. More than one such error in the same error correcting domain could cause loss of data but that would be detected and reported unlike the silent errors at CERN. The RAID industry typically background scrubs disk drives in the arrays to minimize the possibility such silent errors sufficient to cause a loss of data.
I'll take a look at yr new article keeping the above in mind. Tom94022 (talk) 06:32, 2 January 2014 (UTC)Reply
Thank you for a quick response.
Basically, there are two issues there, with completely different backgrounds. First, we have huge disks and even bigger RAID configurations, where it's unknown if a sector is good or not until it's actually attempted to be read; that's where scrubbing plays its role. In a few words, scrubbing goes over the whole drives' surfaces and attempts to read all sectors, in an attempt to detect any media problems as early as possible. Many HDDs are also doing that on their own, without the scrubbing initiated by a RAID controller; I'm sure you've heard a HDD attached to a plain controller doing something when it actually should be idle, that's a HDD's firmware doing the scrubbing.
On the other hand, we have silent data corruption, which attempts because of totally different reasons. Basically, it happens because of HDD firmware bugs, undetected memory corruption or bus errors, cosmic rays etc. Even if everything checks to be Ok while scrubbing and later reading the data from a sector, performing sector-level ECC checks etc., the data can become corupted while leaving the HDD's cache, while traveling trough various buses, while sitting in various buffers/caches etc. Of course, silent corruption happens much less often (and, more importantly, in smaller quantities) than a typical data loss due to media errors, but it's much more deadly. That's why CERN and NetApp are discussing two quite different things.
Hope it makes sense. Of course, I'm more than happy to discuss it further. — Dsimic (talk) 16:27, 2 January 2014 (UTC)Reply

I suggest this article should discuss "Silent Data Corruption" in the CERN sense, that is, information in system memory that is purported to be valid but in fact is in error.

Every element of the storage hierarchy has failure rates, soft, hard and lifetime. The vast majority of these failures are detectable most can be recovered from with many being corrected on the fly. A bit flipped in any memory by a cosmic ray may or may not be detected and corrected depending upon the system configuration. Similarly a packet corrupted in transit or a sector corrupted on a disk maybe corrected or recovered by retry. A failure that is recovered does not cause data corruption and should be beyond the scope of this article.

However, the same failure might have different consequences depending upon its system environment. The 1 out of 90 SATA HDDs in the NetAp study that had checksum mismatches are detected and corrected in NetAp arrays but might cause Silent Data Corruption in a PC where there is no checksum. We don't know - maybe the checksum was written badly? Likewise a PC with no RAM ECC might not detect an bit flip but an ECC protected server would (most of the time). So the scope question becomes do we want to cover the entire domain of today's computing or limit ourselves to the high end like CERN and NetAps?

I agree that having data/metadata checksums is by no means an ironclad solution, but it's way better than not having them. I'd say we shouldn't be limiting ourselves to anything, and both detected and undetected (silent) corruptions should be covered in the article. Just as you described, something detected in one environment might be a silent corruption elsewhere.
As a small example from the consumer-grade environment, I have a laptop with an AMD chipset, and due to a documented hardware bug it has a consistent data corruption when data is moved over its USB 2.0 bus; I've detected those as I'm using file-level checksums, but 99% of the consumers would have no idea about that, as their music files would continue playing even after copied with that corruption. That should be a good real-life example for detected vs. silent corruption, if you agree.
Also, the article already makes a distinction between detected and undetected corruption. Though, having a more clear divider provided should be beneficial, if you agree. — Dsimic (talk) 19:23, 2 January 2014 (UTC)Reply
Your USB experience is a perfect example of a "Undetected Data Corruption" in a PC environment, not even detectable by most users without sophisticated tools. It maybe original research depending upon what is disclosed by AMD and elsewhere. .
The photo in the article is likely a good example of "Detected Data Corruption" in that as I interpret the photo's comment the user found a unreadable file ("Data Loss" the worst form of data corruption) and then attempted a recovery which produced a readable file but upon examination was clearly corrupted. I likely because I am only guessing as to how it got created. Maybe we can use it as an example.
Some blue screens of death might be an example of a "Detected Data Corruption" of the system cause by cosmic rays and maybe we can find a reliable source. It's undetected when the ray hits, but gets detected when a part of RAM gets executed (or maybe even recovered)
An error that is detected and corrected by the system is not "Data Corruption" other than in a transient sense since it is corrected back to its original state. As near as I can tell the NetAp statistic applies to errors that were detected and corrected and is therefore inappropriate to this article. BTW, I looked for but did not find the 1 in 90 (1.1%) statistic, nor could I find the 67 TB calculation in the cited articles, but that really is not the point since there was no reported permanent data corruption in the NetAp studies, just the potential for it during reconstruction. So I again am removing the citation; please let me know why u added it back:
While I think the article should focus on "Undetected Data Corruption" at the system level I can see covering both detected and undetected but really see no reason to get into corrected errors since the corruption is transient and typically invisible to the user. Are we going to cover all the recovery in networking? I hope not. Tom94022 (talk) 20:04, 2 January 2014 (UTC)Reply
[DS] I'd disagree that the sample picture represents an example of detected data corruption, in case it was detected by a human. If the corruption was detected by some part of the computer system (HDD firmware, filesystem etc.), that would've been a detected corruption; this way it's a silent corruption, as nobody except the human end user was aware of it.
[TOM] This all depends upon the definition of "Data Corruption" which is where I stared this Talk; the system does not detect the jpg file as bad. Tom94022 (talk) 01:35, 3 January 2014 (UTC)Reply
[DS] Exactly. But, if the picture isn't detected as bad by a computer system, isn't that a silent data corruption? — Dsimic (talk) 02:31, 3 January 2014 (UTC)Reply
[DS] Regarding the way how that garbled picture was created, it was probably the result of a semi-successful file undelete attempt. I've seen such pictures myself, and they're garbled because after deletion some part of them was overwritten by other data on a filesystem; the picture size is contained within the picture format header, resulting in a picture retaining its original size, but with a garbled portion. Also, I've seen a lot of those back at the time when modem communications (or slow uplinks in general) were common.
[TOM] Agreed that is a possibility but couldn't it be a file system corruption too.
[DS] Sure thing, it could be a filesystem (data and/or metadata) corruption. Though, then we should know whether the filesystem complained about its own errors or not, in order to classify that as silent or detected corruption. :)
[DS] "Blue screens of death" would be in fact signs of a silent data corruption, as the corruption was unknown to Windows until the effects reached the point of crashing the operating system. The same applies to Linux and its kernel oops. On the other hand, an NMI produced by a data corruption detected by the ECC memory, would be a sign of detected data corruption, as in that case the exact location, type and effect of the data corruption is known.
In other words, a corruption detected by a part of the computer system still is a data corruption, but it's not a silent data corruption. That's the important distinction to be made. Also, experiencing only the accumulated effects of silent corruptions, is still a silent corruption. Hope you agree.
[TOM] Again we are back to what is the definition of the types of Data Corruption. Why is a corrupted picture detected by the user a detected data corruption while a blue screen of death detected by a user an undetected data corruption.
[DS] Actually no, they're both silent corruptions, and that's what I previously wrote above (quoted): "I'd disagree that the sample picture represents an example of detected data corruption."
[DS] Regarding the NetApp's study, that's something quite important, and it should be included – please allow me to explain. They've detected a significant number of silent corruptions, and some of them were detected immediately by the RAID controller, but some of them were detected only upon rebuilding those RAID sets. If you think twice, it's even the fact that there were any silent corruptions, what's actually very important. Why? Because such silent corruptions are happening in one part of the data path which is common everywhere, even outside the NetApp's "enterprise" world – within the hard disk drives, common to the majority of consumer-grade PCs. The fact NetApp nailed those while rebuilding their zig-zag-cross-flow-whatever-parity-enterprise-RAID-123 sets :) changes nothing; a Joe Average could've had Btrfs or ZFS on his consumer-grade PC with two HDDs running in filesystem's internal RAID configuration, and experience nothing as Btrfs or ZFS would also transparently do the same as NetApp's "zig-zag" RAID controller did.
Hope you're getting my point... In a few words, there are also freely available "Joe Average" solutions that – if stretched enough – could even lead us to deleting this whole article as being completely unnecessary. :)
[TOM] I really disagree with your point as to the NetAp study. As best I can tell all of the checksum errors were detected by NetAps and corrected. There is no evidence as to where in the NetAp subsystem these errors were introduced (it could be in their hardware or firmware) nor is there any evidence that the error rate exceeds the specified hard error rate specification or drive failure rate. All we know is that this is a phenomena of the NetAps subsystems studied and therefore has no place in the HDD article. Whether it belongs in this article or not depends upon the article's definition of "Data Corruption".Tom94022 (talk) 01:35, 3 January 2014 (UTC)Reply
[DS] Hm, but NetApp's appliances are still using ordinary hard drives, if I'm not mistaken? I've read somewhere that NetApp customizes firmwares of the HDDs used in their appliances, but those are still pretty much ordinary HDDs. How can we be sure that even CERN's study is relevant; who knows what are they using? Having that in mind, results of such studies are connected with the Hard disk drive article. I'd be happy to know what would be a good example for that article, as relying solely on the manufacturers' data for HDD error rates is close to be against the WP:NPOV?
Once again, if we don't want to list transparently corrected errors as any kind of data corruption, then this whole article is pointless, as Btrfs and ZFS are already made specifically to deal with such issues.
[DS] Regarding the restored content, I hope you'd agree with the explanations and reasoning from above; it's about putting things into context, and those studies are providing real-life numbers for silent data corruptions in HDDs. I've restored those again, please check it out: edit #1, and edit #2. You're totally right about "one out of 90" – somebody pulled those numbers out of thin air (or from the publicly inaccessible ACM paper, or by doing some math on the NetApp's published numbers). However, those numbers were quite misleading, and I've rewritten those sentences from scratch, using only the data available from the publicly accessible reference. Hope you'll find that good enough.
[TOM]I can not see relevance of the NetAps numbers in the HDD article and whether they make it into this article depends upon the scope of the article. On that basis and the intent of WP:BRDC I reverted the HDD article but as a gesture of good will left this article alone. Tom94022 (talk) 01:35, 3 January 2014 (UTC)Reply
[DS] Please, don't get me wrong, I'm not attached to these articles in any way, you can even blank them if you wish. :) Just as stated in my response above, I'd like to know what can we take as some kind of a reliable study to verify those HDD error rates, as specified by their manufacturers, so we maintain WP:NPOV?
[DS] Of course, I'm more than happy to discuss this further. — Dsimic (talk) 00:41, 3 January 2014 (UTC)Reply
[TOM] So far you have not discussed what should be in this article even though I have asked for the same several times. There are a number of deficiencies that need to be straitened before we start adding material of dubious value, e.g.:
  1. Right now the article implies disk drives are the major source of data corruption but if you read the CERN article you will have to conclude that "Silent Data Corruption" as they defined and measured it has many sources, one of which might be in their storage subsystems where the corruption may have originated in a disk drive. The scope of the article needs to cover the scope of the problem.
  2. The article confuses "Undetected Data Corruption" (also called silent), "Detected Unrecoverable Data Corruption" and "Recovered Data;" personally I think the latter is not relevant in this article but the first thing we should do is define the terms (from a reliable source) and scope of the article and then we can see what makes sense. You do recognize that the CERN "Silent Corruptions" are not the same things as the NetAps "Silent Data Corruptions?"
  3. Most of the references are to high end system experience which has little relevance to "Joe Average" - who by definition is running Windows or maybe MAC OS but certainly not ZFS or Bfrs and not on RAID hardware. So their data corruption knowledge needs are quite a bit different than the big data folks in the references. Who is the article for?
I hope you will stop reverting and start discussing. What do we want this article to cover? Tom94022 (talk) 01:35, 3 January 2014 (UTC)Reply
[DS] I do agree on your three points above, and please let's start – if you agree – with defining the difference between "NetApp corruption" and "CERN corruption". From my point of view, they're pretty much the same, it's just that CERN's study sounds like being linked with cosmic rays and such stuff, because of the nature of work they're doing (my previous classification of CERN vs. NetApp was different, and it was made under that influence).
Just as I noted on my talk page, I won't be touching anything until we reach a consensus. Once again, please don't get me wrong, I'm here to make Wikipedia better, and to expand my own knowledge. Thank you for your time, guidance, and patience! — Dsimic (talk) 02:31, 3 January 2014 (UTC)Reply

CERN vs NetApps in context of data corruption taxonomy

edit

I believe the following taxonomy should apply to data corruption for the purposed of this article:

Data Corruption is data at the host system level that differs from the original data

A. Undetected Data Corruption is a subset such that the corruption is not detectable by the system.
B. Detected Data Corruption is a subset such that the corruption is detected but cannot be corrected.

Data errors that are corrected within the system stack are not data corruption in the sense of this article.

Detection at the host system level means that when the corrupted data are accessed by the system the error is detected which may or may not be recoverable.

CERN errors are Type A and very relevant to this article with regard to big data systems. NetAp errors were all detected and corrected at the RAID layer in the stack and therefore are neither Type A nor Type B and not particularly relevant to this article other perhaps in passing mention that they showed the possibility of Type B data corruption during a rebuild. It is not clear from the NetAps data where in the RAID layer the checksum errors occur and looking at the CERN data I would be just as suspicious of the RAID system as of the component disk drives.

The Blue Screen of Death is a Type B since the system detected it. The jpg and and USB audio are Type A. The Word document might be either type A if the application loads it, or type B if the application detects an error.

CERN further broke its Type A errors into four categories

Type 1 mainly bit sized related to RAM and not likely disk drive related
Type 2 mainly small 2n chunks less than 512 bytes which could be disk related but could be any other block sized device in the stack
Type 3 multiple large chunks of 64K which are not like disk drive related
Type 4 which was not understood but small.

Most of the corruptions were Type A.3 with Type A.1 and A.2 about equal. This strongly suggests that any article about data corruption in the big data environment needs to cover far more than the disk drives. For example, everyone knows that if you use TCP to transfer data across the Internet virtually all corrupted TCP segments are detected and retransmitted. What most do not realize that the CRCs used to detect errors in the Ethernet, TCP and IP Frames have a small but non zero probability of corruption not detectable by the CRCs so that such corruption can cause a Type A or Type B Data Corruption. In "Performance of Checksums and CRCs over Real Data" Stone and Partridge, 1995 estimated that between 1 in 16 million and 1 in 10 billion TCP segments will have corrupt data and a correct TCP checksum. The possibility of corruption at the transmission layer needs to be coverred. The fact that most packet errors are detected and corrected by retransmitting is not particularly relevant just as the errors corrected by NetAps are not particularly relevant.

In the PC environment its much simpler since the stack is much simpler. A disk drive hard error, typically 1 in 10-14 bits read is a Type B error, fortunately for most Joe Averages, that's a lot of data.

So that's my proposal, only Type A and B Data Corruption from the viewpoint of the host with both big data and PC perspective. Tom94022 (talk) 06:49, 3 January 2014 (UTC)Reply




Excellent proposal, thank you very much! Everything is based on the data from publicly available (and notable) sources, making a clear inclusion of all elements taking part within such data corruption scenarios; of course, it's not only up to HDDs and I've experienced all that myself. The only thing I'd add is that the article probably should also briefly describe the "Type C" corruptions (detectable and correctable), providing links to other articles (Forward error correction, for example).
Regarding the "CERN vs. NetApp" distinction, you're right there but it's still fuzzy, as NetApp's results would've been different if they had no "zig-zag" RAID parities in place. All that leads to a possible correlation of those NetApp results (before "zig-zag" doing the rebuilds) with big-scale statistical values that could be expected from the mainstream "Joe Average" hardware. Though, making such assumptions would go beyond presenting only what's available from the references.
Just as I already wrote, having data and/or metadata checksums isn't an ironclad solution, it's far away from that. There's just a small correction, it should be 1 in 1014 and not 1 in 10−14; a few manufacturers are stating 1 in 1016 rates for flipped bit reads.
Now that we're pretty much on the same page, shall we plan a new layout for this article? :) — Dsimic (talk) 21:03, 3 January 2014 (UTC)Reply

OK, see next section, but to put an end to the topic of CERN vs NetApp, my objection to NetApp, particularly for the HDD article, is that it is specific to the NetApp environment and not necessarily applicable to the rest of the world. The NetApp article ascribes much if not all of the "silent" corruption it corrects to firmware which exists in both the drives and in the NetApp filers (I suppose undetected SEU turned by firmware into a block error is included, not sure). NetAps makes no attempt to disclose the sources but instead detects and corrects, which is great for their users but cannot be used to ascribe failure rates to the same drives in an EMC system or a PC for that matter. It might deserve a mention herein as an example of additional error detection that prevents silent data corruption but I suspect there might be a better example in ZFS or some other file system that implemented end to end detection. Still looking Tom94022 (talk) 19:28, 10 January 2014 (UTC)Reply

Something like that, as NetApp's results are basically "we have no undetected corruption"... As a company relying on providing its customers with "ironclad" storage solutions, I'd say NetApp probably wouldn't disclose undetected corruption even if they actually were unable to detect a certain amount of it, and/or perform a successful later recovery. Regarding filesystems doing end-to-end protection, I agree they're a much better example, and here are a few links that might serve as some kind of a starting point, together with some associated (and documented) HDD silent corruptions:
However, I agree that NetApp's study should be mentioned, as an example of proprietary countermeasures. — Dsimic (talk) 23:54, 10 January 2014 (UTC)Reply

Content/Layout of article

edit

I've been looking for a reliable source overview of this topic as an outline of the article without much success. CERN is a pretty good overview of the size of the Undetected Data Corruption problem but it really doesn't go into their sources. Most articles point at only one source, typically the HDD. Silent Data Corruption: Causes and Mitigation Techniques does a good job of listing many sources [all?]:

  • What events can lead to SDC
-Silicon faults
-Hardware design faults
-Firmware/Software bugs

Transient and intermittent [silicon] faults are the most common form of SDC
Transient faults include SEU, noise and ESD

Unfortunately it is focused on semiconductors in computers and fails to point out that the same errors can occur at any place in the storage hierarchy and that such errors in a file system structure can lead to larger errors due to corruption of pointers to blocks (read or write wrong block). The latter might be original research unless we can find something. Still looking :-) Any ideas? Tom94022 (talk) 19:13, 10 January 2014 (UTC)Reply

Here's another interesting article, in addition to those listed in the section above, but still nothing providing a required broader coverage: Bitrot and atomic COWs: Inside “next-gen” filesystems. — Dsimic (talk) 03:21, 16 January 2014 (UTC)Reply
Thanks for the article; unfortuantely the author seems to think "bit rot" is only an HDD phenomena. Still looking. Tom94022 (talk) 21:56, 16 January 2014 (UTC)Reply
Here's a quite good talk—although biased toward "Oracle is the best", but not too much—addressing various layers involved into causing data corruption, ways for ensuring end-to-end data integrity etc.: Eliminating Silent Data Corruption in Linux (by Martin Petersen and Sergio Leunissen from Oracle, December 2010). It could be perceived as being Linux-centric, but the presented material is quite general. As a huge plus, slides can be linked externally (with no registration required) so they can be used here as references, like this one (slide1_028.png); to obtain URLs of slides, just use Firebug's Net panel to monitor the slides being fetched during the talk, or poke the numbers manually into the above URL, starting with slide1_001.png and ending in slide1_043.png. Looking usable? — Dsimic (talk) 19:19, 17 January 2014 (UTC)Reply
Great find, here is a pdf of Eliminating Silent Data Corruption with Oracle Linux. Need to go thru the talk again and then maybe on to the article. Thanks Tom94022 (talk) 18:07, 18 January 2014 (UTC)Reply
"Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics" (PDF). USENIX. Retrieved 2014-01-18. which u linked in on 1 Jan actually has a lot to contribute. I think at this point we have sufficient reliable sources to craft a good article. Agree? Tom94022 (talk) 03:30, 19 January 2014 (UTC)Reply
Totally agreed; these three main references (Cristian-Constantinescu-LANL-talk-2008.pdf, data-integrity-webcast.pdf and jiang.pdf), with some help from other more narrow ones, are providing a very good backing for the article. Would you like to propose the new article layout, please? — Dsimic (talk) 03:49, 19 January 2014 (UTC)Reply

Merge with Data degradation

edit

Merge Data degradation with Data corruption. I don't care which moves where. This page just had a bigger references list. Both pages concern loss of data due to errors in computer systems, with no attempt made to distinguish degradation of data from corruption of data which appear synonymous. Waerloeg (talk) 03:54, 6 October 2016 (UTC)Reply

Data degradation is associated with image/photo data loss/change/degradation (i.e. missing or moved 'bits' resulting in missing image 'pixels' or misplaced parts of an image) and usually resulting from intended removal or poorly stored digital image file and thus the loss of file data/'bits' of the digital image information. However, image degradation can also be caused intentionally but not due to corruption of data but rather from intentional re-organising of the image data such as in image compression which results in a degraded image but one of order and with acceptable and expected image quality degradation. JPEG 'artefacts' (jagged image edges) are a resulting visual effect of a 'degraded image' but is accepted and expected.

Data Corruption is associated with the corruption, interference and/or loss of digital information from digital files such as documents not image/photo data information and is usually caused during data file transfer from one source to another (...or intentional, such as sabotage for example). — Preceding unsigned comment added by Greg Stobie (talkcontribs) 01:14, 14 March 2018 (UTC)Reply

Oppose: They really are two different things, one a consequence of normal activity and the other a consequence of failure of some sort; they should remain in two articles. Tom94022 (talk) 06:42, 14 March 2018 (UTC)Reply
Closed (given the uncontested objections to the merge). Klbrain (talk) 22:19, 15 May 2018 (UTC)Reply

Mention usage of corruption for entertainment.

edit

The article should mention the use of data corruption for the purpose of entertainment. beepborp (talk) 02:20, 20 September 2021 (UTC)Reply

I could get behind that. The only problem is how it could be written, and how to source it without it being 95% Vinesauce references.
I'll write a short draft, see what bearable sources I can find and get back to this in a "short" while. cogsan (talk) 16:18, 15 February 2023 (UTC)Reply
...nevermind, all I could get was
"People can also deliberately corrupt data for entertainment, such as by [insert methods here]. Most notably, that Bipyort guy from Sauce of the Vine has a recurring series where he..."
Which, to get straight to the point, would not work at all. cogsan (talk) 20:38, 15 February 2023 (UTC)Reply