Talk:ECC memory

Latest comment: 8 months ago by RastaKins in topic Cray quote

Clean-up of "Problem background" section needed

edit

I tried to reword the third paragraph of the Problem background section to actually make sense, but I'm not sure how to best fix this, and the paragraph still has some major problems. As has been pointed out below, the math makes no sense, and needs to be fixed along with some other stuff. Can someone more experienced/knowledgeable than me look over this paragraph and try to fix it? Hitechcomputergeek (talk) 19:51, 19 May 2017 (UTC)Reply

Performance decrease

edit

Currently the article says: "a performance decrease of around 0.5–2 percent...", and uses a reference that says: "...decrease in performance of approximately 2-3%.". What is the actual normal performance decrease? Searching for other sources I have so far only found "around 2%" or "2-3%". Should the article be edited to say 2-3%, or maybe around 2%?

The article seems to be outdated. Modern ECC fetches unverified content and performs ECC check simultaneously. The performance overhead should only apply upon an error being corrected. — Preceding unsigned comment added by 143.215.156.46 (talk) 15:59, 24 September 2012 (UTC)Reply

Desktop hardware with ECC support

edit

I don't think what follows is encyclopaedic or timeless enough to go in the article; if it's thought it should go in, please add whatever degree of detail is thought appropriate. I think it would be useful for people to know how to implement a desktop system with ECC support, which is quite difficult due to manufacturers' marketing considerations and lack of available information.

In 2012 there is very little desktop-class equipment with ECC support; although the RAM is available, most processors, and most motherboards do not support it, perhaps largely to make people who require reliable systems to but more expensive components. Most of the less-expensive Intel processors, and chipsets which support them, do not support ECC; server-class equipment does. This was not always the case; an Intel Pentium 4 in a Gigabtye 8KNXP motherboard (c2004) supported ECC RAM. As of 2012, most AMD Athlon and later processors support ECC. Not all chipsets and motherboards do, though. However, most (not all) Asus motherboards for AMD processors support ECC. A very few others do too (I think one Gigabyte board, and a Biostar). Information is very hard to get; memory manufacturers' websites state incorrectly that only non-ECC RAM is supported; the motherboard manufacturer (not seller) is the best resource.

When buying ECC RAM for desktop motherboards it's very important not to get registered RAM; most of the reported compatibility issues with ECC RAM are due to using registered server-type ECC RAM, often available on eBay.

ECC is supported by the hardware, not the operating system. This means that errors are automatically corrected, and information is made available to the operating system that this has happened. However, Windows non-server operating systems (e.g. Windows 7, as against Windows Server 2008) refuse to use this information; there is no way to tell what has happened. Most Linuxes do not have this limitation.

For definiteness, a specific example which I am using very successfully with ECC support is an Asus M5A78L/USB3 motherboard with AMD Athlon II X2 245e processor (most AMDs support ECC) and four Crucial 2GB DDR3 ECC non-buffered RAM modules (although the Crucial website says they don't support this motherboard). ECC support must be enabled in the motherboard BIOS settings, it defaults to off.

So Intel, the motherboard manufacturers, and Microsoft try to keep ECC off the desktop, and you need a lot of information to get round this. Pol098 (talk) 14:21, 20 August 2012 (UTC)Reply

The RAM controller has moved into the CPU for both AMD and Intel platforms, so there's nothing the chipset would need to support for ECC. If there's CPU support, it's only a matter of the board firmware (BIOS/UEFI). Windows 7 is supposed to log (event log) ECC errors (WHEA), so there seems to be support. Zac67 (talk) 16:38, 20 August 2012 (UTC)Reply
"Windows 7 is supposed to log (event log) ECC errors (WHEA)". I hope Zac67 is right, but think that this is Microsoft disinformation. I'd be very interested in any further details. As an example of the confusion prevailing, someone asked: "how does Windows 7 handle ECC error logging and reporting? Are corrected ECC errors noted anywhere? Is this a feature of only some flavors of Windows 7?"[1][2]. One answer, from Microsoft support, was "When ECC errors are reported WHEA monitors and corrects these errors automatically. If the error cannot be resolved then it will be logged in the Event Viewer", a weasel answer which means "corrected errors are not reported". I'd like to be wrong. Pol098 (talk) 19:53, 20 August 2012 (UTC)Reply
Take a look at http://msdn.microsoft.com/en-us/library/windows/hardware/ff560453%28v=vs.85%29.aspx; starting with Win7 WHEA logs ECC failures. You can take a look at the history by running (admin mode) bcdedit /enum {badmemory}. Zac67 (talk) 22:15, 1 March 2013 (UTC)Reply

Calculation

edit

Sorry, my math fails me...

"with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10 × 10^−9 error/bit·h)(i.e. 4.58752 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate)"

25,000 to 70,000 errors per billion hours per megabit - assuming "billion" = 10^9 and "megabit" = 1024^2 bit - equals 2.4 to 6.7 × 10^-11 errors/h×bit. For 8 GiB, that does scale up to 4.6 errors/h. What am I doing wrong in the middle? Zac67 (talk) 19:12, 11 February 2013 (UTC)Reply

Yes, I agree -- Something isn't adding up. I get the same numbers you got. And why do we need 5 decimals of precision? --DavidCary (talk) 16:29, 10 March 2013 (UTC)Reply
Yes, complete bull. You can't get more significant digits than you start with, 1-2 is fine. I'll correct the middle figures as well. Additionally, I'll look out for more sources: I've been running my computer with ECC/scrubbing for a couple of weeks now (8 GB DDR3-1333) and have yet to see a single bit error. 1.6 bits/h on the low end seems laughable. Zac67 (talk) 20:53, 10 March 2013 (UTC)Reply
User:Zac67, when you say "running" your computer, was the memory actually under a high load? —DIV (49.195.123.19 (talk) 01:43, 24 April 2020 (UTC))Reply
No, actually only under light (memory) load which should slightly increase the chance for bit errors to manifest and then get detected by scrubbing. --Zac67 (talk) 05:26, 24 April 2020 (UTC)Reply
I think the article should clarify what "megabit" means. By SI nomenclature it means only one million bits. If it is intended to mean something different here, then it should either be renamed to mibibits, or else explained parenthetically. Likewise, if gibibytes are intended, then they should be indicated in the article itself.
As for the calculation:
25,000 to 70,000 errors per billion hours per megabit equals...
  • 2.5 to 7.0 × 10^-11 errors per hour per bit — assuming "billion" = 10^9 and "megabit" = 10^6 bits
    = 2.5 to 7.0 × 10^-2 bit errors per hour per gigabit
    = 0.20 to 0.56 bit errors per hour per gigabyte
    = 1.6 to 4.5 bit errors per hour per 8 gigabytes
    = 1 bit error per gigabyte per 1.8 to 5.0 hours.
  • 2.4 to 6.7 × 10^-11 errors per hour per bit — assuming "billion" = 10^9 and "megabit" = 1024^2 bits
    = 2.6 to 7.2 × 10^-2 bit errors per hour per gigabit
    = 0.20 to 0.57 bit errors per hour per gibibyte
    = 1.6 to 4.6 bit errors per hour per 8 gibibytes
    = 1 bit error per gibibyte per 1.7 to 4.9 hours.
As for the presentation, the earlier examples quote error rates in bits per "gigabyte" per variable-period-of-time, and I think we should use the same format throughout the article for ease of comparison.
—DIV (49.195.123.19 (talk) 01:43, 24 April 2020 (UTC))Reply
We should use mebibit (1024^2) or 'gibibit' (1024^3) here. While it seems awkward it removes ambiguity. --Zac67 (talk) 05:26, 24 April 2020 (UTC)Reply

The spacecraft Cassini–Huygens

edit

This bit under Problem background seems like it doesn't belong in this. I understand the point it is trying to make, but overall it seems out of place. Twotoned (talk) 01:51, 14 November 2014 (UTC)Reply

Hello! To me, having an example like that shouldn't hurt; I've just touched it up a bit so the paragraph as a whole should look a bit less out of place. — Dsimic (talk | contribs) 04:35, 14 November 2014 (UTC)Reply
I decided to give the source a read (interesting stuff!) to clean up that a bit (I find the phrasing confusing) and found some inaccuracies. The ship does use EDAC and the maximum reported count of single-bit errors per day wasn't 3072 but 4 times that of normal (normal being 280). It doesn't give an exact number.
Anywho, here's my version.Twotoned (talk) 09:49, 14 November 2014 (UTC)Reply
Looking good! :) — Dsimic (talk | contribs) 10:02, 14 November 2014 (UTC)Reply

Techniques that deal with memory errors

edit

The Wikipedia articles NX bit and AMD 10h briefly mention "memory mirroring".

Bob Plankers[3] mentions "memory mirroring" and "Dell Fault Resilient Mode". Kevin Noreen[4] mentions "Memory Page Retire" and "Dell Fault Resilient Memory (FRM)". UP2V[5] tries to explain the difference between "Reliable Memory" and "Memory Reliability" features of VMware ESX.

Is there a WP:RELIABLE source for these techniques? --DavidCary (talk) 03:17, 24 January 2015 (UTC)Reply

Hello! NX bit and AMD 10h articles actually don't refer to any memory redundancy when mentioning "mirrored memory": NX bit refers to the separation between execution and data accesses performed by PaX (see this file), while AMD 10h refers to how DIMM addressing can be alternatively mapped (see page 340 in this PDF).
Regarding a WP:RS for the Dell Fault Resilient Memory (FRM) itself, this paper would be one. However, that paper is pretty much a PR overview with no actual description of the underlying technology, the way it interfaces with the operating system, etc.; it might be related to Intel's lockstep memory, see this page for more details. — Dsimic (talk | contribs) 09:23, 25 January 2015 (UTC)Reply
Thank you for clarifying the difference between the kind of "memory mirroring" alluded to in those articles, and the kind of "memory mirroring" described in the chipkill article. I agree that a reference with a technical description of the underlying technology would be a good addition. --DavidCary (talk) 19:36, 27 January 2015 (UTC)Reply
Thank you for bringing it here in the first place. :) — Dsimic (talk | contribs) 23:02, 28 January 2015 (UTC)Reply
The chipkill article once mentioned "memory mirroring" with a reference to Qingsong Li and Utpal Patel.[6]
Today I see that the chipkill article no longer mentions "mirroring".
What other article would be more appropriate for mentioning that technique? --DavidCary (talk) 18:58, 2 February 2015 (UTC)Reply
As described in my edit summary, the trouble with that addition to the Chipkill article was that the provided reference pretty much didn't describe chipkill variants at all, but details of a non-chipkill advanced memory protection implementation. As such, it simply can't be linked with the "chipkill" term, which is IBM's trademark for advanced ECC technology; instead, it might fit well into Lockstep memory, which is most probably the underlying technology, but alas, the reference doesn't even mention "lockstep" word, what leaves the whole thing in some kind of a limbo. With all that in mind, our Dell PowerEdge article would be the only good destination – if you agree. — Dsimic (talk | contribs) 19:55, 2 February 2015 (UTC)Reply

Criticism of the Google 2009 Study

edit

There is a more recent paper that has some fundamental criticism of the methodology (counting errors vs. counting faults) of the Google/Weber 2009 paper linked to from the article.

See there, section 7 "Fallacies": https://www.cs.virginia.edu/%7Egurumurthi/papers/asplos15.pdf

It seems like they point to about a factor 1000 lower fault rate. — Preceding unsigned comment added by 85.216.9.254 (talk) 23:41, 11 March 2017 (UTC)Reply

edit

Hello fellow Wikipedians,

I have just modified 6 external links on ECC memory. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 14:55, 15 September 2017 (UTC)Reply

slightly confusing sentence

edit

This sentence is a little confusing:

" ... with between 25,000 (roughly 2.5 × 10^(−11) error/bit·h) and 70,000 (roughly 7 × 10^(−11) error/bit·h, or 5 bit errors per 8 gigabytes of RAM per hour) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year. "

And I get a little different result. 8 gigabyites is 65,536 megabits; take the higher figure of 70,000; 65536 x 70000 / 10^9 = 4.59 errors in 8GB per hour. I guess that is rounded up to 9? Bubba73 You talkin' to me? 05:10, 11 March 2019 (UTC)Reply

I think ECC memory deserves a more in depth discussion.

edit

Although I have 45 years experience in computer and networking hardware, firmware and software, I am not an expert in ECC memory.

When I was in my twenties, parity detecting one bit error was the norm. A computer supplier I worked for estimated that there would be one error every 2 years on one of their systems. A customer bought 200 of those computers and had error every week which upset management because there was a service call every week on average in out of the way locations.

From my long ago forgotten education, state of the art ECC was about fixing one bit errors and detecting 2 bit errors but really nothing more after that. So data 1 but got 1 okay

  data 1 but got 2 .. no parity error .. ECC failure 2 bit uncorrectable error
  data 1 but got 3 .. parity error .. ECC fixed to 1
  data 1 but got 14  .. no parity error .. but ECC may or may not detect depending on the hash coding. 

My understanding through the decades that people were looking for the most common reasons a set of bits in a memory array might get changed and develop hashing codes to detect them. I was wondering what the state of the art was today. Nothing there.

But the article has a critical flaw. ECC does NOT protect from memory errors, it is trying the lessen the ODDS of a memory error. Statements like:

ECC protects against undetected memory data corruption, and is used in computers where such corruption is unacceptable, for example in some scientific and financial computing applications, or in file servers.

are WRONG.

ECC lessens the odds of memory corruption.. and EVERYONE makes a trade off of their tolerance of corruption. For financial institutions, it might be worth an extra 4% to the cost of a server to avoid the DOWNTIME associated with a parity error (much more common) than a detectable / correctable 2 bit error less common vs a very far less common undetectable 3+ bit error. But in the last case there IS an error.. it is just no one knows. but if there are 10^9 servers and 10^10 phones,laptops, etc and 16*10^9 bytes of memory, we actually start to talk about really memory..


This should have been more tuned to the statistical implications of the trade offs, algorithms and error inducting phenomenons (subatomic, voltage surges, EMF, etc) and the associated costs.

But what do I know.. just a stupid old retired high tech guy.


Sandy.Nicolson (talk) 23:42, 27 October 2019 (UTC)Reply

ECC memory handling in Windows

edit

This topic doesn't seem to be really addressed. Does it differ by edition: Home, Pro, Workstation, Server, ... WHEA vs PSHED. System Error Log entry IDs. — Preceding unsigned comment added by 24.241.152.175 (talk) 18:03, 24 April 2020 (UTC)Reply

Cosmic Radiation

edit

The cosmic radiation we experience on earth mainly consists of muons, not neutrons as it is described in section "Description". 79.227.200.168 (talk) 12:23, 16 August 2020 (UTC) Diego SemmlerReply

Cray quote

edit

The cray quote "parity is for farmers" links punningly to an economics-of-agriculture page. Is this intentional? 18.29.31.111 (talk) 01:11, 15 February 2024 (UTC)Reply

Yes, that was the joke. Cray and his audience would be quite familiar with this agricultural price theory. RastaKins (talk) 21:43, 27 February 2024 (UTC)Reply