I received some helpful followups from my original summary regarding ecache parity errors on UltraII cpu's. Most notably: 1. The problem was actually caused by faulty SRAM's made by IBM. There were two vendors used for this L2 cache, so it just depends on which one you got (you can't tell by looking at them). 2. This problem actually wasn't kept that quiet, it made front-page news on EE Times and Electronic Business, IIRC. 3. Given the manufacturing capacity of the two vendors of SRAM, it would have been impossible for Sun to do a complete recall. (Consensus seems to be that their response was still inadequate.) 4. According to a Sun service engineer, best practices for cpu replacement is two failures in 6 months. 5. Someone noted that Sun recommends the following /etc/system settings to "reduce the ecache bug's effect". I have not verified this with Sun, and I have not tried these settings; they increase the scrubbing rate of the ecache: *eCache Scrubbing set ecache_scrub_enable = 1 set ecache_scan_rate=1000 set ecache_calls_a_sec=100 *End eCache Settings 6. Important to keep up-to-date on kernel patch as changes have been made to mitigate this problem and reduce false-positive reports in the logs. You can view Sun's "Best Practices" document on the ecache parity problem at: ftp://ncmir.ucsd.edu/outgoing/foster/BP_Ecache_10-16-01.pdf I've attached a reply from a Sun service engineer regarding the "CBI event", which is way more than I wanted to know about this! Thanks to: Jed Dobson Jay Lessert Donaldson, Mark Scott Howard > My apologies, the Manager's List archives were down so I couldn't > tell that there are many posts about this. > > This is an Ecache parity error on the CPU, a known problem with > the UltraII cpu's. Can happen when the cpu is under heavy load, > extremely intermittently, but if it happens multiple times then > Sun will replace the cpu under contract support. Just heard from a > Sun engineer that "best practices" is to wait for 3 occurances. > It's happened once; they recommended upgrading to the latest kernel > (108528-17 for Solaris 8) and see if it presents itself again. > Apparently rev -16 included some fixes to prevent spurious cpu > errors. > > Apparently this usually hits cpu's with 8 meg cache, but sometimes > 4 meg as well. > > Rant (source anonymous) > > It never ceases to amaze me how well SUN kept the UltraII design > problems quiet. In effect virtually a whole years > production of chips was broken. A shortcut in the design > (using parity instead of ECC on the cache) meant that > thousands of these things had to be replaced. Never > quite made the news though and how loud did they > shout about the first Pentium being unable to add up. > > Thanks to: > > steven.ruby > Ryan Bishop > Will Enestvedt > rene_casalme > Tim Chipman > joe.fletcher > > > > > Can anyone help with this, it doesn't look good... > > > > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 672871 kern.info] NOTICE: > > [AFT2] errID 0x000644be.021b33e1 CBI event on CPU1 > > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 192776 kern.info] [AFT2] > errID > > 0x000644be.021b33e1 PA=0x00000000.00565000 > > Nov 18 17:31:44 cressida E$tag 0x00000000.0e40000a E$State: Shared > E$parity > > 0x07 > > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] > E$Data > > (0x00): 0x00000000.00000000 > > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] > E$Data > > (0x08): 0x00000000.00080000 *Bad* PSYND=0x0004 > > Nov 18 17:31:44 cressida SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] > E$Data > > (0x10): 0x00000000.00000000 > > > > Dave ------------- End Forwarded Message ------------- << All opinions expressed are mine, not the University's >> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= David Foster National Center for Microscopy and Imaging Research Programmer/Analyst University of California, San Diego dfoster@ucsd.edu Department of Neuroscience, Mail 0608 (858) 534-7968 http://ncmir.ucsd.edu/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= "The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable." -- George Bernard Shaw A CBI event is a ecache error on a cache line that can occur without the system panicing. CBI stands for Clean Bad Idle. Clean means that the cache line is clean, or has not been modified. If it was modifed, it would be a dirty page, which would have required flushing the changes out to memory. Idle indicates that this cache line was not in use by the cpu at this time. Bad means that it detected an error. This is a corrected "scrubbed" Ecache event. This should be handled just like any Ecache event, that is swap on the second event only. It appears that Ecache error reporting has changed (again). Solaris 8 kernel patch 108528-13 introduces the changes detailed in bug 4385694. E$ errors seem to be reported as "xBy events" where x is C for "clean" or D for "dirty", and y is I for "idle" or B for "busy" (so DBI event, CBD event and so on), reflecting the state of the cache line when the error was detected. So basically, a CBI event is telling us that the scrubbing algorythm has found a bad line of ecache data and scrubbed it. _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Fri Nov 22 16:59:36 2002
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:58 EST