There have been a number of replies with regard to a question I asked about detecting failures in mirrored disks, redundant power supplies, and dual CPUs on a Enterprise 220R server running Solaris 2.7. Steve Camp's answer sounded like he knew what he was talking about, and basically stated that one would need a higher class machine (e.g. E250 or Ex000/Ex500) to detect a failed powersupply. I guess I should be glad that they at least moved the status LEDs to the front ofthe machine. He also, probably correctly, pointed out that the CPU's are not really redundant, and that the machine would probably go down if one failed. A number of people suggested the Sun Management Center, which is free for the basic functionality product. I have not tried this, but it is not clear that this will detect what I want either. A number of people also suggested the Big Brother semi-freeware product (http://www.bb4.com). I currently use the freeware product Netsaint (http://www.netsaint.org) for most of my monitoring, and did not see much if anything in the Big Brother description to tempt me to switch. I did examine the modules which were supposed to check hardware like power supplies in Sun hardware, but this used the prtdiag command and both my experience and the scripts indicated would not detect power supply failures on a 220R. One or two people also recommended swatch (ftp://coast.purdue.edu/pub/tools/unix/swatch) to check for errors related to such items in the logs or console messages. However, I am having an odd problem in that none of my simulated failures (unplugging a power supply, offlining a CPU, offlining a submirror) appeared in the logs. I am not sure if that is due to the inadequacy of my simulations, or something more fundamental (and needless to say I am reluctant to increase the reality of these simulations too much). I have set up a cron job to check output of metadb and metastat, thereby covering disk problems (also have mdlogd on, but so far hasn;t helped, possibly because no errors showing up in logs). Will likely add checks of psrinfo (should indicate a CPU problem if it doesn't crash the machine), and prtdiag -v (not entirely sure what problems that will detect, but am pretty sure I want to check it. It does NOT appear to detect power supply failures, at least in my tests.) For now will have to hope the amber led will be manually detected to catch power supply failures. Maybe eventually can rig a phototransistor or some such to detect the LED:). Thanks to all those responding, including Elizabeth Lee, Cristophe Dupre, Steve Camp, Gary Losito, Bertand Hutin, Kevin Buterbaugh, and Arthur Aldridge. Tom Payerle Dept of Physics payerle@physics.umd.edu University of Maryland (301) 405-6973 College Park, MD 20742-4111 Fax: (301) 314-9525 _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Tue Sep 25 13:23:52 2001
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:26 EST