Managers, Many thanks to those who replied. ( Jon, Jay, Johan, Hendrik, Casper, Patrico, Petri, and Roland ) . Most suggestions involved using psradmin to turn off one processor at a time to isolate the fault. This was my first thought too, but the program runs fairly infrequently, and even then, only dumps core occasionally, I calculated that statistically it would take about 6 weeks to find the processor, 12 weeks at worst. Although this program dumps core fairly infrequently, the impact to downstream applications is huge, resulting in sysadmins being called out at horrible hours of the morning to replay database logs and all that stuff we hate doing........ Pbind was suggested, and this is a good approach, though interestingly, it is not possible to bind to the 'current' processor, only an explicitly named one. This is a side effect of the fact that even if Solaris did have a get_cpuid() function, it is only valid at the time it was called, since the very next clock tick might timeslice the process off the current cpu, later to be restarted on ( possibly )a different processor. Statistically, it's likely to restart on the same processor due to affinity rules, but this is not guaranteed, and would only serve to 'point me in the right direction' rather give a concrete answer about which cpu was doing this bit flip. It then struck me that this did not matter, we can start a process, let the scheduler start it anywhere and then immediately pbind to any of the 6 cpus at random. Any core dumps found could then be analysed for a variable which matched the bound cpu. Here's the code. BTW, cpus in Solaris are not necessarily numbered in a linear fashion..... main() { processorid_t cn=0; processorid_t cpu[64]; ncpu=init_cpus(cpu); srand(getpid()); cn=rand() % ncpu; if (processor_bind(P_PID,P_MYID,cpu[cn],NULL) == -1) perror("processor_bind"); processor_bind(P_PID,P_MYID,PBIND_QUERY,&cn); printf("cpu %d of %d\n",cn,ncpu); } So, in a nutshell, the process is started by the scheduler on any available cpu, it then binds for the rest of it's life to any random cpu, and if it core dumps, the core will contain a symbol identifying where the binding took place. I have ommited the guts of the test which uses code from the same libraries as our crashing process. Sadly, this effort is all to disprove Suns recommendation that it's a hardware fault. We would have expected a kernel panic by now if it really was a hardware fault. The fault is likely to reside during process linkage in ld.so.1 ???? Thanks for all your suggestions, very appreciated. Simon. This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or lost by any mistransmission. If you receive this message in error, please immediately delete it and all copies of it from your system, destroy any hard copies of it and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. CREDIT SUISSE GROUP and each of its subsidiaries each reserve the right to monitor all e-mail communications through its networks. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of any such entity. Unless otherwise stated, any pricing information given in this message is indicative only, is subject to change and does not constitute an offer to deal at any price quoted. Any reference to the terms of executed transactions should be treated as preliminary only and subject to our formal written confirmation. _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Fri Sep 21 05:47:49 2001
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:25 EST