**** summary:
I received a couple of me-too's on this one, but no solutions to the
problem. The product vendor has since my original posting acknowledged
that there is a problem and are working on a fix for it.
**** thanks to:
Andrew Foote <acf@nabaus.com.au>
Jacques Rall <jacques.rall@za.eds.com>
Marc S. Gibian
***** answers:
> From: gibian@stars1.hanscom.af.mil
>
> I've been away from the office so I don't know if you've sent out a summary yet.
> Anyway, so far as I know, the only recovery path for a hung socket is a reboot.
> Let me add that hung sockets are not all that uncommon when I've run unattended
> ufsdumps over the LAN. This is why I strongly advise against use of backup
> products that use the OS' underlying tools for the actual tape handling. Some
> argue that they want to be able to restore without first installing the backup
> tool on a crashed system. My position is that you spend so much more time on the
> dump side that the slight overhead during recovery is far outweighed by the
> added reliability during dump.
>
> Hope this helps,
> Marc S. Gibian
> Telos Consulting Services phone: (617) 377-6350
> PRISM/TFS email: gibian@stars1.hanscom.af.mil
> From: Jacques Rall <jacques.rall@za.eds.com>
> What about using pmadm or sacadm? (sorry, don't know any switches)
>
> ----------
> From: ACF
>
> Me too !!
>
> I however am running proxy backups under AIX b/w RS/6000's. Like you,
> the only method I've found to "reset" the socket is by killing all
> associated processes.
>
> PDC do need to work on this as it's pretty dirty.
> Pls let me know how you go,
>
> Rgds,
> Midrange Services.
**** original question:
SUN Sparc20 running Solaris2.5 with the 2.5 recommended patches installed.
Problem description:
This machine is a dedicated backup server that runs the PDC Budtool product
This product uses remote shelled dump/restore to backup the client
machines. There appears to be a bug that gets "activated" when one of the
backup clients either hangs or crashes while a dump is being run. The
backup server keeps the socket connection open to the client that was being
backed up. This socket will stay open until I manually kill the parent
process on my backup server that initiated the remote dump.
I'm working with the backup product vendor on a fix for this problem, but
was hoping in the meantime to find a way to close this socket without
killing the parent backup process. When I kill the parent process, none of
the backups that still remain in the "backup schedule" will get run and the
summary of the backup schedule will not get generated. I guess my basic
question is: shouldn't a socket get closed when the destination machine is
no longer accessible (e.g. no longer ping-able)?
Attached is some info that will hopefully clarify my problem description.
All of the commands have been run from the backup server (of course, since
the client is accessible!):
backupsvr: lsof | grep client
goserver 2095 root 11u inet 0xf611fec0 0t5 TCP backupsvr:1020->client.bms.com:shell
backupsvr: netstat -a | grep client
backupsvr.1020 client.bms.com.shell 61315 0 8760 0 ESTABLISHED
backupsvr: ping client 1
no answer from client.bms.com
(NOTE:# the goserver is the "parent" process which controls the backup schedule
and initiates the remote dump command)
backupsvr: /usr/ucb/ps auxw | grep "goserver -x"
root 2095 0.0 4.3 3272 2676 ? S Jan 05 286:19
/usr/budtool/bin/solaris_sparc/goserver -x0
backupsvr: truss -aef -p 2095
2095: psargs: /usr/budtool/bin/solaris_sparc/goserver -x0
2095: getmsg(12, 0xEFFF87F8, 0xEFFF87EC, 0xEFFF8804) (sleeping...)
Any info on how to try and close this socket without killing the "goserver"
process would be appreciated. Thanks!
--
Christopher M. Murphy email: murphy@bms.com
Bristol Myers Squibb phone: (609) 252-5741
Scientific Information Systems fax: (609) 252-6163
Princeton NJ
-- Christopher M. Murphy email: murphy@bms.com Bristol Myers Squibb phone: (609) 252-5741 Scientific Information Systems fax: (609) 252-6163 Princeton NJ
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:43 CDT