SUMMARY: Are SCSI Warnings Normal When Using Extended SAN Fabrics?

From: Graham Leggate <graham.leggate_at_gmail.com> Date: Sun Sep 21 2008 - 23:28:36 EDT · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:44:12 EST

First, I'd like to thank all those who responded. Thank you.

Sorry for the Summary taking so long to be sent back to the group. However,
given the sensitivity of the site where these warnings were appearing, it
has taken quite a bit of time to get changes implemented and tested.

Trying to get to the bottom of this has involved trial and error of a few
things, which I'll summarise those which made a positive impact to our
system:

1. Implemented the changes to the SD values in /etc/system, the settings
which seem to work best for us is:
set ssd:ssd_io_time=60
set ssd:ssd_max_throttle=20

2. We patched the OS and Veritas software to the latest releases of patches
available. Our systems were nearly a year behind the current recommended
patches.

3. We reset the vxdmp settings back to default, as we had played around with
the iotimeouts and queuedepth:
# Set queuedepth and io back to defaults:
vxdmpadm setattr arraytype A/A-A-HDS recoveryoption=default

4. After do some more research we discovered a little fact about the HDS
SAN's whereby they are not a real Asymmetric, Active-Active arrays. They
mimic an A/A-A by performing internal switching in the HDS Controllers.
This, in theory, shouldn't affect performance or reliability. However, after
talking to Veritas, it was decided to set vxdmp to use a single path to the
SAN for all I/O. This doesn't exclude the other path from being used, ie. in
the event of a HBA failure, or even with multiple LUN's you can still load
balance over your two HBA's, but once set it will use that HBA until a
failure on the path is detected.
# Trying this as a setting to resolve the VXDMP from flappying about on the
HDS SAN:
vxdmpadm setattr enclosure AMS_WMS0 iopolicy=singleactive use_all_paths=no
vxdmpadm setattr enclosure AMS_WMS1 iopolicy=singleactive use_all_paths=no

Now our system appears to be stable, and the number of SCSI warnings has
dropped to 1 or 2 per day, which we can align with errors occuring on the
SAN fabric between the two sites (set and out of frame errors).

Regards Graham

Subject: Are SCSI Warnings Normal When Using Extended SAN Fabrics?
------------------------

From: *Graham Leggate* <graham.leggate@gmail.com>
Date: 2008/7/31
To: sunmanagers@sunmanagers.org

Hi,

I have a question regarding what would be considered a "normal" number
of scsi warnings when using remote SAN's?

We have a number of SUN Servers, E2900, V890, X4200M2's, with dual
HBA's running Solaris 10, U3, Veritas Storage Foundation 5 connected
to a HDS SAN. We have two SAN's, located in two physical datacenters
(prod & DRC) which are approximately 40kms apart. We run dark fibre
between to the two sites and use CWDM's to provide 2 x 2Gbps Data
Networking + 6 x 2Gbps Fibre Channel. The SUN servers use vxdmp to
connect to 2 Brocade switches, and then each Brocade switch has 3 x
2Gbps trunked ISL's to connect to the switches at the remote
datacentre, we also use the Extended Fabric Licenses in the switches.
The servers data volumes are located on the SAN's, where we have a LUN
presented by the local SAN and a second LUN presented by the remote
SAN. The volumes is then mirrored using Veritas. The SUN servers run a
mix of Oracle RAC 10gR2 and an inhouse transaction processing engine
and custom database.

Each day the servers produce a number of warnings to syslog as shown
below. Each time the system warns of a scsi transport issue, it is
always the remote LUN which is reporting the problem against. These
warnings are not causing the systems to fail in anyway, however the
customer is asking for an explanation as to why these messages are
occurring. Previously we did not have the Extended Fabric License or
the Trunking Licenses, and we would see many of these scsi errors in
succession which would then either cause Veritas to mark a disk as
failing or failed, which would mean we would need to re-mirror the
disk. But since we have had the Extended Fabric Licenses installed on
the Brocade switches the number of scsi warning has greatly decreased
and we haven't had any disk failures.  I do not know if these types or
messages are "normal" when running systems with remote mirrors, or if
this is something we need to investigate further to see if there is
any other under-lining problems. Any in-sight from those of you who
run Solaris with remote mirrors would be greatly appreciated.

---messages----
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a00f2,8
(ssd166):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 11880000                  Error Block: 11880000
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409750029
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,3
(ssd144):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 11880000                  Error Block: 11880000
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680012
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,5
(ssd165):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 1132771808                Error Block: 1132771808
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680023
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,7
(ssd169):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 327259936                 Error Block: 327259936
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680029
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,8
(ssd171):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 1132650832                Error Block: 1132650832
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680024
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a00f2,1
(ssd150):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 3407136                   Error Block: 3407136
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409750014
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x1, FRU: 0x0
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,1
(ssd146):
Jul 31 02:00:24 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 1331088                   Error Block: 1331088
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680014
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:00:24 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x1, FRU: 0x0
Jul 31 02:04:07 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a00f2,3
(ssd148):
Jul 31 02:04:07 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 02:04:07 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 12554768                  Error Block: 12554768
Jul 31 02:04:07 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409750012
Jul 31 02:04:07 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 02:04:07 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0
Jul 31 02:11:05 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,3
(ssd144):
Jul 31 02:11:05 SERVER001  SCSI transport failed: reason 'tran_err':
retrying command
Jul 31 03:37:08 SERVER001 scsi: [ID 107833 kern.warning] WARNING:
/ssm@0,0/pci@19,600000/SUNW,emlxs@1/fp@0,0/ssd@w50060e80102a0082,5
(ssd165):
Jul 31 03:37:08 SERVER001  Error for Command: write(10)
Error Level: Retryable
Jul 31 03:37:08 SERVER001 scsi: [ID 107833 kern.notice]    Requested
Block: 1132772880                Error Block: 1132772880
Jul 31 03:37:08 SERVER001 scsi: [ID 107833 kern.notice]    Vendor:
HITACHI                            Serial Number: 750409680023
Jul 31 03:37:08 SERVER001 scsi: [ID 107833 kern.notice]    Sense Key:
Aborted Command
Jul 31 03:37:08 SERVER001 scsi: [ID 107833 kern.notice]    ASC: 0xc0
(<vendor unique code 0xc0>), ASCQ: 0x3, FRU: 0x0

Many Thanks

Regards Graham
____________________
Graham Leggate

-
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers