VI3 Advanced Log Analysis Mostafa Khalil TA19 VCP, VMware Product Support Engineering

TA19
VI3 Advanced Log Analysis
Mostafa Khalil
VCP, VMware Product Support Engineering
Housekeeping
Please turn off your mobile phones, blackberries
and laptops
Your feedback is valued: please fill in the session
evaluation form (specific to that session) & hand it to the
room monitor / the materials pickup area at registration
Each delegate to return their completed event evaluation
form to the materials pickup area will be eligible for a free
evaluation copy of VMware’s ESX 3i
Please leave the room between sessions, even if your next
session is in the same room as you will need to be
rescanned
Agenda
ESX Server Boot process
Locating logs
How to read the logs
Interpreting log entries
Making sense of it all
ESX Server 3.0.x/3.5 Boot Process
boot loader
initrd
VMkernel
vmnix
VMware init
scripts
init scripts
/sbin/init
Boot loader in MBR points to the boot device
Grub.conf lists the boot menu options.
Selected menu provides:
Location of root partition. Uses UUID instead of device name
(e.g. /dev/sda)
RAM Disk file name relative to /boot location
ESX Server 3i Boot Process
boot loader
initrd
VMkernel
VMware init
scripts
/sbin/init
Boot loader in MBR points to the boot device
initrd (initial Ram disk) is loaded
Collecting Logs – UI
Logon to VI Client as an
Administrator
Select: File  Export  Export
Diagnostics Data
Or Administration  Export
Diagnostics Data
Select servers from which to collect
the logs including VC Server
Select “Include information ..”
Checkbox
Specify location for storing the files
ESX Server 3.0.x Logs (collected via vm-support)
Logs are located mostly under /var/log directory.
Locations listed here are relative to that directory
vmkernel
messages
dmesg
boot.log
initrdlogs/*
vmksummary
vmware/hostd.log
vmware/vpx/vpxa.log
vmware/esxcfg-boot.log
vmware/esxcfg-firewall.log
vmware/vmware-cim.log
vmware/esxcfg-linuxnet.log
vmware/esxupdate.log
oldconf/esx.conf.*
rpmpkgs
vmkernel-version
ESX 3i Log Files
config.log
vmware/hostd.log
messages
vmware/aam/*
slpd.log
vmware/vpx/vpxa.log
wsmand.log
configRP.log
vmkernel Log (3.0.x/3.5)
Located in /var/log directory
Contains all events generated by vmkernel
vmkwarning log is a subset of this one and contains only
the warning events
Rotated with a numeric extension. The current log
wihout extension and the next newest one with “.1”
extension
All events since last vmkernel load are also in memory in
/proc/vmware/log
messages log files (3i)
Located in /var/log directory
Contains all events generated by vmkernel
Rotated with a numeric extension. The current log
wihout extension and the next newest one with “.0.gz”
extension (rotated and compressed)
vmkernel Log - Components
Jun 19 09:12:54 giza vmkernel: 14:22:31:50.009 cpu3:1033)scsi-qla0: Scheduling SCAN for new luns....
System
Date/Time
Message
source
Host
name
Device
uptime
CPU:World ID
Message
Sample Rescan Event vmkernel Log Entries
cpu3:1033)<6>scsi-qla0: Scheduling SCAN for new luns....
cpu1:1034)SCSI: 8244: Starting rescan of adapter vmhba0
The beginning of a SAN Rescan event for vmhba0
Rescan Event – LUN Discovery
Vendor: IBM
Model: 1722-600
Type: Direct-Access
Rev: 0520
ANSI SCSI revision: 03
cpu3:1033)LinSCSI: 4625: Device vmhba0:0:0 has appeared
VMWARE SCSI Id: Supported VPD pages for vmhba0:0:0 : 0x0 0x80
0x83 0xc0 0xc1 0xc2 0xc3 0xc4 0xc5 0xc6 0xc7 0xc8 0xc9 0xca 0xd0
Storage Vendor’s ID, Array Model and Microcode Rev.
Reported ANSI version is 3 = SCSI-3
LUN 0 on target 0 on vmhba0 discovered
Array sends supported Vital Product Data (VPD)
pages for LUN 0
Array supports VPD pages 0x80 and 0x83
Rescan Event – LUN Discovery – Cont.
VMWARE SCSI Id: Device id info for vmhba0:0:0: 0x1 0x3 0x0 0x10
0x60 0xa 0xb 0x80 0x0 0x17 0x4e 0x84 0x0 0x0 0x12 0x84 0x43 0x 78
0xa3 0x9f
VMWARE SCSI Id: Id for vmhba0:0:0 0x60 0x0a 0x0b 0x80 0x00 0x17
0x4e 0x84 0x00 0x00 0x12 0x84 0x43 0x78 0xa3 0x9f 0x31 0x37 0x32
0x32 0x2d 0x36
Device ID for the LUN reported. If VPD page 0x83 were not
supported, this line would not show in the log
LUN Id is reported. This matches what is in the proc node
/proc/vmware/scsi/vmhba0/0:0:
Id: 60 a b 80 0 17 4e 84 0 0 12 84 43 78 a3 9f 31 37 32 32 2d 36
Rescan Event – LUN Discovery - Conclusion
cpu3:1033)SCSI: 1424: Device vmhba0:0:0 is attached to a V53 FAStT
SAN.
cpu3:1033)SCSI: 640: Dual Controllers active for adapter vmhba0
cpu3:1033)SCSI: 1450: The IBM FAStT device on vmhba0:0:0 is not
configured in Auto-Volume Transfer mode. ESX will handle path
failover to passive controllers as necessary.
cpu3:1033)SCSI: 2044: Setting default path policy to MRU on target
vmhba0:0:0
The LUN is identified as attached to a V53 FAStT SAN
vmhba0 can access both controllers on the FAStT
AVT is identified as “Disabled” and path failover will be
handled by ESX
Default Path Policy is set to MRU for that LUN
If AVT were enabled, the policy would have been set to Fixed
Path Failover Event
cpu1:1038)WARNING: SCSI: 1785: Manual switchover to path vmhba0:1:5
begins.
cpu1:1038)WARNING: SCSI: 1110: Did not switchover to vmhba0:1:5. Check
Unit Ready Command returned READY instead of NOT READY for standby
controller.
cpu1:1038)WARNING: SCSI: 1820: Manual switchover to vmhba0:1:5
completed successfully.
cpu1:1038)SCSI: 1789: Changing active path to vmhba0:1:5
Starting a manual switchover (done on ESX side)
No need to move the LUN on the array to target 1
since it returned a “READY” state on that target
Switchover (failover) completed
Now vmhba0:0:5 active path is changed to vmhba0:1:5
Snapshot LUN Detection (ESX 3.0.x)
LVM: 5739: Device vmhba2:2:2:1 is a snapshot:
LVM: 5745:
disk ID:
<type 3, len 15, lun 2, devType 0,
scsi 3, h(id) 1771423412675533879>
LVM: 5747:
m/d disk ID: <type 3, len 15, lun 2, devType 0,
scsi 3, h(id) 9219142619163180480>
LVM: 5739: Device vmhba2:2:2:1 is a snapshot:
LVM: 5745:
disk ID: <type 3, len 15, lun 2, devType 0, scsi
3, h(id) 1771423412675533879>
LVM: 5747:
m/d disk ID: <type 3, len 15, lun 2, devType 0,
scsi 3, h(id) 9219142619163180480>
ALERT: LVM: 4903: vmhba2:2:2:1 may be snapshot: disabling
access. See resignaturing section in SAN config guide.
• This logging appears in the /var/log/vmkernel log file.
• The line containing m/d is the metadata.
• In this case it is the h(id) data in the LVM header which is mismatched.
Snapshot LUN Detection (ESX 3.5 and 3i)
LVM: 5573: Device
vml.010044000044363048313739443030343420202020444636303046:1 detected to be
a snapshot:
LVM: 5580:
queried disk ID: <type 1, len 22, lun 68, devType 0, scsi 4,
h(id) 3084339621621410734>
LVM: 5587:
on-disk disk ID: <type 1, len 22, lun 68, devType 0, scsi 4,
h(id) 3661551745314019942>
ALERT: LVM: 4469:
vml.010044000044363048313739443030343420202020444636303046:1 may be
snapshot: disabling access. See resignaturing section in SAN config guide.
• This logging appears in /var/log/vmkernel log file on ESX 3.5
• This logging appears in /var/log/message log file on ESX 3i
• The line containing m/d is the metadata
• In this case it is the h(id) data in the LVM header which is mismatched
• The “type” field identifies the Disk ID type
Type value
Disk ID Type
1
Serial Number
2
NAA
3
Symm6
Understanding SCSI Error Strings
Format:
Device/Host Sense_buffer[2] 12 13
Abbrev: D/H S ASC ASCQ
Extended:
“Device Status”/”Host Status” “Sense Key” “Additional
Sense Code” “Additional Sense Code Qualifier”
Example:
2/0 0x6 0x29 0x0
Understanding SCSI Error Strings –
Device Status
Device Status: (Displayed in decimal values)
Code
Meaning
0
No errors
2
Check Condition
8
Device Busy
24
Reservation Conflict
Understanding SCSI Error Strings – Host
Status
Host Status (displayed in decimal values)
Code
Meaning
0
Host_OK
1
Host No_Connect
2
Host_Bus_Busy
3
Host_Timeout
4
Host_Bad_Target
5
Host_Abort
Understanding SCSI Error Strings – Host
Status –cont.
Host Status (displayed in decimal values)
Code
Meaning
6
Host_Parity
7
Host_Error
8
Host_Reset
9
Host_Bad_INTR
10
Host_PassThrough
11
Host_Soft_Error
Understanding SCSI Error Strings –
Sense Key
Sense Key (displayed in hex)
Code Meaning
0x0
No Sense Information
0x1
Last command completed but used error correction
0x2
Unit Not Ready
0x3
Medium Error
0x4
Hardware error
0x5
ILLEGAL_REQUEST (Passive SP)
0x6
LUN Reset
Understanding SCSI Error Strings – Sense
Key – cont.
Sense Key (displayed in hex)
Code Meaning
0x7
Data_Protect – Acces to data is blocked
0x8
Blank_Check – Reached an unexpected region
0xa
Copy_Aborted
0xb
Aborted_Command – Target aborted command
0xc
Comparison for SEARCH DATA unsuccessful
0xd
Volume_Overflow – Medium is full
0xe
Source and Data on Medium do not agree
Understanding SCSI Error Strings –
ASC/ASCQ
ASC and ASCQ are always in pairs (in hex) (ASQ usually 0)
Code Meaning
0x4
Unit Not Ready
0x3
Unit Not Ready – Manual Intervention Required
0x2
Unit Not Ready - Initializing Command Required
0x29
Device Power on or SCSI Reset
0x8b
ASC_QUIESCENCE_HAS_BEEN_ACHIEVED (IBM FAStT)
0x94
ASC_Invalid_Req_due_To_Current_LU_Ownership (IBM FAStT)
0x01
ASCQ_Invalid_Req_due_To_Current_LU_Ownership (IBM
FAStT)
0x02
ASCQ_QUIESCENCE_HAS_BEEN_ACHIEVED (IBM FAStT)
Understanding SCSI Error Strings - Examples
cpu3)WARNING: SCSI: 5663: vmhba1:0:10:1 status = 2/0 0x6 0x29 0x0
cpu3)WARNING: SCSI: 5663: vmhba2:1:5:0 status = 24/0 0x0 0x0 0x0
cpu0)SCSI: 8879: vmhba2:1:5:0 status = 24/0 0x0 0x0 0x0
cpu0)WARNING: SCSI: 8760: returns error: "SCSI reservation conflict". Code: 0xbad0023.
cpu3)WARNING: SCSI: 5663: vmhba1:0:9:1 status = 0/1 0x0 0x0 0x0
2/0 0x6 0x29 0x0 (Device Check Condition - Lun Reset)
24/0 0x0 0x0 0x0 (SCSI Reservation Conflict)
0/1 0x0 0x0 0x0 (Device OK/Host No_Connect)
Translating vmkernel Error Codes
In VI3 already listed in English along with the code
You can find them in:
VMware-esx-drivers-public-source-<ver>-<build>.tar.gz
At:
http://www.vmware.com/download/vi/open_source.html
File: return_status.h
In:
/src/include/vmware/vmklinks/vmkernel/public
Codes listed sequentially starting from 0x0 (hex)
Codes get renumbered with new releases
Translating vmkernel Error Codes – Examples
(3.0.2U1)
Line number starts from “0”. Calculate the hex value which gives the last 2 digts in the
hex code that starts with 0xBAD00
Line# Hex
Code
Meaning
27
1b
0xBAD001b Corrupt Redo Log
34
22
0xBAD0022 SCSI Reservation Conflict
35
23
0xBAD0023 File System Locked
51
33
0xBAD0033 VMFS volume missing physical extents
65
41
0xBAD0041 Error parsing MPS Table
129
81
0xBAD0081 No Swap File
138
8A
0xBAD008a SCSI LUN is in snapshot state
142
8E
0xBAD008e Exceed maximum number of files on
the filesystem
Messages Log (3.0x/3.5)
Console events
Logon events
iSCSI Authentication events
Jul 24 19:13:33 giza sshd[18915]: Connection from 10.16.112.24 port 1396
Jul 24 19:13:36 giza sshd[18915]: Accepted password for root from 10.16.112.24 port 1396 ssh2
Jul 24 19:13:36 giza sshd(pam_unix)[18915]: session opened for user root by (uid=0)
Jul 29 01:01:03 giza iscsid[32725]: cannot make connection to 10.16.95.161:3260: No route to host
initrdlogs
Located in /var/log/initrdlogs
Events during initial boot from RAM Disk
Logs include:
vmklog.vmk
messages
vmklog.<storage-driver-name> (e.g. vmklog.qla2300_7xx)
hostd.log
Located in /var/log/vmware
Sym-linked to the current rotated hostd log file
Hostd events
VI Client communications when directly connected to ESX
Events done on behalf of
VPXA
System Services
Firewall System
HA services
VMware Converter
vpxa.log
Located in /var/log/vmware/vpx
Sym-linked to the current rotated vpxa.log
Events of intractions with Virtual Center Server
Log for VMware VirtualCenter, pid=4470, version=2.0.2, build=build-50618, option=Release,
section=2
[2007-07-19 11:41:11.172 'App' 3076436896 info] Current working directory: /var/log/vmware/vpx
[2007-07-19 11:41:11.172 'App' 3076436896 info] Initializing SSL context
[2007-07-19 11:41:11.216 'App' 3076436896 info] Starting VMware VirtualCenter Agent Daemon 2.0.2
build-50618
[2007-07-19 11:41:11.221 'App' 3076436896 info] [VpxaInvtHost] Manager IP: :902 Host IP:
esxcfg-firewall.log
Located in /var/log/vmware directory
All VMware Firewall rules events
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
options --log-level debug"
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
2006-10-09 13:57:02 (15463) INFO :
"/sbin/iptables
"/sbin/iptables
"/sbin/iptables
"/sbin/iptables
-A icmp-out -p icmp --icmp-type echo-reply -j ACCEPT"
-A icmp-out -j DROP"
-N log-and-drop"
-A log-and-drop -j LOG --log-ip-options --log-tcp-
"/sbin/iptables -A log-and-drop -j DROP"
"/sbin/iptables -N valid-source-address"
"/sbin/iptables -A valid-source-address -s 127.0.0.1 -j DROP"
"/sbin/iptables -A valid-source-address -s 0.0.0.0/8 -j DROP"
"/sbin/iptables -A valid-source-address -d 255.255.255.255 -j DROP“
2007-07-14 13:28:30 (30269) INFO : Setting service AAMClient to 1
oldconf Files
Backup copies of /etc/vmware/esx.conf file
Located in /var/log/oldconf directory
Created prior to updating the existing file
Only when changes done via VC, VI Client or esxcfg-* scripts
Date and time of backup used as the extension of the file name
esx.conf.2007-06-30_03:56:58
esxupdate.log
In /var/log/vmware directory
History of all updates done via esxupdate tool
Date and PID
Packages installed
Results of the installation
vmkernel-version
In /var/log directory
lists current and all previous kernel build numbers
Found vmkernel version 27701
Found vmkernel version 32039
Questions?
TA19
VI3 Advanced Log Analysis
Mostafa Khalil, VCP
VMware Product Support Engineering
For more information …
http://www.vmware.com