MegaRaid Problems - Can't recreate a Logical disk containing a single physical disk after failure.
While working on a disk replacement on an older Oracle (SUN) X4170 M2 we ran into trouble.
In this config there was a raid controller configured with 4 physical hard disks used as 4 logical disks.
No redundancy was configured here. Redundancy was configure using Solaris 10 ZFS layer in zpools.
Here is where it got ugly. When a disk died the MegaRaid logical disk disappeared (No redundancy so that;s expected.)
After disk replacement you need to recreate the logical disk (again expected). MegaRaid however refused to recreate the logical disk. Try as we might it only spit out a generic exit code and Failure message:
# ./MegaCli -CfgLdAdd -r0 [252:1] -a0
Adapter 0: Configure Adapter Failed
Exit Code: 0x54
Solution is below however a little background on this config.
Normal MegaRaid config:
Logical disks:
# ./MegaCli -LDInfo -Lall -aALL |egrep 'Virtual|size|State'
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
State : Optimal
Virtual Drive: 1 (Target Id: 1)
State : Optimal
Virtual Drive: 2 (Target Id: 2)
State : Optimal
Virtual Drive: 3 (Target Id: 3)
State : Optimal
Physical disks:
# ./MegaCli -PDList -aALL |egrep 'Slot|state|Inq'
Slot Number: 0
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1ABC
Slot Number: 1
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930003SSUN300G0E71101471DEF
Slot Number: 2
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1LGHI
Slot Number: 3
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1LJKL
So in this case we saw
Logical disks
# ./MegaCli -LDInfo -Lall -aALL |egrep 'Virtual|State'
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
State : Optimal
Virtual Drive: 1 (Target Id: 1)
State : Optimal
Virtual Drive: 3 (Target Id: 3)
State : Optimal
# ./MegaCli -LDInfo -L2 -aALL
Adapter 0 -- Virtual Drive Information:
Adapter 0: Virtual Drive 2 Does not Exist.
Replaced disk was now available in Physical disk list
# ./MegaCli -PDList -aALL |egrep 'Slot|state|Inq|Enc'
Enclosure Device ID: 252
Slot Number: 0
Enclosure position: N/A
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1ABC
Enclosure Device ID: 252
Slot Number: 1
Enclosure position: N/A
Firmware state: Unconfigured(good), Spun Up
Inquiry Data: SEAGATE ST930003SSUN300G0E71101471ZDEF
Enclosure Device ID: 252
Slot Number: 2
Enclosure position: N/A
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1LGHI
Enclosure Device ID: 252
Slot Number: 3
Enclosure position: N/A
Firmware state: Online, Spun Up
Inquiry Data: SEAGATE ST930005SSUN300G06061201Q1LJKL
So we should be able to add the new disk back into a logical disk array
but it fails:
# ./MegaCli -CfgLdAdd -r0 [252:1] -a0
Adapter 0: Configure Adapter Failed
Exit Code: 0x54
So after trying a few things with the field engineer replacing the disk. reinserting different slots etc. came my new favorite quote:
" Friends don't let friends MegaRaid"
But here is the solution:
Any Logical disk with caching enabled retains data that was being written at fail time. So Logical disk 2 cache data was retained.
Logical disk (LD2) and in fact any new LD could not be recreated when cache was being preserved.
Confirm :
# ./MegaCli -GetPreservedCacheList -a0
Adapter #0
Virtual Drive(Target ID 02): Missing.
Exit Code: 0x00
Dump cache
# ./MegaCli -DiscardPreservedCache -L2 -a0
Adapter #0
Virtual Drive(Target ID 02): Preserved Cache Data Cleared.
Exit Code: 0x00
Try again to recreate LD
# ./MegaCli -CfgLdAdd -r0[252:1] -a0
Adapter 0: Created VD 2
Adapter 0: Configured the Adapter!!
Exit Code: 0x00
SUCCESS.
Now you can move on to any software configured mirroring etc.
In this case we were able to do a zpool replace command.
A little more background on LSI logic RAID controllers Using Megaraid, Megacli, Storcli
For comparison sake we have seen similar config issues on Both IBM and Cisco UCS servers. In the case of non redundant (raid0 )logical disks this preserved cache issue can also arise. In the case of Cisco UCS we have seen a more meaningful error appear leading to the solution much quicker:
No doubt the vendor or OS package has a more updated of Megaraid i.e. Megacli64 or storcli.
# ./MegaCli64 -CfgLdAdd -r0[8:13] -a0
Adapter 0: Configure Adapter Failed
FW error description:
The current operation is not allowed because the controller has data in cache for offline
or missing virtual disks.
Exit Code: 0x54
The solution is using the same option -DiscardPreservedCache