Exa Stuff: How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)

How to Replace a Hard Drive in an Exadata Storage Server (Cell) (Predictive Failure)

1. If not turned on, enable service LED for the device with the following command, where <ID> is the "name" value provided in the action plan (such as 20:3 in the example above):

CellCLI> alter physicaldisk <ID> serviceled on

CellCLI> alter physicaldisk 20:3 serviceled on

This will cause the disk's Amber fault LED to blink rapidly as a locate indication.

2. Remove the faulty disk from drive slot and insert new disk into slot.

3. Wait for the OK Green LED as system recognize new drive. Turn off LED if enabled manually.

CellCLI> alter physicaldisk 20:3 serviceled off

If that is unsuccessful you can use.,

"lsscsi"

"/opt/MegaRAID/MegaCli/MegaCli64 -PDList -a0"

to verify the drive from an OS perspective.

From Cell Node perform below steps:

1. When you replace a physical disk, first the disk must be acknowledged by the RAID controller before the rest of the system can access it. Login to the cell server and enter the CellCLI interface, and run the following command, where <ID> is the "name" value provided in the action plan:

CellCLI> LIST PHYSICALDISK <ID> detail

CellCLI> list physicaldisk 20:3 detail
name:                   20:3
...luns:                   0_3
...
physicalInsertTime:     2012-07-23T19:11:58-04:00
...
slotNumber:             3
status:                 normal

The "status" field should report "normal". Note also that the physicalInsertTime should be current date and time, and not an earlier time.

CellCLI> alter cell validate configuration

3. Upon the replacement, lun, grid disks and cell disks that existed on previous disk are automatically re-created on new disk including the ASM diskgroup and the data will be rebalanced on them, based on the disk group redundancy and asm_power_limit parameter values.

Grid disks and cell disks can be verified with the following CellCLI command, where the lun name is reported in the physicaldisk output from step 1 above ("0_3" in this example"):

CellCLI> list lun 0_3 detail
name:                   0_3
cellDisk:               CD_03_edx2cel01
...
status:                 normalCellCLI> list celldisk where lun=0_3 detail
name:                   CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:12:04-04:00
...
status:                 normal

CellCLI> list griddisk where celldisk=CD_03_edx2cel01 detail
name:                   DATA_Q1_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:13:24-04:00
diskType:               HardDisk
errorCount:             0
id:                     8cd3556f-ee21-4497-9e0d-10b7ed9b4e74
offset:                 32M
size:                   423G
status:                 active

name:                   DBFS_DG_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:14:08-04:00
diskType:               HardDisk
errorCount:             0
id:                     26d6729c-9d24-44de-b8d9-91831e2010d2
offset:                 528.734375G
size:                   29.125G
status:                 active

name:                   RECO_Q1_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T18:15:31-04:00
diskType:               HardDisk
errorCount:             0
id:                     09f3859b-2e2c-4b68-97ea-96570d1ded29
offset:                 423.046875G
size:                   105.6875G
status:                 active

Status should be normal for the cell disks and active for the grid disks. All of the creation times should also match the insertion time of the replacement disk.

4. To confirm that the status of the rebalance, connect to the ASM instance on a database node, and validate the disks were added back to the ASM diskgroups and a rebalance is running:

SQL> set linesize 132
SQL> col path format a50
SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like '%CD_03_edx2cel01';GROUP_NUMBER PATH                                         HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------- ------------ ------- ------------------------------
1 o/192.168.9.9/DATA_Q1_CD_03_edx2cel01         MEMBER       CACHED DATA_Q1_CD_03_edx2cel01
2 o/192.168.9.9/DBFS_DG_CD_03_edx2cel01         MEMBER       CACHED DBFS_DG_CD_03_edx2cel01
3 o/192.168.9.9/RECO_Q1_CD_03_edx2cel01         MEMBER       CACHED RECO_Q1_CD_03_edx2cel01

SQL> select * from gv$asm_operation;

INST_ID GROUP_NUMBER OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK EST_RATE
---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ----------
EST_MINUTES ERROR_CODE
----------- --------------------------------------------
2            3 REBAL WAIT         10

1            3 REBAL RUN          10         10       1541       2422
7298           0

An active rebalance operation can be identified by STATE=RUN. The column group_number and inst_id provide the diskgroup number of the diskgroup being rebalanced and the instance number where the operation is running. The rebalance operation is complete when the above query returns "no rows selected".

(MODE_STATUS = ONLINE or MOUNT_STATUS=CACHED) (via SQL> )

SQL> select group_number,failgroup,mode_status,count(*) from v$asm_disk group by group_number,failgroup,mode_status;

If the new griddisks were not automatically added back into the ASM diskgroup configuration, then locate the disks with group_number=0, and add them back in manually using "alter diskgroup <name> add disk <path> rebalance power 10;" command:

SQL> select path,header_status from v$asm_disk where group_number=0;

PATH                                               HEADER_STATU
-------------------------------------------------- ------------
o/192.168.9.9/DBFS_DG_CD_03_edx2cel01        FORMER
o/192.168.9.9/DATA_Q1_CD_03_edx2cel01        FORMER
o/192.168.9.9/RECO_Q1_CD_03_edx2cel01        FORMER

SQL> alter diskgroup dbfs_dg add disk 'o/192.168.9.9/DBFS_DG_CD_03_edx2cel01' rebalance power 10;
SQL> alter diskgroup data_q1 add disk 'o/192.168.9.9/DATA_Q1_CD_03_edx2cel01' rebalance power 10;
SQL> alter diskgroup reco_q1 add disk 'o/192.168.9.9/RECO_Q1_CD_03_edx2cel01' rebalance power 10;

5. If the disk replaced was a system disk in slot 0 or 1, then the status of the OS volume should also be checked. Login as 'root' on the Storage cell and check the status using the same 'df' and 'mdadm' commands listed above:

[root@dbm1cel1 ~]# mdadm -Q --detail /dev/md5
/dev/md5:
Version : 0.90
Creation Time : Thu Mar 17 23:19:42 2011
Raid Level : raid1
Array Size : 10482304 (10.00 GiB 10.73 GB)
Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 5
Persistence : Superblock is persistentUpdate Time : Wed Jul 18 11:56:36 2012
State : active, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1

UUID : e75c1b6a:64cce9e4:924527db:b6e45d21
Events : 0.215

Number   Major   Minor   RaidDevice State
3      65      213        0      spare rebuilding   /dev/sdad5
1       8       21        1      active sync   /dev/sdb5

2 8 5 - faulty spare

[root@dbm1cel1 ~]#

While the system disk is rebuilding, the state will show as "active, degraded" or "active,degraded,recovering" with one indicating it is rebuilding and the 3rd being the 'faulty' disk. After rebuild has started, re-running this command will give a "Rebuild Status: X% complete" line in the output. When the system disk sync status is complete, the state should return to "clean" only with 2 devices.

Hiccups after replacement:

Upon replacement if you find the disk is not being imported try to clear the cache.

status: import failure

CD_01_mytestkitcel02 normal

CD_02_mytestkitcel02 not present

name: 20:02

deviceId: 21

diskType: HardDisk

enclosureDeviceId: 20

errMediaCount: 0

errOtherCount: 0

makeModel: "HITACHI H7230AS60SUN2.88T"

physicalFirmware: A142

physicalInterface: sas

physicalSerial: xxxxxx

physicalSize: 2.7290234193205833T

slotNumber: 3

status: import failure

Slot Number: 3

Firmware state: Unconfigured(good), Spun Up

Try with force option,

CellCLI> import celldisk CD_02_mytestkitcel02 force

CELL-04560: Cannot complete import of cell disk CD_02_mytestkitcel02 . Received Error: CELL-04549: Cannot obtain LUN for cell disk: CD_02_mytestkitcel02.

Cell disks not imported: Celldisk: CD_02_mytestkitcel02

1)Check the pinned cache
#/MegaCli64 -GetPreservedCachelist -a0
2) Discard the cache
#MegaCli64 -DiscardPreservedCache -L3<slot number> -force -a0
3)Validate the cache was discarded
#MegaCli64 -GetPreservedCachelist -a0
4) Create the Lun
#MegaCli64 -CfgLdAdd -R0 20:02 WB NORA Direct NoCachedBadBBU -strpsz1024 -a0
5) Now check Lun is created or not
celcli>list lun
6) CellCLI> list physicaldisk <lun ID> detail

Firmware state: Unconfigured(good), Spun Up

Foreign State: Foreign

# MegaCli64 -CfgForeign -Clear -a0

if you get any error try next :

#MegaCli -CfgForeign -Clear -aALL
#MegaCli64 -CfgLdAdd -R0 [20:02] WB NORA Direct NoCachedBadBBU -strpsz1024 -a0

#MegaCli64 -pdlist -a0 |egrep 'Slot Number|Firmware state'

See if the serial number is reflecting for new disk under cell detail. If not have to drop and recreate the cell. Once the disk shows normal from RAID controller, check the physical disk is showing 'normal'. If still shows 'import failure' time to verify firmware level.

Verify the FW level:

#MegaCli64 -PDList -a0 -NoLog|grep Firmware

This might require to bring down MS services before you upgrade FW.

#CheckHWnFWProfile -action updatefw -component HardDisk -attribute all_fw -slot 20:02

Then verify its status:

# MegaCli64 -pdlist -a0 |egrep 'Slot Number|Firmware state'

After FW upgrade, need to drop and reenable for "import failure" disk state.,

alter physicaldisk 20:02 drop for replacement
if failed try following:
alter physicaldisk 20:02 drop for replacement force

Then reenable:

alter physicaldisk 20:02 reenable

If required create asm disks manually. (ID 1281395.1)

For more details, refer Doc ID 1390836.1 & Doc ID 1113013.1

Exa Stuff

Friday, February 12, 2016

How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)

How to Replace a Hard Drive in an Exadata Storage Server (Cell) (Predictive Failure)

No comments:

Post a Comment

Blog Archive