Wednesday, October 19, 2016

Cluster Time Synchronization Service (CTSS) on RAC

Cluster Time Synchronization Service (CTSS) is installed as part of Oracle clusterware.
Basically it runs in two modes,


 Observer mode

   Time synchronization service works with NTP, then CTSS must be in Observer mode.
  CRS-4700: The Cluster Time Synchronization Service is in Observer mode.


 Active mode

   When CTSS detects time sychronization service or configuration is broken on the system,  then  CTSS is in 'Active' mode.

In order to change CTSS from observer to active mode, cluster needs to be stopped. Once cluster is stopped in all nodes, stop ntp service and deconfigure the service.


Once cluster is started, CTSS will be active.
CRS-4701: The Cluster Time Synchronization Service is in Active mode.



Wednesday, May 4, 2016

Ocrcheck: Logical corruption check failed: How to backup and recover OLR

Oracle Local Registry (OLR) is introduced in 11gR2/12c Grid Infrastructure. It contains local node specific configuration required by OHASD and is not shared between nodes; in other word, every node has its own OLR.This note provides steps to backup or restore OLR.


Solution


OLR location

The OLR location pointer file is '/etc/oracle/olr.loc' or '/var/opt/oracle/olr.loc' depending on platform. The default location after installing Oracle Clusterware is:
GI Cluster: <GI_HOME>/cdata/<hostname.olr>
GI Standalone (Oracle Restart): <GI_HOME>/cdata/localhost/<hostname.olr>


To backup

OLR will be backed up during GI configuration(installation or upgrade). In contrast to OCR, OLR will NOT be automatically backed up again after GI is configured, only manual backups can be taken. If further backup is required, OLR needs to be backed up manually. To take a backup of the OLR use the following command.
# <GI_HOME>/bin/ocrconfig -local -manualbackup


To list backups

To List the backups currently available:
# <GI_HOME>/bin/ocrconfig -local -showbackup
node1 2010/12/14 14:33:20 /u01/app/oracle/grid/11.2.0.1/cdata/node1/backup_20101214_143320.olr
node1 2010/12/14 14:33:17 /u01/app/oracle/grid/11.2.0.1/cdata/node1/backup_20101214_143317.olr
 
 Clusterware maintains the history of the five most recent manual backups and will not update/delete a manual backups after it has been created.
$ocrconfig -local -showbackup  shows manual backups in the registry though they are removed or archived manually in OS file system by OS commands


#ocrconfig -local -showbackup


To restore

Be sure GI stack is completely down and ohasd.bin is not up and running, use the following command to confirm:


ps -ef| grep ohasd.bin

This should return no process, if ohasd.bin is still up and running, stop it on local node:


# <GI_HOME>/bin/crsctl stop crs -f  <========= for GI Cluster

OR

# <GI_HOME>/bin/crsctl stop has  <========= for GI Standalone

Once it's down, restore with the following command:
# <GI_HOME>/bin/ocrconfig -local -restore <olr-backup>

If the command fails, create a dummy OLR, set correct ownership and permission and retry the restoration command:


# cd <OLR location>
# touch <hostname>.olr
# chmod 600 <hostname>.olr
# chown <grid>:<oinstall> <hostname>.olr

Once it's restored, GI can be brought up:
# <GI_HOME>/bin/crsctl start crs   <========= for GI Cluster

OR

$ <GI_HOME>/bin/crsctl start has  <========= for GI Standalone, this must be done as grid user.


Error:
====
ocrcheck
Status of Oracle Cluster Registry is as follows :
  Version : 2
  Total space (kbytes) : 1024372
  Used space (kbytes) : 3876
  Available space (kbytes) : 1020496
  ID : 916021765
  Device/File Name : /dev/ocr_disk1
  Device/File integrity check succeeded
  Device/File Name : /dev/ocr_disk2
  Device/File integrity check succeeded

  Cluster registry integrity check succeeded


  Logical corruption check failed


Changes

Most likely an OCR parameter was incorrectly set.


Cause

In this case, the -force option was used to set diagwait parameter while CRS 11.1.0.7 was up and running on one/more nodes.
This caused two keys pointing to same keyname.


Solution

To fix logical corruption :
1. Restore consistent backup of OCR. Backup must be from before the change that introduced the corruption.
See steps in OCR / Vote disk Maintenance Operations: (ADD/REMOVE/REPLACE/MOVE) (Doc ID 428681.1)

2. If OCR backup is not available, then rebuild OCR.
See steps in  How to Deconfigure/Reconfigure(Rebuild OCR) or Deinstall Grid Infrastructure (Doc ID 1377349.1)

References

NOTE:428681.1 - OCR / Vote disk Maintenance Operations: (ADD/REMOVE/REPLACE/MOVE)
NOTE:1377349.1 - How to Deconfigure/Reconfigure(Rebuild OCR) or Deinstall Grid Infrastructure

Tuesday, March 8, 2016

Griddisks not coming ONLINE after execution of alter griddisk all active

SYMPTOMS:

A cell-to-cell offload operation occurs, such as when performing a resilvering operation during ASM disk resync, or when issuing "alter diskgroup drop disks" statement in ASM

Affected Versions:

Grid Infrastructure version is 12.1.0.2, lower than 12.1.0.2.160119

The fix for bug 21218243 is installed in the Grid Infrastructure home.  This fix was included in 12.1.0.2.11, 12.1.0.2.12, and 12.1.0.2.13, but was also provided as a backport to some previous bundle patches. The fix for bug 22304421 is not installed in the Grid Infrastructure home.  This is the resolution fix for this issue.

Workaround:

Step 1 - Verify cellsrv process is running on all storage servers, and restart, if necessary.

CellCLI> list cell attributes cellsrvStatus
         stopped

CellCLI> alter cell startup services cellsrv

Starting CELLSRV services...
The STARTUP of CELLSRV services was successful.

Step 2 - Set initialization parameter _kxdbio_disable_offload_opcode=1 in all ASM instances.
This parameter may be set while ASM remains online.  For example,
SQL> alter system set "_kxdbio_disable_offload_opcode"=1;
System altered.
_kxdbio_disable_offload_opcode=1 disables cell-to-cell offload operations, such as cell-to-cell offloaded resilvering during ASM disk resync.  Such operations will proceed without direct cell-to-cell communication when this parameter is set as directed.  Other offloaded operations, such as query, incremental backup, and file creation, are unaffected when setting _kxdbio_disable_offload_opcode=1.


Note: This workaround should be used as a temporary remedy only and should be removed upon using the recommended resolution

 On 2015-Nov-09 patch 21976960 was released to correct this issue.  Subsequently, it was discovered that patch 21976960 did not prevent the issue in all possible scenarios and was withdrawn.  On 2015-Dec-22 patch 22304421 replaced patch 21976960.  If patch 21976960 was previously installed, it is not necessary to rollback prior to taking additional action.

Reference:

Wednesday, February 17, 2016

What is BBU and Learn Cycle

Battery Monitoring via Learn Cycles

Learn cycles are done periodically to fully discharge the battery and re-charge it. When complete, the BBU determines the new capacity of charge the battery can hold. Failure to run learn cycles at their recommended intervals may reduce the usable life of the battery by reducing the full charge capacity more rapidly leading to premature end of service life. This is reported by the "Full Charge Capacity" field in MegaCLI BBU output and will be updated after a learn cycle. Refer to the next section for an example.

When a learn cycle is initiated, the charging circuit automatically places any virtual drives that are in WB mode into WT mode for the duration of the cycle which will temporarily reduce write performance. Once the learn cycle completes, the virtual drives are automatically transitioned back to WB mode if the battery is still capable of holding the required charge amount. Learn cycle time will vary dependent on the BBU type.

For BBU07 the complete learn cycle process and the cache in WT mode is expected to be 6 to 8 hours.
For BBU08 the complete learn cycle process and the cache in WT mode is expected to be 2 to 3 hours.

Note, when a new BBU is installed into a system, it will have a depleted charge state. Any virtual drives attached will be forced into WT cache mode while a full learn cycle is performed. Usually a sufficient charge to maintain the cache is reached after this cycle is complete. This may take 24 hours or longer.

To determine the Battery Type, run the following:
# ./MegaCli64 -AdpBbuCmd -a0 | grep BatteryType
BatteryType: iBBU08


To know which ASIC version of LSI controller is running:
#lspci | grep RAID
13:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
#


How to check FW version
#./MegaCli64 -AdpAllInfo -a0 | grep "FW Package Build"
FW Package Build: 12.12.0-0178
#

Learn cycles on Exadata are default configured as follows:
    Storage Cells with image 11.2.1.2.x the learn cycle occurs monthly from first power on.
    Storage Cells with image 11.2.1.3.1 or later, the learn cycle is manually scheduled quarterly.
    Database nodes are set for automatic scheduled, which occurs every 30 days from first power on.

To change the start time on Storage Cells for when the learn cycle occurs, use a command similar to the following. The time reverts to the default learn cycle time after the cycle completes:

CellCLI> ALTER CELL bbuLearnCycleTime="2011-01-22T02:00:00-08:00"

How to know bbuLearnCycleTime:

CellCLI> list cell attributes bbuLearnCycleTime
         2016-04-17T02:00:00+01:00

CellCLI>


# ./MegaCli64 -AdpBbuCmd -a0 | grep "Learn Cycle" 
Learn Cycle Requested        : No
 Learn Cycle Active           : No
 Learn Cycle Status           : OK
 Learn Cycle Timeout          : No


Battery Charge Condition Requirements & Replacement Guidelines

The absolute minimum BBU07 charge required to meet the minimum 48 hours hold-up time is 600mAh.
When the BBU07 can no longer hold this much charge, MegaCli64 will report this with the "Remaining Capacity Low" setting will change from the normal "No" to "Yes"which may be an early warning notice to check whether its "Full Charge Capacity" is getting low.
The absolute minimum BBU08 charge required to meet the minimum 48 hours hold-up time is 674mAh.
Note, on BBU08 this may be flagged prematurely due to a firmware bug (Sun CR 7018730) that incorrectly sets the value higher at 960mAh based on incorrect operational assumptions. If this is being flagged due to this bug, ignore the alert if the "Full Charge Capacity" value is over 800mAh.

# ./MegaCli64 -AdpBbuCmd -a0 | grep "Remaining Capacity Low"
  Remaining Capacity Low       : Yes
   
# ./MegaCli64 -AdpBbuCmd -a0 | grep "Capacity: "
Remaining Capacity: 597 mAh
Full Charge Capacity: 612 mAh
Design Capacity: 1215 mAh


In this condition, the BBU can no longer support the cache for the duration required and needs replacement immediately. All virtual drives on a system with this set will be forced into WT mode to protect data until it is replaced, reducing performance until then.

Another parameter that needs to be checked is Max Error. Max Error is a reading that determines whether the reading of the battery condition is accurate or not. An error limit of <10% is considered to be a valid condition reading. If it is greater than it should be considered the battery condition cannot be reported properly and the BBU unit is treated as failed.

 Check the battery status and replace battery if the full charge capacity after learn cycle is less than 600 mAh
BBU08 -
    1. Check the battery status and replace battery if the full charge capacity after learn cycle is less than 674 mAh, regardless of any other BBU output field.
   2. Check the battery status and replace battery if the Max Error rate reported is 10% or greater.

Battery proactive replacement recommended within the next 60 days guidelines are as follows:

BBU07 -
 1. Replace Battery Module after 3 Year service life assuming the battery temperature does not exceed 45C.
If the temperature exceed 45C (Battery temp shall not exceed 49C), replace the battery every 2 years.
BBU08 -
    1. Replace Battery Module after 3 Year service life assuming the battery temperature
    does not exceed 45C. If the temperature exceed 45C (battery temp shall not exceed 55C), replace the battery every 2 years.

# ./MegaCli64 -AdpBbuCmd -a0 | grep "Temperature"
Temperature: 47 C
Temperature             : High
Over Temperature        : Yes


The virtual drive on this DB node is currently in WT and will remain so until the temperature drops and the BBU resumes charging.

# ./MegaCli64 -LDInfo -LALL -aALL | grep "Cache Policy"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Disk Cache Policy: Disabled

Monday, February 15, 2016

How to backup cell config


 How to backup cell config file ?

 1. stop cell services on cell server having Disk issue

cellcli> alter cell shutdown services all

2. move file cell_disk_config*

cd $OSSCONF
mv cell_disk_config* /tmp/

3. start all services

cellcli> alter cell startup services all

Friday, February 12, 2016

How to resolve error "CELL-02772: Cannot create or alter flash cache due to unequal cell disk sizes."

Flash F40 cards may come from field replacement stock with FMOD's that have been through a complete erasure process using a Secure Erase method that ensures absolutely no data is left on the device). In some cases, the secure erase procedure could cause the visible size of the FMOD's to change to a capacity slightly larger than expected. This happened because a configuration part of the device was being reset as a result of the Secure Erase needing to touch every sector on the FMOD.
On Engineered Systems (Exadata, Sparc SuperCluster, Exaltyics) use the "ddoemcli" utility to change the "max visible sectors" to the correct setting.
For non-Engineered Systems there has been provided another solution to the problem using the linux hdparm utility to change the "maximum visible sectors" to the expected size.


# cellcli -e create flashcache all
CELL-02772: Cannot create or alter flash cache due to unequal cell disk sizes.
[root@mytestkit ~]#

Line 176: lunSize: 93.13225793838501G
Line 187: lunSize: 93.13225793838501G
Line 198: lunSize: 93.13225793838501G
Line 209: lunSize: 93.13225793838501G
Line 220: lunSize: 93.13225793838501G
Line 231: lunSize: 93.13225793838501G
Line 242: lunSize: 93.13225793838501G
Line 253: lunSize: 93.13225793838501G
Line 264: lunSize: 93.1604232788086G
Line 275: lunSize: 93.1604232788086G
Line 286: lunSize: 93.1604232788086G
Line 297: lunSize: 93.1604232788086G
Line 308: lunSize: 93.13225793838501G
Line 319: lunSize: 93.13225793838501G
Line 330: lunSize: 93.13225793838501G
Line 341: lunSize: 93.13225793838501G



== Action Plan ===

Using the "ddoemcli" utility
NOTE - ENSURE THAT THE AFFECTED FLASH MODULES ARE DROPPED FROM ASM. (This should have been done automatically upon failure)
1. Read the current maximum visible sectors of the FMOD using the "DDOEMCLI" Command
./ddoemcli -c 1 -list
(-c 1 = controller 1, -list means list all output)
NOTE - Flash Card slot # does not match the controller #
F40 slot 1 = controller 3
F40 Slot 2 = controller 4
F40 slot 4 = controller 2
F40 slot 5 = controller 1

2. Change the maximum user addressable LBAs to the LSI factory setting using DDOEMCLI -c # -format -op -MaxLBA command:
[root@mytestkit ~]# ./ddoemcli -c 1 -format -slot 4 -op -maxlba 195312500 -s
****************************************************************************
LSI Corporation WarpDrive Management Utility
Version 110.00.07.00 (2013.08.12)
Copyright (c) 2013 LSI Corporation. All Rights Reserved.
****************************************************************************
LSI WarpDrive Management Utility: Please wait. Format of WarpDrive is in
progress..
Selected over-provisioning level change is done successfully.
LSI WarpDrive Management Utility: WarpDrive format successfully completed.
LSI WarpDrive Management Utility: Execution completed successfully.

The BOLDED text above is the command used to correctly program the MaxLBA value.
3. Repeat steps for any other FMOD's on the F40 controller that also need to be adjusted.
Typically the 4 FMOD's are numbered slots 4, 5, 6 and 7 on each F40 card.
NOTE - ADDITIONAL STEPS WILL BE REQUIRED IN CELLSERV/ASM TO COMPLETE THE FLASH ASSMEBLY REPLACEMENT.
A REBOOT MAY ALSO BE NECESSARY


When completed, all Flash Disks should show the same size, as shown in the below example:
CellCLI> LIST PHYSICALDISK attributes name ,physicalSize where name like '.*FLASH.*'


For more detail refer to  ( Doc ID 1671022.1 )

How to enable Write-Back flash cache

As most of you know, Write back flash cache significantly improves performance for applications which are very intense in write operations, provides the ability to cache write I/Os directly to PCI flash in addition to read I/Os instead of writing to hard disks. Look for significant waits for "free buffer waits" or high IO times to check for write bottleneck in AWR reports, then it's time to go for this mode.
 It is available from release 11.2.3.2.0 (V2 and later ).
 FlashCache is “WriteThrough” by default.


LIST CELL shows the current value.
CELLCLI> list cell attributes flashcachemode
WriteThrough


  How to enable Write-Back Flash Cache:
 Methods are available:
  1. Rolling Method
  2. Non-Rolling Method
Note: Before performing the below steps, Perform the following check as root from one of the compute nodes:
Check all griddisk “asmdeactivationoutcome” and “asmmodestatus” to ensure that all griddisks on all cells are “Yes” and “ONLINE” respectively.

 # dcli -g cell_group -l root cellcli -e list griddisk attributes asmdeactivationoutcome, asmmodestatus



Check that all of the flashcache are in the “normal” state and that no flash disks are in a degraded or critical state:


# dcli -g cell_group -l root cellcli -e list flashcache detail


Rolling Method:
 (Assuming that RDBMS and ASM instances are UP and enabling Write-Back Flash Cache in One Cell Server at a time)


Login to Cell Server:
 Step 1. Drop the flash cache on that cell
#cellcli –e drop flashcache
Step 2. Check the status of ASM if the grid disks go OFFLINE. The following command should return 'Yes' for the grid disks being listed:
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome

Step 3. Inactivate the griddisk on the cell
# cellcli –e alter griddisk all inactive

Step 4. Shut down cellsrv service
# cellcli -e alter cell shutdown services cellsrv

Step 5. Set the cell flashcache mode to writeback
# cellcli -e "alter cell flashCacheMode=writeback"

Step 6. Restart the cellsrv service
# cellcli -e alter cell startup services cellsrv

Step 7. Reactivate the griddisks on the cell
# cellcli –e alter griddisk all active

Step 8. Verify all grid disks have been successfully put online using the following command:
# cellcli -e list griddisk attributes name, asmmodestatus

Step 9. Recreate the flash cache
# cellcli -e create flashcache all

Step 10. Check the status of the cell to confirm that it's now in WriteBack mode:
# cellcli -e list cell detail | grep flashCacheMode

Step 11.  Repeat these same steps again on the next cell to the FINAL cell. However, before taking another storage server offline, execute the following making sure 'asmdeactivationoutcome' displays YES:
# cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome


Non-Rolling Method:
 (Assuming that RDBMS & ASM instances are DOWN while enabling Write-Back Flash Cache)
Step 1. Drop the flash cache on that cell
# cellcli -e drop flashcache

Step 2. Shut down cellsrv service
# cellcli -e alter cell shutdown services cellsrv

Step 3. Set the cell flashcache mode to writeback
# cellcli -e "alter cell flashCacheMode=writeback"

Step 4. Restart the cellsrv service
# cellcli -e alter cell startup services cellsrv

Step 5. Recreate the flash cache
# cellcli -e create flashcache all



Write-Back Flash Cache Not Required for DiskGroup:
 Note: We can disable Write-Back Flash Cache diskgroups like RECO not requiring this feature. This can save space in the flash cache.
CACHINGPOLICY could be used to change the flash cache policy of the griddisk.
Before changing the cache policy from default to none, ensure there is no cached data in flash cache for the grid disk:
CellCLI> create griddisk all harddisk prefix=RECO, size=1006, cachingPolicy="none“
OR
CELLCLI>ALTER GRIDDISK grid_disk_name FLUSH
CELLCLI>ALTER GRIDDISK grid_disk_name CACHINGPOLICY="none"



Flushing the data from Flash Cache to Disk – Manual Method:
 The data which is not been synchronized with griddisk can be synchronized using the FLUSH option.
CELLCLI>ALTER GRIDDISK grid_disk_name FLUSH
Use the following command to check the progress of this activity:
CELLCLI>LIST GRIDDISK ATTRIBUTES name, flushstatus, flusher


 Reinstating WriteThrough FlashCache:
  1. To reinstate Writethrough caching, FlashCache must first be flushed.
  2. FlashCache must then be dropped and cellsrv stopped.
Step 1. CELLCLI> alter flashcache all flush
Step 2. CELLCLI> drop flashcache
Step 3. CELLCLI> alter cell shutdown services cellsrv
Step 4. CELLCLI> alter cell flashCacheMode = WriteThrough
Step 5. CELLCLI> alter cell startup services cellsrv

How to Replace a Hard Drive in an Exadata Storage Server (Predictive Failure)

How to Replace a Hard Drive in an Exadata Storage Server (Cell) (Predictive Failure)


1. If not turned on, enable service LED for the device with the following command, where <ID> is the "name" value provided in the action plan (such as 20:3 in the example above):

CellCLI> alter physicaldisk <ID> serviceled on
CellCLI> alter physicaldisk 20:3  serviceled on

This will cause the disk's Amber fault LED to blink rapidly as a locate indication.
2. Remove the faulty disk from drive slot and insert new disk into slot. 
3. Wait for the OK Green LED as system recognize new drive. Turn off LED if enabled manually.
CellCLI> alter physicaldisk 20:3 serviceled off

If that is unsuccessful you can use.,
"lsscsi"
"/opt/MegaRAID/MegaCli/MegaCli64 -PDList -a0"

to verify the drive from an OS perspective.

From Cell Node perform below steps:

1. When you replace a physical disk, first the disk must be acknowledged by the RAID controller before the rest of the system can access it. Login to the cell server and enter the CellCLI interface, and run the following command, where <ID> is the "name" value provided in the action plan:

CellCLI> LIST PHYSICALDISK <ID> detail
CellCLI> list physicaldisk 20:3 detail
name:                   20:3
...
luns:                   0_3
...
physicalInsertTime:     2012-07-23T19:11:58-04:00
...
slotNumber:             3
status:                 normal
The "status" field should report "normal". Note also that the physicalInsertTime should be current date and time, and not an earlier time.
CellCLI> alter cell validate configuration

 3. Upon the replacement, lun, grid disks and cell disks that existed on previous disk are automatically re-created on new disk including the ASM diskgroup and the data will be rebalanced on them, based on the disk group redundancy and asm_power_limit parameter values.
Grid disks and cell disks can be verified with the following CellCLI command, where the lun name is reported in the physicaldisk output from step 1 above ("0_3" in this example"):
CellCLI> list lun 0_3 detail
name:                   0_3
cellDisk:               CD_03_edx2cel01
...
status:                 normal
CellCLI> list celldisk where lun=0_3 detail
name:                   CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:12:04-04:00
...
status:                 normal
CellCLI> list griddisk where celldisk=CD_03_edx2cel01 detail
name:                   DATA_Q1_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:13:24-04:00
diskType:               HardDisk
errorCount:             0
id:                     8cd3556f-ee21-4497-9e0d-10b7ed9b4e74
offset:                 32M
size:                   423G
status:                 active
name:                   DBFS_DG_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T19:14:08-04:00
diskType:               HardDisk
errorCount:             0
id:                     26d6729c-9d24-44de-b8d9-91831e2010d2
offset:                 528.734375G
size:                   29.125G
status:                 active
name:                   RECO_Q1_CD_03_edx2cel01
availableTo:
cellDisk:               CD_03_edx2cel01
comment:
creationTime:           2012-07-23T18:15:31-04:00
diskType:               HardDisk
errorCount:             0
id:                     09f3859b-2e2c-4b68-97ea-96570d1ded29
offset:                 423.046875G
size:                   105.6875G
status:                 active
Status should be normal for the cell disks and active for the grid disks. All of the creation times should also match the insertion time of the replacement disk. 

4. To confirm that the status of the rebalance, connect to the ASM instance on a database node, and validate the disks were added back to the ASM diskgroups and a rebalance is running:
SQL> set linesize 132
SQL> col path format a50
SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like '%CD_03_edx2cel01';
GROUP_NUMBER PATH                                         HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------- ------------ ------- ------------------------------
1 o/192.168.9.9/DATA_Q1_CD_03_edx2cel01         MEMBER       CACHED  DATA_Q1_CD_03_edx2cel01
2 o/192.168.9.9/DBFS_DG_CD_03_edx2cel01         MEMBER       CACHED  DBFS_DG_CD_03_edx2cel01
3 o/192.168.9.9/RECO_Q1_CD_03_edx2cel01         MEMBER       CACHED  RECO_Q1_CD_03_edx2cel01
SQL> select * from gv$asm_operation;
INST_ID GROUP_NUMBER OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE
---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ----------
EST_MINUTES ERROR_CODE
----------- --------------------------------------------
2            3 REBAL WAIT         10
1            3 REBAL RUN          10         10       1541       2422
7298           0
An active rebalance operation can be identified by STATE=RUN. The column group_number and inst_id provide the diskgroup number of the diskgroup being rebalanced and the instance number where the operation is running.  The rebalance operation is complete when the above query returns "no rows selected".

 (MODE_STATUS = ONLINE or MOUNT_STATUS=CACHED) (via SQL> )
SQL> select group_number,failgroup,mode_status,count(*) from v$asm_disk group by group_number,failgroup,mode_status;
If the new griddisks were not automatically added back into the ASM diskgroup configuration, then locate the disks with group_number=0, and add them back in manually using "alter diskgroup <name> add disk <path> rebalance power 10;" command:

SQL> select path,header_status from v$asm_disk where group_number=0;
PATH                                               HEADER_STATU
-------------------------------------------------- ------------
o/192.168.9.9/DBFS_DG_CD_03_edx2cel01        FORMER
o/192.168.9.9/DATA_Q1_CD_03_edx2cel01        FORMER
o/192.168.9.9/RECO_Q1_CD_03_edx2cel01        FORMER
SQL> alter diskgroup dbfs_dg add disk 'o/192.168.9.9/DBFS_DG_CD_03_edx2cel01' rebalance power 10;
SQL> alter diskgroup data_q1 add disk 'o/192.168.9.9/DATA_Q1_CD_03_edx2cel01' rebalance power 10;
SQL> alter diskgroup reco_q1 add disk 'o/192.168.9.9/RECO_Q1_CD_03_edx2cel01' rebalance power 10;

5. If the disk replaced was a system disk in slot 0 or 1, then the status of the OS volume should also be checked. Login as 'root' on the Storage cell and check the status using the same 'df' and 'mdadm' commands listed above:
[root@dbm1cel1 ~]# mdadm -Q --detail /dev/md5
/dev/md5:
Version : 0.90
Creation Time : Thu Mar 17 23:19:42 2011
Raid Level : raid1
Array Size : 10482304 (10.00 GiB 10.73 GB)
Used Dev Size : 10482304 (10.00 GiB 10.73 GB)
Raid Devices : 2
Total Devices : 3
Preferred Minor : 5
Persistence : Superblock is persistent
Update Time : Wed Jul 18 11:56:36 2012
State : active, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 1
Spare Devices : 1
UUID : e75c1b6a:64cce9e4:924527db:b6e45d21
Events : 0.215
Number   Major   Minor   RaidDevice State
3      65      213        0      spare rebuilding   /dev/sdad5
1       8       21        1      active sync   /dev/sdb5
2       8        5        -      faulty spare

[root@dbm1cel1 ~]#

While the system disk is rebuilding, the state will show as "active, degraded" or "active,degraded,recovering" with one indicating it is rebuilding and the 3rd being the 'faulty' disk. After rebuild has started, re-running this command will give a "Rebuild Status: X% complete" line in the output. When the system disk sync status is complete, the state should return to "clean" only with 2 devices.
Hiccups after replacement:

Upon replacement if you find the disk is not being imported try to clear the cache.

status: import failure

CD_01_mytestkitcel02  normal
CD_02_mytestkitcel02   not present
:

           name:                    20:02
          deviceId:               21
          diskType:               HardDisk
        enclosureDeviceId:      20
        errMediaCount:          0
        errOtherCount:          0
        makeModel:              "HITACHI H7230AS60SUN2.88T"
        physicalFirmware:       A142
        physicalInterface:      sas
        physicalSerial:          xxxxxx
          physicalSize:           2.7290234193205833T
        slotNumber:              3
          status:                 import failure

Slot Number: 3
Firmware state: Unconfigured(good), Spun Up

Try with force option,




CellCLI> import celldisk CD_02_mytestkitcel02 force

CELL-04560: Cannot complete import of cell disk CD_02_mytestkitcel02 . Received Error: CELL-04549: Cannot obtain LUN for cell disk: CD_02_mytestkitcel02.

Cell disks not imported: Celldisk: CD_02_mytestkitcel02

1)Check the pinned cache
#/MegaCli64 -GetPreservedCachelist -a0
2) Discard the cache
#MegaCli64 -DiscardPreservedCache -L3<slot number> -force -a0
3)Validate the cache was discarded
#MegaCli64 -GetPreservedCachelist -a0
4) Create the Lun
#MegaCli64 -CfgLdAdd -R0 20:02 WB NORA Direct NoCachedBadBBU -strpsz1024 -a0
5) Now check Lun is created or not
celcli>list lun
6) CellCLI> list physicaldisk <lun ID> detail

Firmware state: Unconfigured(good), Spun Up                                 
Foreign State: Foreign                                                                    

# MegaCli64 -CfgForeign -Clear -a0

if you get any error try next : 

#MegaCli -CfgForeign -Clear -aALL
#MegaCli64 -CfgLdAdd -R0 [20:02] WB NORA Direct NoCachedBadBBU -strpsz1024 -a0

#MegaCli64 -pdlist -a0 |egrep 'Slot Number|Firmware state'

See if the serial number is reflecting for new disk under cell detail. If not have to drop and recreate the cell. Once the disk shows normal from RAID controller, check the physical disk is showing 'normal'. If still shows 'import failure' time to verify firmware level.

Verify the FW level:

#MegaCli64 -PDList -a0 -NoLog|grep Firmware

This might require to bring down MS services before you upgrade FW.

#CheckHWnFWProfile -action updatefw -component HardDisk -attribute all_fw -slot 20:02


 Then verify its status: 

 # Meg
aCli64 -pdlist -a0 |egrep 'Slot Number|Firmware state'


After FW upgrade, need to drop and reenable for "import failure" disk state.,

alter physicaldisk 20:02 drop for replacement

 if failed try following:

alter physicaldisk 20:02 drop for replacement force

Then reenable:
alter physicaldisk 20:02 reenable


If required create asm disks manually. (ID 1281395.1)


 
For more details, refer Doc ID 1390836.1 & Doc ID 1113013.1