Tuesday, December 25, 2018

IORM plan on Exadata

Configuring an IORM plan on Exadata


The IORM plan can be configured using the ALTER IORMPLAN command on command-line interface (CellCLI) utility on each Exadata storage cell. It consists of two parameters - dbplan and catplan. While the "dbplan" is used to create the I/O resource directives for the databases, the "catplan" is used to allocate resources by workload category consolidated on the target system. Both the parameters are optional, i.e. if catplan is not specified, category-wise I/O allocation will not take place. The directives in an inter-database plan specify allocations to databases, rather than consumer groups. To create a database plan, IORM uses certain attributes as listed below.
  • name - Specify the database name, profile name (from Exadata Storage Software Release 12.1.2.1). Use "other" when specifying allocation and "default" when specifying share for databases.
  • level - Specify the level of allocation. In a multi-level plan, if the current level is unable to utilize the allocated resources, the resources are cascaded to the next level.
  • role - Specify the database role i.e. primary or standby in an Oracle Data Guard environment. It indicates that the directive is applicable only if the database exists in the role specified. For "other" and "default" directive, the attribute is not applicable.
  • allocation/share - Specify the resource allocation to a specific database in terms of percentage or shares. If you specify both allocation and share, the directive is invalidated. With percentage based allocation, you can specify a "level", so that the unused resources can be cascaded to the successive levels. There can be a maximum of eight levels and the sum of all allocations at a level must not exceed 100. Likewise, there can be a maximum 32 directives.
With "share" based allocation, you do not have to specify levels and allocation as a percentage. A share can be a value between 1 to 32, which represents the degree of importance for a specific database. Share-based allocations can support up to 1024 directives.
  • limit - Specify maximum limit of disk utilization for a database. This is a handy directive in consolidation exercises because it helps in achieving consistent I/O performance and pay-for-performance capability.
  • flashCache - Specify whether or not a database can make use of flash cache
  • flashLog - Specify whether or not a database can make use of flash log
 
From Exadata cell versions 11.2.3.2 and above, the IORM is enabled by default with the BASIC objective. The BASIC objective lets IORM protect high latency small I/O requests and manage flash cache. To enable the IORM for user defined plans, you must set the objective to AUTO. To disable the IORM, set the objective back to BASIC.

CellCLI> ALTER IORMPLAN OBJECTIVE = AUTO;
 
 

The IORM Objective

The objective is an essential setting in an IORM plan. It is used to optimize the I/O request issue based on the workload characteristics. An IORM objective can be either basic, auto, low_latency, high_throughput, or balanced.
  • Basic - This is the default setting and doesn't deal with the user-defined plans. It only guards the high latency small I/Os while maximum throughput is maintained.
  • low_latency - The objective is to reduce the latency by capping the concurrent I/O requests maintained in the disk drive buffer. The setting is suitable specifically for OLTP workloads.
  • high_throughput - The target is used for warehouse workloads to maximize the throughput by delivering a larger buffer of concurrent I/O requests.
  • balanced - The objective balances the low latency and high throughput
  • auto - The objective lets the IORM to decide the appropriate objective depending on the active workload on the cell

 

Managing Exadata Flash Cache

One of the key enablers of Exadata's extreme performance and scalability is the Exadata Smart flash Cache. The IO Resource Manager allows the enabling and disabling the usage of flash cache by multiple databases consolidated on an Exadata machine. The IORM plan directive can set the "flashCache" attribute to prevent the databases from using flash cache. If the attribute is not specified in the directive, the database is assumed to be using the flash cache. Disabling the flash cache for a database would require considerable thinking and strong justification. The usage of flash logs can also be controlled through IORM plans. You can set the attribute "flashLog" in the plan directive to enable or disable the flash log usage for a database. But since it consumes a very small portion of total flash, it is recommended to make use of flash log.
 
Starting with Exadata Storage Server software release 12.1.2.1, the IORM can also manage the flash I/Os along with the disk I/Os using a feature known as Flash IORM. OLTP flash I/Os are automatically prioritized over scan I/Os, thus ensuring faster OLTP response times. Based on the allocation made in IORM plan directives, the flash bandwidth can be distributed across multiple databases. The distribution of excess flash bandwidth between scans cascades up to the consumer groups in each database.
Another new feature in Exadata Storage Server software release 12.1.2.1, the new Flash Cache Resource Management allows the users to configure the minimum and maximum value of flash which can be consumed by a database. The new attributes - "flashCacheMin" determines the minimum flash guaranteed for a database while "flashCacheLimit" is the soft upper limit. The "flashCacheLimit" is enforced only when the flash is full.
 

Tuesday, December 18, 2018

[CELL-05651] [oracle.ossmgmt.ms.core.MSCellMetricDef] File system "/opt/oracle" is now 100% used.

CELL-01514: Connect Error. Verify that Management Server is listening at the specified HTTP port: 8888


 

Oracle Exadata Storage Server Software - Version 18.1.0.0.0 to 18.1.4.0.0 [Release 12.2]
Information in this document applies to any platform.

 

Symptoms

File system "/opt/oracle" is 100% full on Cell

Cause

Bug 26995980 EXADATA /OPT/ORACLE FULL DUE TO ACCESS LOGS FULL OF /CLISERVICE MESSAGES

Bug 27525029 CONTENT INCLUSION OF 26995980 IN EXADATA PSU 18.1.5.0.0

Solution

This is a known issue caused by Base Bug 26995980 'EXADATA /OPT/ORACLE FULL DUE TO ACCESS LOGS FULL OF /CLISERVICE MESSAGES'

Release Notes:

WebServer logs (access.log) are not removed after reaching allowed count number.

--

No other workarounds are found. The recommendation is to manually removed the older access log files to free up the space.

The fix for Bug 26995980 is included in the latest Exadata release 18.1.5.0.0

Reference:

Bug 26995980 EXADATA /OPT/ORACLE FULL DUE TO ACCESS LOGS FULL OF /CLISERVICE MESSAGES

Bug 27525029 CONTENT INCLUSION OF 26995980 IN EXADATA PSU 18.1.5.0.0
 

Tuesday, December 11, 2018

Where to find supported versions for Exadata


 

Purpose

This document lists the software patches and releases for Oracle Exadata Database Machine. This document includes versions for both the database servers and storage servers of Oracle Exadata Database Machine with database servers running Intel x86-64 processors.

For an index and references to the most frequently used My Oracle Support Notes with respect to Oracle Exadata and Oracle Exadata Database Machine environments, refer to the Master Note for Oracle Exadata Database Machine and Exadata Storage Server
Note 1187674.1.

Scope

The information in this document applies only to Exadata software 11.2 and higher.  It does not apply to any previous version of Exadata software. Current releases for other Exadata software versions is maintained in a different note.

Note: The currently supported versions may change frequently, so it is important to review this document immediately prior to any Oracle Exadata Database Machine deployment.

Details 


Latest Releases and Patching News

Before upgrading, see the Requirements for Exadata Feature Usage section in this document for software requirements that may necessitate pre-upgrade patch application to other software in order to support specific Exadata features or patch application methods.
  • New Oracle Exadata Deployment Assistant (OEDA) release Nov 2018 - Supports 18 (18.1.0-18.4.0), 12.2.0.1 (BP170620-RU181016), 12.1.0.2 (BP1-BP181016) and 11.2.0.4 (BP1-BP181016)
  • New Exadata 18.1.10.0.0 (Note 2463368.1)
  • New QFSDP release - Quarterly Full Stack Download Patch (QFSDP) Oct 2018
  • New Exadata 19.1.0.0.0 (Note 2334423.1)
    • Requires OEDA Oct 2018 or later.
    • Verify minimum Grid Infrastructure and Database version requirements are met before updating to this release. See Note 2334423.1 for details
    • Database servers configured physical and domU move to Oracle Linux 7.5.  Verify compatibility of custom-installed software with OL7 before updating to this release.
  • New 18c Database release - 18.4.0.0.181016 Release Update
  • New 12.2.0.1 Database release - 12.2.0.1.181016 Release Update
  • New 12.1.0.2 Database release - 12.1.0.2.181016 Database Proactive Bundle Patch
  • New 11.2.0.4 Database release - 11.2.0.4.181016 Database Patch for Exadata
  • Updated ACFS drivers required for the most efficient CVE-2017-5715 (Spectre variant 2) mitigation in Exadata versions >= 18.1.5.0.0 and >= 12.2.1.1.7 are included in the July 2018 quarterly database releases. Earlier quarterly database releases (April 2018 and earlier) still require a separate ACFS patch. See Document 2356385.1 for details.
  • Oracle Database and Grid Infrastructure Upgrade Recommendations
    • If you are currently running 11.2.0.4 or 12.1.0.2, review the upgrade recommendations in Document 742060.1 to help you stay within the guidelines of Lifetime Support and Error Correction Policies.
 

Exadata Software Updates Overview and Guidelines

For an explanation and overview of Oracle Exadata Database Machine updates, and guidelines for applying and testing software on Exadata, refer to Document 1262380.1.

Exadata Software and Hardware Maintenance Planning

For information about Exadata Software and Hardware Support Lifecycle, see Document 1570460.1.
For information about planning for software and hardware maintenance, see Document 1461240.1.

Critical Issues

Review Document 1270094.1 for Exadata Critical Issues.

Security-Related Guidance

Review Document 1405320.1 for Responses to common Exadata security scan findings.

 

Disable pstack Called From Diagsnap After Applying PSU/RU released between October 2017 and July 2018 to Grid Infrastructure (GI) Home on 12.1.0.2 and 12.2. (Doc ID 2422509.1)

Description

Troubleshooting node reboots/evictions within Grid Infrastructure (GI) often is difficult due to the lack of Network and OS level resource information.  To help circumvent this situation the diagsnap feature has been developed and integrated with Grid Infrastructure.  Diagsnap is triggered to collect Network and OS level resource information when a given node is about to get evicted or when Grid Infrastructure is about to crash.
  
The diagsnap feature is enabled automatically starting from 12.1.0.2 Oct2017 PSU and 12.2.0.1 Oct2017 RU.For more information about the diagsnap feature, refer to the Document 2345654.1 "What is diagsnap resource in 12c GI and above?"  

Occurrence

In certain situations diagsnap executes pstack (and pfiles on Solaris) against critical daemons like ocssd.bin and gipcd.bin. 
Although very infrequent, taking pstack and pfiles on ocssd.bin can suspend the ocssd.bin daemon long enough to cause node reboots and evictions.  For this reason Oracle has decided to ask customers to disable diagsnap functionality until the proper fixes are  provided in a future PSU and/or RU. Once the fixes are applied, diagsnap will not call pstack (and pfiles on Solaris).

Symptoms

Node reboots and evictions after applying the 12.1.0.2 Oct2017 PSU (and later) or 12.2.0.1 Oct2017 RU (and later) but before 12.1.0.2 Oct2018 PSU and 12.2.0.1 Oct2018 RU To Grid Infrastructure (GI) Home.
The problem is fixed in 12.1.0.2 Oct2018 PSU and 12.2.0.1 Oct2018 RU.


Workaround

Either apply the patch or disable the osysmond from issuing pstack (and diagsnap from issuing pfiles in Solaris)

For non-Solaris environments:

1.  apply the latest PSU or RU or the patch for Bug:28266751, and the fix disables the osysmond from issuing pstack.

The fix for bugs 28266751 is included in the 12.1.0.2 Oct 2018 PSU and 12.2.0.1 Oct 2018 RU,
so
the strong recommendation is to apply 12.1.0.2 Oct 2018 PSU and 12.2.0.1 Oct 2018 RU or later.Refer to the Document 756671.1 "Master Note for Database Proactive Patch Program" for the patch number for the latest 12.1.0.2 PSU and 12.2.0.1 RU.

OR
2.  Disable osysmond from issuing pstack:
As root user, issue
crsctl stop res ora.crf -init
Update PSTACK=DISABLE in $GRID_HOME/crf/admin/crf<HOSTNAME>.ora
crsctl start res ora.crf -init
 

Patches

Following bugs are opened to remove the pstack and pfiles feature from diagsnap.
Bug:28266751 - REMOVE PSTACK FOR CSS AND GIPC IN DIAGSNAPBug:26943660 - DIAGSNAP.PL SHOULDN'T RUN PFILES ON CRSD.BIN

ORA-09817: Write to audit file failed. Linux-x86_64 Error: 28: No space left on device



 Affects:



Product (Component)Oracle Server (PCW)
Range of versions believed to be affected(Not specified)
Versions confirmed as being affected
Platforms affectedGeneric (all / most platforms affected)

Fixed:


The fix for 28266751 is first included in

Interim patches may be available for earlier versions - click here to check.


Description:

  On Clusterware you may see pstack collected on processes like ocssd or gipcd

 

Rediscovery Notes:  

   CHM performs sometimes calls pstack on ocssd.bin or gipcd

Workaround:  

    None


For more details check - Doc ID 2422509.1

Monday, November 19, 2018

Which one to select Infiniband or 10GbE to connect to ZFS?

Selecting Infiniband or 10GbE
The Oracle ZFS Storage Appliance can be configured with either Infiniband or 10GbE for the purpose of connecting to an Exadata Database Machine.  
Exadata utilizes a native Infiniband infrastructure that provides increased bandwidth, lower latency and reduced system resources utilization compared to 10GbE.  The Oracle ZFS Storage Appliance can integrate seamlessly by connecting to the two Infiniband leaf switches (NM2-36P) that are pre-configured in the Exadata rack.
In many situations it is recommended to utilize Infiniband when connecting to Exadata due to the superior performance that can be realized.  However, in some situations 10GbE will make a better choice.  Some examples of these situations include distance limitations that make Infiniband deployments prohibitive or backing up five or more isolated Exadatas to a single Oracle ZS5-4.
An Infiniband link provides equal or superior performance to a 10GbE link under all workloads.  However, certain disk IOPS intensive workloads like OLTP direct database will not see a significant benefit of utilizing Infiniband.  Because of this, 10GbE is a good fit for these workloads.

Sunday, October 14, 2018

Spectre / Meltdown - Intel processor vulnerabilities [CVE-2017-5715 CVE-2017-5753 CVE-2017-5754]

Oracle Exadata Database Machine with respect CVE-2017-5753 (Spectre v1), CVE-2017-5715 (Spectre v2), and CVE-2017-5754 (Meltdown) Intel processor vulnerabilities Status:


Patch Availability Table

Affected Products

Patch Availability

 CVE-2017-5715CVE-2017-5753CVE-2017-5754
Oracle Exadata Database Machine
(compute nodes/storage servers)
18.1.5.0.0
12.2.1.1.7
Note 1
Note 5
Note 6
18.1.4.0.0
12.2.1.1.6
18.1.4.0.0
12.2.1.1.6
Sun Data Center InfiniBand Switch 36 (NM2-36P)
Not Required
Note 2
Not Required
Note 2
Not Required
Note 2
Cisco Catalyst C4948/C4948E-F-S
Not Required
Note 3
Not Required
Note 3
Not Required
Note 3
Cisco Nexus 93108TC-EX
May be Required
Note 4
May be Required
Note 4
May be Required
Note 4




Note 1: Install updated ACFS drivers before updating Exadata database servers to >= 18.1.5.0.0 or >= 12.2.1.1.7, and use the most recent patchmgr/dbserver.patch (Patch 21634633 version 18.1.5.0.0 / 5.180529 or later) to perform the database server updates.  Updated ACFS drivers are needed whether or not ACFS is in use in order for the most efficient CVE-2017-5715 mitigation to be in place once >= 18.1.5.0.0 or >= 12.2.1.1.7 is installed.  See section Required ACFS Driver Updates below for details.
Note 2: The InfiniBand Switch component (NM2-36P) is not currently believed to be impacted by these vulnerabilities.


Required ACFS Driver Updates

Install updated ACFS drivers before updating Exadata database servers to >= 18.1.5.0.0 or >= 12.2.1.1.7. Note the following:
  1. Updated ACFS drivers are needed for the most efficient CVE-2017-5715 mitigation to be in place once >= 18.1.5.0.0 or >= 12.2.1.1.7 is installed.
  2. Updated ACFS drivers are needed whether or not ACFS is in use.
  3. In an OVM configuration dom0 may be updated to >= 18.1.5.0.0 or >= 12.2.1.1.7 before the ACFS drivers are updated in domU.
Updated ACFS drivers are available with the following:
  • Grid infrastructure 18.3.0.0.180717 and later version 18 quarterly updates
  • Grid infrastructure 12.2.0.1.180717 and later version 12.2.0.1 quarterly updates
  • Grid infrastructure 12.1.0.2.180831 and later version 12.1.0.2 quarterly updates
  • Grid infrastructure 12.1.0.2.180717 plus patch 23312691
How to Verify Proper Mitigation


To verify your systems have the proper mitigations in place after the ACFS driver and Exadata updates are performed, run the following command:
$ grep . /sys/devices/system/cpu/vulnerabilities/*
The expected output depends on the hardware version, server type, and system configuration.  If the output shown on your system does not match 
X7 hardware - storage server and database server (non-OVM and domU)
Expected vulnerability mitigation output:
$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: lfence
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: IBRS, IBRS_FW, IBPB
X6 and earlier hardware - storage server and database server (non-OVM and domU)
Expected vulnerability mitigation output:
$ grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: lfence
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBRS_FW, IBPB
All hardware - database server (dom0)
Expected vulnerability mitigation output:
# grep . /sys/devices/system/cpu/vulnerabilities/*
/sys/devices/system/cpu/vulnerabilities/meltdown:Vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: lfence
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline
Note that the output for dom0 incorrectly indicates the system is vulnerable to Meltdown (CVE-2017-5754).  Exadata database servers configured with OVM use the Xen hypervisor and guest VMs in HVM mode.  Xen with guests in HVM mode is not vulnerable to Meltdown.
Troubleshooting unexpected mitigation
If Spectre v2 mitigation does not match the expected output shown above, then confirm "imageinfo -ver" indicates the installed Exadata versions is >= 18.1.5.0.0 or >= 12.2.1.1.7.  If the proper Exadata version is installed, then follow these guidelines:
Storage servers
This situation should never occur on storage servers.  Contact Oracle Support.

Database servers (non-OVM and domU)
Perform the following verification steps on each database server:
  1. Verify the patch containing the updated ACFS drivers has been installed in the grid infrastructure home.  If the following OPatch command does not return output then the ACFS patch is not installed. 
    $ <gihome>/grid/OPatch/opatch lsinventory -bugs_fixed | grep ^27463879
    27463879 27463879 Tue May 15 20:31:10 UTC 2018 TRACKING BUG FOR RECOMPILING USM DRIVERS WITH
     
    If the patch is not installed, perform the following steps: 1) install the ACFS patch, as indicated in the "Required ACFS Driver Updates" section above; 2) reboot (required for the kernel to re-enable proper mitigation); 3) verify proper mitigation. This may be done in a rolling manner.

    If the patch is installed, but it was installed after Exadata software was upgraded to >= 18.1.5.0.0 or >= 12.2.1.1.7, then reboot is required for the kernel to re-enable proper mitigation.

    Otherwise, proceed to the next troubleshooting step.
  2. Verify the state of the ACFS driver using the acfsdriverstate command.
    $ <gihome>/bin/acfsdriverstate version
    ACFS-9325: Driver OS kernel version = 4.1.12-94.8.2.el6uek.x86_64.
      
    If Driver OS kernel version = 4.1.12-32, then the updated ACFS driver is not loaded.  This scenario may occur if the ACFS patch was installed before database server update, but the database server update was performed with an older patchmgr/dbserver.patch.  To load the proper ACFS driver perform the following steps as root: 1) stop clusterware; 2) run "<gihome>/bin/acfsroot install" to load the proper drivers; 3) reboot (required for the kernel to re-enable proper mitigation); 4) verify proper mitigation.  This may be done in a rolling manner.

    If Driver OS kernel version = 4.1.12-94.8.2 or a higher kernel version (note that this will not necessarily match the installed kernel version), then the correct driver is loaded.  Proceed to the next troubleshooting step.
  3. Review /var/log/messages for output referring to "Spectre V2" logged shortly after the time for the current system boot, indicating a module is loaded "not compiled with retpoline compiler", similar to the following:
    kernel: badmodule: loading module not compiled with retpoline compiler.
    kernel: Spectre V2 : Disabling Spectre v2 mitigation retpoline.
    kernel: Spectre V2 : Spectre v2 mitigation set to IBRS.
      
    If /var/log/messages does not contain any message referring to "Spectre V2" then one of the following conditions exist: 1) /var/log/messages file was rotated - review an older messages file; or 2) Exadata software has not been upgraded to >= 18.1.5.0.0 or >= 12.2.1.1.7.

    If the message reported is "oracleoks: loading module not compiled with retpoline compiler", then it indicates the proper ACFS driver is not in place.  Review the previous troubleshooting steps.

    If the message refers to any other module, then it is likely caused by user-installed software that supplies a module that has not been compiled with a retpoline-aware compiler.  Contact the vendor of that kernel module to obtain an update.

    For retpoline mitigation to be active, kernel modules/drivers that contain code needing retpolines must be compiled with a retpoline-aware compiler. Loading a module needing retpolines that was not compiled with a retpoline-aware compiler (e.g. an older ACFS driver, or a third-party module) will cause the kernel to disable retpoline mitigation systemwide, and fallback to a different mitigation (e.g. IBRS), which may have higher than expected performance impact on some systems. Review /var/log/messages, as shown above, for output showing the kernel disabling retpoline because a module was not compiled with a retpoline-aware compiler. All kernel modules delivered with Exadata 18.1.5.0.0 and 12.2.1.1.7 have been compiled with a retpoline-aware compiler. The updated ACFS drivers discussed above have been compiled with a retpoline-aware compiler.

  4. If the previous troubleshooting steps do not resolve the issue, then Contact Oracle Support.

Tuesday, July 10, 2018

M2.devices in X7 and how to monitor and replace faulty disks

Oracle Exadata Database Machine X7 systems comes with two internal M.2 devices that contain the system area. In all previous systems, the first two disks of the Oracle Exadata Storage Server are system disks and the portions on these system disks are referred to as the system area.



Note:
Oracle Exadata Rack and Oracle Exadata Storage Servers can remain online and available while replacing an M.2 disk.

This section contains the following topics:

Monitoring the Status of M.2 Disks

You can monitor the status of a M.2 disk by checking its attributes with the CellCLI LIST PHYSICALDISK command.
The disk firmware maintains the error counters, and marks a drive with Predictive Failure when the disk is about to fail. The drive, not the cell software, determines if it needs replacement.
  • Use the CellCLI command LIST PHSYICALDISK to determine the status of a M.2 disk:
 
CellCLI> LIST PHYSICALDISK WHERE disktype='M2Disk' DETAIL
         name:                           M2_SYS_0
        deviceName:                  /dev/sdm
        diskType:                      M2Disk
         makeModel:                    "INTEL SSDSCGJK150G7"
         physicalFirmware:         N2010112
         physicalInsertTime:      2017-07-14T08:42:24-07:00
         physicalSerial:            PHDW7082000M150A
         physicalSize:               139.73558807373047G
         slotNumber:                  "M.2 Slot: 0"
         status:                failed

         name:                  M2_SYS_1        
         deviceName:            /dev/sdn
         diskType:              M2Disk
         makeModel:             "INTEL SSDSCKJB150G7"
         physicalFirmware:      N2010112
         physicalInsertTime:    2017-07-14T12:25:05-07:00
         physicalSerial:        PHDW708204SZ150A
         physicalSize:          139.73558807373047G
         slotNumber:            "M.2 Slot: 1"
         status:                normal


Replacing a M.2 Disk Due to Failure or Other Problems

Failure of a M.2 disks reduces redundancy of the system area, and can impact patching, imaging, and system rescue. Therefore, the disk should be replaced with a new disk as soon as possible. When a M.2 disk fails, the storage server automatically and transparently switches to using the software stored on the inactive system disk, making it the active system disk.



An Exadata alert is generated when an M.2 disk fails. The alert includes specific instructions for replacing the disk. If you have configured the system for alert notifications, then the alert is sent by e-mail to the designated address. M.2 disk is hot-pluggable and can be replaced when the power is on. After the M.2 disk is replaced, Oracle Exadata System Software automatically adds the new device to the system partition and starts the rebuilding process.
 
  1. Identify the failed M.2 disk.
    CellCLI> LIST PHYSICALDISK WHERE diskType=M2Disk AND status!=normal DETAIL
             name:                      M2_SYS_0
              deviceName:              /dev/sda
              diskType:                M2Disk
              makeModel:              "INTEL SSDSCKJB150G7"
             physicalFirmware:          N2010112
             physicalInsertTime:        2017-07-14T08:42:24-07:00
             physicalSerial:            PHDW7082000M150A
             physicalSize:              139.73558807373047G
             slotNumber:                "M.2 Slot: 0"
           status:                    failed - dropped for replacement
    
  2. Locate the cell that has the white LED lit.
  3. Open the chassis and identify the M.2 disk by the slot number in Step 1.
  4. The amber LED for this disk should be lit to indicate service is needed.
    M.2 disks are hot pluggable, so you do not need to power down the cell before replacing the disk.
  5. Remove the M.2 disk:
    1. Rotate both riser board socket ejectors up and outward as far as they will go.
      The green power LED on the riser board turns off when you open the socket ejectors.
    2. Carefully lift the riser board straight up to the remove it from the sockets.
  6. Insert the replacement M.2 disk:
    1. Unpack the replacement flash riser board and place it on an antistatic mat.
    2. Align the notch in the replacement riser board with the connector key in the connector socket.
    3. Push the riser board into the connector socket until the riser board is securely seated in the socket.
      Caution:
      If the riser board does not easily seat into the connector socket, verify that the notch in the riser board is aligned with the connector key in the connector socket. If the notch is not aligned, damage to the riser board might occur.
    4. Rotate both riser board socket ejectors inward until the ejector tabs lock the riser board in place.
      The green power LED on the riser board turns on when you close the socket ejectors.
  7. Confirm the M.2 disk has been replaced.
    CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=M2Disk DETAIL
         name:                  M2_SYS_0 
        deviceName:            /dev/sdm   
       diskType:              M2Disk   
       makeModel:             "INTEL SSDSCKJB150G7"   
       physicalFirmware:      N2010112    
       physicalInsertTime:    2017-08-24T18:55:13-07:00   
       physicalSerial:        PHDW708201G0150A   
       physicalSize:          139.73558807373047G   
       slotNumber:            "M.2 Slot: 0"   
       status:                normal   
    
       name:                  M2_SYS_1   
       deviceName:            /dev/sdn   
       diskType:              M2Disk   
       makeModel:             "INTEL SSDSCKJB150G7"    
       physicalFirmware:      N2010112   
       physicalInsertTime:    2017-08-24T18:55:13-07:00   
       physicalSerial:        PHDW708200SZ150A   
       physicalSize:          139.73558807373047G   
       slotNumber:            "M.2 Slot: 1"   
       status:                normal 
    
  8. Confirm the system disk arrays are have an active sync status, or are being rebuilt.
# mdadm --detail /dev/md[2-3][4-5]
/dev/md24:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 104857600 (100.00 GiB 107.37 GB)
  Used Dev Size : 104857600 (100.00 GiB 107.37 GB)
   Raid Devices : 2
  Total Devices : 2

               State  : active
 Active Devices  : 2
Working Devices  : 2
 Failed Devices  : 0
   Spare Devices : 0  

            UUUID : 152f728a:6d294098:5177b2e5:8e0d8c6c
   Number    Major    Minor    RaidDevice    State
    1           8         16             0       active sync  /dev/sdb
    0           8           0            1       active sync  /dev/sda
/dev/md25:
      Container : /dev/md/imsm0, member 1
     Raid Level : raid1
     Array Size : 41660426 (39.73 GiB 42.66 GB)
  Used Dev Size : 41660524 (39.73 GiB 42.66 GB)
   Raid Devices : 2
  Total Devices : 2

               State  : clean
 Active Devices  : 2
Working Devices  : 2
 Failed Devices  : 0
   Spare Devices : 0  

             UUID : 466173ba:507008c7:6d65ed89:3c40cf23
   Number    Major    Minor    RaidDevice    State
 1           8         16        0      active sync  /dev/sdb
 0           8         0         1      active sync  /dev/sda

Wednesday, July 4, 2018

Storage Cell reports error RS-7445 [Serv MS Leaking Memory]


Bug 19790644 - RS-7445 [SERV MS LEAKING MEMORY]

Issue is a memory leak in the Java executable.

This bug affects systems running with JDK 7u51 or later versions (1.7.0_55-b13) which are 11.2.3.3.1 and 12.1.1.1.1

This is relevant for all versions 11.2.3.3.1 to 12.1.2.1.1 [Release 11.2 to 12.1]  (excluding  11.2.3.3.0 or 12.1.1.1.0)
Systems running 11.2.3.3.0 or 12.1.1.1.0 are not affected as they use 1.7.0_25-b15 

Cause:


MS process will be consuming memory (up to 2GB).  Normally MS will use around 1GB of memory but because of the bug, the memory allocated can grow upt to 2GB.

Normal memory usage:

ps -feal|grep java
0 S root     18585 13652  0  80   0 - 15319 pipe_w 15:21 pts/1    00:00:00 grep java
0 S root     27960 27958  0  80   0 - 292553 futex_ Jun17 ?       01:45:06 /usr/java/default/bin/java -Xms256m -Xmx512m -XX:-UseLargePages -Djava.library.path=/opt/oracle/cell/cellsrv/lib -Ddisable.checkForUpdate=true -jar /opt/oracle/cell/oc4j/ms/j2ee/home/oc4j.jar -out /opt/oracle/cell/cellsrv/deploy/log/ms.lst -err /opt/oracle/cell/cellsrv/deploy/log/ms.err
292553 * 4096 = 1142MB (1GB).

Larger values will indicate memory leak.
When using command pmap -x <MS process pid>,  if memory leak is still present  it will report a large number of 64 memory chunks:
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000       4       4       0 r-x--  java
0000000000600000       4       4       4 rw---  java
00000000019c3000   85212   83816   83816 rw---    [ anon ]
00000000dae00000   46080   45856   45856 rw---    [ anon ]
00000000ddb00000   37888       0       0 -----    [ anon ]
00000000e0000000  175104  174900  174900 rw---    [ anon ]
00000000eab00000  174080       0       0 rw---    [ anon ]
00000000f5500000   87552   87552   87552 rw---    [ anon ]
00000000faa80000   87552       0       0 -----    [ anon ]
00007f261c000000   38384   37488   37488 rw---    [ anon ]
00007f261e57c000   27152       0       0 -----    [ anon ]
00007f2624000000   58488   56628   56628 rw---    [ anon ]
00007f262791e000    7048       0       0 -----    [ anon ]
00007f262c000000   65524   65444   65444 rw---    [ anon ]
00007f262fffd000      12       0       0 -----    [ anon ]
00007f2634000000   65528   65528   65528 rw---    [ anon ]
00007f2637ffe000       8       0       0 -----    [ anon ]
00007f263c000000   65536   65528   65528 rw---    [ anon ]
00007f2644000000   65536   65360   65360 rw---    [ anon ]
00007f264c000000   65528   65520   65520 rw---    [ anon ]
00007f264fffe000       8       0       0 -----    [ anon ]
00007f2654000000   65456   65456   65456 rw---    [ anon ]


Solution:

1. Error is ignorable as MS service will be re-started automatically, which will reset the process and memory used.
2. While patching the storage cell with one-off patches is not generally recommended, if there are issues where the MS service is not automatically re-started, the JDK needs upgraded on the Storage Cell
    Use Patch 20328167: TRACKING BUG FOR JDK 1.7.0.72- B33 PATCH (wrapper for 20328167: Oracle JDK 7 Update 72 b33 or later)
3. If no other issues are being seen, the recommended action is to wait for Exadata Cell Software version 12.1.2.1.2 or later.
 

Tuesday, July 3, 2018

Asmcmd daemon consuming High CPU

ASMCMD commands when executed with parameters are leaving the asm connection open "asmcmd daemon" and consumes high CPU usage.

Bug 28019068 - EXADATA: ASMCMD CONSUMING HIGH CPU IN 18C

$ ps -ef | grep asmcmd | grep -v grep
root      224347      1  3 03:20 ?        00:00:00 asmcmd daemon
grid   234123      1  2 03:20 ?        00:00:00 oracle+asm_asmcmd
pstack output:

#0  0x00007f4954b00050 in __open_nocancel () from /lib64/libpthread.so.0
#1  0x00000000005622de in PerlIOUnix_open ()
#2  0x0000000000563c74 in PerlIOBuf_open ()
#3  0x0000000000565795 in PerlIO_openn ()
#4  0x000000000053b030 in Perl_do_open6 ()
#5  0x00000000005274d2 in Perl_pp_open () 
#6  0x00000000004cefad in Perl_runops_standard ()
#7  0x0000000000443622 in S_run_body ()
#8  0x000000000044350b in perl_run ()
#9  0x000000000041de78 in main ()
You can see open ( ) call and leave pool connection open.

Workaround:

1. When you kick off an ASMCMD command, it actually establishes a connection to the ASM instance. To disable connection pooling, use the --nocp parameter to the ASMCMD tool:
 $ asmcmd --nocp <parameters>

2. Use commands within ASMCMD command line instead of passing parameters to avoid  pool connections open.
 

Thursday, June 28, 2018

Physical disk in failed state but online on MegaRaid on CELL

Physical disk in failed state but online on MegaRaid on CELL


BUG ID:  25632147

Workaround:



 1. Copy the cell disk config.xml file (Take backup)


 [root@jfclcx0024 config]# cp cell_disk_config.xml* /tmp
 [root@jfclcx0024 config]# ls /tmp/cell_disk_config.xml*
 /tmp/cell_disk_config.xml /tmp/cell_disk_config.xml_
 /tmp/cell_disk_config.xml__



 2. Remove the config file


 [root@jfclcx0024 config]# rm cell_disk_config.xml*
 rm: remove regular file `cell_disk_config.xml'? y
 rm: remove regular file `cell_disk_config.xml_'? y
 rm: remove regular file `cell_disk_config.xml__'? y




 3. Stop the celld services.


 [root@jfclcx0024 config]# service celld stop
 Stopping the RS, CELLSRV, and MS services...
 The SHUTDOWN of services was successful.



 4. Start the cell services


 [root@jfclcx0024 config]# service celld start
 1: 108 usec
 1: 76 usec

 Starting the RS, CELLSRV, and MS services...
 Getting the state of RS services... running
 Starting CELLSRV services...
 The STARTUP of CELLSRV services was not successful.
 CELL-01537: Unable to read the cell_disk_config.xml file because the file is missing or empty.
 Starting MS services...
 The STARTUP of MS services was successful.




 5. Check the status of the cell services, cellsrv will not be started


 [root@jfclcx0024 config]# service celld status
 rsStatus: running
 msStatus: running
 cellsrvStatus: stopped

 [root@jfclcx0024 config]#
 [root@jfclcx0024 config]#

 6. copy the backup config.xml file to its original location.


 [root@jfclcx0024 config]# cp /tmp/cell_disk_config.xml .



 7. Restart the cell services.


 [root@jfclcx0024 config]# service celld restart
 Stopping the RS, CELLSRV, and MS services...
 The SHUTDOWN of services was successful.
 Starting the RS, CELLSRV, and MS services...
 Getting the state of RS services... running
 Starting CELLSRV services...
 The STARTUP of CELLSRV services was successful.
 Starting MS services...
 The STARTUP of MS services was successful.

 8. Confirm the cell services are up
 [root@jfclcx0024 config]# service celld status
 rsStatus: running
 msStatus: running
 cellsrvStatus: running
 [root@jfclcx0024 config]#



 9. Now check the status of the physical disk, if the status is normal.


 [root@jfclcx0024 config]# cellcli -e list physicaldisk 8:10 detail
 name: 8:10
 deviceId: 25
 deviceName: /dev/sdk
 diskType: HardDisk
 enclosureDeviceId: 8
 errOtherCount: 0
 luns: 0_10
 makeModel: "HGST H7280A520SUN8.0T"
 physicalFirmware: P9E2
 physicalInsertTime: 2017-02-26T07:23:06+00:00
 physicalInterface: sas
 physicalSerial: P1PMBV
 physicalSize: 7.153663907200098T
 slotNumber: 10
 status: normal

 

Wednesday, June 27, 2018

CLSU-00107: operating system function: open failed; failed with error data: 2; at location: SlfFopen1

CLSU-00107: operating system function: open failed; failed with error data: 2; at location: SlfFopen1

CLSU-00101: operating system error message: No such file or directory



When executing the asmcmd command in GI,one got the error Can't open '/opt/oracle/log/diag/asmcmd/user_grid/weasel1xa.rjf.com/alert/alert.log' for append.
This is pointing to the Oracle base and not the Oracle home for directory location. 
 
[grid@weasel1xa ~]$ export DBI_TRACE=1
[grid@weasel1xa ~]$ asmcmd
   DBI 1.616-ithread default trace level set to 0x0/1 (pid 2193 pi 7fa010) at DBI.pm line 278 via asmcmdshare.pm line 270
Can not create path /opt/oracle/log/diag/asmcmd/user_grid/weasel1xa.rjf.com/alert
Can not create path /opt/oracle/log/diag/asmcmd/user_grid/weasel1xa.rjf.com/trace
Can't open '/opt/oracle/log/diag/asmcmd/user_grid/weasel1xa.rjf.com/alert/alert.log' for append
CLSU-00100: Operating System function: open failed failed with error data: 2
CLSU-00101: Operating System error message: No such file or directory
CLSU-00103: error location: SlfFopen1


As per documentation:

Under certain circumstances, $ORACLE_BASE and $ORACLE_HOME can be set to override the default locations of the alert.log and trace.trc files.
http://docs.oracle.com/cd/E11882_01/server.112/e18951/asm_util001.htm#OSTMG94362
 
 
"log" directory was missing in asmcmd log location path /opt/oracle/log/diag/asmcmd ($ORACLE_BASE/log/diag/asmcmd)

$ls -l /opt/oracle

total 48
drwxr-x--- 4 grid oinstall 4096 Oct 31 11:27 admin
drwxr-x--- 2 grid oinstall 4096 Oct 31 11:27 audit
drwxr-x--- 6 grid oinstall 4096 Oct 31 11:29 cfgtoollogs
drwxr-xr-x 2 grid oinstall 4096 Oct 31 11:32 checkpoints
drwxr-xr-x 3 oracle oinstall 4096 Nov 4 16:48 core
drwxrwxr-x 3 grid oinstall 4096 Oct 30 17:54 crsdata
drwxrwxr-x 12 grid oinstall 4096 Nov 1 16:26 diag
drwxr-xr-x 3 oracle oinstall 4096 Oct 31 13:44 opatchauto
drwxr-xr-x 10 oracle oinstall 4096 Oct 31 16:17 ora_sw
drwxr-xr-x 5 oracle oinstall 4096 Nov 5 14:10 product
drwxr-xr-x 3 grid oinstall 4096 Oct 24 14:14 weasel1xa
drwxr-xr-x 3 root root 4096 Oct 31 13:41 weasel1xa.rjf.com


Solution: Issue resolved after creating the directory structure /opt/oracle/log/diag/asmcmd where "log" subdirectory was included.