Wednesday, May 3, 2017

Aware of network socket files before you cleanup in /var/tmp, /tmp, /usr/tmp, it's critical for oracle clusterware to run

ORA-29701: unable to connect to Cluster Synchronization Service

Purpose

This note explains relevant issues if Oracle Clusterware's network socket files are deleted or wrongly owned.


Details

Oracle Clusterware(CRS or Grid Infrastructure) network socket files are located in /tmp/.oracle, /usr/tmp/.oracle or /var/tmp/.oracle, it's important not to touch them manually unless instructed by Oracle Support to keep clusterware healthy.


Cause

The hidden directory  '/var/tmp/.oracle' (or /tmp/.oracle on some platforms) or its content was removed while instances & the CRS stack were up and running. Typically this directory contains a number of "special" socket files that are used by local clients to connect via the IPC protocol (sqlnet) to various Oracle processes including the TNS listener, the CSS, CRS & EVM daemons or even  database or ASM instances. These files are created when the "listening" process starts.
A typical listing of the '/var/tmp/.oracle' shows a number of such files:
# cd /var/tmp/.oracle
# ls -l
srwxrwxrwx 1 oracle   dba     0 Sep  6 10:50 s#9862.2
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:35 sAracnode1_crs_evm
srwxrwxrwx 1 root     root    0 Sep 15 11:35 sracnode1DBG_CRSD
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:34 sracnode1DBG_CSSD
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:35 sracnode1DBG_EVMD
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:35 sCracnode1_crs_evm
srwxrwxrwx 1 root     root    0 Sep 15 11:35 sCRSD_UI_SOCKET
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:35 sEXTPROC
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:34 sOCSSD_LL_racnode1_crs
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:34 sOracle_CSS_LclLstnr_crs_1
srwxrwxrwx 1 root     root    0 Sep 15 11:35 sora_crsqs
srwxrwxrwx 1 root     root    0 Sep 15 11:35 sprocr_local_conn_0_PROC
srwxrwxrwx 1 oracle   dba     0 Sep 15 11:35 sSYSTEM.evm.acceptor.auth

When a file is deleted on Unix, it becomes "invisible" at the filesystem level, however any process which had the file opened when it was deleted will still be able to use it. 
Attempts to open a "deleted" file for reading will fail (ENOENT 2 /* No such file or directory */) , opening a file with the same name for writing will create a new (different) file.
Therefore only processes that attempted to open the socket file during the initial handshake were failing with ORA-29701 while existing processes were unaffected.

A very common cause  for this issue are system administration activities that involve freeing up space in /tmp, /var/tmp etc - either run occasionally or regularly via cronjobs. As a rule of thumb the directory .oracle in /var/tmp or /tmp should always be excluded from such activities. The best time to completely clean out these directories would be during system boot - before the clusterware is started. 

Solution

The only way to re-create these special files is to restart (instance, listener, CRS). In a RAC environment this requires the shutdown & restart of the entire CRS stack.

As these special files are required to communicate with the various CRS daemons, it most likely will not be possible to stop (and restart) the CRS stack using the following commands as user root - but  it won't hurt to try it anyway:


11g:
# $ORA_CRS_HOME/bin/crsctl stop crs
# $ORA_CRS_HOME/bin/crsctl start crs

If the above fails to successfully stop the CRS stack, a system reboot will be inevitable.

As for deleting files from temporary directory via a cronjob (or otherwise):
the directory '/var/tmp/.oracle' (on some platform /tmp/.oracle) should be excluded from such jobs/tasks. The files in this directory occupy only a few bytes and generally do not need to be cleaned up.

Please note that the location of the .oracle directory is not configurable, so the only way to avoid such issues is to make sure it is not deleted while the clusterware is up & running.
 
If the specified temp location must be cleaned to release space, consider to delete files which meet both criterias:


1. Size. File must be big enough, i.e. anything bigger than 5MB
2. Date. File must be old enough, i.e. only those that's not accessed/modified for more than 30 days.

Ref: Doc ID 1322234.1 & Doc ID 391790.1

2 comments:

  1. Vanaja, Good post. you are one of the strong exadata SME whom I come across. I admire your passion towards learning new stuff in exadata/zfs and share with others..
    Keep growing..

    ReplyDelete