Wednesday, November 27, 2013

Troubleshooting ASM Startup Issue in 11g Grid Infrastructure



In 11g when one places the ASM spfile, OCR or VoteDisk on ASM diskgroup, thus exists a tight interlink between these components. ASM plays very important role when starting grid infrastructure stack, let it single node or multi node i.e. RAC. 
With Single Node Grid Infrastructure, the database using ASM instance has to register it self with CSS. 

So when one cannot start ASM for some reason, the question will be how to trouble shoot the issue since everything is tightly integrated to each other. So first lets look at ASM startup sequence. 

When Grid Infrastructure starts,  Oracle will try to locate the ASM parameter file, to start CSSDAgent in following sequence. 


  • First it will look into GPNP Profile to find the parameter with name "asmdiskstring" 
    • Profile is usually located under <GRID_HOME>/gpnp/profiles/peer
    • file - profile.xml
<orcl:ASM-Profile id="asm" DiscoveryString="" SPFile="+DATA/<host>/asmparameterfile  /registry.253.768413123"/> .

Here the issue can be if the above mentioned file is not found then ASM will fail to start. So make sure your profile reflects the correct value.
  • If the above step doesn't reflect any value then, the next look up will be done in GRID_HOME/dbs folder and if located the file will start using pfile found. 
 Again, the file has to be present here and if not the ASM will fail to start as it cannot locate the parameter file. So it will be advisable to have both spfile and pfile to save some pain later. 

There is a caveat here. what if  the gpnp profile reflect the the proper file which exists on ASM but cannot be opened due to corruption. In this case again ASM will fail to start.
Solution to these issue will be to start ASM with transient parameter file as follows ( this is for 2 node RAC )

Use Case - On 2 node RAC only one node is healthy and another node is having problem with CRS stack start up. 

1- Create a new ASM pfile

ora_+ASM1.ora
+ASM1.asm_diskgroups='DATA','FRA'#Manual Mount
+ASM2.asm_diskgroups='DATA','FRA'#Manual Mount
*.asm_diskstring='/dev/oracleasm/disks/*'
*.asm_power_limit=5
*.diagnostic_dest='//u01/app/11.2.0.3/grid/log'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'
2- Start up the ASM instance

on the first node

$ export $ORACLE_SID=<asm instance name>
$ export $ORACLE_HOME=<full path of the asm home>

$ sqlplus / as sysdba
sql> startup pfile=<the full pathname of ora_+ASM1.ora>

This will start ASM Instance on node 1.

3) Recreate the spfile


SQL> create spfile='+DATA' from pfile='/u01/app/11.2.0.3/grid/dbs/init+ASM1.ora';
File created.
SQL> sho parameter spfile;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      +DATA/<host>/asmparameterfile
                                                      /registry.253.768413123                             
Note - Pls make sure that you check your gpnp profile reflecting its new value. 
4) Restarted the OHAS/CRS stack on node

Connect as root user:


# /u01/app/11.2.0.3/grid/bin/crsctl stop crs
Make sure that all the processes are exited and then start crs
# /u01/app/11.2.0.3/grid/bin/crsctl start crs

After some time do the health check on crs stack that it is started successfully. 

# /u01/app/11.2.0.3/grid/bin/crsctl check cluster -all

Thursday, November 21, 2013


11g Grid Start-up Issue Due to Missing Permissions



Recently one of our 11g cluster went down on multiple nodes. Upon checking we figured out the issue was with permissions as the owner of the GRID Home changed from "grid" user to "oracle" user. 

Diagnosis - 


grid@/u01/app/11.2.0.3/grid/cdata/>ls -ltr

total 2888

drwxr-xr-x 2 oracle oinstall      4096 Mar 11  2012 localhost
drwxr-xr-x 2 oracle oinstall      4096 Mar 11  2012 hostxxx
drwxrwxr-x 2 oracle oinstall      4096 Nov 15 21:56 devorclrac
-rw------- 1 oracle oinstall     272756736 Nov 21 05:30 hostxxx.olr

So the solution to fix this issue is to re-link Grid Home binaries.
Following is the process to do that.

grid@/u01/app/11.2.0.3/grid/crs/install/>./rootcrs.pl -unlock -crshome /u01/app/11.2.0.3/grid
You must be logged in as root to run this script.
Log in as root and rerun this script.
2013-11-21 05:48:27: Not running as authorized user
Insufficient privileges to execute this script.
root or administrative privileges needed to run the script.

[root@hostxxx~]# cd /u01/app/11.2.0.3/grid/crs/install/
[root@hostxxxinstall]# ./rootcrs.pl -unlock -crshome /u01/app/11.2.0.3/grid
Using configuration parameter file: ./crsconfig_params
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hostxxx'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hostxxx'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hostxxx'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hostxxx' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hostxxx' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hostxxx' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully unlock /u01/app/11.2.0.3/grid

However during relink as Grid user, hit following error  -

grid@/u01/app/11.2.0.3/grid/bin/>./relink
./relink: line 164: /u01/app/11.2.0.3/grid/install/current_makeorder.xml: Permission denied
writing relink log to: /u01/app/11.2.0.3/grid/install/relink.log
./relink: line 181: /u01/app/11.2.0.3/grid/install/relink.log: Permission denied
grid@/u01/app/11.2.0.3/grid/bin/>ls -ltr /u01/app/11.2.0.3/grid/install/current_makeorder.xml
ls: /u01/app/11.2.0.3/grid/install/current_makeorder.xml: No such file or directory

So the relink failed again with permissions issue. The reason being is that there are lots of binaries/executable under GRID home which are still owned by Oracle user. hence you need to change that.

Relink Log -
oracle.xml.parser.v2.XMLParseException: Start of root element expected.

The above error in relink log is completely mis-leading. so ignore it.

So following are the steps to fix the issue. 

Step 1 - Make sure no Grid processes are running / force stop
/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f

Step 2 - Changed the permissions of GRID HOME to grid:oinstall  ( just to make relink work )

[root@hostxxx11.2.0.3]# chown -R grid:oinstall grid

Step 3 - relink grid home binaries ( Make sure variable ORACLE_HOME is set to Grid Home and you are running this as grid unix user )

As the Oracle Grid Infrastructure for a Cluster owner: 

grid@/home/grid/>/u01/app/11.2.0.3/grid/bin/relink
writing relink log to: /u01/app/11.2.0.3/grid/install/relink.log

As root again: 

# cd $Grid_home/rdbms/install/
# ./rootadd_rdbms.sh 


Step 4 - Load the ASMLib driver ( basically start the init service if not already started )

/etc/init.d/oracleasm status

If down,

[root@hostxxx~]# /etc/init.d/oracleasm start
Initializing the Oracle ASMLib driver:                     [  OK  ]
Scanning the system for Oracle ASMLib disks:     [  OK  ]

Make sure ASM can see the devices -

[root@hostxxx~]# /etc/init.d/oracleasm listdisks

Step 5 - Make sure no Grid processes are running / force stop

/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f


Step 6 - Lock Grid Home...and that will also start the CRS stack

[root@hostxxx~]# /u01/app/11.2.0.3/grid/crs/install/rootcrs.pl -patch
Using configuration parameter file: /u01/app/11.2.0.3/grid/crs/install/crsconfig_params
CRS-4123: Oracle High Availability Services has been started.

Step 7 – Perform the Health Check on entire cluster

crsctl status resource –t –init
crsctl status resource –t

crsctl check has
crsctl check cluster
crsctl check crs
crsctl check cluster -all

Now if you check your cluster will be started okay...