Tuesday, December 31, 2013


Apply PSU 9 on OEM 11.1.0.1 Base Release


While running OEM 11.1.0.1, we ran into few issues over a period of time. With 12c Cloud control out for a while, we will considering the upgrade of the existing OEM 11g. Looking at the complexity of 12c upgrade and Oracle suggesting 2 cycles of test, production upgrade still a long shot. 
Mean while, the option we had is to consider the application of PSU on top of base release of OEM 11g. 
Looking around we figured out the latest PSU available was PSU 9 aka 11.1.0.1.9 patch . 

Following is the outline should one consider the application of PSU patch. 

If you do not have the latest version of OPatch, then download it from patch# 6880880 for 11.1.0.0.0 release.

Pre-requisite Patch 12620174

This patch is an auto-update patch. That means that any new install of OMS will have it by default, you dont need to apply if manually. Though, for old installations, one has to apply it manually. 

This patch is called as generic patch i.e. it can be applied on top of 11.1 release. The purpose of this patch is to enhance the patching process for OMS. If you remember, earlier patches used to have two post patch SQL files, to introduce the SQL changes. Post application of this patch, you will only need one. This will reduce the time as well as the user errors, As there are many who forgot to apply the second file due similar sounding names. 
Another important thing is that, this patch only affects how the SQL changes rolled out by applying patch, it doesn't affect how java changes are rolled out during patch. 

Steps to apply Patch 12620174

1. Perquisites: Make sure opatch,unzip are in path. Also ensure you are using latest OPatch version.

2. Copy the patch to the server and unzip

$ unzip p12620174_111010_Generic.zip

3. Shut down services running from the ORACLE_HOME.

$ emctl stop oms
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Stopping WebTier…
WebTier Successfully Stopped
Stopping Oracle Management Server…
Oracle Management Server Successfully Stopped
Oracle Management Server is Down

$ $AGENT_HOME/bin/emctl stop agent
Oracle Enterprise Manager 11g Release 1 Grid Control 11.1.0.1.0
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Stopping agent … stopped.

4. Set your current directory to the directory where the patch is located:

$ cd 12620174
$ opatch napply
OPatch succeeded.

5. Connect to rcuJDBCEngine as SYS and execute the following sql file. Please make sure you set $ORACLE_HOME to OMS_HOME before you connect to rcuJDBCEngine.


$ORACLE_HOME/bin/rcuJDBCEngine sys/welcome1@myhost.myorg.com:1521:sid JDBC_SCRIPT 10154264/patch_10154264.sql $PWD $ORACLE_HOME


Completed SQL script execution normally.
1 scripts were processed

6. Start OMS using the following command. In case of multi-OMS environment, start on all OMS machines

$ emctl start oms
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Starting WebTier…
WebTier Successfully Started
Starting Oracle Management Server…
Oracle Management Server Successfully Started
Oracle Management Server is Up

Steps to apply 16572176

This is the actual PSU that will roll out the changes to OMS (java+SQL)

1. Perquisites: Make sure opatch,unzip are in path. Also ensure you are using latest OPatch version.
Set ORACLE_HOME to OMS home

2. Ensure that the PSU does not conflict with the already-installed one-off patches. To do so, run the following command to generate a report that lists all conflicting patches.

$ opatch prereq CheckConflictAgainstOHWithDetail -phBaseDir ./16572176
Invoking OPatch 11.2.0.1.1

Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation. All rights reserved.

PREREQ session
Oracle Home : /u01/app/oracle/Middleware/oms11g
Central Inventory : /u01/app/oraInventory
from : /etc/oraInst.loc
OPatch version : 11.2.0.1.1
OUI version : 11.1.0.8.0
OUI location : /u01/app/oracle/Middleware/oms11g/oui
Log file location : /u01/app/oracle/Middleware/oms11g/cfgtoollogs/opatch/opatch2011-08-04_11-06-55AM.log

Patch history file: /u01/app/oracle/Middleware/oms11g/cfgtoollogs/opatch/opatch_history.txt
OPatch detects the Middleware Home as “/u01/app/oracle/Middleware”
Invoking prereq “checkconflictagainstohwithdetail”
Prereq “checkConflictAgainstOHWithDetail” passed.
OPatch succeeded.

Note: If you do see any conflicting patches refer README.txt

3. Stop all om services

$<ORACLE_HOME>/bin/emctl stop oms -all
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Stopping WebTier…
WebTier Successfully Stopped
Stopping Oracle Management Server…
Oracle Management Server Successfully Stopped
Oracle Management Server is Down

4. Download and Unzip the p16572176_111010_Generic.zip and cd to 16572176

$cd 16572176
$ opatch apply

Invoking OPatch 11.2.0.1.1
Oracle Interim Patch Installer version 11.2.0.1.1
Copyright (c) 2009, Oracle Corporation. All rights reserved.
OPatch succeeded.

5. Connect to rcuJDBCEngine as SYSMAN and run the apply.sql script as follows:

$ /u01/app/oracle/Middleware/oms11g/bin/rcuJDBCEngine sysman@Host:1521:GCREPO JDBC_SCRIPT apply.sql $PWD $ORACLE_HOME

###### SQL Patching operation has started. The Pre-requisites check ######
###### may take upto 3 minutes. Please do not cancel the operation. ######
###### Refer to My Oracle Support note 1326515.1 for more information ######
----------------
Completed SQL script execution normally.
41 scripts were processed

6. Start OMS

$ emctl start oms
Oracle Enterprise Manager 11g Release 1 Grid Control
Copyright (c) 1996, 2010 Oracle Corporation. All rights reserved.
Starting WebTier…
WebTier Successfully Started
Starting Oracle Management Server…
Oracle Management Server Successfully Started
Oracle Management Server is Up


If you have installed emcli, then run the following command on all the emcli installations:
$ emcli sync

Once the PSU applied on OMS home, you may need to apply patch on Agent Home as well. At the time of writing of this article, the PSU for agent home is not yet available. The latest patch is available for Agent Home is PSU 7.

Patch 9346282 - 11.1.0.1.7 Patch Set Update for Oracle Management Agent. Pls refer following table (as publish in Doc ID - 1358092.1



11.1.0.1
OMSAGENT UnixAGENT Windows
PSU1 (11.1.0.1.1)10065631--
PSU2 (11.1.0.1.2)102700731027360710273607
PSU3 (11.1.0.1.3)11727299934590611778791
PSU4 (11.1.0.1.4)12423703934591312423714
PSU5 (11.1.0.1.5)12833678934592112833724
PSU6  (11.1.0.1.6)13248190934624313248202
PSU7 (11.1.0.1.7)13711705934628213711732
PSU8 (11.1.0.1.8)14766609--
PSU9 (11.1.0.1.9)16572176--
PSU10 (11.1.0.1.10)17154155--

Wednesday, November 27, 2013

Troubleshooting ASM Startup Issue in 11g Grid Infrastructure



In 11g when one places the ASM spfile, OCR or VoteDisk on ASM diskgroup, thus exists a tight interlink between these components. ASM plays very important role when starting grid infrastructure stack, let it single node or multi node i.e. RAC. 
With Single Node Grid Infrastructure, the database using ASM instance has to register it self with CSS. 

So when one cannot start ASM for some reason, the question will be how to trouble shoot the issue since everything is tightly integrated to each other. So first lets look at ASM startup sequence. 

When Grid Infrastructure starts,  Oracle will try to locate the ASM parameter file, to start CSSDAgent in following sequence. 


  • First it will look into GPNP Profile to find the parameter with name "asmdiskstring" 
    • Profile is usually located under <GRID_HOME>/gpnp/profiles/peer
    • file - profile.xml
<orcl:ASM-Profile id="asm" DiscoveryString="" SPFile="+DATA/<host>/asmparameterfile  /registry.253.768413123"/> .

Here the issue can be if the above mentioned file is not found then ASM will fail to start. So make sure your profile reflects the correct value.
  • If the above step doesn't reflect any value then, the next look up will be done in GRID_HOME/dbs folder and if located the file will start using pfile found. 
 Again, the file has to be present here and if not the ASM will fail to start as it cannot locate the parameter file. So it will be advisable to have both spfile and pfile to save some pain later. 

There is a caveat here. what if  the gpnp profile reflect the the proper file which exists on ASM but cannot be opened due to corruption. In this case again ASM will fail to start.
Solution to these issue will be to start ASM with transient parameter file as follows ( this is for 2 node RAC )

Use Case - On 2 node RAC only one node is healthy and another node is having problem with CRS stack start up. 

1- Create a new ASM pfile

ora_+ASM1.ora
+ASM1.asm_diskgroups='DATA','FRA'#Manual Mount
+ASM2.asm_diskgroups='DATA','FRA'#Manual Mount
*.asm_diskstring='/dev/oracleasm/disks/*'
*.asm_power_limit=5
*.diagnostic_dest='//u01/app/11.2.0.3/grid/log'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'
2- Start up the ASM instance

on the first node

$ export $ORACLE_SID=<asm instance name>
$ export $ORACLE_HOME=<full path of the asm home>

$ sqlplus / as sysdba
sql> startup pfile=<the full pathname of ora_+ASM1.ora>

This will start ASM Instance on node 1.

3) Recreate the spfile


SQL> create spfile='+DATA' from pfile='/u01/app/11.2.0.3/grid/dbs/init+ASM1.ora';
File created.
SQL> sho parameter spfile;
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string      +DATA/<host>/asmparameterfile
                                                      /registry.253.768413123                             
Note - Pls make sure that you check your gpnp profile reflecting its new value. 
4) Restarted the OHAS/CRS stack on node

Connect as root user:


# /u01/app/11.2.0.3/grid/bin/crsctl stop crs
Make sure that all the processes are exited and then start crs
# /u01/app/11.2.0.3/grid/bin/crsctl start crs

After some time do the health check on crs stack that it is started successfully. 

# /u01/app/11.2.0.3/grid/bin/crsctl check cluster -all

Thursday, November 21, 2013


11g Grid Start-up Issue Due to Missing Permissions



Recently one of our 11g cluster went down on multiple nodes. Upon checking we figured out the issue was with permissions as the owner of the GRID Home changed from "grid" user to "oracle" user. 

Diagnosis - 


grid@/u01/app/11.2.0.3/grid/cdata/>ls -ltr

total 2888

drwxr-xr-x 2 oracle oinstall      4096 Mar 11  2012 localhost
drwxr-xr-x 2 oracle oinstall      4096 Mar 11  2012 hostxxx
drwxrwxr-x 2 oracle oinstall      4096 Nov 15 21:56 devorclrac
-rw------- 1 oracle oinstall     272756736 Nov 21 05:30 hostxxx.olr

So the solution to fix this issue is to re-link Grid Home binaries.
Following is the process to do that.

grid@/u01/app/11.2.0.3/grid/crs/install/>./rootcrs.pl -unlock -crshome /u01/app/11.2.0.3/grid
You must be logged in as root to run this script.
Log in as root and rerun this script.
2013-11-21 05:48:27: Not running as authorized user
Insufficient privileges to execute this script.
root or administrative privileges needed to run the script.

[root@hostxxx~]# cd /u01/app/11.2.0.3/grid/crs/install/
[root@hostxxxinstall]# ./rootcrs.pl -unlock -crshome /u01/app/11.2.0.3/grid
Using configuration parameter file: ./crsconfig_params
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'hostxxx'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'hostxxx'
CRS-2673: Attempting to stop 'ora.cssdmonitor' on 'hostxxx'
CRS-2677: Stop of 'ora.cssdmonitor' on 'hostxxx' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'hostxxx' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'hostxxx' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully unlock /u01/app/11.2.0.3/grid

However during relink as Grid user, hit following error  -

grid@/u01/app/11.2.0.3/grid/bin/>./relink
./relink: line 164: /u01/app/11.2.0.3/grid/install/current_makeorder.xml: Permission denied
writing relink log to: /u01/app/11.2.0.3/grid/install/relink.log
./relink: line 181: /u01/app/11.2.0.3/grid/install/relink.log: Permission denied
grid@/u01/app/11.2.0.3/grid/bin/>ls -ltr /u01/app/11.2.0.3/grid/install/current_makeorder.xml
ls: /u01/app/11.2.0.3/grid/install/current_makeorder.xml: No such file or directory

So the relink failed again with permissions issue. The reason being is that there are lots of binaries/executable under GRID home which are still owned by Oracle user. hence you need to change that.

Relink Log -
oracle.xml.parser.v2.XMLParseException: Start of root element expected.

The above error in relink log is completely mis-leading. so ignore it.

So following are the steps to fix the issue. 

Step 1 - Make sure no Grid processes are running / force stop
/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f

Step 2 - Changed the permissions of GRID HOME to grid:oinstall  ( just to make relink work )

[root@hostxxx11.2.0.3]# chown -R grid:oinstall grid

Step 3 - relink grid home binaries ( Make sure variable ORACLE_HOME is set to Grid Home and you are running this as grid unix user )

As the Oracle Grid Infrastructure for a Cluster owner: 

grid@/home/grid/>/u01/app/11.2.0.3/grid/bin/relink
writing relink log to: /u01/app/11.2.0.3/grid/install/relink.log

As root again: 

# cd $Grid_home/rdbms/install/
# ./rootadd_rdbms.sh 


Step 4 - Load the ASMLib driver ( basically start the init service if not already started )

/etc/init.d/oracleasm status

If down,

[root@hostxxx~]# /etc/init.d/oracleasm start
Initializing the Oracle ASMLib driver:                     [  OK  ]
Scanning the system for Oracle ASMLib disks:     [  OK  ]

Make sure ASM can see the devices -

[root@hostxxx~]# /etc/init.d/oracleasm listdisks

Step 5 - Make sure no Grid processes are running / force stop

/u01/app/11.2.0.3/grid/bin/crsctl stop crs -f


Step 6 - Lock Grid Home...and that will also start the CRS stack

[root@hostxxx~]# /u01/app/11.2.0.3/grid/crs/install/rootcrs.pl -patch
Using configuration parameter file: /u01/app/11.2.0.3/grid/crs/install/crsconfig_params
CRS-4123: Oracle High Availability Services has been started.

Step 7 – Perform the Health Check on entire cluster

crsctl status resource –t –init
crsctl status resource –t

crsctl check has
crsctl check cluster
crsctl check crs
crsctl check cluster -all

Now if you check your cluster will be started okay... 

Tuesday, September 17, 2013

ADRCI Purge Command Fails With DIA-48322




When running adrci from Grid home for removal or purge of the old data when we ran purge command, we hit following. 

adrci> purge
DIA-48322: Relation [INCIDENT] of ADR V[2] incompatible with V[2] tool
DIA-48210: Relation Not Found
DIA-48166: error with opening ADR block file because file does not exist 


This happened when trying to purge the listener Home. 

The reason being is that this home does not contain any activity to be purged:

adrci> show incident

0 rows fetched

However this is equally applicable to other homes as well and only restricted to listener home. 


These errors are raised when you run a purge command in the listener/other ADR Home without incidents to purge. So if may seem like bug but actually not a bug as this is an expected behavior in this situation as there is nothing to purge. 

Thursday, August 29, 2013


 Instance Caging

Recently on one of our production cluster running 8 nodes and having shared environment with multiple applications, ran into high CPU usage. Since this is shared environment, running few databases, its always difficult to pinpoint which is culprit. Hence after more than couple of incidents we were able to point out which one is actually causing mayhem. Also the pattern of problem is not predictable as application is used across globe. But since hang was not momentarily it really affects all other candidates as well. This is serious problem in shared environment as one wants to save cost on hardware and software, which is quite under stable. However, this also ensures the optimum usage of the capacity as not all the databases are always used up to the capacity (again, case to case).

So the question is how do we avoid such an instance and still provide the optimum usage of resources ? 

11g, introduces one new feature called "instance caging", where you can have more than one instance on the same server, and you can share the CPUs reported by the operating system

-- To find out total cpu and cores available to your system
SQL> select cpu_count_current,cpu_core_count_current from v$license;

System with 12 physical cpu cores


CPU_COUNT_CURRENT CPU_CORE_COUNT_CURRENT

----------------- ----------------------

               12                     12


System with 6 physical cpu  and 12 cores, Hyper Threaded


CPU_COUNT_CURRENT CPU_CORE_COUNT_CURRENT

----------------- ----------------------

               12                     6


One can use following query to find out the CPU SOCKET and CPU CORES available to database 

    select * from v$osstat where stat_name like '%CPU%'





In Hyper threaded Servers, the output will be like following.


 
Now one needs to determine how the database instances on your server will share the CPU.  With Instance Caging, each instance's cpu_count specifies the maximum number of CPUs you want it to use at any time for that particular instance. The sum of the cpu_counts across all database instances determines the amount of isolation between the database instances and the efficiency of the server.

For maximum isolation one can use  "partition" approach. With the partition approach, the sum of the cpu_counts is less than or equal to the number of CPUs. 

For example, suppose the total number of CPUs (i.e. CPU threads/cores) is 16.  Using the partition approach, we could set cpu_count=8 for database A, cpu_count=4 for database B, and cpu_count=4 for database C.  The sum of the cpu_counts is 16, which equals the number of CPUs.

The disadvantage of the partition approach is that any CPU unused by one database instance cannot be used by another. Therefore, for non-critical databases where you also want to achieve better CPU utilization efficiency, use the "over-subscribe" approach. With the over-subscribe approach, the sum of the cpu_counts is less than or equal to 3x the number of CPUs.

For example, for a server with 16 CPUs, you could use the over-subscribe approach and set cpu_count=8 for database A, cpu_count=8 for database B, and cpu_count=8 for database C.  The sum of the cpu_counts is 24, which is greater than the number of CPUs.  Therefore, if all databases are using their full CPU allocation, there will be some CPU contention.

Enable Instance Caging – 

-- Set the cpu_count initialization parameter.
-- This is a dynamic parameter, and can be set with the following statement:

SQL> ALTER SYSTEM SET CPU_COUNT = 4;

To verify that Instance Caging is enabled, check that "instance_caging" equals "ON" and that "cpu_count" is set appropriately.

SQL> select instance_caging from v$rsrc_plan where is_top_plan = 'TRUE';

INS

---

ON

Once this is enabled,  The CPU spikes will be limited to only specified number of cpu's and will be not be hampering other databases.



Wednesday, August 28, 2013

Identify The Top Temp Consumers In DataBase

First of all one needs to identify the sessions using temp space. The view reflecting this info is v$tempseg_usage. 

One imp note is that v$tempseg_usage is same as v_$sort_area_usage view

SELECT s.sid, s.username, u.tablespace , s.sql_hash_value, u.sqlhash hash_value, u.segtype, u.contents, u.blocks
FROM gv$session s, gv$tempseg_usage u
WHERE s.saddr=u.session_addr
order by u.blocks desc;











The main segment type one needs to check is SORT segment as that is what we are after. 
Now we have session that is causing strain on our database sort area we need to check the statement causing all the mayhem. If you are in luck there might be only open cursor against this session. However, in heavily used production databases there will be multiple open cursors against one session. 
So the view you need to consult is v$open_cursor. We also need to identify the number of child cursor for this parent cursor as well as text. Following is the way to do it..


select sql_id, sorts ,  sql_text from gv$sql 
where hash_value in 
(select hash_value from gv$open_cursor where sid=1920)
order by 2 desc;




As you can see there are possible two possible sql that are using temp at the moment. Now you have identified the possible culprits you can investigate further. 


Following Script will tell you how much free space you have available in Temp Tablespace.
SQL> select b.tablespace_name, sum(b.bytes_used/1024/1024) used_mb,
     sum(bytes_free/1024/1024) free_mb
     from v$temp_space_header b
     group by b.tablespace_name;

Thursday, August 1, 2013

NET-BACKUP Configuration For Oracle



If the oracle_link script is executed, it will display a message similar to the following and backups of the Oracle instance will not utilize the NetBackup implementation of the Oracle SBT API.

Following is the work around to fix the issue. This is linux/Unix specific. For AIX one has to do modifications as per environment. 

On the client host, manually create the needed symbolic link between the Oracle installation and the NetBackup installation.

1)     Confirm that the NetBackup shared libraries are present on the client host and of the correct version.

$ cd /usr/openv/netbackup/
$ ./bin/goodies/support/versioninfo –f bin/libobk.so
======= bin/libobk.so =======
@(#) NetBackup_7.1 1234567890

2)     Check for an existing libobk being used by Oracle and rename if present.

$ cd $ORACLE_HOME/lib
$ ls libobk.so
libobk.so
$ mv libobk.so libobk.so.orig

3)     Create the symbolic link to the NetBackup shared library.

$ ln –s /usr/openv/netbackup/bin/libobk.so libobk.so
  
 An updated oracle_link script is included in NetBackup release update 7.1.0.4 and release 7.5.


Error During OEM Agent Install On RAC with XML-20108: (Fatal Error) Start of root element expected. Completed with Status=102


During install of an agent on RAC environment we hit following error.



ERROR: <Line 1, Column 1>: XML-20108: (Fatal Error) Start of root element expected.

Completed with Status=102



Looked at MOS but that didn’t gave up any clues and has to look around a bit. Eventually when trying to look at installed products on one of the nodes in cluster, I found that entire inventory.xml file is missing.

So that might the cause of problem here.

Though I was not sure whether the problem is due to missing file so decided to go step by step. First thing first, I need an inventory.xml.



Now the standard way is to create inventory is to run runInstaller executable with attachHome option. However, I decided to try another work around, much simpler than earlier one.  The reason being that, this is functional node and cannot do any changes to current configuration without change approval.  So I decided to copy the inventory.xml from another node to the current node(assuming that both have ideal ORACLE products installed as they are part of same cluster).



Following is the snippet of the XML file copied.



<?xml version="1.0" standalone="yes" ?>

<!-- Copyright (c) 1999, 2010, Oracle. All rights reserved. -->

<!-- Do not modify the contents of this file by hand. -->

<INVENTORY>

<COMPOSITEHOME_LIST>

</COMPOSITEHOME_LIST>

<VERSION_INFO>

   <SAVED_WITH>11.1.0.8.0</SAVED_WITH>

   <MINIMUM_VER>2.1.0.6.0</MINIMUM_VER>

</VERSION_INFO>

<HOME_LIST>

<HOME NAME="Ora11g_gridinfrahome1" LOC="/u01/app/11.2.0/grid" TYPE="O" IDX="1" CRS="true">

   <NODE_LIST>

      <NODE NAME="xxxxxxxxx01"/>

      <NODE NAME=" xxxxxxxxx02"/>

   </NODE_LIST>

</HOME>

<HOME NAME="OraDb11g_home1" LOC="/oracle/rdbms/11203" TYPE="O" IDX="2">

   <NODE_LIST>

      <NODE NAME=" xxxxxxxxx01"/>

      <NODE NAME=" xxxxxxxxx02"/>

   </NODE_LIST>

</HOME>

</HOME_LIST>

</INVENTORY>



Now with inventory.xml file in place when I tried installation, it worked.



Lesson learned –

Inventory.xml file is needed as any new product installed will add an entry in that file hence missing file may cause failures…