Monday, November 27, 2017


Unable To Start DSE Post 5.1 Upgrade

After upgrading our DSE servers running SPARK, from 5.0.8 version to 5.1.0 version, it refused to start with following error. 


ERROR [main] 2017-11-20 14:33:58,931  DseDaemon.java:494 - Unable to start DSE server.
com.datastax.bdp.plugin.PluginManager$PluginActivationException: Unable to activate plugin com.datastax.bdp.plugin.DseFsPlugin

Caused by: org.apache.cassandra.exceptions.ConfigurationException: DSEFS does not support authentication method configured with org.apache.cassandra.auth.PasswordAuthenticator. DSEFS supports INTERNAL, LDAP and DIGEST authentication schemes configured with DseAuthenticator.

As you can see, the default Authenticator and Authorizer are being deprecated and now changed to following. 

authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer
role_manager: com.datastax.bdp.cassandra.auth.DseRoleManager

In the dse.yaml file, configure the corresponding options:

Configure the DSE Authenticator by uncommenting the authentication_options and changing the settings.
FROM -
# authentication_options:
#     enabled: false
#     default_scheme: internal

TO -
enabled: true
default_scheme: internal


Configure the DSE Role Manager by uncommenting role_management_options and setting the mode
role_management_options:
     mode: internal

Configure the DSE Authorizer by uncommenting the authorization_options and changing the settings.
 authorization_options:
     enabled: true
#     transitional_mode: normal
#     allow_row_level_security: false


Once done, run nodetool upgradesstables, if you have not already ran it. 

[root@hqidlinfdb36 cassandra]# nodetool  upgradesstables
WARN  14:50:54,675 Small cdc volume detected at /var/lib/cassandra/cdc_raw; setting cdc_total_space_in_mb to 243.  You can override this in cassandra.yaml
WARN  14:50:54,681 memtable_cleanup_threshold has been deprecated and should be removed from cassandra.yaml


You must configure the replication factors appropriate for using DSE Security in production environments. The keyspaces that require an increased replication factor are:
  • system_auth
  • dse_security

Change the system_auth keyspace RF:
ALTER KEYSPACE system_auth
    WITH REPLICATION= {'class' : 'NetworkTopologyStrategy',
                       'data_center_name' : N,
                       'data_center_name' : N};

Example - ALTER KEYSPACE system_auth WITH REPLICATION =
  {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};


Change the dse_security keyspace RF:

ALTER KEYSPACE "dse_security"
   WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

Run the nodetool repair on the security keyspaces.
nodetool repair --full system_auth
nodetool repair --full dse_security

After changing the replication strategy, you must run nodetool repair with the --full option. when running full repair, you may see following warning where one has to modify the RF for dse_leases key space. 

insufficient replication (you created a new DC and didn't ALTER KEYSPACE dse_leases) and the duration (30000) being different (you have to disable/delete/recreate the lease to change the duration).
java.io.IOException: No live replicas for lease Leader/master/5.1.dc1 in table dse_leases.leases Nodes [/10.22.102.27] are all down/still starting.

ALTER  KEYSPACE dse_leases WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'}  AND durable_writes = true;

Once this is done, you are now running successfully on 5.1 version. 


Wednesday, November 8, 2017

DB Does Not Open Post FlashBack ORA-01139: RESETLOGS option only valid after an incomplete database recovery


During one of the flashback activity on one of RAC DB on Exadata system, we faced a challenge while opening DB post flashback. 
we tried to open DB in resetlogs on single instance and it would not open. 

alter database open resetlogs 

ERROR at line 1: 
ORA-38760: This database instance failed to turn on flashback database 

we try then turning the flashback off then open the database but after turning flash back off following is the status. 

alter database flashback off; 
SYS@DBS01P>select flashback_on from v$database; 

FLASHBACK_ON 
------------------ 
RESTORE POINT ONLY 

SYS@DBS01P>select open_mode from v$database; 

OPEN_MODE 
-------------------- 
MOUNTED 

alter database open resetlogs 

ERROR at line 1: 
ORA-01139: RESETLOGS option only valid after an incomplete database recovery 

ERROR at line 1: 
ORA-00600: internal error code, arguments: [kcvcrv_fb_inc_mismatch], [46], [525459024], [959427550], [45], [4004329076], [958125589], [], [], [], [], [] 


Upon research we found that the issue is locks on datafiles and controlfile. 
after flashback database the locks on header and data files didnt clear and caused the failure while starting the DB. 

Hence we need to  recreate controfile from trace with NO RESETLOGS as resetlogs was already done. 

we tried doing but but we faced an error stating that we need DB Open in Read/Write Mode, which is not happening to us. So another way is to restore controlfile to temp location and use it to recreate controlfile. 

1. create pfile from current spfile. 
create pfile='/tmp/DBS01P_new.ora' from spfile;

2. Open pfile and modify controlfile location. 
control_files='/tmp/DBS01P_new.ctl'

3. Mount the instance with new pfile 
startup nomount pfile='/tmp/DBS01P_new.ora'

4. Restore controlfile from backup and create trace file 

Backup Piece -  Piece Name: /ora/DBS01P/backup01/c-2841507472-20171106-01
restore controlfile to '/tmp/DBS01P_new.ctl' from '/ora/DBS01P/backup01/c-2841507472-20171106-01'

5. Create trace from current controlfile.
alter database backup controlfile to trace

6. Open the trace file and just copy the NORESETLOGS Command as follows to new file. 

CREATE CONTROLFILE REUSE DATABASE "DBS01P" NORESETLOGS FORCE LOGGING ARCHIVELOG
MAXLOGFILES 192
MAXLOGMEMBERS 3
MAXDATAFILES 1024
MAXINSTANCES 32
MAXLOGHISTORY 4192
LOGFILE
  GROUP 1 '+DATAC1/DBS01P/onlinelog/group_1.296.907412313'  SIZE 4096M BLOCKSIZE 512,
  GROUP 2 '+DATAC1/DBS01P/onlinelog/group_2.295.907412287'  SIZE 4096M BLOCKSIZE 512,
  GROUP 3 '+DATAC1/DBS01P/onlinelog/group_3.277.907412331'  SIZE 4096M BLOCKSIZE 512,
  GROUP 4 '+DATAC1/DBS01P/onlinelog/group_4.313.907412375'  SIZE 4096M BLOCKSIZE 512,
  GROUP 5 '+DATAC1/DBS01P/onlinelog/group_5.287.907412391'  SIZE 4096M BLOCKSIZE 512,
  GROUP 6 '+DATAC1/DBS01P/onlinelog/group_6.278.907412419'  SIZE 4096M BLOCKSIZE 512,
  GROUP 7 '+DATAC1/DBS01P/onlinelog/group_7.289.907412449'  SIZE 4096M BLOCKSIZE 512,
  GROUP 8 '+DATAC1/DBS01P/onlinelog/group_8.288.907412459'  SIZE 4096M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
  '+DATAC1/DBS01P/datafile/system.1405.904054341',
  '+DATAC1/DBS01P/datafile/sysaux.1403.904054339',
  .............
  '+DATAC1/DBS01P/datafile/dpa_tbs.9286.950976195'
CHARACTER SET AL32UTF8
;

 29   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50   51   52   53   54   55   56   57   58 
CREATE CONTROLFILE REUSE DATABASE "DBS01P" NORESETLOGS FORCE LOGGING ARCHIVELOG
*
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-12720: operation requires database is in EXCLUSIVE mode

## This activity needs the DB to start in non cluster mode. 

SYS@DBS01P1>alter system set cluster_database=false scope=spfile sid='*';
System altered.

SYS@DBS01P1>shut immediate;
ORA-01507: database not mounted
ORACLE instance shut down.

SYS@DBS01P1>startup nomount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 1.0262E+11 bytes

Fixed Size                  2265224 bytes
Variable Size            1.7985E+10 bytes
Database Buffers         8.4289E+10 bytes
Redo Buffers              342851584 bytes

SYS@DBS01P1>sho parameter cluster

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cluster_database                     boolean     FALSE
cluster_database_instances           integer     1

SYS@DBS01P1> @create_ctl
;
Control file created.

SYS@DBS01P1>SYS@DBS01P1>recover database;
ORA-00283: recovery session canceled due to errors
ORA-00264: no recovery required

SYS@DBS01P1>alter database open ;
Database altered.

SYS@DBS01P1>select  open_mode from v$database;
OPEN_MODE
--------------------
READ WRITE

Now one can put the DB in cluster mode and start the other instances. 
Hope this will help you to overcome any controlfile corruption issues. 

Thursday, August 17, 2017

Golden Gate Extract/Pump/Replicat Shows "RUNNING" when it is Down


After a UNIX system crashes and is restarted, a status check of the GoldenGate group (process) may show "RUNNING", but the process really is not running.
This can happen for Extract, an Extract data pump, or Replicat.

GoldenGate considers a group as "RUNNING" when its checkpoint file is locked. In a mounted file system, (such as NFS or NAS) an operating system bug prevents the locks on the files from being released after a system crash. Because the checkpoint file is locked, the GoldenGate process is considered "RUNNING" when checked through GGSCI.

In a normal situation, the process should be accessing the .pce file (if Extract) or .pcr file (if Replicat) in the dirpcs directory AND the .cpe (Extract) or .cpr (Replicat) file in the dirchk directory (in the GoldenGate installation directory).

When I checked, I see following status. 

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED
JAGENT      STOPPED
REPLICAT    RUNNING     RQOG025     00:00:00      175:17:57
REPLICAT    RUNNING     RQOG035     00:00:00      175:17:56
REPLICAT    STOPPED     RQOG125     00:00:00      03:08:05
REPLICAT    RUNNING     RQOG135     00:00:00      175:17:48
REPLICAT    RUNNING     RQOG225     00:00:00      175:17:51
REPLICAT    RUNNING     RQOG235     00:00:00      175:17:46

Try to find out the pid for the process which is being shown as running. In my case the OS didnt return any PID so effectively no process was running. 

ps -ef | grep RQOG025

check the checkpoint directory to find out if the processes has uncleared checkpoint files. 

[ggsadmin 12.2.0]$cd dirpcs/
MGR.pcm      RQOG025.pcr  RQOG035.pcr  RQOG135.pcr  RQOG225.pcr  RQOG235.pcr
[ggsadmin 12.2.0]$cd dirpcs/
[ggsadmin dirpcs]$ll
total 24
-rw-r----- 1 ggsadmin oinstall 60 Aug 17 09:07 MGR.pcm
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG025.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG035.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG135.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG225.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug  7 10:46 RQOG235.pcr

check the process id for replicat and check if the process exists, if it does, then kill it. 

[ggsadmin dirpcs]$cat RQOG025.pcr
PROGRAM REPLICAT        PROCESSID       RQOG025 PORT    xxxxxxdb11.7846       PID     93930
[ggsadmin dirpcs]$ps -p 93930
  PID TTY          TIME CMD
93930 ?        00:10:08 replicat

[ggsadmin dirpcs]$kill 93930
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG035.pcr
PROGRAM REPLICAT        PROCESSID       RQOG035 PORT    xxxxxxdb11.7845       PID     93931
[ggsadmin dirpcs]$kill 93931
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG135.pcr
PROGRAM REPLICAT        PROCESSID       RQOG135 PORT    xxxxxxdb11.7843       PID     93933
[ggsadmin dirpcs]$kill 93933
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$ccat RQOG225.pcr
-bash: ccat: command not found
[ggsadmin dirpcs]$cat RQOG225.pcr
PROGRAM REPLICAT        PROCESSID       RQOG225 PORT    xxxxxxdb11.7841       PID     93934
[ggsadmin dirpcs]$kill 93934
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG235.pcr
PROGRAM REPLICAT        PROCESSID       RQOG235 PORT    xxxxxxdb11.7842       PID     93935
[ggsadmin dirpcs]$kill 93935
[ggsadmin dirpcs]$

Start the GG instance using AGCTL 

[ggsadmin dirpcs]$agctl status   goldengate gg_inst
Goldengate  instance 'gg_inst' is not running

[ggsadmin dirpcs]$agctl start   goldengate gg_inst
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
ERROR: Manager not currently running.

CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' succeeded
CRS-2563: Attempt to start resource 'xag.gg_inst.goldengate' on 'xxxxxxdb11' has failed. Will re-retry on 'xxxxxxdb12' now.
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2632: There are no more servers to try to place resource 'xag.gg_inst.goldengate' on that would satisfy its placement policy
CRS-4000: Command Start failed, or completed with errors.

So as you ccan see the start up failed even after clearing up checkpoint files. 

[ggsadmin dirpcs]$agctl status   goldengate gg_inst
Goldengate  instance 'gg_inst' is not running

[ggsadmin 12.2.0]$./ggsci

GGSCI (xxxxxxdb11) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED
JAGENT      STOPPED
REPLICAT    STOPPED     RQOG025     00:00:00      175:27:07
REPLICAT    STOPPED     RQOG035     00:00:00      175:27:06
REPLICAT    STOPPED     RQOG125     00:00:00      03:17:15
REPLICAT    STOPPED     RQOG135     00:00:00      175:26:57
REPLICAT    STOPPED     RQOG225     00:00:00      175:27:01
REPLICAT    STOPPED     RQOG235     00:00:00      175:26:55


GGSCI (xxxxxxdb11) 2> start mgr
Manager started.

GGSCI (xxxxxxdb11) 3> info all
Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     STOPPED    -- Manager still not started. 
JAGENT      STOPPED
REPLICAT    STOPPED     RQOG025     00:00:00      175:27:10
REPLICAT    STOPPED     RQOG035     00:00:00      175:27:09
REPLICAT    STOPPED     RQOG125     00:00:00      03:17:18
REPLICAT    STOPPED     RQOG135     00:00:00      175:27:01
REPLICAT    STOPPED     RQOG225     00:00:00      175:27:04
REPLICAT    STOPPED     RQOG235     00:00:00      175:26:59

THE PROBLEM HERE IS THE DATASTORE. ONE NEEDS TO RECREATE IT AS FOLLOWS. 

GGSCI (xxxxxxdb11) 2> delete datastore ;
Are you sure you want to delete the datastore? yes
2017-08-17T16:20:17Z  INFO    OGG-06492  Datastore deleted
GGSCI (xxxxxxdb11) 3> exit

[ggsadmin@xxxxxxdb11 12.2.0]$./ggsci
GGSCI (xxxxxxdb11) 1> create datastore shm id 1000
Profile 'Trace' added.
2017-08-17T16:20:30Z  INFO    OGG-06489  Datastore created
GGSCI (xxxxxxdb11) 2> exit

GGSCI (xxxxxxdb11) 1> info datastore
2017-08-17T16:20:38Z  INFO    OGG-06309  Datastore uses shared memory (SHM) environment files with a starting ID of 0x000003e8.
GGSCI (xxxxxxdb11) 2> exit

[ggsadmin@xxxxxxdb11 12.2.0]$agctl start  goldengate gg_inst
[ggsadmin@xxxxxxdb11 12.2.0]$agctl status  goldengate gg_inst
Goldengate  instance 'gg_inst' is running on xxxxxxdb51

GGSCI (xxxxxxdb11) 2> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING
JAGENT      RUNNING
REPLICAT    STOPPED     RQOG025     00:00:00      175:31:41
REPLICAT    STOPPED     RQOG035     00:00:00      175:31:40
REPLICAT    STOPPED     RQOG125     00:00:00      03:21:49
REPLICAT    STOPPED     RQOG135     00:00:00      175:31:31
REPLICAT    STOPPED     RQOG225     00:00:00      175:31:35
REPLICAT    STOPPED     RQOG235     00:00:00      175:31:30

GGSCI (xxxxxxdb11) 3> start er *

Sending START request to MANAGER ...
REPLICAT RQOG025 starting

Sending START request to MANAGER ...
REPLICAT RQOG035 starting

Sending START request to MANAGER ...
REPLICAT RQOG125 starting

Sending START request to MANAGER ...
REPLICAT RQOG135 starting

Sending START request to MANAGER ...
REPLICAT RQOG225 starting

Sending START request to MANAGER ...
REPLICAT RQOG235 starting

GGSCI (xxxxxxdb11) 4> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING     <-- MANAGER IS NOW RUNNING AFTER RECREATING DATASTORE
JAGENT      RUNNING
REPLICAT    RUNNING     RQOG025     00:00:00      175:32:03
REPLICAT    RUNNING     RQOG035     00:00:00      175:32:02
REPLICAT    RUNNING     RQOG125     00:00:00      03:22:11
REPLICAT    RUNNING     RQOG135     00:00:00      175:31:53
REPLICAT    RUNNING     RQOG225     00:00:00      175:31:57
REPLICAT    RUNNING     RQOG235     00:00:00      175:31:51

IF ABOVE FIX DOESNT WORK YOU CAN TRY FOLLOWING WORK AROUND TOO.... 

Workaround 1:
Rename the checkpoint file, then copy (not rename again) the file back to the original name. For example:

mv RQOG025.cpe RQOG025.cpe.save
cp -p RQOG025.cpe.save RQOG025.cpe

Workaround 2:
Move the checkpoint file to another file system, and then move it back again. For example:

mv RQOG025.cpe /tmp/RQOG025.cpe
mv /tmp/RQOG025.cpe ./RQOG025.cpe

Monday, June 26, 2017



DSE Cassandra Node Failed To Start Post OS Upgrade 


One of our Cassandra production cluster node refused to start after OS Patching was done with following errorr.

ERROR [main] 2017-06-25 18:09:40,906  CassandraDaemon.java:709 - Exception encountered during startup
org.apache.cassandra.io.FSReadError: java.io.EOFException
        at org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_66]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_66]
        at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_66]
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[na:1.8.0_66]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_66]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_66]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_66]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_66]
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_66]
        at org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
        at org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
        at org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
        at org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.jav


Upon looking around, we found that there is a reported major bug with this error under this JIRA as following.

CassandraCASSANDRA-12728 Handling partially written hint files

Cause – 
Corruption to the hints tables causing Cassandra to go in failure loop. This could have happen due to following. 

1. Node was rebooted before service was shutdown properly. 
2. service went down abruptly while writing Hints table. 
3. Node rebooted due to power failure. 

Since the cause of the issue was corrupted Hints table, we need to cleanup the hints for the node and then try to restart. 

After that node started fine. Also, since the node was down, it is imperative to run the repair on the node to make sure the data is consistent. 

Hope that helps. 

Thursday, April 6, 2017

Golden Gate 12.2 Monitoring JAGENT 

Not Starting Post Upgrade


After an upgrade from 12.1 to 12.2 JAGENT, the monitoring agent is not starting up. 


Error - 
[2017-04-05T09:50:33.133-07:00] [JAGENT] [ERROR] [OGGMON-20497] [com.goldengate.monitor.jagent.comm.ws.NotificationsCollector] [tid: StatusCollector] [ecid: 0000Lfr0uKD5mZSpyCd9iW1Oof4E000003,0] Failed to get Status Information. RESTful Web Service might not be available.
[2017-04-05T09:50:38.134-07:00] [JAGENT] [ERROR] [OGGMON-20494] [com.goldengate.monitor.jagent.comm.ws.NotificationsCollector] [tid: StatusCollector] [ecid: 0000Lfr0uKD5mZSpyCd9iW1Oof4E000003,0] Error occurred while registering the OGG process. Exception: [[
 source parameter must not be null
]]

OR 

Exception in thread "StatusCollector" java.lang.NullPointerException
        at com.goldengate.monitor.jagent.comm.ws.NotificationsCollector.checkAndDeleteProcess(NotificationsCollector.java:140)
        at com.goldengate.monitor.jagent.comm.ws.NotificationsCollector.checkAndPublishRegisteredAndDeletedProcess(NotificationsCollector.java:128)


SOLUTION

Login to ggsci prompt and run following 

[ggsadmin@12.2.0]$agctl stop   goldengate xxxx

[ggsadmin@12.2.0]$./ggsci

GGSCI () 1> delete datastore
Are you sure you want to delete the datastore? y
2017-04-06T18:23:20Z  INFO    OGG-06492  Datastore deleted

GGSCI () 2> create datastore
Profile 'Trace' added.
2017-04-06T18:23:27Z  INFO    OGG-06489  Datastore created
GGSCI () 3> exit

[ggsadmin@12.2.0]$agctl start   goldengate xxxxx

Even after this if the agent doesnt start, please check the prm file for the agent. 

[ggsadmin@ bin]$cat /ggsadmin/app/cqxad02p/ggate/12.2.0/dirprm/jagent.prm

COMMAND java -Dconfig.dir=/ggsadmin/app/cqxad02p/ggate/12.2.0/cfg -Djava.util.logging.config.class=oracle.core.ojdl.logging.LoggingConfiguration -Doracle.core.ojdl.logging.config.file=/ggsadmin/app/cqxad02p/ggate/12.2.0/cfg/logging-config.xml -Doracle.core.ojdl.logging.componentId=JAGENT -jar -Xms512m -Xmx1024m /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

[ggsadmin@cfg]$ll /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

-rw-r----- 1 ggsadmin oinstall 363165 Oct 12  2015 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

the JAGENT.jar file has no executable permissions and hence it was not starting up

[ggsadmin@cfg]$chmod  755 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar


[ggsadmin@cfg]$ll /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar
-rwxr-xr-x 1 ggsadmin oinstall 363165 Oct 12  2015 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

After this change, the JAGENT started fine. 


Thursday, February 16, 2017

PSU bundle patch 5 Database Patch Set Update : 12.1.0.2.x : Installed in the CDB but not in the PDB 

OR 

validate_con_names: PDB$SEED is not open


The issue happens to one of our Oracle Database Appliance where after patching one the PDB fails to start. 

Upon checking the issue we came across following. The reason that caused the issue was when patch was applied the PDB was not open in read/write mode, hence datapatch component was not applied on that PDB 

SQL> select * from PDB_PLUG_IN_VIOLATIONS;

26-JUL-16 02.15.57.395533 PM                                                PQGECAH02                                                    SQL Patch                                                         ERROR                0          1
PSU bundle patch 5 (Database Patch Set Update : 12.1.0.2.5 (21359755)): Installed in the PDB but not in the CDB.
PENDING
Call datapatch to install in the PDB or the CDB

Then we tried following queries to open the PDB in read write mode. 

SQL> alter session set container = xxxxxx;

alter system disable restricted session;
select con_id,name,open_mode from gv$pdbs;
select con_id,logins,status from gv$instance;


SYS@xxxxx>select con_id,name,open_mode from gv$pdbs;

    CON_ID NAME                           OPEN_MODE
---------- ------------------------------ ----------
         2 PDB$SEED                       READ ONLY
         3 xxxxx                               READ WRITE
         2 PDB$SEED                       READ ONLY
         3 xxxxx                               MOUNTED


SYS@xxxxx>SELECT restricted FROM gv$containers WHERE con_id = 3;

RES
---
YES

SYS@CQXAC06P1>alter session set container = xxxxxx;
Session altered.

SYS@xxxxx>shutdown
Pluggable Database closed.

SYS@xxxxx>startup
Warning: PDB altered with errors.
Pluggable Database opened.

SYS@xxxxx>select con_id, name , open_mode , restricted FROM gv$containers ;

    CON_ID NAME                           OPEN_MODE  RES
---------- ------------------------------ ---------- ---
         1 CDB$ROOT                       READ WRITE NO
         2 PDB$SEED                       READ ONLY  NO
         3 xxxxx                               READ WRITE YES
         1 CDB$ROOT                       READ WRITE NO
         2 PDB$SEED                       READ ONLY  NO
         3 xxxxx                               READ WRITE YES


Run following from one of the nodes.
./datapatch -verbose -pdbs xxxxxx

[oracle@xxxxOPatch]$ ./datapatch -verbose -pdbs xxxxxx
SQL Patching tool version 12.1.0.2.0 on Thu May 26 13:25:17 2016
Copyright (c) 2015, Oracle.  All rights reserved.

Log file for this invocation: /u01/app/oracle/cfgtoollogs/sqlpatch/sqlpatch_68726_2016_05_26_13_25_17/sqlpatch_invocation.log

Bootstrapping registry and package to current versions...done
Connecting to database...OK
Note:  Datapatch will only apply or rollback SQL fixes for PDBs
       that are in an open state, no patches will be applied to closed PDBs.
       Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation
       (Doc ID 1585822.1)
Bootstrapping registry and package to current versions...done
Determining current state...done

Current state of SQL patches:
Bundle series PSU:
  ID 5 in the binary registry and not installed in any PDB

Adding patches to installation queue and performing prereq checks...
Installation queue:
  For the following PDBs: PQBSMAC06
    Nothing to roll back
    The following patches will be applied:
      21359755 (Database Patch Set Update : 12.1.0.2.5 (21359755))

Installing patches...
Patch installation complete.  Total patches installed: 1

Validating logfiles...
Patch 21359755 apply (pdb PQBSMAC06): SUCCESS
  logfile: /u01/app/oracle/cfgtoollogs/sqlpatch/21359755/19194568/21359755_apply_CQXAC06P_PQBSMAC06_2016May26_13_26_04.log (no errors)
SQL Patching tool complete on Thu May 26 13:26:09 2016


Lower patch in CDB_DEST than in CDB_SOURCE?

There is another situation, where your PDB has higher patch level than your CDB. 
Now this becomes a bit more tricky. See the output of PDB_PLUG_IN_VIOLATIONS:

TYPE  MESSAGE
---- ----------------------------------------------------------------------------
ERROR PSU bundle patch 1 (PSU Patch 4711): Installed in the PDB but not in the CDB
ACTION
------------------------------------------------
Call datapatch to install in the PDB or the CDB

What does this mean? Should I install now the current CDB/PDB's PSU into my target environment before being able to step down? 

If you think this message is misleading. And when you look into the MyOracle Support Note describing this under scenario 3 (MOS Note:1935365.1 - Multi tenant Unplug/Plug Best Practices) you'll see that the author silently assumed as well that is is more likely that you'll remove the patch from the PDB. 

But how do you remove changes which came in with datapatch from within a PDB only?

You will need to run datapatch -rollback on the affected PDBs only:

[oracle@xxxxOPatch]$ ./datapatch -rollback <patch id> –force [–bundle_series] -pdbs <pdb1,pdb2,...,pdbn>

This will rollback any patches installed on PDBs and then you can subsequently run the datapatch on CDB, which will fix all the issues. 

One more workaround would be following. 

SQL>alter pluggable database pdb close immediate instances=all;

SQL>alter pluggable database pdb open upgrade;

SQL>>select * from gv$pdbs;


-- Run datapatch from ORACLE_HOME/OPatch

[oracle@xxx OPatch]$ ./datapatch -verbose

There is one more thing that one can come across when applying datapatch on CDB. 
during datapatch apply it may wait and fail on time out with failure to grant access to one user as follows. 


SQL> GRANT SELECT ON sys.gv_$instance TO dv_secanalyst
  2  /
GRANT SELECT ON sys.gv_$instance TO dv_secanalyst
                    *
ERROR at line 1:
ORA-04021: timeout occurred while waiting to lock object


WORKAROUND:

One can startup the database in restrict mode and execute the failed SQL statement manually but require downtime to startup the database in restrict mode.


SQL> GRANT SELECT ON sys.gv_$instance TO dv_secanalyst

The above statement will not work if you startup the database in normal mode.


Another Issue - 


Error
validate_con_names: PDB$SEED is not open
catconInit: Unexpected error returned by validate_con_names for inclusive Container list

Cause - 

PDB$SEED is not open:

SQL> SELECT name, open_mode FROM v$pdbs;

NAME OPEN_MODE
------------------------------ ----------
PDB$SEED MOUNTED
PQAUG01  READ WRITE 


Fix - 

SQL> alter session set "_oracle_script"=TRUE;
SQL> alter pluggable database pdb$seed OPEN READ ONLY;
SQL> alter session set "_oracle_script"=FALSE;
SQL> SELECT name, open_mode FROM v$pdbs;

NAME                                                                             OPEN_MODE
-------------------------------------------------------------------------------- ----------
PDB$SEED                                                                         READ ONLY
PQAUG01                                                                          READ WRITE