elcarOnOsdnaH

Monday, November 27, 2017

Unable To Start DSE Post 5.1 Upgrade

After upgrading our DSE servers running SPARK, from 5.0.8 version to 5.1.0 version, it refused to start with following error.

ERROR [main] 2017-11-20 14:33:58,931 DseDaemon.java:494 - Unable to start DSE server.
com.datastax.bdp.plugin.PluginManager$PluginActivationException: Unable to activate plugin com.datastax.bdp.plugin.DseFsPlugin

Caused by: org.apache.cassandra.exceptions.ConfigurationException: DSEFS does not support authentication method configured with org.apache.cassandra.auth.PasswordAuthenticator. DSEFS supports INTERNAL, LDAP and DIGEST authentication schemes configured with DseAuthenticator.

As you can see, the default Authenticator and Authorizer are being deprecated and now changed to following.

authenticator: com.datastax.bdp.cassandra.auth.DseAuthenticator
authorizer: com.datastax.bdp.cassandra.auth.DseAuthorizer
role_manager: com.datastax.bdp.cassandra.auth.DseRoleManager

In the dse.yaml file, configure the corresponding options:

Configure the DSE Authenticator by uncommenting the authentication_options and changing the settings.
FROM -
# authentication_options:
# enabled: false
# default_scheme: internal

TO -
enabled: true
default_scheme: internal

Configure the DSE Role Manager by uncommenting role_management_options and setting the mode
role_management_options:
mode: internal

Configure the DSE Authorizer by uncommenting the authorization_options and changing the settings.
authorization_options:
enabled: true
# transitional_mode: normal
# allow_row_level_security: false

Once done, run nodetool upgradesstables, if you have not already ran it.

[root@hqidlinfdb36 cassandra]# nodetool upgradesstables
WARN 14:50:54,675 Small cdc volume detected at /var/lib/cassandra/cdc_raw; setting cdc_total_space_in_mb to 243. You can override this in cassandra.yaml
WARN 14:50:54,681 memtable_cleanup_threshold has been deprecated and should be removed from cassandra.yaml

You must configure the replication factors appropriate for using DSE Security in production environments. The keyspaces that require an increased replication factor are:

system_auth
dse_security

Change the system_auth keyspace RF:
ALTER KEYSPACE system_auth
WITH REPLICATION= {'class' : 'NetworkTopologyStrategy',
'data_center_name' : N,
'data_center_name' : N};

Example - ALTER KEYSPACE system_auth WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

Change the dse_security keyspace RF:

ALTER KEYSPACE "dse_security"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

Run the nodetool repair on the security keyspaces.
nodetool repair --full system_auth
nodetool repair --full dse_security

After changing the replication strategy, you must run nodetool repair with the --full option. when running full repair, you may see following warning where one has to modify the RF for dse_leases key space.

insufficient replication (you created a new DC and didn't ALTER KEYSPACE dse_leases) and the duration (30000) being different (you have to disable/delete/recreate the lease to change the duration).
java.io.IOException: No live replicas for lease Leader/master/5.1.dc1 in table dse_leases.leases Nodes [/10.22.102.27] are all down/still starting.

ALTER KEYSPACE dse_leases WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'} AND durable_writes = true;

Once this is done, you are now running successfully on 5.1 version.

Wednesday, November 8, 2017

DB Does Not Open Post FlashBack ORA-01139: RESETLOGS option only valid after an incomplete database recovery

During one of the flashback activity on one of RAC DB on Exadata system, we faced a challenge while opening DB post flashback.
we tried to open DB in resetlogs on single instance and it would not open.

alter database open resetlogs
*
ERROR at line 1:
ORA-38760: This database instance failed to turn on flashback database

we try then turning the flashback off then open the database but after turning flash back off following is the status.

alter database flashback off;
SYS@DBS01P>select flashback_on from v$database;

FLASHBACK_ON
------------------
RESTORE POINT ONLY

SYS@DBS01P>select open_mode from v$database;

OPEN_MODE
--------------------
MOUNTED

alter database open resetlogs
*
ERROR at line 1:
ORA-01139: RESETLOGS option only valid after an incomplete database recovery

ERROR at line 1:
ORA-00600: internal error code, arguments: [kcvcrv_fb_inc_mismatch], [46], [525459024], [959427550], [45], [4004329076], [958125589], [], [], [], [], []

Upon research we found that the issue is locks on datafiles and controlfile.
after flashback database the locks on header and data files didnt clear and caused the failure while starting the DB.

Hence we need to recreate controfile from trace with NO RESETLOGS as resetlogs was already done.

we tried doing but but we faced an error stating that we need DB Open in Read/Write Mode, which is not happening to us. So another way is to restore controlfile to temp location and use it to recreate controlfile.

1. create pfile from current spfile.
create pfile='/tmp/DBS01P_new.ora' from spfile;

2. Open pfile and modify controlfile location.
control_files='/tmp/DBS01P_new.ctl'

3. Mount the instance with new pfile
startup nomount pfile='/tmp/DBS01P_new.ora'

4. Restore controlfile from backup and create trace file

Backup Piece - Piece Name: /ora/DBS01P/backup01/c-2841507472-20171106-01
restore controlfile to '/tmp/DBS01P_new.ctl' from '/ora/DBS01P/backup01/c-2841507472-20171106-01'

5. Create trace from current controlfile.
alter database backup controlfile to trace

6. Open the trace file and just copy the NORESETLOGS Command as follows to new file.

CREATE CONTROLFILE REUSE DATABASE "DBS01P" NORESETLOGS FORCE LOGGING ARCHIVELOG
MAXLOGFILES 192
MAXLOGMEMBERS 3
MAXDATAFILES 1024
MAXINSTANCES 32
MAXLOGHISTORY 4192
LOGFILE
GROUP 1 '+DATAC1/DBS01P/onlinelog/group_1.296.907412313' SIZE 4096M BLOCKSIZE 512,
GROUP 2 '+DATAC1/DBS01P/onlinelog/group_2.295.907412287' SIZE 4096M BLOCKSIZE 512,
GROUP 3 '+DATAC1/DBS01P/onlinelog/group_3.277.907412331' SIZE 4096M BLOCKSIZE 512,
GROUP 4 '+DATAC1/DBS01P/onlinelog/group_4.313.907412375' SIZE 4096M BLOCKSIZE 512,
GROUP 5 '+DATAC1/DBS01P/onlinelog/group_5.287.907412391' SIZE 4096M BLOCKSIZE 512,
GROUP 6 '+DATAC1/DBS01P/onlinelog/group_6.278.907412419' SIZE 4096M BLOCKSIZE 512,
GROUP 7 '+DATAC1/DBS01P/onlinelog/group_7.289.907412449' SIZE 4096M BLOCKSIZE 512,
GROUP 8 '+DATAC1/DBS01P/onlinelog/group_8.288.907412459' SIZE 4096M BLOCKSIZE 512
-- STANDBY LOGFILE
DATAFILE
'+DATAC1/DBS01P/datafile/system.1405.904054341',
'+DATAC1/DBS01P/datafile/sysaux.1403.904054339',
.............
'+DATAC1/DBS01P/datafile/dpa_tbs.9286.950976195'
CHARACTER SET AL32UTF8
;

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
CREATE CONTROLFILE REUSE DATABASE "DBS01P" NORESETLOGS FORCE LOGGING ARCHIVELOG
*
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-12720: operation requires database is in EXCLUSIVE mode

## This activity needs the DB to start in non cluster mode.

SYS@DBS01P1>alter system set cluster_database=false scope=spfile sid='*';
System altered.

SYS@DBS01P1>shut immediate;
ORA-01507: database not mounted
ORACLE instance shut down.

SYS@DBS01P1>startup nomount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 1.0262E+11 bytes
Fixed Size 2265224 bytes
Variable Size 1.7985E+10 bytes
Database Buffers 8.4289E+10 bytes
Redo Buffers 342851584 bytes

SYS@DBS01P1>sho parameter cluster

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cluster_database boolean FALSE
cluster_database_instances integer 1

SYS@DBS01P1> @create_ctl
;
Control file created.

SYS@DBS01P1>SYS@DBS01P1>recover database;
ORA-00283: recovery session canceled due to errors
ORA-00264: no recovery required

SYS@DBS01P1>alter database open ;
Database altered.

SYS@DBS01P1>select open_mode from v$database;
OPEN_MODE
--------------------
READ WRITE

Now one can put the DB in cluster mode and start the other instances.
Hope this will help you to overcome any controlfile corruption issues.

Thursday, August 17, 2017

Golden Gate Extract/Pump/Replicat Shows "RUNNING" when it is Down

After a UNIX system crashes and is restarted, a status check of the GoldenGate group (process) may show "RUNNING", but the process really is not running.
This can happen for Extract, an Extract data pump, or Replicat.

GoldenGate considers a group as "RUNNING" when its checkpoint file is locked. In a mounted file system, (such as NFS or NAS) an operating system bug prevents the locks on the files from being released after a system crash. Because the checkpoint file is locked, the GoldenGate process is considered "RUNNING" when checked through GGSCI.

In a normal situation, the process should be accessing the .pce file (if Extract) or .pcr file (if Replicat) in the dirpcs directory AND the .cpe (Extract) or .cpr (Replicat) file in the dirchk directory (in the GoldenGate installation directory).

When I checked, I see following status.

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED
JAGENT STOPPED
REPLICAT RUNNING RQOG025 00:00:00 175:17:57
REPLICAT RUNNING RQOG035 00:00:00 175:17:56
REPLICAT STOPPED RQOG125 00:00:00 03:08:05
REPLICAT RUNNING RQOG135 00:00:00 175:17:48
REPLICAT RUNNING RQOG225 00:00:00 175:17:51
REPLICAT RUNNING RQOG235 00:00:00 175:17:46

Try to find out the pid for the process which is being shown as running. In my case the OS didnt return any PID so effectively no process was running.

ps -ef | grep RQOG025

check the checkpoint directory to find out if the processes has uncleared checkpoint files.

[ggsadmin 12.2.0]$cd dirpcs/
MGR.pcm RQOG025.pcr RQOG035.pcr RQOG135.pcr RQOG225.pcr RQOG235.pcr
[ggsadmin 12.2.0]$cd dirpcs/
[ggsadmin dirpcs]$ll
total 24
-rw-r----- 1 ggsadmin oinstall 60 Aug 17 09:07 MGR.pcm
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG025.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG035.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG135.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG225.pcr
-rw-r----- 1 ggsadmin oinstall 69 Aug 7 10:46 RQOG235.pcr

check the process id for replicat and check if the process exists, if it does, then kill it.

[ggsadmin dirpcs]$cat RQOG025.pcr
PROGRAM REPLICAT PROCESSID RQOG025 PORT xxxxxxdb11.7846 PID 93930
[ggsadmin dirpcs]$ps -p 93930
PID TTY TIME CMD
93930 ? 00:10:08 replicat

[ggsadmin dirpcs]$kill 93930
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG035.pcr
PROGRAM REPLICAT PROCESSID RQOG035 PORT xxxxxxdb11.7845 PID 93931
[ggsadmin dirpcs]$kill 93931
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG135.pcr
PROGRAM REPLICAT PROCESSID RQOG135 PORT xxxxxxdb11.7843 PID 93933
[ggsadmin dirpcs]$kill 93933
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$ccat RQOG225.pcr
-bash: ccat: command not found
[ggsadmin dirpcs]$cat RQOG225.pcr
PROGRAM REPLICAT PROCESSID RQOG225 PORT xxxxxxdb11.7841 PID 93934
[ggsadmin dirpcs]$kill 93934
[ggsadmin dirpcs]$

[ggsadmin dirpcs]$cat RQOG235.pcr
PROGRAM REPLICAT PROCESSID RQOG235 PORT xxxxxxdb11.7842 PID 93935
[ggsadmin dirpcs]$kill 93935
[ggsadmin dirpcs]$

Start the GG instance using AGCTL

[ggsadmin dirpcs]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is not running

[ggsadmin dirpcs]$agctl start goldengate gg_inst
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
ERROR: Manager not currently running.

CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb11'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb11' succeeded
CRS-2563: Attempt to start resource 'xag.gg_inst.goldengate' on 'xxxxxxdb11' has failed. Will re-retry on 'xxxxxxdb12' now.
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2674: Start of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' failed
CRS-2679: Attempting to clean 'xag.gg_inst.goldengate' on 'xxxxxxdb12'
CRS-2681: Clean of 'xag.gg_inst.goldengate' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'gg-xxx-veri-agent' on 'xxxxxxdb12'
CRS-2677: Stop of 'gg-xxx-veri-agent' on 'xxxxxxdb12' succeeded
CRS-2673: Attempting to stop 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12'
CRS-2677: Stop of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb12' succeeded
CRS-2672: Attempting to start 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11'
CRS-2676: Start of 'hqiqlxxxgg03-vip.xxxx.com' on 'xxxxxxdb11' succeeded
CRS-2672: Attempting to start 'gg-xxx-veri-agent' on 'xxxxxxdb11'
CRS-2676: Start of 'gg-xxx-veri-agent' on 'xxxxxxdb11' succeeded
CRS-2632: There are no more servers to try to place resource 'xag.gg_inst.goldengate' on that would satisfy its placement policy
CRS-4000: Command Start failed, or completed with errors.

So as you ccan see the start up failed even after clearing up checkpoint files.

[ggsadmin dirpcs]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is not running

[ggsadmin 12.2.0]$./ggsci

GGSCI (xxxxxxdb11) 1> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED
JAGENT STOPPED
REPLICAT STOPPED RQOG025 00:00:00 175:27:07
REPLICAT STOPPED RQOG035 00:00:00 175:27:06
REPLICAT STOPPED RQOG125 00:00:00 03:17:15
REPLICAT STOPPED RQOG135 00:00:00 175:26:57
REPLICAT STOPPED RQOG225 00:00:00 175:27:01
REPLICAT STOPPED RQOG235 00:00:00 175:26:55

GGSCI (xxxxxxdb11) 2> start mgr
Manager started.

GGSCI (xxxxxxdb11) 3> info all
Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER STOPPED -- Manager still not started.
JAGENT STOPPED
REPLICAT STOPPED RQOG025 00:00:00 175:27:10
REPLICAT STOPPED RQOG035 00:00:00 175:27:09
REPLICAT STOPPED RQOG125 00:00:00 03:17:18
REPLICAT STOPPED RQOG135 00:00:00 175:27:01
REPLICAT STOPPED RQOG225 00:00:00 175:27:04
REPLICAT STOPPED RQOG235 00:00:00 175:26:59

THE PROBLEM HERE IS THE DATASTORE. ONE NEEDS TO RECREATE IT AS FOLLOWS.

GGSCI (xxxxxxdb11) 2> delete datastore ;
Are you sure you want to delete the datastore? yes
2017-08-17T16:20:17Z INFO OGG-06492 Datastore deleted
GGSCI (xxxxxxdb11) 3> exit

[ggsadmin@xxxxxxdb11 12.2.0]$./ggsci
GGSCI (xxxxxxdb11) 1> create datastore shm id 1000
Profile 'Trace' added.
2017-08-17T16:20:30Z INFO OGG-06489 Datastore created
GGSCI (xxxxxxdb11) 2> exit

GGSCI (xxxxxxdb11) 1> info datastore
2017-08-17T16:20:38Z INFO OGG-06309 Datastore uses shared memory (SHM) environment files with a starting ID of 0x000003e8.
GGSCI (xxxxxxdb11) 2> exit

[ggsadmin@xxxxxxdb11 12.2.0]$agctl start goldengate gg_inst
[ggsadmin@xxxxxxdb11 12.2.0]$agctl status goldengate gg_inst
Goldengate instance 'gg_inst' is running on xxxxxxdb51

GGSCI (xxxxxxdb11) 2> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING
JAGENT RUNNING
REPLICAT STOPPED RQOG025 00:00:00 175:31:41
REPLICAT STOPPED RQOG035 00:00:00 175:31:40
REPLICAT STOPPED RQOG125 00:00:00 03:21:49
REPLICAT STOPPED RQOG135 00:00:00 175:31:31
REPLICAT STOPPED RQOG225 00:00:00 175:31:35
REPLICAT STOPPED RQOG235 00:00:00 175:31:30

GGSCI (xxxxxxdb11) 3> start er *

Sending START request to MANAGER ...
REPLICAT RQOG025 starting

Sending START request to MANAGER ...
REPLICAT RQOG035 starting

Sending START request to MANAGER ...
REPLICAT RQOG125 starting

Sending START request to MANAGER ...
REPLICAT RQOG135 starting

Sending START request to MANAGER ...
REPLICAT RQOG225 starting

Sending START request to MANAGER ...
REPLICAT RQOG235 starting

GGSCI (xxxxxxdb11) 4> info all

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING <-- MANAGER IS NOW RUNNING AFTER RECREATING DATASTORE
JAGENT RUNNING
REPLICAT RUNNING RQOG025 00:00:00 175:32:03
REPLICAT RUNNING RQOG035 00:00:00 175:32:02
REPLICAT RUNNING RQOG125 00:00:00 03:22:11
REPLICAT RUNNING RQOG135 00:00:00 175:31:53
REPLICAT RUNNING RQOG225 00:00:00 175:31:57
REPLICAT RUNNING RQOG235 00:00:00 175:31:51

IF ABOVE FIX DOESNT WORK YOU CAN TRY FOLLOWING WORK AROUND TOO....

Workaround 1:
Rename the checkpoint file, then copy (not rename again) the file back to the original name. For example:

mv RQOG025.cpe RQOG025.cpe.save
cp -p RQOG025.cpe.save RQOG025.cpe

Workaround 2:
Move the checkpoint file to another file system, and then move it back again. For example:

mv RQOG025.cpe /tmp/RQOG025.cpe
mv /tmp/RQOG025.cpe ./RQOG025.cpe

Monday, June 26, 2017

DSE Cassandra Node Failed To Start Post OS Upgrade

One of our Cassandra production cluster node refused to start after OS Patching was done with following errorr.

ERROR [main] 2017-06-25 18:09:40,906 CassandraDaemon.java:709 - Exception encountered during startup
org.apache.cassandra.io.FSReadError: java.io.EOFException
at org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_66]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_66]
at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_66]
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_66]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_66]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_66]
at org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.jav

Upon looking around, we found that there is a reported major bug with this error under this JIRA as following.

CassandraCASSANDRA-12728 Handling partially written hint files

Cause –
Corruption to the hints tables causing Cassandra to go in failure loop. This could have happen due to following.

1. Node was rebooted before service was shutdown properly.
2. service went down abruptly while writing Hints table.
3. Node rebooted due to power failure.

Since the cause of the issue was corrupted Hints table, we need to cleanup the hints for the node and then try to restart.

After that node started fine. Also, since the node was down, it is imperative to run the repair on the node to make sure the data is consistent.

Hope that helps.

Thursday, April 6, 2017

Golden Gate 12.2 Monitoring JAGENT

Not Starting Post Upgrade

After an upgrade from 12.1 to 12.2 JAGENT, the monitoring agent is not starting up.

Error -
[2017-04-05T09:50:33.133-07:00] [JAGENT] [ERROR] [OGGMON-20497] [com.goldengate.monitor.jagent.comm.ws.NotificationsCollector] [tid: StatusCollector] [ecid: 0000Lfr0uKD5mZSpyCd9iW1Oof4E000003,0] Failed to get Status Information. RESTful Web Service might not be available.
[2017-04-05T09:50:38.134-07:00] [JAGENT] [ERROR] [OGGMON-20494] [com.goldengate.monitor.jagent.comm.ws.NotificationsCollector] [tid: StatusCollector] [ecid: 0000Lfr0uKD5mZSpyCd9iW1Oof4E000003,0] Error occurred while registering the OGG process. Exception: [[
source parameter must not be null
]]

OR

Exception in thread "StatusCollector" java.lang.NullPointerException
at com.goldengate.monitor.jagent.comm.ws.NotificationsCollector.checkAndDeleteProcess(NotificationsCollector.java:140)
at com.goldengate.monitor.jagent.comm.ws.NotificationsCollector.checkAndPublishRegisteredAndDeletedProcess(NotificationsCollector.java:128)

SOLUTION

Login to ggsci prompt and run following

[ggsadmin@12.2.0]$agctl stop goldengate xxxx

[ggsadmin@12.2.0]$./ggsci

GGSCI () 1> delete datastore
Are you sure you want to delete the datastore? y
2017-04-06T18:23:20Z INFO OGG-06492 Datastore deleted

GGSCI () 2> create datastore
Profile 'Trace' added.
2017-04-06T18:23:27Z INFO OGG-06489 Datastore created
GGSCI () 3> exit

[ggsadmin@12.2.0]$agctl start goldengate xxxxx

Even after this if the agent doesnt start, please check the prm file for the agent.

[ggsadmin@ bin]$cat /ggsadmin/app/cqxad02p/ggate/12.2.0/dirprm/jagent.prm

COMMAND java -Dconfig.dir=/ggsadmin/app/cqxad02p/ggate/12.2.0/cfg -Djava.util.logging.config.class=oracle.core.ojdl.logging.LoggingConfiguration -Doracle.core.ojdl.logging.config.file=/ggsadmin/app/cqxad02p/ggate/12.2.0/cfg/logging-config.xml -Doracle.core.ojdl.logging.componentId=JAGENT -jar -Xms512m -Xmx1024m /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

[ggsadmin@cfg]$ll /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

-rw-r----- 1 ggsadmin oinstall 363165 Oct 12 2015 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

the JAGENT.jar file has no executable permissions and hence it was not starting up

[ggsadmin@cfg]$chmod 755 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

[ggsadmin@cfg]$ll /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar
-rwxr-xr-x 1 ggsadmin oinstall 363165 Oct 12 2015 /ggsadmin/app/cqxad02p/jagent12.2/oggmon/ogg_agent/dirjar/jagent.jar

After this change, the JAGENT started fine.

Thursday, February 16, 2017

PSU bundle patch 5 Database Patch Set Update : 12.1.0.2.x : Installed in the CDB but not in the PDB

OR

validate_con_names: PDB$SEED is not open

The issue happens to one of our Oracle Database Appliance where after patching one the PDB fails to start.

Upon checking the issue we came across following. The reason that caused the issue was when patch was applied the PDB was not open in read/write mode, hence datapatch component was not applied on that PDB

SQL> select * from PDB_PLUG_IN_VIOLATIONS;

26-JUL-16 02.15.57.395533 PM PQGECAH02 SQL Patch ERROR 0 1

PSU bundle patch 5 (Database Patch Set Update : 12.1.0.2.5 (21359755)): Installed in the PDB but not in the CDB.

PENDING

Call datapatch to install in the PDB or the CDB

Then we tried following queries to open the PDB in read write mode.

SQL> alter session set container = xxxxxx;

alter system disable restricted session;

select con_id,name,open_mode from gv$pdbs;

select con_id,logins,status from gv$instance;

SYS@xxxxx>select con_id,name,open_mode from gv$pdbs;

CON_ID NAME OPEN_MODE

---------- ------------------------------ ----------

2 PDB$SEED READ ONLY

3 xxxxx READ WRITE

2 PDB$SEED READ ONLY

3 xxxxx MOUNTED

SYS@xxxxx>SELECT restricted FROM gv$containers WHERE con_id = 3;

RES

---

YES

SYS@CQXAC06P1>alter session set container = xxxxxx;

Session altered.

SYS@xxxxx>shutdown

Pluggable Database closed.

SYS@xxxxx>startup

Warning: PDB altered with errors.

Pluggable Database opened.

SYS@xxxxx>select con_id, name , open_mode , restricted FROM gv$containers ;

CON_ID NAME OPEN_MODE RES

---------- ------------------------------ ---------- ---

1 CDB$ROOT READ WRITE NO

2 PDB$SEED READ ONLY NO

3 xxxxx READ WRITE YES

1 CDB$ROOT READ WRITE NO

2 PDB$SEED READ ONLY NO

3 xxxxx READ WRITE YES

Run following from one of the nodes.

./datapatch -verbose -pdbs xxxxxx

[oracle@xxxxOPatch]$ ./datapatch -verbose -pdbs xxxxxx

SQL Patching tool version 12.1.0.2.0 on Thu May 26 13:25:17 2016

Log file for this invocation: /u01/app/oracle/cfgtoollogs/sqlpatch/sqlpatch_68726_2016_05_26_13_25_17/sqlpatch_invocation.log

Bootstrapping registry and package to current versions...done

Connecting to database...OK

Note: Datapatch will only apply or rollback SQL fixes for PDBs

that are in an open state, no patches will be applied to closed PDBs.

Please refer to Note: Datapatch: Database 12c Post Patch SQL Automation

(Doc ID 1585822.1)

Bootstrapping registry and package to current versions...done

Determining current state...done

Current state of SQL patches:

Bundle series PSU:

ID 5 in the binary registry and not installed in any PDB

Adding patches to installation queue and performing prereq checks...

Installation queue:

For the following PDBs: PQBSMAC06

Nothing to roll back

The following patches will be applied:

21359755 (Database Patch Set Update : 12.1.0.2.5 (21359755))

Installing patches...

Patch installation complete. Total patches installed: 1

Validating logfiles...

Patch 21359755 apply (pdb PQBSMAC06): SUCCESS

logfile: /u01/app/oracle/cfgtoollogs/sqlpatch/21359755/19194568/21359755_apply_CQXAC06P_PQBSMAC06_2016May26_13_26_04.log (no errors)

SQL Patching tool complete on Thu May 26 13:26:09 2016

Lower patch in CDB_DEST than in CDB_SOURCE?

There is another situation, where your PDB has higher patch level than your CDB.
Now this becomes a bit more tricky. See the output of PDB_PLUG_IN_VIOLATIONS:

TYPE MESSAGE
---- ----------------------------------------------------------------------------
ERROR PSU bundle patch 1 (PSU Patch 4711): Installed in the PDB but not in the CDB
ACTION
------------------------------------------------
Call datapatch to install in the PDB or the CDB

What does this mean? Should I install now the current CDB/PDB's PSU into my target environment before being able to step down?

If you think this message is misleading. And when you look into the MyOracle Support Note describing this under scenario 3 (MOS Note:1935365.1 - Multi tenant Unplug/Plug Best Practices) you'll see that the author silently assumed as well that is is more likely that you'll remove the patch from the PDB.

But how do you remove changes which came in with datapatch from within a PDB only?

You will need to run datapatch -rollback on the affected PDBs only:

[oracle@xxxxOPatch]$ ./datapatch -rollback <patch id> –force [–bundle_series] -pdbs <pdb1,pdb2,...,pdbn>

This will rollback any patches installed on PDBs and then you can subsequently run the datapatch on CDB, which will fix all the issues.

One more workaround would be following.

SQL>alter pluggable database pdb close immediate instances=all;

SQL>alter pluggable database pdb open upgrade;

SQL>>select * from gv$pdbs;

-- Run datapatch from ORACLE_HOME/OPatch

[oracle@xxx OPatch]$ ./datapatch -verbose

There is one more thing that one can come across when applying datapatch on CDB.
during datapatch apply it may wait and fail on time out with failure to grant access to one user as follows.

SQL> GRANT SELECT ON sys.gv_$instance TO dv_secanalyst
2 /
GRANT SELECT ON sys.gv_$instance TO dv_secanalyst
*
ERROR at line 1:
ORA-04021: timeout occurred while waiting to lock object

WORKAROUND:

One can startup the database in restrict mode and execute the failed SQL statement manually but require downtime to startup the database in restrict mode.

SQL> GRANT SELECT ON sys.gv_$instance TO dv_secanalyst

The above statement will not work if you startup the database in normal mode.

Another Issue -

Error -

validate_con_names: PDB$SEED is not open

catconInit: Unexpected error returned by validate_con_names for inclusive Container list

Cause -

PDB$SEED is not open:

SQL> SELECT name, open_mode FROM v$pdbs;

NAME OPEN_MODE

------------------------------ ----------

PDB$SEED MOUNTED

PQAUG01 READ WRITE

Fix -

SQL> alter session set "_oracle_script"=TRUE;

SQL> alter pluggable database pdb$seed OPEN READ ONLY;

SQL> alter session set "_oracle_script"=FALSE;

SQL> SELECT name, open_mode FROM v$pdbs;

NAME OPEN_MODE
-------------------------------------------------------------------------------- ----------
PDB$SEED READ ONLY
PQAUG01 READ WRITE