Tuesday, January 13, 2015

12.1.0.2 Bug When Starting Grid Infra Management Repository DB 


During start up 12.1.0.2 Grid Infrastructure Management Repository, we came across the following error.

Startup failed with following error..

[oracle@dbrac1 bin]$ srvctl start mgmtdb
PRCR-1079 : Failed to start resource ora.mgmtdb
CRS-5017: The resource action "ora.mgmtdb start" encountered the following error:
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/dbrac2/crs/trace/crsd_oraagent_oracle.trc".

CRS-2674: Start of 'ora.mgmtdb' on 'dbrac2' failed
CRS-5017: The resource action "ora.mgmtdb start" encountered the following error:
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/dbrac1/crs/trace/crsd_oraagent_oracle.trc".

CRS-2674: Start of 'ora.mgmtdb' on 'dbrac1' failed
CRS-2632: There are no more servers to try to place resource 'ora.mgmtdb' on that would satisfy its placement policy

Upon checking alert logs we noticed the corruption, albeit logical... 

Beginning crash recovery of 1 threads
Tue Jan 13 21:22:21 2015
Started redo scan
Hex dump of (file 4, block 3301) in trace file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_ora_22748.trc

Reading datafile '+DATA/_MGMTDB/DATAFILE/undotbs1.259.860525535' for corruption at rdba: 0x01000ce5 (file 4, block 3301)
Reread (file 4, block 3301) found same corrupt data (logically corrupt)
Write verification failed for File 4 Block 3301 (rdba 0x1000ce5)
Tue Jan 13 21:22:36 2015
Errors in file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_m000_22997.trc:
ORA-00338: log 2 of thread 1 is more recent than control file
ORA-00312: online log 2 thread 1: '+DATA/_MGMTDB/ONLINELOG/group_2.262.860525589'
Errors in file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_ora_22748.trc  (incident=41020) (PDBNAME=CDB$ROOT):
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/incident/incdir_41020/-MGMTDB_ora_22748_i41020.trc
Tue Jan 13 21:22:49 2015
Dumping diagnostic data in directory=[cdmp_20150113212249], requested by (instance=1, osid=22748), summary=[incident=41020].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Tue Jan 13 21:22:51 2015
Aborting crash recovery due to error 600
Tue Jan 13 21:22:51 2015
Errors in file /u01/app/oracle/diag/rdbms/_mgmtdb/-MGMTDB/trace/-MGMTDB_ora_22748.trc:
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []

However, our management listener was running fine. 

[oracle@dbrac1 trace]$ srvctl status  mgmtlsnr
Listener MGMTLSNR is enabled
Listener MGMTLSNR is running on node(s): dbrac1

Upon checking MOS, we figure this is a partially published bug which caused this behaviour. The solution to overcome to this is as follows.

export ORACLE_SID= -MGMTDB

SQL>  startup mount
SQL>  recover database;
SQL>  alter database open ;

Upon startup we hit the following error
alter database open
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 29109
Session ID: 1 Serial number: 24261

Alert Logs shows... 

ORA-00338: log 2 of thread 1 is more recent than control file
ORA-00312: online log 2 thread 1: '+DATA/_MGMTDB/ONLINELOG/group_2.262.860525589'

This means the SCN of control file is older than SCN of logfile. This happens when you do recovery of old controlfile, which was not the case here. So something does not seem right. I also tried to start the DB on second node of cluster and again it failed.
I also tried to start MGMTCA tool, to reconfigure the management repository but it failed since my GIRM DB was not running.

There may be possibility of the wrong control file being used, but again that was ruled out. So ultimately, what I have to do is to simulate the cancel based recovery and then try to start the db with resetlogs option. 

SQL> startup mount
ORACLE instance started.

Total System Global Area  788529152 bytes
Fixed Size                  2929352 bytes
Variable Size             314576184 bytes
Database Buffers          465567744 bytes
Redo Buffers                5455872 bytes
Database mounted.
SQL> recover database until cancel;
Media recovery complete.
SQL> alter database open resetlogs;
Database altered.

And as you can this approach worked. 

[oracle@dbrac1 bin]$ ps -ef | grep pmon
oracle    3320     1  0 20:30 ?        00:00:00 asm_pmon_+ASM1
oracle    3508     1  0 22:10 ?        00:00:00 mdb_pmon_-MGMTDB
oracle    4712 18509  0 22:13 pts/1    00:00:00 grep pmon
oracle    6378     1  0 20:34 ?        00:00:00 ora_pmon_orcl1
[oracle@dbrac1 bin]$ srvctl  start mgmtdb
[oracle@dbrac1 bin]$ srvctl  status mgmtdb
Database is enabled
Instance -MGMTDB is running on node dbrac1

Hope this helps...