Wednesday, March 27, 2013

ORA-12547: TNS:lost contact / Process W000 died /

Process J000 died / kkjcre1p: unable to spawn jobq slave process



There are days when you can’t seem to do a thing wrong and some are days when you can’t do a thing right. 26 March 13 was also one of those later kind of days.
DBA got the page that production DB was not accessible for users to run batch jobs and we jumped on issue. The error reported was as follows

ORA-12547: TNS:lost contact

This is very generic error. After spending a bit time DBA figured out that even they cannot connect to DB. Things were not looking right.  General suspect in this kind situation is the process limits. So the alert file was scanned for similar error (though users never reported similar error). Surprisingly there was no such error that indicates that process parameter in DB has not maxed out nor the session’s parameter was!!!

Though Error which we got was following…

Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /db/archive/app/oracle/ product/diag/rdbms/mmprod/MMPROD/trace/MMPROD_cjq0_778436.trc:
Process W000 died, see its trace file
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process

Hmm.. Unable to spawn process… May be JOB_QUEUE_PROCESS parameter reached limit. But then it will not allow the background job but will not block user connections, even as sysdba.
Something  amiss. We need to look at different angle now. Probably something wrong with OS hence SA were called in and they declared (as always) that nothing seem wrong with OS. We also asked them to check profile limits and they seemed to be fine.  So the ball is back in out court.

MOS was consulted and we came across few notes suggesting that this might be due to permissions issues. But looking at ORACLE_HOME binaries, things were looking fine!!!

So the decision was taken to bounce the DB and bring the DB back in business as much valuable time was lost in trouble shooting and all SLA in shambles. DB was bounce and now we find ourselves in more serious shit than we can imagine as DB was not ready to start.

Things were getting interesting by minute. Suddenly we noticed that when we were trying to set ORACLE_HOME, we got PERMISSIONS DENIED error. One of the DBA came across following stack trace…

skgpgcmdout: read() for cmd /usr/bin/procstack 1425470 2>&1 timed out after 24.378 seconds

Now something is really messed up and one of our applications Primary DBA (very experienced guy, whose opinion cannot be over ruled!) again asked to SA to check OS more thoroughly.
And Guess what surprise, surprise. SA came back few minutes later and declared that permission on special device /dev/null was changed by an unknown entity and that brought the whole house down.

Once, the permissions were fixed the DB came up nicely.
So the important lesson learned was for the connections issue, not always look at DB but also beyond it.

Wednesday, March 13, 2013

ORA-01110: data file xx MRP0: Background Media Recovery process shutdown 



whilst doing a simple admin task we ran into issue on our production DG setup. On primary DB we ran out space on one Table space and as routine task one of our DBA, added data file on primary. Usually this kind of changes were routine in our env having DG setup, but in this particular setup we ran into the issue. 


Log file shows - 

MRP0: Background Media Recovery terminated with error 1111
Errors in file /u01/app/oracle/diag/rdbms/xxxxxxx/xxxxxxx/trace/xxxxxxx_pr00_22450.trc:
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
ORA-01157: cannot identify/lock data file 25 - see DBWR trace file
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
Slave exiting with ORA-1111 exception
Errors in file /u01/app/oracle/diag/rdbms/xxxxxxx/xxxxxxx/trace/xxxxxxx_pr00_22450.trc:
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
ORA-01157: cannot identify/lock data file 25 - see DBWR trace file
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
Recovery Slave PR00 previously exited with exception 1111
Errors in file /u01/app/oracle/diag/rdbms/xxxxxxx/xxxxxxx/trace/xxxxxxx_mrp0_22432.trc:
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
ORA-01157: cannot identify/lock data file 25 - see DBWR trace file
ORA-01111: name for data file 25 is unknown - rename to correct file
ORA-01110: data file 25: '/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025'
MRP0: Background Media Recovery process shutdown (xxxxxxx)


In the database - 

SQL> select name from v$datafile;
NAME
--------------------------------------------------------------------------------
+DATA/p2psfdom/datafile/domstab3.264.804084437
+DATA/p2psfdom/datafile/hadoopuser_tab.272.804085249
/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025

The database is using ASM hence it should add DBF to ASM and not on disk. This behavior was controlled by DG parameter STANDBY_FILE_MANAGEMENT. 
This parameter was set to MANUAL in this env, hence the file was not propagated to ASM, hence needs a fix.

Following is the fix that one can use to overcome this sort of issue

-- Set the STANDBY_FILE_MANAGEMENT initialization to MANUAL on the standby database if its --- not 
SQL>   alter system set standby_file_management='manual';

-- Next rename the datafile: 

SQL > alter database create datafile ‘/u01/app/oracle/product/11.2.0/db_1/dbs/UNNAMED00025’ as '+DG_DATA’ size 8192m;

-- Next we re-enable auto standby file management.
SQL> alter system set standby_file_management='auto';

-- The last step on the standby database side
SQL> recover managed standby database disconnect from session;

Managed recovery should now be able to proceed normally.