Friday, November 11, 2016

MongoDB StartUp Issues with 

"systemctl status mongod.service" and "journalctl -xe"


We recently had issue where one of our replicaset member went offline for some time and  fell behind in syncing.
To come around the situation, we decide the clean out the node and start the service so that it will sync automatically from other replicaset members.
However after cleaning out the data directories the mongo refused to start. 

sh-4.2$ pbrun /etc/init.d/mongod start
Starting mongod (via systemctl):  Job for mongod.service failed because the control process exited with error code. See "systemctl status mongod.service" and "journalctl -xe" for details.                 [FAILED]

-sh-4.2$ pbrun /etc/init.d/mongod status

รข mongod.service - SYSV: Mongo is a scalable, document-oriented database.
   Loaded: loaded (/etc/rc.d/init.d/mongod)
   Active: failed (Result: exit-code) since Fri 2016-11-11 10:09:20 MST; 2min 49s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 6552 ExecStop=/etc/rc.d/init.d/mongod stop (code=exited, status=0/SUCCESS)
  Process: 10514 ExecStart=/etc/rc.d/init.d/mongod start (code=exited, status=1/FAILURE)
 Main PID: 2156 (code=exited, status=0/SUCCESS)

Nov 11 10:09:20 hqiqlmemdb01 systemd[1]: Starting SYSV: Mongo is a scalable, document-oriented database....
Nov 11 10:09:20 hqiqlmemdb01 runuser[10522]: pam_unix(runuser:session): session opened for user mongod by (uid=0)
Nov 11 10:09:20 hqiqlmemdb01 runuser[10522]: pam_unix(runuser:session): session closed for user mongod
Nov 11 10:09:20 hqiqlmemdb01 mongod[10514]: Starting mongod: [FAILED]
Nov 11 10:09:20 hqiqlmemdb01 systemd[1]: mongod.service: control process exited, code=exited status=1
Nov 11 10:09:20 hqiqlmemdb01 systemd[1]: Failed to start SYSV: Mongo is a scalable, document-oriented database..
Nov 11 10:09:20 hqiqlmemdb01 systemd[1]: Unit mongod.service entered failed state.
Nov 11 10:09:20 hqiqlmemdb01 systemd[1]: mongod.service failed.
-

While checking the mongod.conf file, I spoted the mongodb.key value. This file was there for authentication. 

security:
  authorization: enabled
  keyFile: /data/mongodb/pqmrsme001/mongodb.key

While cleaning up Data directory, we also wiped out that file. so I copied the file back from backup and placed it in data directory. 
After that process started just fine. 


-sh-4.2$ pbrun /etc/init.d/mongod start
Starting mongod (via systemctl):                           [  OK  ]

-sh-4.2$ pbrun /etc/init.d/mongod status
mongod.service - SYSV: Mongo is a scalable, document-oriented database.
   Loaded: loaded (/etc/rc.d/init.d/mongod)
   Active: active (running) since Fri 2016-11-11 10:28:44 MST; 6s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 6552 ExecStop=/etc/rc.d/init.d/mongod stop (code=exited, status=0/SUCCESS)
  Process: 11733 ExecStart=/etc/rc.d/init.d/mongod start (code=exited, status=0/SUCCESS)
 Main PID: 11745 (mongod)
   CGroup: /system.slice/mongod.service
           11745 /usr/bin/mongod -f /etc/mongod.conf

Thursday, September 15, 2016


Enable Authentication & Authorization on 

MongoDB Ops Manager


In this note, I will describe how to enable authentication and authorization on MongoDB Ops Manager cloud control env. 

My setup is as follows
3 node rerplicaset for Ops Manager Metadata Repository
3 node rerplicaset for Ops Manager Sync Store DB 
2 node for application servers
1 node for Backup Daemon 

At high level we need to do following. 

a. Create db user in rerplicaset with right roles.
b. Modify Ops manager config files with user details
c. Create key file on replicaset nodes and copy it on all nodes (assuming you are using single key file across all replica sets. If otherwise create multiple key files)
d. Modify mongod.conf file on all nodes of all replica sets and restart mongbdb in rolling fashion.


1.  create users on both replset(metadata and blockstore/syncstore)

db.createUser(
  {
    user: "opsmanager",
    pwd: "xxxxxxx",
    roles: [ { role: "root", db: "admin" } ]
  }
)
Make sure the data is replicated.

2. modify opsmanager config files with users details

#mongo.mongoUri=mongodb://username:password@memdb03.phx.aexp.com:27017,memdb02.phx.aexp.com:27017,memdb01.phx.aexp.com:27017

3. bounce all opsmanagers and backup daemon

pbrun /etc/init.d/mongodb-mms restart


Create key file on replicaset nodes -

1. create key file on replset , both

 openssl  rand -base64 755 > /etc/mongodb.key
 chown mongod:mongod /etc/mongodb.key
 chmod  400 /etc/mongodb.key

copy /etc/mongodb.key to all db instances in both repl set.

2.  modify mongb conf files for authrization and key file (secondary, secondary, primary)

 Go to the first secondary instance and change following in conf file. (Try to do it before hand for all replica sets, all nodes)

 security:
  authorization: enabled
  keyFile: /data/mongodb/pqmrsme001/mongodb.key
  
 bounce the mongodb instance and make sure its in sync with primary

 Go to second secondary perform the above change and then primary

 pbrun /etc/init.d/mongod restart


##After first/primary node restart

[root@lpdosput00250 ~]#  mongo --port 27017 --authenticationDatabase admin -u opsmanager -p xxxxxxx
MongoDB shell version: 3.2.1
MongoDB Enterprise rs0:RECOVERING> rs.status()
{
        "set" : "rs0",
        "date" : ISODate("2016-08-18T18:46:06.886Z"),
        "myState" : 3,
        "term" : NumberLong(141),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "10.20.176.248:28017",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",    <-- Notice the change of state
                        "uptime" : 34,
                        "optime" : {
                                "ts" : Timestamp(1471545769, 1),
                                "t" : NumberLong(141)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:42:49Z"),
                        "infoMessage" : "could not find member to sync from",
                        "configVersion" : 3,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "10.20.176.248:28018",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)", <-- Notice the change of state
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:46:03.207Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                },
                {
                        "_id" : 2,
                        "name" : "10.20.176.248:28019",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)", <-- Notice the change of state
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:46:03.209Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                }
        ],
        "ok" : 1
}

##After Secondary node restart -

MongoDB Enterprise rs0:SECONDARY> rs.status()
{
        "set" : "rs0",
        "date" : ISODate("2016-08-18T18:47:27.806Z"),
        "myState" : 2,
        "term" : NumberLong(141),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "10.20.176.248:28017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "PRIMARY", <-- Notice the change of state
                        "uptime" : 10,
                        "optime" : {
                                "ts" : Timestamp(1471545769, 1),
                                "t" : NumberLong(141)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:42:49Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:47:27.724Z"),
                        "lastHeartbeatRecv" : ISODate("2016-08-18T18:47:24.382Z"),
                        "pingMs" : NumberLong(0),
                        "configVersion" : 3
                },
                {
                        "_id" : 1,
                        "name" : "10.20.176.248:28018",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY", <-- Notice the change of state
                        "uptime" : 10,
                        "optime" : {
                                "ts" : Timestamp(1471545769, 1),
                                "t" : NumberLong(141)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:42:49Z"),
                        "infoMessage" : "could not find member to sync from",
                        "configVersion" : 3,
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "10.20.176.248:28019",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)", <-- Notice the change of state
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:47:27.726Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                }
        ],
        "ok" : 1
}


##After Secondary node restart -

MongoDB Enterprise rs0:SECONDARY> rs.status()
{
        "set" : "rs0",
        "date" : ISODate("2016-08-18T18:51:29.408Z"),
        "myState" : 2,
        "term" : NumberLong(143),
        "syncingTo" : "10.20.176.248:28018",
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "10.20.176.248:28017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY", <-- Notice the change of state
                        "uptime" : 9,
                        "optime" : {
                                "ts" : Timestamp(1471546211, 1),
                                "t" : NumberLong(143)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:50:11Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:51:25.260Z"),
                        "lastHeartbeatRecv" : ISODate("2016-08-18T18:51:29.261Z"),
                        "pingMs" : NumberLong(0),
                        "electionTime" : Timestamp(0, 0),
                        "electionDate" : ISODate("1970-01-01T00:00:00Z"),
                        "configVersion" : 3
                },
                {
                        "_id" : 1,
                        "name" : "10.20.176.248:28018",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY", <-- Notice the change of state
                        "uptime" : 9,
                        "optime" : {
                                "ts" : Timestamp(1471546211, 1),
                                "t" : NumberLong(143)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:50:11Z"),
                        "lastHeartbeat" : ISODate("2016-08-18T18:51:25.260Z"),
                        "lastHeartbeatRecv" : ISODate("2016-08-18T18:51:28.127Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "10.20.176.248:28017",
                        "configVersion" : 3
                },
                {
                        "_id" : 2,
                        "name" : "10.20.176.248:28019",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY", <-- Notice the change of state
                        "uptime" : 10,
                        "optime" : {
                                "ts" : Timestamp(1471546211, 1),
                                "t" : NumberLong(143)
                        },
                        "optimeDate" : ISODate("2016-08-18T18:50:11Z"),
                        "syncingTo" : "10.20.176.248:28018",
                        "infoMessage" : "syncing from: 10.20.176.248:28018",
                        "configVersion" : 3,
                        "self" : true
                }
        ],
        "ok" : 1
}


After you make this change, one has to enable the password for oplog store database so that ops manager can login 
and write oplog from replica sets to oplog store database.

From OpsMgr:  Click 'Admin' in upper right, then the 'Backup' tab, then the 'Oplog Storage' sub-tab and fill in username and password fields.

Once that is done, restart the opsmanager and backup daemon servers.

If you miss the above, you will see following error on Ops manager system warnings.

com.xgen.svc.brs.web.res.BackupConfigurationResource.updateReplicaSetConfiguration(javax.servlet.http.HttpServletRequest,javax.servlet.http.HttpServletResponse,java.lang.String,java.lang.String,java.lang.String,java.lang.String,java.lang.String,java.lang.String,java.lang.String,boolean,java.lang.String,java.util.List,java.lang.String,java.lang.String,java.lang.String) - msg: { "serverUsed" : "memdb05:27017" , "ok" : 0.0 , "errmsg" : "not authorized on 570d36d634fc72f8790f2e63 to execute command { createIndexes: \"oplog_pdmcl101\", indexes: [ { name: \"groupId_1_rsId_1_valid_1_end_1__id_1\", ns: \"570d36d634fc72f8790f2e63.oplog_pdmcl101\", background: true, key: { groupId: 1, rsId: 1, valid: 1, end: 1, _id: 1 } } ] }" , "code" : 13} com.mongodb.CommandFailureException { "serverUsed" : "memdb05:27017" , "ok" : 0.0 , "errmsg" : "not authorized on 570d36d634fc72f8790f2e63 to execute command { createIndexes: \"oplog_pdmcl101\", indexes: [ { name: \"groupId_1_rsId_1_valid_1_end_1__id_1\", ns: \"570d36d634fc72f8790f2e63.oplog_pdmcl101\", background: true, key: { groupId: 1, rsId: 1, valid: 1, end: 1, _id: 1 } } ] }" , "code" : 13} 

Also remember that if you have monitoring and backup agents running on backup daemon, they dont need bounce coz monitoring agent does not need to talk with Meta data DB as it check the target db status and ask the opsmanager to write details to Metadata DB.

MongoDB Start Up Issues & Warnings


While taking backup of MongoDB using Ops Manager cloud control, we came across various warnings at different stages. 
following are the warnings that one may come across during start of the Head DB.


WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files.

##Pls create following file if not already present.

 cat /etc/security/limits.d/90-nproc.conf
* soft nproc 64000
* hard nproc 64000

Make sure to reboot the system after the change.


As we are using Data Domain to store snapshots and SAN for Head DB, another error we faced was following.

unable to create or lock mongod.lock file

Here the issue was with the options we used to mount the file system.so we changed the options to include following.

nolock, noatime, nointr

There were also some more warnings regarding NUMACTL not installed. 
one way is to install the numactl on backup daemon server with the help of SA. 
OR 
we can disable it using Ops Manager UI as follows.

Non-Uniform Memory Access (NUMA) Settings

Key - mongodb.disable.numa
Value Type: boolean

To disable NUMA for the head databases:

1.Click the Admin link, then the General tab, then the Ops Manager Config page, and then the Custom section.
2.Add mongodb.disable.numa as a Key and set its Value to true.
3.Click Save.

Tuesday, August 9, 2016


Remove Old SNAPs From AWR History



Recently we came across the siutation, where our AWR snapshots taking lot of space in SYSAUX. Upon review we figure out that the DB was recently restored to new servers during physical migration. 
Hence the data from old DBID was still pertinent in DB. 

New DBID=3360745648
Old DBID = 926777616 


SYS@NPOFE02P1>select DBID,NAME from v$database;

      DBID NAME
---------- ------------
3360745648 xxxxx

SYS@NPOFE02P1>select * from dba_hist_wr_control;

      DBID SNAP_INTERVAL                  RETENTION                      TOPNSQL
---------- ------------------------------ ------------------------------ ----------
3360745648 +00000 00:15:00.0              +00015 00:00:00.0              DEFAULT
 926777616 +00000 00:15:00.0              +00060 00:00:00.0              DEFAULT

-- We checked how many snaps we do have for old DBID.
SYS@NPOFE02P1>select SNAP_ID, DBID,INSTANCE_NUMBER,SNAP_LEVEL from dba_hist_snapshot where dbid=926777616 order by snap_id, dbid;

-- find out the min and max range to remove from DB

SYS@NPOFE02P1>select min(SNAP_ID), max(SNAP_ID) from dba_hist_snapshot where dbid=926777616;

MIN(SNAP_ID) MAX(SNAP_ID)
------------ ------------
      113085       118150

 
SQL> execute dbms_workload_repository.drop_snapshot_range(113085,118150);

PL/SQL procedure successfully completed.

Once done the above step, all the snpas were removed from Database.


Friday, January 8, 2016

Add Node To MongoDB Replica-Set With 

Minimum Sync Time


Normally when one adds node to MongoDB replica set, the mongodb will wipe out all the existing DBs on newly added node and sync up the data from primary node.
This is the most easy way to add node to replica set but if the volume of the data is high then initial sync up will take long time to complete.
To avoid that there is another way to achieve the same with much less time. 

Following are the steps that one can take to add node to mongodb replica set with minimum time. 

Initial Setup -
All of my mongo processes are running on one node on different ports.

RepB - Primary
RepA - Secondary

RepC - New Secondary <- This is the new node which will be added to replica set.

Now one need to copy the files from one of the secondary replica nodes. 
To make sure the copy is consistent, apply the lock on secondary node to make sure no writes are applied while the files are copied.
Also make sure the oplog is large enough to store the updates from primary without having to recycle it self. 

[root@]# mongo  --port 29001
MongoDB shell version: 3.0.7
connecting to: 127.0.0.1:29001/test
Server has startup warnings:
TestABC:SECONDARY>

TestABC:SECONDARY> db.fsyncLock()
{
        "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
        "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
        "ok" : 1
}
TestABC:SECONDARY> EXIT

Copy the files from existing secondary node to new secondary node.

[root@lpdosput00251 RepA]# scp -r * ../RepC/
[root@lpdosput00251 RepA]# cd ../RepC/
[root@lpdosput00251 RepC]# ll
total 163852
drwxr-xr-x 2 root root     4096 Jan  8 14:28 journal
-rwxr-xr-x 1 root root 67108864 Jan  8 14:28 local.0
-rwxr-xr-x 1 root root 16777216 Jan  8 14:28 local.ns
-rwxr-xr-x 1 root root        5 Jan  8 14:28 mongod.lock
-rwxr-xr-x 1 root root       69 Jan  8 14:28 storage.bson
-rwxr-xr-x 1 root root 67108864 Jan  8 14:28 test.0
-rwxr-xr-x 1 root root 16777216 Jan  8 14:28 test.ns
[root@lpdosput00251 RepC]# rm mongod.lock

Once the copy is complete, unlock the secondary node as follows.

TestABC:SECONDARY> db.fsyncUnlock()
{ "ok" : 1, "info" : "unlock completed" }

Now start the mongod process as a member of the replica set. 

[root@lpdosput00251 RepC]# mongod --replSet TestABC --dbpath /amex/mongodb/RepC --port 29003 --oplogSize 50 --logpath log.C --logappend --fork

Now one has to add this node to replica set configuration.

connect to primary node.

[root@lpdosput00251 ~]# mongo  --port 29002
MongoDB shell version: 3.0.7
connecting to: 127.0.0.1:29002/test

TestABC:PRIMARY> rs.add("10.20.176.194:29003")
{ "ok" : 1 }

Log file will show following messages

2016-01-08T14:30:32.734-0700 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "TestABC", version: 6, members: [ { _id: 0, host: "10.20.176.194:29001", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "10.20.176.194:29002", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "10.20.176.194:29003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2016-01-08T14:30:32.734-0700 I REPL     [ReplicationExecutor] This node is 10.20.176.194:29003 in the config
2016-01-08T14:30:32.734-0700 I REPL     [ReplicationExecutor] transition to STARTUP2
2016-01-08T14:30:32.735-0700 I REPL     [ReplicationExecutor] Member 10.20.176.194:29002 is now in state PRIMARY
2016-01-08T14:30:32.735-0700 I REPL     [ReplicationExecutor] Member 10.20.176.194:29001 is now in state SECONDARY
2016-01-08T14:30:34.280-0700 I NETWORK  [initandlisten] connection accepted from 10.20.176.194:39327 #4 (3 connections now open)
2016-01-08T14:30:34.281-0700 I NETWORK  [conn4] end connection 10.20.176.194:39327 (2 connections now open)
2016-01-08T14:30:35.450-0700 I REPL     [ReplicationExecutor] syncing from: 10.20.176.194:29002
2016-01-08T14:30:35.450-0700 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to 10.20.176.194:29002
2016-01-08T14:30:37.431-0700 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-01-08T14:30:37.432-0700 I REPL     [ReplicationExecutor] transition to SECONDARY

Once you notice that, newly added node is in finished recovering, check the replica set status.

TestABC:SECONDARY> rs.status().members[0].name
10.20.176.194:29001
TestABC:SECONDARY> rs.status().members[0].stateStr
SECONDARY
TestABC:SECONDARY> rs.status().members[1].name
10.20.176.194:29002
TestABC:SECONDARY> rs.status().members[1].stateStr
PRIMARY
TestABC:SECONDARY> rs.status().members[2].name
10.20.176.194:29003
TestABC:SECONDARY> rs.status().members[2].stateStr
SECONDARY

To check whether the data is populated correctly or not.

[root@lpdosput00251 RepC]# mongo  --port 29003
MongoDB shell version: 3.0.7
connecting to: 127.0.0.1:29003/test

TestABC:SECONDARY> show dbs
2016-01-08T15:01:31.697-0700 E QUERY    Error: listDatabases failed:{ "note" : "from execCommand", "ok" : 0, "errmsg" : "not master" }
    at Error (<anonymous>)
    at Mongo.getDBs (src/mongo/shell/mongo.js:47:15)
    at shellHelper.show (src/mongo/shell/utils.js:630:33)
    at shellHelper (src/mongo/shell/utils.js:524:36)
    at (shellhelp2):1:1 at src/mongo/shell/mongo.js:47
TestABC:SECONDARY> rs.slaveOk()
TestABC:SECONDARY> show dbs
local  0.078GB
test   0.078GB
TestABC:SECONDARY> use test
switched to db test

TestABC:SECONDARY> show collections
foo
system.indexes
TestABC:SECONDARY> db.foo.find()
{ "_id" : ObjectId("5660264dfb21b4cc04a0c734"), "str" : "Sunil" }
{ "_id" : ObjectId("5660265bfb21b4cc04a0c735"), "str" : "Hardik", "x" : 3 }
{ "_id" : ObjectId("56602667fb21b4cc04a0c736"), "str" : "Hardik", "x" : 1 }
{ "_id" : ObjectId("56602674fb21b4cc04a0c737"), "str" : "sunny", "x" : 3 }
{ "_id" : ObjectId("5660267ffb21b4cc04a0c738"), "str" : "sunilkumar", "x" : 7 }

Your node is now successfully added to replica set and in sync with primary.