Friday, October 29, 2010

Oracle 11G : root.sh fails with - Failure at final check of Oracle CRS stack. 10

I was setting up a Oracle 11G RAC in a two node Linux cluster and got into a issue while running the root.sh in the second node of the cluster as below:


/rdbms/crs/root.sh
Checking to see if Oracle CRS stack is already configured
/etc/oracle does not exist. Creating it now.

Setting the permissions on OCR backup directory
Setting up Network socket directories
Oracle Cluster Registry configuration upgraded successfully
clscfg: EXISTING configuration version 4 detected.
clscfg: version 4 is 11 Release 1.
Successfully accumulated necessary OCR keys.
Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.
node :
node 1: devdb03b devdb03b-priv devdb03b
node 2: devdb03a devdb03a-priv devdb03a
clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply -force parameter to override.
-force is destructive and will destroy any previous cluster
configuration.
Oracle Cluster Registry for cluster has already been initialized
Startup will be queued to init within 30 seconds.
Adding daemons to inittab
Expecting the CRS daemons to be up within 600 seconds.
Failure at final check of Oracle CRS stack.
10


After the error I was manually evaluating the basic setup was right, there were a couple of issues which were trivial and had escaped the clufy verification:

1. The private and virtual host names were commented in the /etc/hosts file in one node.
2. The time was not synced in both the nodes which could cause node eviction.

[oracle@devdb03b cssd]$ date
Thu Oct 28 10:49:38 GMT 2010
[oracle@devdb03a ~]$ date
Thu Oct 28 10:48:39 GMT 2010


After the required changes were done the installation was cleaned up following the following note and reinstalled.

Note: How to Clean Up After a Failed 10g or 11.1 Oracle Clusterware Installation [ID 239998.1]

Which still dint resolve the issue, after some more analysis on the trace dumps from the ocssd , we could see that the network heart beat was not coming through for some other reason like a port block or a firewall issue, checking the /etc/services and iptables confirmed it.

[ CSSD]2010-10-28 10:55:58.709 [1098586432] >TRACE: clssnmReadDskHeartbeat: node 1, devdb03a, has a disk HB, but no network HB, DHB has rcfg 183724820, wrtcnt, 476, LATS 51264, lastSeqNo 476, timestamp 1288262691/387864

OL=tcp)(HOST=wv1
devdb03b-priv)(P
ORT=49895))

iptables was enabled and had many restrictions, so after adding the following in the iptables and restarting the nodes (as in one node the crs restart was hanging forever).

In node devdb03a


ACCEPT all -- devdb03b anywhere

In node devdb03b


ACCEPT all -- devdb03a anywhere

After this the crs became healthy, but no resources were there.

This was due to the root.sh failure in the second node, to fix this the vipca was run as rot user from the first node and everything fell in place quickly, and all the vip,ons and gsd came up fine.

Tuesday, October 12, 2010

Oracle Netbackup restore to a different user/server

I am working on an environment where we have (Oracle RMAN + Netbackup) for our backup strategy, and today we had one of our APPS DBA seeking help for restoring the production backup to a different server (dev) in a different user.

For restoring in a different server it was straight forward as I had done it multiple times before, the solution is as below to send the name of the client which took the backup via NB_ORA_CLIENT parameter, which will let the netbackup client browse the backups taken from the production server (prd-bkp).

Note: the actual client here is prd-bkp

run
{
host "date";
allocate channel t1 DEVICE TYPE sbt_tape PARMS 'SBT_LIBRARY=/usr/openv/netbackup/bin/libobk.so64.1' format '%d_dbf_%u_%t' ;
send 'NB_ORA_CLIENT=prd-bkp';
restore controlfile to '/upg04/FINDEV/control01.ctl' ;
set until time "to_date('05-10-2010 10:01:00','dd-mm-yyyy hh24:mi:ss')";
release channel t1 ;
debug off;
host "date";
}

Coming to the second issue, where we have to restore the file to a different user, I had to work to analyze the issue and from the help of the backup admin I pulled out the log files from the netbackup master server and could see the following message


07:21:46.888 [6581] <2> db_valid_master_server: dev-bkp is not a valid server
07:21:46.933 [6581] <2> process_request: command C_BPLIST_4_5 (82) received
07:21:46.933 [6581] <2> process_request: list request = 329199 82 oradev dbadev prd-bkp dev-bkp
dev-bkp NONE 0 3 999 1281405910 1284084310 4 4 1 1 1 0 4 7230 9005 4 0 C C C C C 0 2 0 0 0
07:21:46.947 [6581] <2> get_type_of_client_list_restore: list and restore not specified for dev-bkp
07:21:46.947 [6581] <2> get_type_of_client_free_browse: Free browse allowed for dev-bkp
07:21:46.948 [6581] <2> db_valid_client: -all clients valid-
07:21:46.949 [6581] <2> fileslist: sockfd = 9
07:21:46.949 [6581] <2> fileslist: owner = oradev
07:21:46.949 [6581] <2> fileslist: group = dbadev
07:21:46.949 [6581] <2> fileslist: client = prd-bkp
07:21:46.949 [6581] <2> fileslist: sched_type = 12

The reason being the backup was done from oracle user - dba group, restore was tried from oradev user - dbadev group. Here the user and group has to match for the restore to succeed. Changing the existing setup to a production like user was not possible because we had users like oradev,oratst in the box which are all going to have copies from production.

Finally we could see that since Netbackup 6.0 MP4 we could read the backup images if the groups of the users were same, voila we were in 6.5. This change was possible for us to do with the help of the sysadmin without compromising on security.By doing so the backup went fine the solution was as below:

1.Stop the oracle instance runnign in dev.
2.Changed the group of oradev from dbadev to dba.
3.Changed the binaries ownership to oradev:dba.
4.Start the oracle instance with the oradev user.
5.Restart the backup

Now we had a smooth restore and below was the log from the Netbackup master server.


07:53:11.923 [9423] <2> db_valid_client: -all clients valid-
07:53:11.924 [9423] <2> fileslist: sockfd = 9
07:53:11.924 [9423] <2> fileslist: owner = oradev
07:53:11.924 [9423] <2> fileslist: group = dba
07:53:11.924 [9423] <2> fileslist: client = prd-bkp
07:53:11.924 [9423] <2> fileslist: sched_type = 12.