The Classic case of OFFLINE Drop in an EMC BCV with Transportable Tablespace

Introduction:

Recently  in the team we came across the fact that we had to be working a lot with technology inside the EMC  storage environment which  is called  BCV ( business continuity  volume) .  This technology enables the possibility  to clone a (production) Oracle database to another server.   Another method would be to use Snaps in the EMC  but that is out of scope for this blog now ( though I do work with them as well and to an Oracle Dba  the End result is the same  cause both technologies offer replication – cloning ).

In this scenario there are two ways to do to an replication of the data in the EMC env..

  1. Either you will do an alter database begin backup – and backup than do a split in the EMC  , do and  alter database end backup on the source side . After that on the Unix Side the disks need to be made visible and mounted (when using file systems). If you are using ASM then Linux administrators need to make sure that all the disks needed for the database are present in the multi-path configuration. Once that Unix/ Linux part is done you could  mount the database ( when using a file system structure  )  or mount the disk groups in ASM and then mount the database for  backups  or open the database.
  2. Or you do a split in the EMC ( without the begin and end backup) on  the source environment.  The rest  of the scenario is similar to 1 which means  make the disk visible and mount the database or mount the disk groups , then mount the database. In this scenario you will open the Database on this target side thus performing a crash recovery  of the instance.

In this blog we will follow scenario 2. Once the file systems are mounted on the target Unix machine  the database is opened.  Purpose of this all would be to create  a migration database where the project  members could consolidate data in a local and extra created tablespace as part of a migration strategy.

Details

After the file systems had been mounted by Unix Admins , by dbas request some extra local mount points had been set up  on top of the  mount points that had come across  with the BCV.   Once the database was opened the first time an extra tablespace was  created on those file systems only locally present on the target machine.  This tablespace was created 1.1TB in size. Together with project it was agreed that this tablespace would become a transportable tablespace since recreating it after  a BCV refresh would take some 6 to 7 hours each time a refresh occurred.  Making it a transportable tablespace would offer the idea of unplugging it right before a BCV refresh would occur and  plugging it back in after a refresh thus saving time.

This scenario worked well  many times, but as always one time things did not work out well. After a refresh and after the plug-in of the tablespaces there was technical issue on the local file systems . Unix Admins had run repair jobs on the file system where the datafiles of the  transportable tablespace (TTS)  was located and in the end even had cleaned out  that file system ..   So there i was with a refreshed database  that had plugged in the transportable tablespace of 1.1TB  and the database  had crashed and would not open ( since the datafiles for the  TTS where missing).  A refresh  from production would take  3 – 4 hours extra downtime  and i would still have to recreate the  tablespace by hand anyhow  after that .

Then i recalled a scenario from the old days in which  the scenario was that you had a database without a backup of the new created tablespace that had crashed. Once again Google was my friend and  after doing some reading I set up the scenario below after confirming that this could be my way out:

## First of all it was needed to find  all  the datafiles supposed to be part of  this  TTS:

select FILE#,substr(NAME,1,80)
from v$datafile
where name like ‘%CVN_TBS%’
order by 1;

SQL> /

FILE# SUBSTR(NAME,1,80)
———- ——————————————————————————–
1063 /db/MYDB1/cvn_tbs/CVN_TBS_050.dbf
.

.

.
1113 /db/MYDB1/cvn_tbs/CVN_TBS_005.dbf
1114 /db/MYDB1/cvn_tbs/CVN_TBS_004.dbf
1115 /db/MYDB1/cvn_tbs/CVN_TBS_003.dbf
1116 /db/MYDB1/cvn_tbs/CVN_TBS_002.dbf
1117 /db/MYDB1/cvn_tbs/CVN_TBS_001.dbf

55 rows selected.

##  Next step was to offline drop them since there was no Backup present.

select ‘alter database datafile ‘||file#||’ offline drop ;’
from v$datafile
where name like ‘%CVN_TBS%’
order by 1;

## Spooled it to a file and ran the script so that dropped the datafiles ..

##  In the next step opened db
alter database open;

## Recreating the 1.1  TB  tablespace would take long hours so i created the tablespace first with a small datafile then started alter  tablespace  with ranges of datafile to be added in script and ran it in background..

#!/bin/ksh
#
# start all  steps
#

## recreate
cat step6_all.ksh
nohup /opt/oracle/SRCENS1/admin/create/step6_1.ksh&
sleep 60
nohup /opt/oracle/SRCENS1/admin/create/step6_11.ksh& <– adding datafile #2 – 10
nohup /opt/oracle/SRCENS1/admin/create/step6_21.ksh& <– adding datafile #11 – 20
nohup /opt/oracle/SRCENS1/admin/create/step6_31.ksh&
nohup /opt/oracle/SRCENS1/admin/create/step6_41.ksh&
nohup /opt/oracle/SRCENS1/admin/create/step6_51.ksh&

Since  I knew that project would not need all 1.1 TB from the first minute that scenario  minimized my downtime, thus saving time. And indeed at the next normal refresh  first I unplugged the TTS again , got a new refreshed BCV and plugged the TTS back in .  Thus making everybody happy again.

As always  happy reading  and till we meet again,

Mathijs

When your Restart / Rac Node screams : [ohasd(1749)]CRS-0004:logging terminated

Introduction:

During  one of those days on a Monday where  a hot line ( ticket ) service was provided team received a ticket that all instances  of an Oracle Restart ( 11.2.0.3)  where down. I assisted in doing  troubleshooting  which is always fun because that gives a bit of a Sherlock Holmes idea investigating issues at hand .  After logging on to the box  I saw  that the ASM Instance was up  but a  “crsctl check  has”  showed that no communication with has daemon could be established.  So the puzzle became a bit more interesting at that moment .

Details:

  • Environment is an Oracle restart environment on Linux.
  • 11.2.0.3 Grid Infra with 11.2.0.3 Databases running on it
  • Grid infra has been installed in: /opt/crs/product/11203/crs/
  • Environment went down during the weekend ( fortunately it was a Test environment  with a standard office time SLA)

It was time to  take a peek in the alert-log of the cluster on that node.  Since my installation was done in: /opt/crs/product/11203/crs/  that meant I had to look for the log-file in the subdirectory:  /opt/crs/product/11203/crs/log/mysrvr01/   ( where in  the cluster mysrvr01 is the name of the node  i was working on ) to see the alert-file. In there I found this showing what was going on:

2014-10-18 10:47:37.060:
[ohasd(1749)]CRS-0004:logging terminated for the process. log file: “/opt/crs/product/11203/crs/log/mysrvr01/ohasd/ohasd.log”
2014-10-18 10:47:37.128:
[ohasd(1749)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /opt/crs/product/11203/crs/auth/ohasd/mysrvr01/A4107671
2014-10-18 10:47:37.146:
[ohasd(1749)]CRS-10000:CLSU-00100: Operating System function: mkdir failed with error data: 28
CLSU-00101: Operating System error message: No space left on device
CLSU-00103: error location: authprep6
CLSU-00104: additional error information: failed to make dir /opt/crs/product/11203/crs/auth/ohasd/mysrvr01/A5139099

With this as a clue dear Watson it was a lot easier to proceed.  Started hunting large files in the file system of the  /opt/crs/product/11203/crs/  and found a flood of XML files  that had not  been erased for ages there . Which made me realize that a clean up job in cron would become handy for this as well.

After  removing the xmls and checking the mount point again  I logged in as Root  and first issued a  crsctl start has.

After some time i noticed in the alert log of the node :
014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.DATA1.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.DUMMY_DATA.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:26:50.847:
[/opt/crs/product/11203/crs/bin/oraagent.bin(32461)]CRS-5818:Aborted command ‘check’ for resource ‘ora.FRA1.dg’. Details at (:CRSAGF00113:) {0:0:2} in /opt/crs/product/11203/crs/log/mysrvr01/agent/ohasd/oraagent_oracle/oraagent_oracle.log.
2014-10-20 08:27:52.189:
[evmd(4105)]CRS-1401:EVMD started on node mysrvr01.
2014-10-20 08:27:53.685:
[cssd(4193)]CRS-1713:CSSD daemon is started in local-only mode
2014-10-20 08:27:55.581:
[ohasd(32421)]CRS-2767:Resource state recovery not attempted for ‘ora.diskmon’ as its target state is OFFLINE
2014-10-20 08:28:02.233:
[cssd(4193)]CRS-1601:CSSD Reconfiguration complete. Active nodes are mysrvr01 .

When I checked again :) ah good news not only ASM instance running but also all Instances present on the box .  Another puzzle solved.

Happy reading and till next Time ,

Mathijs

PRVF-4557 During installation of Oracle Binaries ( 11.2.0.3)

Introduction:

During an Oracle software install for a new RAC environment once again  I was surprised by Oracle . I had installed the Grid infrastructure (GI) 11.2.0.3 on RedHat Linux with success and I was about to install the Oracle Binaries which should be a piece of cake with the installed  GI up and running properly.  According to the old quote “Never stop learning ( or being surprised  )”   during the preparations the runInstaller  managed to give me a surprise  and I was unable to continue with installation. So obviously this  surprise needed  fixing first.

 

Details:

When running the runInstaller , after choosing the option to install the Oracle binaries on a Real Application Cluster environment  and after  a couple of options  I received a pop up telling me :

PRVF-4557 : Node application “ora.svrb1hr.vip” is offline on node

I was unable to go ahead and had to investigate. As always Google and the Oracle Community where  my brothers in oracle arms for this so i Came across this scenario :

First I checked my hosts file to make sure  information to be present on both nodes. That showed  following details which looked ok:

oracle@svrb1hr:/opt/oracle [CRM]# grep  vip /etc/hosts 
10.97.242.32 svrb1hr-vip.myenv.dc-world.de svrb1hr-vip
10.97.242.33 svrb2hr-vip.myenv.dc-world.de svrb2hr-vip

Next step was  checking my cluster resources. As  mentioned before GI install had finished  properly the day before so that really made me wonder:

oracle@svrb1hr:/opt/oracle [CRM]# crsctl status resource -t 
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.CLUSTERDATA.dg
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.LISTENER.lsnr
               ONLINE  OFFLINE      svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.asm
               ONLINE  ONLINE       svrb1hr                 Started             
               ONLINE  ONLINE       svrb2hr                 Started             
ora.gsd
               OFFLINE OFFLINE      svrb1hr                                     
               OFFLINE OFFLINE      svrb2hr                                     
ora.net1.network
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.ons
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.registry.acfs
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       svrb1hr                                     
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       svrb2hr                                     
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       svrb1hr                                     
ora.cvu
      1        ONLINE  ONLINE       svrb2hr                                     
ora.svrb1hr.vip
      1        ONLINE  INTERMEDIATE svrb2hr                 FAILED OVER         
ora.svrb2hr.vip
      1        ONLINE  ONLINE       svrb2hr                                     
ora.oc4j
      1        ONLINE  ONLINE       svrb1hr                                     
ora.scan1.vip
      1        ONLINE  ONLINE       svrb1hr                                     
ora.scan2.vip
      1        ONLINE  ONLINE       svrb2hr                                     
ora.scan3.vip
      1        ONLINE  ONLINE       svrb1hr                                     

Well that was unexpected, because  expectation was that the srvrb1hr.vip should be running on the first node. Still don’t understand what happened to cause this. Hmm and frankly if you have suggestions what happened please let me know . But  I did know  it was needed to bring back the vip  address to the first server.

First attempt  was to issue the command needed  ( crs_relocate ) on the node where  I was already working ( node 1) .

oracle@svrb1hr:/opt/oracle [CRM]# which crs_relocate
/opt/crs/product/11203/crs/bin/crs_relocate

oracle@svrb1hr:/opt/oracle [CRM]# crs_relocate svrb1hr.vip
CRS-0210: Could not find resource ‘svrb1hr.vip’.
###  activities needed 2 b done from second node

Grumbling with this  but well at least it was explained what to do next ….

So I opened a session against the second node , made sure my Oracle Home was pointing to the GI.

oracle@svrb1hr:/opt/oracle [CRM]# ssh svrb2hr

On the second box  the command crs_relocate was entered:

                     
oracle@svrb2hr:/opt/oracle [CRS]# crs_relocate ora.svrb1hr.vip
Attempting to stop `ora.svrb1hr.vip` on member `svrb2hr`
Stop of `ora.svrb1hr.vip` on member `svrb2hr` succeeded.
Attempting to start `ora.svrb1hr.vip` on member `svrb1hr`
Start of `ora.svrb1hr.vip` on member `svrb1hr` succeeded.
Attempting to start `ora.LISTENER.lsnr` on member `svrb1hr`
Start of `ora.LISTENER.lsnr` on member `svrb1hr` succeeded.

Well that  looked promessing  so let’s check one more time then:

 

oracle@svrb2hr:/opt/oracle [CRS]#  crsctl status resource -t 
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.CLUSTERDATA.dg
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.LISTENER.lsnr
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.asm
               ONLINE  ONLINE       svrb1hr                 Started             
               ONLINE  ONLINE       svrb2hr                 Started             
ora.gsd
               OFFLINE OFFLINE      svrb1hr                                     
               OFFLINE OFFLINE      svrb2hr                                     
ora.net1.network
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.ons
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
ora.registry.acfs
               ONLINE  ONLINE       svrb1hr                                     
               ONLINE  ONLINE       svrb2hr                                     
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       svrb1hr                                     
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       svrb2hr                                     
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       svrb1hr                                     
ora.cvu
      1        ONLINE  ONLINE       svrb2hr                                     
ora.svrb1hr.vip
      1        ONLINE  ONLINE       svrb1hr                                     
ora.svrb2hr.vip
      1        ONLINE  ONLINE       svrb2hr                                     
ora.oc4j
      1        ONLINE  ONLINE       svrb1hr                                     
ora.scan1.vip
      1        ONLINE  ONLINE       svrb1hr                                     
ora.scan2.vip
      1        ONLINE  ONLINE       svrb2hr                                     
ora.scan3.vip
      1        ONLINE  ONLINE       svrb1hr

 

Much better!!  after this I restarted the runInstaller again from the first node and indeed installation of  the Oracle Binaries was flawless.

 

As always  happy  reading ,

 

Mathijs

The good the bad and the corrupted (ORA-01578, ORA-26040, ORA-01110)

Introduction:

 

At the moment I am part of a team that will upgrade a billing environment from 10.2.0.5 to 11.2.0.4 on Solaris. After a restore and point in time  recovery  on a preproduction environment the team has upgraded all databases using the manual upgrade method ( still not that confident that the upgrades would run as smooth with the dbua….. ). As always after the upgrade there is a time of hyper care where the alert files should be monitored even more closely for funky issues. This blog is giving you some more details on the funky stuff we came across and how we analyzed it. The one thing that did make my frown during this action however was the fact that various tools showed various results ( not necessarily showing all the same results.

Conclusions:

When upgrading  important environments a DBA  should really have eyes and  ears everywhere before performing that upgrade because once upgraded   it will be a tough call to proof that something was broken in the old environment already or that it was caused by the upgrade. So I would really recommend to scrutinize  your environment like someone studies a creature under a microscope before upgrading questioning broken jobs , invalid objects and as  I learned the hard way during this upgrade .. objects with no logging option (DBA_TABLES, DBA_INDEXES, DBA_LOBS, DBA_TAB_PARTITIONS, DBA_LOB_PARTITIONS, DBA_TAB_SUBPARTITIONS, etc.
LOGGING=’NO’ indicates NOLOGGING. ).  In my case e a number of objects had been created with the no logging option  and indeed only after  upgrading I had to fight my way back to find out what was going on , what objects  were involved ( in my case  it was related to a number of objects  , staging tables for a specific user  where a drop  and recreate and exp/imp dealt with it). I  did recommend to make sure that there will be force logging for the complete database ( but was told that customer wants to be able to create objects with no logging option  for performance reasons).

As I mentioned in the blog  in the End  I was plain lucky that this  was caused by  staging tables  which made the scenario more easy since we could  drop and recreate them easily. Once again I learned something ( and found proof again that success  just loves preparation). Of course this is a slightly different scenario  then the one we will run in production since we will  run the Upgrade directly but still would need to bear in mind the objects in production  created with the no logging option if one of the databases in scope would need a restore if the upgrade would break the database. Well in  this case all databases where upgraded with success  so all DBAS  happy and confident for the production migration which is scheduled.

Details:

SERVER: mysrvr

Database: OPTMYDB

Timeline:

  • All Databases of the billing environment have been restored  to the new OPT server called mysrvr.
  • We gave upgraded all Databases from 10.2.0.5 to 11.2.0.4 on Solaris.
  • After opening  the database OPTMYDB  and running statistics jobs she started throwing  error messages in alert about block corruptions  as seen in Example 1 below. Yesterday together with a colleague from Adba  team the gather statistics issue was resolved  and three objects had been identified to have corrupt blocks ( see Example2) .  These three objects have been dropped and recreated and there were no more corruption messages in the alert file.
  • After that I have been asked by my manager to run a dbv for the full database checking all datafiles .
  • I also have used rman in Parallel to find possible corruption  which is still running but first output is in example3 :

Example 1

## In  the alertfile this was the information We came across setting off the first alarm bells:

Mon Aug 04 22:00:43 2014

DBMS_STATS: GATHER_STATS_JOB encountered errors.  Check the trace file.

Errors in file /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/trace/OPTMYDB_j002_3640.trc:

ORA-20011: Approximate NDV failed: ORA-29913: error in executing ODCIEXTTABLEOPEN callout

ORA-29400: data cartridge error

error opening file /home/tmcon/oracle/UNMATCHED_SUBS_3640.log

Errors in file /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/trace/OPTMYDB_j002_3640.trc  (incident=8665):

ORA-01578: ORACLE data block corrupted (file # 620, block # 391690)

ORA-01110: data file 620: ‘/db/MYDB/data/tbsdata_e/tbsdip01_4m_14.dbf’

ORA-26040: Data block was loaded using the NOLOGGING option

Incident details in: /opt/oracle/OPTMYDB/diag/rdbms/MYDB/OPTMYDB/incident/incdir_8665/OPTMYDB_j002_3640_i8665.trc

 

Example2

I have grepped  the ORA-01578 from the alert file and added them to a table to get an overview of datafiles and blocks having the issues. And since the alertfile is a growing  objects this is what I got over t he time interval since starting up the database as a 11020.4 environment.

grep ORA-01578 alert_OPTMYDB.log AFN BL
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 391690) 620 391690
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 620, block # 392202) 620 392202
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946
ORA-01578: ORACLE data block corrupted (file # 661, block # 423946) 661 423946

 

## On mos I retrieved a note that offered more information on the object that was located around the corrupted block:

select file_id AFN, relative_fno, tablespace_name
from dba_data_files
where relative_fno=&RFN;
Enter value for rfn: 620
old   3: where relative_fno=&RFN
new   3: where relative_fno=620
       AFN RELATIVE_FNO TABLESPACE_NAME
———- ———— ——————————
      3689          620 UNDO_TBS_01
       620          620 TBSDIP01_4M
##  Used this statement to find the object that was on that spot in the tablespace:
select *
from dba_extents
where file_id = &AFN 620
and &BL between block_id AND block_id + blocks – 1; 391690
620
392202
SQL> select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1;
  2    3    4  Enter value for afn: 620
old   3: where file_id = &AFN
new   3: where file_id = 620
Enter value for bl: 391690
old   4: and &BL between block_id AND block_id + blocks – 1
new   4: and 391690 between block_id AND block_id + blocks – 1
OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME                 SEGMENT_TYPE       TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID    FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_BPS_05062014000_NOBP
                               TABLE              TBSDIP01_4M
         0        620     391689    4194304        512          620
SQL> select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1;
  2    3    4  Enter value for afn: 620
old   3: where file_id = &AFN
new   3: where file_id = 620
Enter value for bl: 392202
old   4: and &BL between block_id AND block_id + blocks – 1
new   4: and 392202 between block_id AND block_id + blocks – 1
OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME                 SEGMENT_TYPE       TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID    FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_BPS_05062014000_NORUEB
                               TABLE              TBSDIP01_4M
         0        620     392201    4194304        512          620
SQL> select file_id AFN, relative_fno, tablespace_name
from dba_data_files
where relative_fno=&RFN;
  2    3  Enter value for rfn: 661
old   3: where relative_fno=&RFN
new   3: where relative_fno=661
       AFN RELATIVE_FNO TABLESPACE_NAME
———- ———— ——————————
       661          661 TBSDIP01_4M
select *
from dba_extents
where file_id = &AFN
and &BL between block_id AND block_id + blocks – 1; 661 423946
OWNER
——————————
SEGMENT_NAME
——————————————————————————–
PARTITION_NAME                 SEGMENT_TYPE       TABLESPACE_NAME
—————————— —————— ——————————
EXTENT_ID    FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
———- ———- ———- ———- ———- ————
SCOTT
SCOTT_SII_05062014000R_BAN
                               TABLE              TBSDIP01_4M
         0        661     423945    4194304        512          661

 

Example3

On Mos  I also retrieved a note that offered RMAN as a tool to find  logical corruption. Since I am curious by nature of course had to use that goodie too !

 

RMAN> run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
backup validate check logical database;
}

 

##  Running this in Rman will put rows if any in the v$database_block_corruption.  When i asked for more details on this view i was told that it should return to 0 rows when rerunning.  I restarted the rman process after some time  (so maybe since it did not complete it did not wrap up properly but at second run I noticed still rows present there).

SQL> select * from V$DATABASE_BLOCK_CORRUPTION

 

FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTIO

———- ———- ———- —————— ———

235      34825         92         1.3760E+13 NOLOGGING

235     323083        510         1.3760E+13 NOLOGGING

548     315914        833         1.3760E+13 NOLOGGING

648     148489        512         1.3760E+13 NOLOGGING

 

## Well the good thing was that this showed in one view what most likely was  the cause of the issue : Objects being created  with nologging option  thus invalidating a 100 %  recovery after a restore  ( and recovery of a number of archives ) and upgrade to 11G after that.

From Oracle support I got this as a good note as it turned out  that the corruption was caused  by no logging operations:

The Gains and Pains of Nologging Operations in a Data Guard Environment (Doc ID 290161.1)

ORA-1578 / ORA-26040 Corrupt blocks by NOLOGGING – Error explanation and solution (Doc ID 794505.1)

A quote from this Note really made me both frown and giggle:

…… Begin Quote

Note that the data inside the affected blocks is not salvageable. Methods like “Media Recovery” or “RMAN blockrecover” will not fix the problem unless the data file was backed up after the NOLOGGING operation was registered in the Redo Log.

…… END  Quote

I also learned that the way for Oracle identifies that the block was before invalidated due to NOLOGGING  is by updating most of the bytes in that block with 0xff but only if that “invalidate” redo is applied in a Recovery. If the associated redo/archived log file is used to RECOVER the data files ( which was the case in this preproduction environment) , Oracle invalidates such blocks and the error ORA-26040 along with error ORA-1578 are reported by SQL statements in the next block reads. So in this case the recovery  was done  but blocks were touched by NOLOGGING operations were marked soft corrupt during the recovery as those redo logs with flag “this block was nologging” were applied to the block.

http://docs.oracle.com/cd/E11882_01/backup.112/e10642/osadvsce.htm#BRADV90047

The recommendation is : Always performs backup after NOLOGGING operations, don’t touch those blocks again, use that backup for recovery.

 

As always  happy reading and till we meet again,

 

Mathijs

 

Dropping redo groups in Another Thread

Introduction:

Quite some time ago I had to turn a Rac environment into a single Instance / Database. However I did notice there where still redo groups of the second instance present which I wanted to drop. This is just a quick blog  to show steps I have followed :

 

## Checking about the redologs and the thread information:

SQL> select * from v$log;

GROUP#   THREAD# SEQUENCE#     BYTES BLOCKSIZE      MEMBERS ARC STATUS          FIRST_CHANGE# FIRST_TIME      NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

1         1      1931 2147483648       512         2 YES INACTIVE         1.3789E+13 14.08.2014 18:46:19   1.3789E+13 18.08.2014 07:08:18

2         1      1933 2147483648       512         2 YES INACTIVE         1.3789E+13 18.08.2014 07:08:20   1.3789E+13 18.08.2014 07:08:22

3         1      1934 2147483648       512         2 NO CURRENT          1.3789E+13 18.08.2014 07:08:22   2.8147E+14

4         2        128 2147483648       512         2 YES INACTIVE            146319194 19.04.2012 23:21:10   146397537 20.04.2012 09:19:04

5         2        127 2147483648       512         2 YES INACTIVE            146311735 19.04.2012 22:35:29   146319194 19.04.2012 23:21:10

6         2        126 2147483648       512         2 YES INACTIVE            146169044 18.04.2012 23:20:54   146311735 19.04.2012 22:35:29

 

## Now lets try to drop a group that is part of 2nd Thread

SQL> alter database drop logfile group 4;

alter database drop logfile group 4

*

ERROR at line 1:

ORA-01623: log 4 is current log for instance MYDBP12 (thread 2) – cannot drop

ORA-00312: online log 4 thread 2: ‘+MYDBP1_REDO/MYDBP1/onlinelog/group_4.260.772812657′

ORA-00312: online log 4 thread 2: ‘+MYDBP1_DATA/MYDBP1/onlinelog/group_4.267.772812685′

 

 ## Checking Threads

SQL> select THREAD#, STATUS, ENABLED from v$thread;

THREAD# STATUS ENABLED

———- —— ——–

1 OPEN   PUBLIC

2 CLOSED PUBLIC

 

## And as we saw in first query which groups belong to the thread we want to drop: 

SQL> select group# from v$log where THREAD#=2;

GROUP#

———-

4

5

6

 

## First we need to make sure  that the thread will not interfere anymore so we disable it

SQL>

SQL> alter database disable thread 2;

Database altered.

 

## Now we can drop the logfile group:

SQL> alter database drop logfile group 4;

Database altered.

SQL> alter database drop logfile group 5;

Database altered.

SQL> alter database drop logfile group 6;

Database altered.

##  Believe is one thing , checking is better to see if thread is gone:

SQL> select THREAD#, STATUS, ENABLED from v$thread;

THREAD# STATUS ENABLED

———- —— ——–

1 OPEN   PUBLIC

## Lets check the groups from redo:

SQL> select * from v$log;

GROUP#   THREAD# SEQUENCE#     BYTES BLOCKSIZE      MEMBERS ARC STATUS          FIRST_CHANGE# FIRST_TIME      NEXT_CHANGE# NEXT_TIME

———- ———- ———- ———- ———- ———- — —————- ————- ——————- ———— ——————-

1         1      1931 2147483648       512         2 YES INACTIVE         1.3789E+13 14.08.2014 18:46:19   1.3789E+13 18.08.2014 07:08:18

2         1      1933 2147483648       512         2 YES INACTIVE         1.3789E+13 18.08.2014 07:08:20   1.3789E+13 18.08.2014 07:08:22

3         1      1934 2147483648       512         2 NO CURRENT          1.3789E+13 18.08.2014 07:08:22   2.8147E+14

 

Happy Dba again !

 

Enjoy reading and till we meet again,

 

Mathijs

 

Is your backup strategy safe ( a story of BCV – Snap restores with EMC)

Introduction:

As a dba there should always be some room to improve your backup and recovery scenarios.  This blog is about a  judgment  day scenario – losing a 3TB database in full and your bosses are asking  what will you do and when will the database be back on its feet. In such scenarios I do tend to debate when do you really lose the full database with a need to restore it ..  proper responses might be maybe during an upgrade  to a newer Oracle Release or when you lose the server where the  database is running (however  given the fact that  a lot of sites use SAN  that would even  offer other options than a restore of such a big database not that likely).  In my experience it would be much more common to come across scenarios to restore single datafiles , tablespaces . But OK  for the argument’s sake and to proof we have a scenario when we lose the full database ( means all data files and control files) let us go into more detail with that.

Details:

On my most important databases this has been the setup in the past when implementing the database(s) in ASM to give us a save feeling.

  • Every database has three disk groups   called SID_DATA , SID_FRA and SID_REDO .
  • Of course database files are in the SID_DATA , Control files  and  REDOLOG group members are  in   SID_DATA  and SID_REDO and the archives are put to the SID_FRA.
  • Fortunately we are using  EMC technology , both Snap and BCV. We use them to  set up  clone – reporting environments, pre-production – and  do our backups with it.

As with regard to the backups an interesting debate has risen this week which triggered this blog. In the setup  of the Snap and BCV it has always been considered that  there have to be two of them for each database. One holding the SID_DATA disk group ( means the database files)  and one Snap or BCV holding SID_FRA  and SID_REDO.  When a backup is triggered,  the following actions are performed on the production side:

  • In the database it is brought to begin backup ,
  • In the database a copy of the controlfile is pushed to SID_FRA  disk group ,
  • In the EMC tools the  mirror is broken,
  • In the database an end backup is issued.
  • On the backup server the database is mounted with the copied version of the controlfile and a level backup is running with rman (both level0 and level 1 ( indeed also using block change tracking ).
  • After a backup  the snap / bcv remains mounted on the backup server for when of a true disaster to be used to restore  the full database from disk..
  • On the production server  the archives are saved in intervals to tape directly .

 

Soo  far so good right ? .. We have it all covered (or not ?)  . ….

If you take a closer look at a judgment day disaster – recovery ask  where customer will say we need a restore of the FULL database  ( of 3TB) would we be happy  when seeing this ?

If we check the redos  and control files we will see this:

1* select a.GROUP#,MEMBER,BYTES/1024/1024 from v$logfile a, v$log b where a.group#= b.group#  order by group#;

 

GROUP# MEMBER                                                                                                                                   BYTES/1024/1024

———- ——————————————————————————– —————

1 +MYDBP1_REDO/mydbp1/onlinelog/group_1.258.772722547                                                                       2048

1 +MYDBP1_DATA/mydbp1/onlinelog/group_1.262.772722575                                                                       2048

2 +MYDBP1_REDO/mydbp1/onlinelog/group_2.259.772723069                                                                       2048

2 +MYDBP1_DATA/mydbp1/onlinelog/group_2.263.772723099                                                                       2048

3 +MYDBP1_REDO/mydbp1/onlinelog/group_3.260.772723879                                                                       2048

3 +MYDBP1_DATA/mydbp1/onlinelog/group_3.266.772723909                                                                       2048

4 +MYDBP1_REDO/mydbp1/onlinelog/group_4.261.772723247                                                                       2048

4 +MYDBP1_DATA/mydbp1/onlinelog/group_4.267.772723277                                                                       2048

5 +MYDBP1_REDO/mydbp1/onlinelog/group_5.262.772723627                                                                       2048

5 +MYDBP1_DATA/mydbp1/onlinelog/group_5.534.772723657                                                                       2048

6 +MYDBP1_REDO/mydbp1/onlinelog/group_6.263.772722935                                                                       2048

6 +MYDBP1_DATA/mydbp1/onlinelog/group_6.535.772722963                                                                       2048

 

12 rows selected.

 

## Control files are in :

SQL> show parameter control_files

 

NAME                                                      TYPE VALUE

———————————— ———– ——————————

control_files                                          string               +MYDBP1_REDO/mydbp1/controlfile/current.257.771667625, +MYDBP1_DATA/mydbp1/controlfile/current.261.771667625

 

So if all hell breaks loose and you are asked to restore the backup from the  Snap or BCV  what will happen …..

First scenario was : well it is all  taken care of since we have two  bcv – snaps  remember, so   the dba would contact the storage colleague that he should  restore the  SID_DATA  snap( he should sync it back to production) and after that we would do a recovery  using the archives and the online redo logs  who are still present on the production box . And we feel very good with this cause we thought of the  _FRA and _REDO should not be synced back to production .. right ? ( actually the proper answer is wrong).

As we were discussing this scenario.. And after looking at the locations in the v$logfile we did notice  a week spot in that idea.. If we restore the  SID_DATA from BCV that would also mean that the first member of every online redolog would be wiped out since we have a redo member in  the SID_DATA diskgroup ( Oracle does not like online redologs NOT being in sync)..  And  we would also lose a current control file since that is too  in the same diskgroup ( Neither not synced control files btw are like by Oracle).  Of course such a scenario will only be needed  between dusk till dawn  after 4 nights of being on call .. and all shouting we need this issue fixed as soon as possible  so there would be little time to investigate  and set up a scenario.

Looking at this scenario  at hand,  when being asked to restore asap from BCV, losing one of the control files well that could easily be fixed by starting the database with  one control file ( since we did not touch the _FRA and _REDO diskgroups (we did not restore the second BCV  !! lucky us)   that would offer us still a current control file and after  doing our activities to restore and recover , we could use rman to copy the controlfile again to a second – third place and amend the spfile) .

The online redo member however  that would hurt bad if it really would occur that one online redo log member would be overwritten ( since we Restore from the  SID_DATA SNAP – BCV) to be fastest possible way back on our feet from disk once  the full database would be destroyed.  In a lab test  I could consider perhaps  a scenario where mounting the database and dropping the first member from each group ….. but not on a production highly important database like in this scenario..

So  that means that we will need to make sure that when a SID_DATA  snap- bcv  is  used (resynced to production) to restore from disk  it should only hold  Datafiles  and ok a control file (and not  also a online redo log member as happens at the moment) .

In this case I will go and set up new online redo log groups and  I will set them up in the SID_FRA and SID_REDO diskgroup. Then I will  do log switches  and in the end will drop the groups that had members both in _DATA and _REDO.   After that .. I would be much more relaxed for my judgment day restore request to recover the FULL 3 TB database from the snap – bcv at maximum speed. It is not that much work to carry out and I am sure I feel a lot more confident that even a scenario where a 3TB database needs to be restored  from a BCV can be done quit quickly and with less  risks.

 

As Always  happy reading and till we meet again,

 

Mathijs

 

The return of the relink Grid Infrastructure and Rdbms relink

Introduction:

This week  I have been part of the debate again , do we or don’t we relink when major activities like  Upgrade of Linux Kernel is performed . I  have been asked to do the relink after the Rac cluster was upgraded on Linux. So as always thought it would be wise to make notes during the day as a plan to be performed during the night . In this blog you will find the  steps i have performed on a two node Rac cluster with 11.2.0.4 Grid Infrastructure and two Oracle software trees  holding 11.2.0.4 Rdbms and 11.1. Rdbms.

With regard to relinking  discussion in team had been like .. 1) we might break things in relinking  and 2) we don’t have the resources to do that for every server. My recommendation is to follow Oracle in this  and do deal with relink of  the Grid Infra right after  OS has been relinked . Cause if something is broken during the Upgrade and your relinking there after well at least you know where it came from and can deal with things as from there . Where as  if you do not relink your Software right after such a major change on OS you might still be hit in the dark in the upcoming weeks and  you would need to figure out then what might have caused things.

You can even debate on the fact if it is needed to stop the resources like listeners and databases gracefully before shutting down the cluster  or  to perform a checkpoint in your database and just shutdown the crs .  I have been doing both approaches and never had issues so far. But i can imagine that heavy used , busy systems  might prefer the grace shutdown before shutting down GI.

 

Below you will find my steps . As always happy reading  and till we meet again ,

Mathijs.

 

Detailed Plan:

 

mysrvrar / mysrvrbr Steps 1 – 8 will be performed on all two nodes in my cluster, in a sequential order with some delay to make sure no cluster panic will occur.
1 crsctl status resource -t>/tmp/BeforeWork.lst Check your cluster in order to be able to compare it to what it looks like after the relinking. Maybe it is even a good idea to put it into a file. Often i end up on clusters which i am not that familiar with on a daily basis. So i tend to make this overview before i start working on the cluster.
2 cSpfile.ksh This is a home made script in which several activities are performed. It will perform a create a spfile , do a checkpoint and do switch logfile right before shutting down the cluster node.
3 emctl stop agent
4 srvctl stop home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrar This will stop all resources that started from 1120.4 home and keep a record of them in the file in /tmp/status RDBMS. This will be convenient when starting again .
5
6 srvctl stop instance -d MYDBCM -i MYDBCM1 This is a shared cluster so we have customers requiring the 1120.4 software and some the 11.1 software . The 11.1 databases have to be stopped individually.
srvctl stop instance -d MYDBCMAC -i MYDBCMAC1
7 srvctl stop listener -n mysrvrar -l listener_MYDBCM1 It is common to have a listener per database so i will stop the 11.1 listener   in proper way as well.
srvctl stop listener -n mysrvrar -l listener_MYDBCMAC1
8 As root: Dealing with the cluster means you have to logon or perform sudo su – as the ORACLE user to become ROOT to perform the needed task to stop the cluster-ware on the cluster node.
9 cd /opt/crs/product/11204/crs/bin
10 ./crsctl disable crs During this maintenance Linux will be patching and rebooting various times so i was asked to make sure that the Grid Infra structure is not starting at each reboot till we are ready.
11 ./crsctl stop crs Last step as preparation for the Linux guys to patch the Machines . Shutting down the Grid Infra structure. Time to take a 2hr sleep.
Time to Relink the software on the two nodes Starting relink on the first node. Performing steps   9 and following . I will complete all steps needed on the first node and see to it that the Grid Infrastructure is started before moving on to the second node.
12 CHECK IF CRS IS DOWN otherwise REPEAT step 4 After Returning to the cluster still check if crs is down.   Because it is better to be safe then sorry.
13 As root: In order to relink the Grid Infra you have to become the root user again.
14 cd /opt/crs/product/11204/crs/bin as root
15 cd /opt/crs/product/11204/crs/crs/install
16 perl rootcrs.pl -unlock Earlier this night the GI was shutdown for Linux patching. When you perform this perl rootcrs .pl -unlock it will try to shutdown the GI. So in my case i got a message that the system was not able to stop the crs ..
17 As the grid infrastructure for a cluster owner: This was a bit tricky. Cause the owner of the Grid Infra in my case is Oracle so dont try this as root . Better to open a second window as Oracle for the steps below.
18 export ORACLE_HOME=/opt/crs/product/11204/crs As the Oracle user.
cd /opt/crs/product/11204/crs/bin As the Oracle user.
19 relink Relink will also write a relink log which you can tail.
20 [Step 1] Log into the UNIX system as the Oracle software owner: Once the GI software has been relinked it is time for relinking the Oracle Homes( in my case an 11.1 and 11.2. software tree). In my case i logged on as the oracle user.
21 [STEP 2] Verify that your $ORACLE_HOME is set correctly:
22 For all Oracle Versions and Platforms, perform this basic environment check first:
export $ORACLE_HOME= /opt/oracle/product/11204_ee_64/db Oracle 11.2.0.4
export $ORACLE_HOME= /opt/oracle/product/111_ee_64/db Oracle 11.1
cd $ORACLE_HOME
pwd Check the environment.
23 [Step 3] Verify and/or Configure the UNIX Environment for proper relinking:
Set LD_LIBRARY_PATH to include $ORACLE_HOME/lib LD_LIBRARY_PATH needs to be in place so when relinking both ORACLE versions make sure you set the environment in a correct way.
export LD_LIBRARY_PATH=/opt/oracle/product/11204_ee_64/db/lib
echo $LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/oracle/product/111_ee_64/db/lib
echo $LD_LIBRARY_PATH
24 [Step 4] For all Oracle Versions and UNIX Platforms:
Verify that you performed Step 2 correctly: Check , check and check again
env | grep -i LD_ ….make sure that you see the correct absolute path for $ORACLE_HOME in the variable definitions.
25 [Step 5] For all Oracle Versions and UNIX Platforms:
Verify umask is set correctly:
umask This must return 022. If it does not, set umask to 022.
umask 022
umask
26 [Step 6] Run the OS Commands to Relink Oracle:
Important Notes:
* Before relinking Oracle, shut down both the database and the listener.
* The following commands will output a lot of text to your session window. To capture this output for upload to support, redirect the output to a file.
* If relinking a client installation, it’s expected that some aspects of the following commands will fail if the components were not originally installed.
27 For all UNIX platforms:
Oracle 8.1.X, 9.X.X, 10.X.X or 11.X.X
————————————-
$ORACLE_HOME/bin/relink all Oracle 11.1
$ORACLE_HOME/bin/relink oracle 11.2
writing relink log to: /opt/oracle/product/11204_ee_64/db/install/relink.log
28 How to Tell if Relinking Was Successful: If relinking was successful, the make command will eventually return to the OS prompt without an error. There will NOT be a ‘Relinking Successful’ type message. I performed a tail on the logfiles as relink was running in a second window and did not see any issues. And as the note says wait for the prompt to return ( with no comments – messages ) and you are good to go
29 As root again: Since i am relinking both the GI and the RDBMS i have moved this step ( starting the GI again till after the RDBMS relinking has finished because of   course during the relink of RDBMS the environment ( Databases , listeners ) have to be down !
30 cd /opt/crs/product/11204/crs/crs/install/
31 perl rootcrs.pl -patch This perl rotcrs.pl -patch wil also start the cluster on this node again.NOTE we had issues that this was hanging on the first Node . It appeared that the second node was up and running after all ( my Linux Colleague had issued a crsctl disable crs from an old not active cluster-ware software which was still present on the box) . So in this specific scenario on second node i stopped crs again   then the script continued on first node.
32 crsctl enable crs If you have used the disable crs . Enable it again so after a node reboot the GI will start.
33 As Oracle
emctl start agent Agent was already running so no manual action needed.
34 srvctl start home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrar This will start all resources started from 1120.4 home. The resources had been saved previously in the /tmp/statusRDBMS file
35 srvctl start instance -d MYDBCM -i MYDBCM1 Starting the 11.1 Resources.
srvctl start instance -d MYDBCMAC -i MYDBCMAC1
36 srvctl start listener -n mysrvrar -l listener_MYDBCM1 Starting the 11.1 Resources.
srvctl start listener -n mysrvrar -l listener_MYDBCMAC1
37 As Oracle User on the second node once it is relinked:
38 srvctl start instance -d MYDBCM -i MYDBCM2 Starting the 11.1 Resources.
srvctl start instance -d MYDBCMAC -i MYDBCMAC2
39 srvctl start listener -n mysrvrbr -l listener_REQMOD2 Starting the 11.1 Resources.
srvctl start listener -n mysrvrbr -l listener_MYDBCM2
srvctl start home -o $ORACLE_HOME -s /tmp/statusRDBMS -n mysrvrbr
crsctl status resource -t Check your cluster again and compare the result with the status before. Hopefully all resources will appear online online   or at least show the situation as it was before . There might be an extra activity if you are using   services that have been relocated   during the action. In such case you will have to relocate them again to the original location.