Wednesday, May 23, 2007

Dataguard, documentation/scripts for non-DBAs during failover

We're looking at implementing Dataguard as part of an implementation of Documentum (a document management system from EMC) and I have been asked to look at producing documentation and scripts for non-DBA users to use during a failover. The actual failover of the database itself will be handled by our DBAs, this is for the sys admins, network admins, application admins &c who may need to do things during fail over.

I'm currently going through the Dataguard concepts guide and have found some other documentation on OTN but was hoping that someone with more knowledge of Dataguard could point me towards any documentation that might help with the non-Oracle side of failover. If anyone has any documentation/scripts like this they have prepared themselves that they are prepared to share and I could adapt to our environment then I'd be very grateful.

Here is the background:

We have a pair of IBM p590 servers, running AIX 5, currently sitting in different parts of the same datacentre but (hopefully) one will be moving to a different site in the near future. Each will run both primary and standby environments, one running the standbys for the primaries on the other. Each environment will run in a virtual server.

The Documentum service runs in an N-Tier configuration:

Presentation layer/
Application/Business logic layer
Metadata/Storage layer

The presentation layer is either a fat client running on local PCs or a web front end running in Tomcat that the users can access via a browser. The application/business logic layer is the Documentum application running in Oracle Application Server 10g. The metadata/storage layer consists of an Oracle 10g database and filesystem storage on IBM storage devices.

Additionally on the application/business logic layer there are interfaces to SAP provided by Documentum Services for SAP running on a separate Windows 2003 server and scanning stations and servers, these connect to the Documentum application.

Users do not access the database directly, nor do any other services, all access is via the application.

When a document is added to the repository (via the presentation layer, Services for SAP or scanned) it is rendered to PDF and added to a filestore on the storage (i.e. the file is saved to a directory), metadata about the document (title, location, categories, key words &c) are stored in the Oracle database.

The filestore will be synced from primary to standby by either IBM Flashcopy or IBM Metro Mirror, the metadata will be synced by Oracle Dataguard. Due to the way Documentum handles inconsistency between the metadata and the filestore (i.e. documents in one that are not in the other) the metadata sync will always lag behind the filestore sync (if there's metadata for a non-existant document then the metadata can easily be found and deleted but if there's no metadata for an existing document it's a bigger job to find the document, an analogy would be looking up words in a book's index to find them in the book vs checking each word in a book to see if it's in the index).

Edited to add (as a result of comments on Experts Exchange):

The failover of the database itself will be handled by the DBA team. I have been asked to produce documentation for any changes that need to be made outside the database. All the documentation I can find ignores anything outside the database. Clearly there will need to be activities outside of the database when a failover or switchover takes place, the obvious one that comes to mind is pointing the clients to the new server. I haven't been able to find anything about that.

Possible solutions that come to mind for pointing the clients to the new server are:
  • Edit the TNSNAMES.ORA files. This would be possible in this set up as we only have a few boxes (application servers, scanning servers and Services for SAP servers) that connect to these databases. If the number of boxes increases significantly then it might no longer be possible, anyhow I prefer to avoid manual processes as I know how easy it is to miss something when you're under pressure.
  • Use a unique hostname for each database and have a DNS entry (so if database ORCL1 is on server bigprod, IP address, then we have a DNS entry orcl1 which resolves to and use that in the Net8 settings for the ORCL1 service), when failover happens we just edit the DNS settings to point to the IP address of the standby server. This was proposed in another project and may get implemented. The downside is it means involving a directories person in a fail/switch over.
  • Use OID for database names resolution. Probably the ideal solution but also means implementing another extra technology, and paying for it. On the other hand we do have a plan to implement OID at some point in the future so we will probably use this eventually.
  • Specify multiple ADDRESSes in the TNSNAMES.ORA file so if it can't reach the primary it will try to standbys, if the failover hasn't happened yet (primary is down, standby hasn't changed to primary yet) then it will have to time out. As we're planning regular switch overs we'd have to make sure it times out quickly for those times that the first server in the list is the secondary.