SQL Server to Oracle Replication Troubleshooting

A couple of days ago we had a serious service affecting outage to do with our replication which is responsible for taking order data from our OLTP SQL Server 2005 system, and into a Oracle 11g RAC database (11.1.0.7.9).

In the end, we had to make use of our Premier Microsoft support in order to help resolve the situation and after a long but useful 3 hours on the phone with them, we manage to resolve and continue replication without the loss of any (good) data.

Troubleshooting

The first notification we received that there was an issue was a Distribution Agent failure

SQL Server Alert System: ‘Replication: agent failure’ occurred on MYServer DESCRIPTION:    Replication-Replication Distribution Subsystem: agent Server\Instance-Publisher-ATD-ORADB-32 failed. ORA-01722: invalid number

We also got alerted of the issue from our Oracle grid COntrol monitoring:

Message=UDM WEB2: SESSION MISSING FOR SQL SERVER DISTRIB.EXE!!
Metric=User-Defined Numeric Metric

What is an ORA-01722?

Cause: The attempted conversion of a character string to a number failed because the character string was not a valid numeric literal. Only numeric fields or character fields containing numeric data may be used in arithmetic functions or expressions. Only numeric fields may be added to or subtracted from dates.

Action: Check the character strings in the function or expression. Check that they contain only numbers, a sign, a decimal point, and the character "E" or "e" and retry the operation.

Confirming the subscription error

Use [Distribution]
Go
sp_helpsubscriptionerrors [ @publisher = ] 'publisher'
        , [ @publisher_db = ] 'publisher_db' 
        , [ @publication = ] 'publication' 
        , [ @subscriber = ] 'subscriber' 
        , [ @subscriber_db = ] 'subscriber_db'

To retrieve the variables for the above command, run the following on the Publication

sp_helpsubscription

Or the same information is also available in the Run Agent step of the distribution agent

When running sp_helpsubscriptionerrors, you get the following output. In our case, confirmation of the ORA error:

image

This gives me some very key information for troubleshooting further, including the xact_seqno and the command_id which will help me identify the point of failure and the command that is being executed.

Further Issues

Our replication issue was further complicated because the replication agent kept on crashing since the first occurence of the ORA error. This meant that it wasnt retrying every minute as i would normally expect. This was a bit of a red herring as we couldnt be sure if the crash was because of the dodgy data or that there was something greater wrong with replication. Unfortuinately, we didnt get much help from the error it was producing:

Replication-Replication Distribution Subsystem: agent Server\Instance-Publisher-ATD-ORADB-32 failed. The replication agent encountered a failure. See the previous job step history message or Replication Monitor for more information.  The step failed.

Add Verbose logging

The first step after realising that the replication agent was crashing, was to add more verbose logging:

http://support.microsoft.com/kb/312292

-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel [0|1|2]

I set logging to 2, which is the most detailed level (although Microsoft later told me to change to level 4. I cant say i noticed anything different in the logging)

The last thing in the log before crashing was this:

Transaction seqno: 0x00025a04000381d400f0

Command Id:  24

Partial: 0

Type: 30

Command: <<parameterized command>>

2012-07-12 12:15:51.672 OLE DB Subscriber ‘WEB1’: SELECT * FROM "EVENT" WHERE 0 = 1

2012-07-12 12:15:51.687 Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_DBTIMESTAMP, SQL Type: 135

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

The Application event log gave further stack trace info:

The description for Event ID ( 52 ) in Source ( SQLSERVERAGENT ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: A replication agent encountered a fatal error and was shut down. A mini-dump has been generated at the following location:

C:\Program Files\Microsoft SQL Server\90\Shared\ErrorDumps\ReplAgent20120712121316_0.mdmp

This was passed to Microsoft and they told us that it was the oracle driver which was causing the crash and told us that this may be down to the data being sent, which brought us back to the initial theory of solving the dodgy data.

Attempts at fixing replication

Skip Errors – The first thing i attempted was skipping the error, which would have meant missing data, but would have at least caught up replication and i could have investigated the dodgy data in the backgroun without affecting service. Add the SkipErrors flag into the distribution agent and restart

-SkipErrors 1722

Unfortunately this didn’t work, possibly because the agent was crashing prior to receiving the error code back from oracle.

Finding the transaction and the command

Using the exact_seqno variable for start and end sequence, run the following:

sp_browsereplcmds 
    @xact_seqno_start ='0x00025a04000381d400f0' , 
    @xact_seqno_end ='0x00025a04000381d400f0'

This confirmed what we saw in both the sp_helpsubscriptionerrors and the verbose logging, that the EVENT table contained the dodgy data.

The problem we had was how to get this data out of the distribution database.

Please note, that the following information is unsupported and undocumented, and should not be used under normal circumstances

Every time you create replication from SQL Server to Oracle, A MSREPL7 table is created on the Oracle subscriber which keeps track of which transactions have been delivered and last delivery time etc. Ive blogged about this table in the past (https://dbamohsin.wordpress.com/2010/09/23/error-message-when-the-distribution-agent-tries-to-apply-to-the-initial-snapshot-to-an-oracle-11g-subscriber-%E2%80%9Cora-00932-inconsistent-datatypes%E2%80%9D/)

select PUBLCN, DSTAGENT, TIMECOL, XACTTS from atd_uk_tab_replication.MSREPL7
WHERE PUBLCN = 'MYPUBLICATION';
PUBLCN DSTAGENT TIMECOL XACTTS
MYPUBLICATION distrib 17/07/2012 11:10:22 00025A71000B464C014200000000

To fix the issue we manually updated the XACTTS column in the MSREPL7 table to the offenting xact_seqno. This tells replication that it has completed that XACTTS and it will move past it to look for the next sequence in the distribution database.

Before making the change, we confirmed that the MSREPL7 table was showing the XACTTS just prior to our failing sequence by checking msrepl_transactions.

select * from msrepl_transactions
where entry_time >= '2012-07-12 10:10:00'
order by entry_time

We also manually ran all the commands in the offending xact_seqno manually against the subscriber. We masked the dodgy data and replaced it with something else. The idea behind this is that your subscription does not lose any rows and stays completely in sync.

As soon as we then updated the MSREPL7 table, and started replication, transactions started flowing through. We have 2 identical subscribers so we had to follow this method twice.

Progress of a Backup or Restore in SQL 2000?

In SQL 2005 onwards, you can use DMV’s to monitor percentage completed of certain operations (https://dbamohsin.wordpress.com/2011/02/08/dmvs-sys-dm_exec_requests-and-percent_complete/)

Is this possible in SQL 2000 apart from using the STATS flag on the BACKUP or RESTORE syntax…

Yes…

Use the following command against the SPID running the backup or Restore to see buffer information

DBCC OUTPUTBUFFER(spid)

Output Buffer                                                                
—————————————————————————–

00000000   04 00 00 5d 00 38 06 00 79 01 00 00 00 ab 44 00   …].8..y….«D.

00000010   8b 0c 00 00 01 00 14 00 33 00 30 00 20 00 70 00   ‹…….3.0. .p.

00000020   65 00 72 00 63 00 65 00 6e 00 74 00 20 00 72 00   e.r.c.e.n.t. .r.

00000030   65 00 73 00 74 00 6f 00 72 00 65 00 64 00 2e 00   e.s.t.o.r.e.d…

Intelligent Reindexing on Filer based volumes

Problem

A reindexing job in our billing system DB caused an outage on the SQL server, by filling up the snapshot space for the db_log volume on out NetApp FIler. This happened due to the indexes being very highly fragmented causing a huge amount of log churn when rebuilding.

Reindexing in SQL Server is necessary in defragmenting indexes, which over time have become fragmented. Learn more here

http://sql-articles.com/articles/performance-tunning/index-fragmentation-in-sql-server-2005/

https://dbamohsin.wordpress.com/2011/05/20/fillfactor-of-an-index-slow-down-fragmentation/

I’ve been working on re-engineering the reindexing process so that it can consider the churn and limit the work done based on churn.

Solution

The new Reindexing process is as follows:

Every night at 23:00 a job runs called “Maintenance – Populate Fragmentation” runs on the SQL Server. This populates a table called TMG_Index_Fragmentation in the Billing database which is used in the job detailed below. This job records all fragmentation in the database.

Every Tuesday & Sunday at 01:00 a job called “Intelligent Reindex/Reorganise” runs on the SQL server. This does the Reindexing for the billing system DB based on the following criteria:

Only Reindex while the churn created by the reindexing is less that 5gb – if the upper limit is reached then reindexing stops

  • Reindex if above 25% fragmented
  • Reorganise if between 10% and 25% fragmented
  • Reindex online if QUERY_GOVERNER_COST_LIMIT is more than 30 seconds (Discussed here – https://dbamohsin.wordpress.com/2011/02/08/query-control-with-query_governer_cost_limit/). We have a table which stores these objects (TMG_High_Cost_Objects)
  • Skip objects at index level if churn limit is likely to be broken by reindexing (calculated by – current churn + (index reserved space * 1.10) < 5120)
    The Procedure which does the reindexing is detailed below. This procedure can potentially work on any database (as long as the objects needed have been created)

See attached log for what you would expect to see after reindexing job has completed (Intelligent Reindex-Reorganise.log).

A summary is also produced for the workload completed:

############ SUMMARY OF WORKLOAD ############## [SQLSTATE 01000]
Sum of Index Size Rebuilt: 4408 MB [SQLSTATE 01000]
Log Size on Completion: 3074 MB [SQLSTATE 01000]
Number of Indexes Rebuilt online: 10 [SQLSTATE 01000]
Number of Indexes Rebuilt offline: 76 [SQLSTATE 01000]
Number of Indexes Reorganized: 9 [SQLSTATE 01000]
Number of Indexes Skipped – Size Quota: 593 [SQLSTATE 01000]
Number of Indexes Skipped – No Fragmentation: 0 [SQLSTATE 01000]

DECLARE @Table_Name varchar(255)
DECLARE @Index_Name varchar(255)
DECLARE @Reserved int
DECLARE @Used int
DECLARE @Fragmentation int
DECLARE @Page_Count int
DECLARE @Fragment_Count int
DECLARE @SumMB int
DECLARE @onlinerebuildcount int
DECLARE @offlinerebuildcount int
DECLARE @reorgcount int
DECLARE @maxreindexsize int
DECLARE @SkipCount int
DECLARE @SkipCountMB int
DECLARE @UsedMB int

-- Reset Variables used for Summary Report
SET @SumMB = 0
SET @onlinerebuildcount = 0
SET @offlinerebuildcount = 0
SET @reorgcount = 0
SET @SkipCountMB = 0
SET @SkipCount = 0

-- Maximum total reindex size in MB. This dictates how many indexes are picked up for rebuild
SET @maxreindexsize = 5120

DECLARE TableCursor CURSOR FOR
    SELECT 
        OBJECT_NAME(i.[object_id]) as [Table_Name],
        i.name as [Index_Name], 
        (a.total_pages * 8) / 1024 as [Reserved],
        (a.used_pages * 8)/ 1024 as [Used],
        fr.avg_fragmentation_in_percent as [Fragmentaiton], 
        fr.page_count as [Page_Count], 
        fr.fragment_count as [Fragment_Count]
    FROM sys.indexes as i
        inner join tmg_index_fragmentation fr on ((fr.object_id = i.object_id) and (fr.index_id = i.index_id))
        inner join sys.partitions as p on i.object_id = p.object_id and i.index_id = p.index_id 
        inner join sys.allocation_units as a on p.partition_id = a.container_id
-- Rebuilding re-enables an Index therefore discount disabled indexes
    WHERE i.is_disabled = 0
    ORDER BY fr.avg_fragmentation_in_percent DESC

OPEN TableCursor
FETCH NEXT FROM TableCursor INTO @Table_Name, @Index_Name, @Reserved, @Used, @Fragmentation, @Page_Count, @Fragment_Count
WHILE @@FETCH_STATUS = 0
BEGIN
    
SELECT @UsedMB = 
SUM(CONVERT(NUMERIC(10,2),ROUND(FILEPROPERTY([name],'SpaceUsed')/128.,2)))
FROM [sys].[database_files]
Where Type_desc = 'LOG'

    IF (@UsedMB + (@Reserved*1.2)) < @maxreindexsize -- Max churn per reindexing window. Add 20% Leeway to sizing
    BEGIN
        IF @Fragmentation > 25    
        BEGIN
            IF @Table_Name in (Select name from TMG_High_Cost_Objects where reindex_mode = 'ONLINE')
            BEGIN
                PRINT 'REINDEX ONLINE STARTED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)
                EXEC ('ALTER INDEX [' + @Index_Name + '] ON ' + @Table_Name + ' REBUILD WITH (SORT_IN_TEMPDB=ON, ONLINE=ON);');
                SET @onlinerebuildcount = @onlinerebuildcount + 1
                PRINT 'REINDEX ONLINE COMPLETED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)
            END 
            ELSE
            BEGIN
                PRINT 'REINDEX OFFLINE STARTED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)
                EXEC ('ALTER INDEX [' + @Index_Name + '] ON ' + @Table_Name + ' REBUILD WITH (SORT_IN_TEMPDB=ON);');
                SET @offlinerebuildcount = @offlinerebuildcount + 1
                PRINT 'REINDEX OFFLINE COMPLETED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)
            END
        
        END
        ELSE IF @Fragmentation > 10 and @Fragmentation <= 25
        BEGIN
            PRINT 'REORGANIZE STARTED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)    
            EXEC ('ALTER INDEX [' + @Index_Name + '] ON ' + @Table_Name + ' REORGANIZE;');
            SET @reorgcount = @reorgcount + 1
            PRINT 'REORGANIZE COMPLETED - '+ @Index_Name + ' ON ' + @Table_Name + ' AT ' + convert(varchar(20),getdate(),113)
        END
        ELSE
        BEGIN
            PRINT 'OBJECT SKIPPED - '+ @Index_Name + ' ON ' + @Table_Name + '. Minimal Fragmentation (' + CAST(@Fragmentation as varchar(5)) + '%)'
            SET @SkipCount = @SkipCount + 1 
        END
    SET @SumMB = @SumMB + @Reserved
    END
    ELSE
    BEGIN
        PRINT 'OBJECT SKIPPED - '+ @Index_Name + ' ON ' + @Table_Name + '. Failed Size Check (' + CAST(@Reserved as varchar(5)) + ' MB)'
        SET @SkipCountMB = @SkipCountMB + 1 
    END

FETCH NEXT FROM TableCursor INTO @Table_Name, @Index_Name, @Reserved, @Used, @Fragmentation, @Page_Count, @Fragment_Count
END

PRINT '############ SUMMARY OF WORKLOAD ##############'
PRINT 'Sum of Index Size Rebuilt: ' + CAST(@SumMB as varchar(6)) + ' MB'
PRINT 'Log Size on Completion: ' + CAST(@UsedMB as varchar(6)) + ' MB'
PRINT 'Number of Indexes Rebuilt online: ' + CAST(@onlinerebuildcount as varchar(6))
PRINT 'Number of Indexes Rebuilt offline: ' + CAST(@offlinerebuildcount as varchar(6))
PRINT 'Number of Indexes Reorganized: ' + CAST(@reorgcount as varchar(6))
PRINT 'Number of Indexes Skipped - Size Quota: ' + CAST(@SkipCountMB as varchar(6))
PRINT 'Number of Indexes Skipped - No Fragmentation: ' + CAST(@SkipCount as varchar(6))

CLOSE TableCursor

DEALLOCATE TableCursor

return
GO

Shrink MSDB Database

Shrinking a database isnt generally recommended but there may be situations when a compact operation is required. For example, introducing a purging policy of 90 days on backup history when previously there wasn’t one would cause a lot of free space in the msdb, especially if its a high activity instance based on jobs.

Backup MSDB

The first step is to do a full backup of MSDB.  You should always make a backup before you plan on doing any changes with a system database.  You should also have a plan to restore the system database just in case you have to implement it.

To check what is using the space:

USE [msdb]
go
SELECT object_name(i.object_id) as objectName,
i.[name] as indexName,
Sum(a.total_pages) as totalPages,
sum(a.used_pages) as usedPages,
sum(a.data_pages) as dataPages,
(sum(a.total_pages) * 8 ) / 1024 as totalSpaceMB,
(sum ( a.used_pages) * 8 ) / 1024 as usedSpaceMB,
(sum(a.data_pages) * 8 ) / 1024 as dataSpaceMB
FROM sys.indexes i
INNER JOIN sys.partitions p
ON i.object_id = p.object_id
AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a
ON p.partition_id = a.container_id
GROUP BY i.object_id, i.index_id, i.[name]
ORDER BY sum(a.total_pages) DESC, object_name(i.object_id)
GO

Disable all Agent Jobs

Select * into msdb..DBA_Agent_jobs_Snapshot
FROM msdb..sysjobs
WHERE ENABLED = 1

SELECT 'EXEC msdb.dbo.sp_update_job @job_id=N''' 
+ cast(job_id as varchar(40)) + ''', @enabled=0'
FROM msdb..DBA_Agent_jobs_Snapshot

--Run output from above

Shrink MSDB

USE [msdb]
GO
--1GB was sufficient for this MSDB. 
DBCC SHRINKFILE (N'MSDBData' , 1024)
GO

You can keep an eye on the compact status by running the following for your specific session:

select percent_complete, * from sys.dm_exec_requests
--session+_id of the shrink operation
where session_id = 50

Enable all Agent Jobs

SELECT 'EXEC msdb.dbo.sp_update_job @job_id=N''' 
+ cast(job_id as varchar(40)) + ''', @enabled=1'
FROM msdb..DBA_Agent_jobs_Snapshot

--Run output from above

Free eBooks from Microsoft

Came across this the other day – http://blogs.msdn.com/b/microsoft_press/archive/2012/05/04/free-ebooks-great-content-from-microsoft-press-that-won-t-cost-you-a-penny.aspx

Some great content :

image

NOLOCK equivalent in MySQL

SQL Server WITH (NOLOCK) looks like this:

SELECT * FROM TABLE WITH (nolock)

To achieve the same with MySQL, we change the session isolation mode using the SET SESSION command.

SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
SELECT * FROM TABLE_NAME ;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;

This statement will work similar to WITH (NOLOCK) i.e READ UNCOMMITTED data. We can also set the isolation level for all connections globally:

SET GLOBAL TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;

In addition, two system variables related to isolation also level exist in MySQL:

--global isolation level
SELECT @@global.tx_isolation; 
--session isolation level
SELECT @@tx_isolation; 

Or set the isolation level inside a transaction:

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

MySQL Isolation levels documentation is here

Generate AWR reports from command line

Sometimes generating AWR reports from Grid Control can be slow, or even hang with timeouts. Creating the report from the server directly is usually quicker.

All the report generation sql are stored in the directory $ORACLE_HOME/rdbms/admin/

1. The awrrpt.sql SQL script generates an HTML or text report that displays statistics for a range of snapshot Ids.
2. The awrrpti.sql SQL script generates an HTML or text report that displays statistics for a range of snapshot Ids on a specified database and instance.
3. The awrsqrpt.sql SQL script generates an HTML or text report that displays statistics of a particular SQL statement for a range of snapshot Ids. Run this report to inspect or debug the performance of a SQL statement.
4. The awrsqrpi.sql SQL script generates an HTML or text report that displays statistics of a particular SQL statement for a range of snapshot Ids on a specified SQL.
5. The awrddrpt.sql SQL script generates an HTML or text report that compares detailed performance attributes and configuration settings between two selected time periods.
6. The awrddrpi.sql SQL script generates an HTML or text report that compares detailed performance attributes and configuration settings between two selected time periods on a specific database and instance.

 

[**LIVE**]mydb07:/u01/app/oracle/product/11.1/db/rdbms/admin : sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 – Production on Fri Apr 27 09:51:09 2012
Copyright (c) 1982, 2008, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 – 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> @awrrpt.sql

Current Instance
~~~~~~~~~~~~~~~~
DB Id       DB Name      Inst Num Instance
———– ———— ——– ————
4284183767  DB1                2  DB12

Specify the Report Type
~~~~~~~~~~~~~~~~~~~~~~~
Would you like an HTML report, or a plain text report?
Enter ‘html’ for an HTML report, or ‘text’ for plain text
Defaults to ‘html’
Enter value for report_type: text

Type Specified:  text

Specify the Begin and End Snapshot Ids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap: 151533
Begin Snapshot Id specified: 151533

Enter value for end_snap: 151537
End   Snapshot Id specified: 151537

Specify the Report Name
~~~~~~~~~~~~~~~~~~~~~~~
The default report file name is awrrpt_2_151533_151537.txt.  To use this name,
press <return> to continue, otherwise enter an alternative.

Enter value for report_name: awrrpt_db12_201204180700-201204180800.txt

Using the report name awrrpt_db12_201204180700-201204180800.txt