SQL Server to Oracle Replication Troubleshooting

A couple of days ago we had a serious service affecting outage to do with our replication which is responsible for taking order data from our OLTP SQL Server 2005 system, and into a Oracle 11g RAC database (11.1.0.7.9).

In the end, we had to make use of our Premier Microsoft support in order to help resolve the situation and after a long but useful 3 hours on the phone with them, we manage to resolve and continue replication without the loss of any (good) data.

Troubleshooting

The first notification we received that there was an issue was a Distribution Agent failure

SQL Server Alert System: ‘Replication: agent failure’ occurred on MYServer DESCRIPTION:    Replication-Replication Distribution Subsystem: agent Server\Instance-Publisher-ATD-ORADB-32 failed. ORA-01722: invalid number

We also got alerted of the issue from our Oracle grid COntrol monitoring:

Message=UDM WEB2: SESSION MISSING FOR SQL SERVER DISTRIB.EXE!!
Metric=User-Defined Numeric Metric

What is an ORA-01722?

Cause: The attempted conversion of a character string to a number failed because the character string was not a valid numeric literal. Only numeric fields or character fields containing numeric data may be used in arithmetic functions or expressions. Only numeric fields may be added to or subtracted from dates.

Action: Check the character strings in the function or expression. Check that they contain only numbers, a sign, a decimal point, and the character "E" or "e" and retry the operation.

Confirming the subscription error

Use [Distribution]
Go
sp_helpsubscriptionerrors [ @publisher = ] 'publisher'
        , [ @publisher_db = ] 'publisher_db' 
        , [ @publication = ] 'publication' 
        , [ @subscriber = ] 'subscriber' 
        , [ @subscriber_db = ] 'subscriber_db'

To retrieve the variables for the above command, run the following on the Publication

sp_helpsubscription

Or the same information is also available in the Run Agent step of the distribution agent

When running sp_helpsubscriptionerrors, you get the following output. In our case, confirmation of the ORA error:

image

This gives me some very key information for troubleshooting further, including the xact_seqno and the command_id which will help me identify the point of failure and the command that is being executed.

Further Issues

Our replication issue was further complicated because the replication agent kept on crashing since the first occurence of the ORA error. This meant that it wasnt retrying every minute as i would normally expect. This was a bit of a red herring as we couldnt be sure if the crash was because of the dodgy data or that there was something greater wrong with replication. Unfortuinately, we didnt get much help from the error it was producing:

Replication-Replication Distribution Subsystem: agent Server\Instance-Publisher-ATD-ORADB-32 failed. The replication agent encountered a failure. See the previous job step history message or Replication Monitor for more information.  The step failed.

Add Verbose logging

The first step after realising that the replication agent was crashing, was to add more verbose logging:

http://support.microsoft.com/kb/312292

-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel [0|1|2]

I set logging to 2, which is the most detailed level (although Microsoft later told me to change to level 4. I cant say i noticed anything different in the logging)

The last thing in the log before crashing was this:

Transaction seqno: 0x00025a04000381d400f0

Command Id:  24

Partial: 0

Type: 30

Command: <<parameterized command>>

2012-07-12 12:15:51.672 OLE DB Subscriber ‘WEB1’: SELECT * FROM "EVENT" WHERE 0 = 1

2012-07-12 12:15:51.687 Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_DBTIMESTAMP, SQL Type: 135

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_NUMERIC, SQL Type: 131

Column: DBTYPE_VARCHAR, SQL Type: 129

Column: DBTYPE_NUMERIC, SQL Type: 131

The Application event log gave further stack trace info:

The description for Event ID ( 52 ) in Source ( SQLSERVERAGENT ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: A replication agent encountered a fatal error and was shut down. A mini-dump has been generated at the following location:

C:\Program Files\Microsoft SQL Server\90\Shared\ErrorDumps\ReplAgent20120712121316_0.mdmp

This was passed to Microsoft and they told us that it was the oracle driver which was causing the crash and told us that this may be down to the data being sent, which brought us back to the initial theory of solving the dodgy data.

Attempts at fixing replication

Skip Errors – The first thing i attempted was skipping the error, which would have meant missing data, but would have at least caught up replication and i could have investigated the dodgy data in the backgroun without affecting service. Add the SkipErrors flag into the distribution agent and restart

-SkipErrors 1722

Unfortunately this didn’t work, possibly because the agent was crashing prior to receiving the error code back from oracle.

Finding the transaction and the command

Using the exact_seqno variable for start and end sequence, run the following:

sp_browsereplcmds 
    @xact_seqno_start ='0x00025a04000381d400f0' , 
    @xact_seqno_end ='0x00025a04000381d400f0'

This confirmed what we saw in both the sp_helpsubscriptionerrors and the verbose logging, that the EVENT table contained the dodgy data.

The problem we had was how to get this data out of the distribution database.

Please note, that the following information is unsupported and undocumented, and should not be used under normal circumstances

Every time you create replication from SQL Server to Oracle, A MSREPL7 table is created on the Oracle subscriber which keeps track of which transactions have been delivered and last delivery time etc. Ive blogged about this table in the past (https://dbamohsin.wordpress.com/2010/09/23/error-message-when-the-distribution-agent-tries-to-apply-to-the-initial-snapshot-to-an-oracle-11g-subscriber-%E2%80%9Cora-00932-inconsistent-datatypes%E2%80%9D/)

select PUBLCN, DSTAGENT, TIMECOL, XACTTS from atd_uk_tab_replication.MSREPL7
WHERE PUBLCN = 'MYPUBLICATION';
PUBLCN DSTAGENT TIMECOL XACTTS
MYPUBLICATION distrib 17/07/2012 11:10:22 00025A71000B464C014200000000

To fix the issue we manually updated the XACTTS column in the MSREPL7 table to the offenting xact_seqno. This tells replication that it has completed that XACTTS and it will move past it to look for the next sequence in the distribution database.

Before making the change, we confirmed that the MSREPL7 table was showing the XACTTS just prior to our failing sequence by checking msrepl_transactions.

select * from msrepl_transactions
where entry_time >= '2012-07-12 10:10:00'
order by entry_time

We also manually ran all the commands in the offending xact_seqno manually against the subscriber. We masked the dodgy data and replaced it with something else. The idea behind this is that your subscription does not lose any rows and stays completely in sync.

As soon as we then updated the MSREPL7 table, and started replication, transactions started flowing through. We have 2 identical subscribers so we had to follow this method twice.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: