Efficiently retrieving Oracle CLOBS in high latency environments

The Problem

We have recently come across an interesting scenario whereby a java application living within a container in Kubernetes in GCP needs to talk back to an Oracle 12.1 database in an on premise DC.

There are several things that are a given in these situations; that the latency in the chatter between a cloud environment and an on premise environment will be Higher than two services talking in the same DC, and depending on the way you look at it, creating a high latency situation will highlight poor code/queries or inefficiencies that may well have been hidden for years on premise due to the sub 2ms latency that developers and ops were lucky enough to have…

So, althougth there is short term pain, there is definately a lot to gain in spending the resource effort in creating efficiencies in the app to database conversations.


So how do you start investigating an application developer complaining that his database queries are taking 70-80 seconds since moving the application to the cloud, queries that used to take 2-3 seconds? Mainly by breaking down the problem and finding little nuggets (or red herrings!) to concentrate on.

We are lucky to have a multi disciplined squad/team so we are able to quickly troubleshoot from various angles including database, networking and our kubernetes architecture.

What metrics did we have?

Although we like to trust our developers, we also like to have proven data or logs to back up slowness claims. Our developer provided us with Kibana Logs and Graphana dashboards which showed that in some cases a particular heavy query was taking a significant amount of time (greather than 30 seconds up to 80 seconds). The application had a timeout of 30 seconds so in these cases, it was user impacting.

We also had the query which was causing the problem, and from the Kibana logs we could see that the issue was only happening on big data sets – or was more exponential on big datasets.

Creating a safe testing environment

We knew we had a problem with the query in production, but I didnt want to affect Prod performance even more so we backported a copy of the poorly performing dataset into our QA database environment.

As an example, if we continued to use prod to test the Select query, we may be inadvertantly flushing out hot data from the buffer pool and replacing it with out testing data.

Creating a repeatable test and finding patterns

Generally when I am investigating a problem, my approach is normally scientific and make sure I am doing both a Fair test and a Controlled experiment. A fair test is important as when changing variables you should only change one thing at a time, and controlling a test is important so that you have one test that follows the normal expected behaviour.

The query itself always returned 3727 rows with around 35MB of data from the database.

I embedded the query into a script and set some sqlplus options:

spool qadb21-local-5000.log
set ARRAYSIZE 5000
set TIMING on
set TERMOUT off
SELECT /*+ ALL_ROWS() */ V.ID as V1,A.*, V.*,M.*
WHERE A.ID = '2dfdda64-e15c-4cdd-9d56-2db1d013c6a0'
spool off

AUTOTRACE TRACEONLY – I didnt want to write any results to the spool – only the trace. Reason for this is that its possible the writing of the result set to a file would add seconds to the query completion.

ARRAYSIZE – To test fetch size in sqlplus – I tested multiple array sizes of 50, 125, 250, 1000, 2500 & 5000

ALL_ROWS Hint – Just to force the same optimizer results behaviour.

I tested 4 scenarios in total:

  • Running the query directly on the Oracle DB [<1 second runtime]
  • Running the query from my laptop to the Oracle DB [25-30 seconds runtime]
  • Running the query froma container in our GCP testing project to the Oracle DB [1:50 – 2:00 minutes runtime]
  • Running the query from a container in our GCP PreProd project to the oracle DB [1:50 – 2:00 minutes runtime]

By testing 2 different GCP Projects against our database, we can see if the problem is environment specific or a general error. The query was returning in similar durations both in our testing project and our preprod project so we deduced that it was something environment independent – but continued to doublecheck our networking config.

Below is an extract of the kind of output I was using to make sure the test stayed fair. I only took results if the recursive calls were 0 – i.e no hard parses.

0 recursive calls
0 db block gets
32406 consistent gets
28457 physical reads
0 redo size
3065681 bytes sent via SQL*Net to client
522366 bytes received via SQL*Net from client
6008 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
3727 rows processed

As the example output above shows, we were seeing over 6000 rountrips on 3727 rows of data. This rang some alarm bells as in a high latency environment, every rountrip is exponential so this was a key area of focus.

SQLPlus has a default fetchsize of 15 so evern at its worst, we should be seeing no more than approx. 500 roundtrips (rows processed / fetchsize * 2).

I setup a sql trace against my connection and this also showed a high level of chatter between the database and client in all test cases. (https://oracle-base.com/articles/misc/sql-trace-10046-trcsess-and-tkprof)


FETCH #139847718960616:c=0,e=4,p=0,cr=0,cu=0,mis=0,r=0,dep=4,og=4,plh=2542797530,tim=15317741724313
CLOSE #139847718960616:c=0,e=1,dep=4,type=3,tim=15317741724339
EXEC #139847718275456:c=0,e=22,p=0,cr=0,cu=0,mis=0,r=0,dep=4,og=4,plh=3765558045,tim=15317741724381
FETCH #139847718275456:c=0,e=19,p=0,cr=3,cu=0,mis=0,r=1,dep=4,og=4,plh=3765558045,tim=15317741724409
FETCH #139847718275456:c=0,e=3,p=0,cr=0,cu=0,mis=0,r=1,dep=4,og=4,plh=3765558045,tim=15317741724424
FETCH #139847718275456:c=0,e=3,p=0,cr=0,cu=0,mis=0,r=0,dep=4,og=4,plh=3765558045,tim=15317741724435
CLOSE #139847718275456:c=0,e=0,dep=4,type=3,tim=15317741724446


Our network engineers testing various parts of our cloud to on prem architecture including transfer speeds between different environments, How many hops it was taking to get to and from the database, and packet loss.

We found that the network route the query was taking was doing one extra hop, so we did some remediation work to correct this – but we didnt see much impact on query times [reduction of less than 1-2 seconds].

We also did a tcpdump of the packets between my laptop and the database server, to see what was actually happening when the query ran…

sudo tcpdump -i en0 host -w query.pcap
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on en0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:58:05.142087 IP > ICMP echo request, id 35751, seq 0, length 64
08:58:05.146186 IP > ICMP echo reply, id 35751, seq 0, length 64
2 packets captured
804 packets received by filter
0 packets dropped by kernel

We ran the pcap file thorugh wireshark and could see straight away that the query was constantly going back and forth to the database – thousands of times in a 2 minute period. This matched up pretty well to what we found with the database statistics showing a high level of roundtrips, and also the SQL Trace output.

Digging Deeper into the fetch size

I broke up the query and looked into all the data type of the result set and found that there was a CLOB column which was storing json data. After investigating I found that sqlplus does individual row retrieval for queries with CLOBs. SQLPlus has been developed so that it doesnt use a feth array when retriving CLOBS (https://asktom.oracle.com/pls/apex/asktom.search?tag=performance-issue-with-clob).

To test that the CLOBs were the culprit, I converted the CLOB data to varchar2 and reran the query and found that the data was returned under 1 seconds with 2 roundtrips.

UPDATE M SET METRICS_DATA_VAR = dbms_lob.substr( METRICS_DATA, 4000, 1 );

How does the Java Driver CLOB retrieval work?

It seems it works in a similar fashion to the sqlplus retrieval. Our CLOBs ranged between 4200 and 5000 characters. The default fetch size for CLOBS is 4000 characters so potentially you could have 2 fetches for 1 CLOB depending on the size.

As of Oracle 11.2g JDBC Driver you can use a prefetch.

if (statement1 instanceof OracleStatement) {
    ((OracleStatement) statement1).setLobPrefetchSize(50000);

or can be done globally for the application;

System.setProperty(OracleConnection.CONNECTION_PROPERTY_DEFAULT_LOB_PREFETCH_SIZE, "50000");


Setting the CONNECTION_PROPERTY_DEFAULT_LOB_PREFETCH_SIZE to 50000 has made a huge difference to our application performance. Why 50000? I worked it out based on being able to fit 10 rows per fetch but not overloading the application java heap.

We have seen queries that were taking 70 seconds on big datasets now taking less that 3 seconds. Round trip latency is the major factor at play here and by reducing the number of fetch trips the query is making to retrieve the data, it significantly boosts performance.

ADRCI – Automatic Diagnostic Repository Command Interpreter

Notes on using ADRCI to package incidents for Oracle support.

From OS command line export Oracle SID and run adrci to launch the command line tool.



Some useful commands once in the tool

show home
show incident
show problem
ips pack problem <problem_number>

Some example output:

ADRCI: Release – Production on Fri Dec 4 11:46:46 2015

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/oracle"

adrci> show home

ADR Homes:

adrci> show incident

ADR Home = /u01/app/oracle/diag/rdbms/mydb/INST22:

INCIDENT_ID          PROBLEM_KEY                                                                            CREATE_TIME
——————– ———————————————————–                            —————————————-

327230               ORA 7445 [qmcxdAddNmspcAttrFromPfxId()+216]                 2015-12-03 09:33:35.465000 +00:00
330918               ORA 7445 [qmcxdAddNmspcAttrFromPfxId()+216]                 2015-12-04 10:38:12.696000 +00:00

2 rows fetched

adrci> show problem

ADR Home = /u01/app/oracle/diag/rdbms/mydb/INST22:

PROBLEM_ID           PROBLEM_KEY                                                 LAST_INCIDENT        LASTINC_TIME
——————– ———————————————————– ——————– —————————————-

1                    ORA 7445 [kolarsAssign()+1408]                              35914                2013-07-03 02:21:55.866000 +01:00

1 rows fetched

adrci> ips pack problem 1

Generated package 3 in file /u01/app/oracle/diag/rdbms/mydb/INST22/trace/ORA7445qm_20151204120758_COM_1.zip, mode complete


Full Credit to a colleague who wrote this up!

ORA 600s after applying PSU7 for

We were seeing the following two ORA 600s after applying PSU7. Technically the issue wasnt related to PSU 7 but more the switchover involved when patching…

SR 3-8646935731 : ORA 600 [ktbdchk1: bad dscn]
SR 3-8664869401 : ORA 600-[FILE:jsks.c LINE:2388 FUNCTION:jsksStartOCICall() ID:OCIKCallPus]


The solution was to set the following dynamic parameter on both the primary and standby database:

ALTER SYSTEM SET "_ktb_debug_flags"=8 SCOPE=BOTH;

This parameter is designed to heal blocks having invalid dependent scn’s on switchover operations.

From the traces provided to oracle, I was told that the affected object for this issue was ID 12152331.

Running the following SQL determines the object:

select owner,object_name,object_type,subobject_name,object_id,data_object_id
  from   dba_objects
  where  object_id in (12152331)
  or     data_object_id in (12152331);

The affected blocks were on index SYS.I_SCHEDULER_JOB_RUN_DETAILS on table SCHEDULER$_JOB_RUN_DETAILS.

As well as the dynamic parameter, I also did the following:



Retrieve Oracle SPID from unix OS PID

The Unix/Linux PID that you see in ‘top’ is visible as SPID in V$PROCESS. Then, you’d have to join V$PROCESS to V$SESSION to get the SID. (Note : PID in V$PROCESS is the Oracle PID, not the OS PID).

--Replace bind variable with OS PID List
select s.sid, s.serial#, s.username,
       to_char(s.logon_time,'DD-MON HH24:MI:SS') logon_time,
       p.pid oraclepid, p.spid "ServerPID", s.process "ClientPID",
       s.program clientprogram, s.module, s.machine, s.osuser,
       s.status, s.last_call_et
from  gv$session s, gv$process p
where p.spid=nvl('&unix_process',' ')
and s.paddr=p.addr
order by s.sidÔ…

Unregister Mview Snapshot Cleanly

Useful SQL to dynamically create unregister mview script for all mviews for a certain site or schema or both!

--Dynamic SQL - Unregister Snapshot info for a particular schema and database
'execute dbms_snapshot.unregister_snapshot(''' || OWNER || ''',''' || NAME || ''',''' || SNAPSHOT_SITE || ''');' 
from dba_registered_snapshots rs

This links to my previous posts on this topic:

Find Site holding up mlog & Register Orphaned Mviews

Investigation SQL for Mview and Mlog growth

Also see metalink note “How to Manually Register/Unregister Snapshots” [ID 67371.1]

Investigating high number of sessions/scraping

Some queries to help investigate scraping or max session errors on the database:

--Total Session counts on database per schema
select username, Service_name, count(*) from gv$session
group by username, Service_name
HAVING count(*) > 100
order by 3 desc;ì*
--Total Session counts split by specific instance 
-- Use if only certain instances recieving max session error
select username, inst_id, Service_name, count(*) from gv$session
where inst_id in (1,2)
group by username, inst_id, Service_name
HAVING count(*) > 100
order by 4 desc;

Out machine names have Data Centre references so we are able to further split:

--Split Sessions by Data Centre
    WHEN '%dc1%' THEN 'DC1'
    ELSE 'DC2'
, count(*)
FROM gv$session
where SERVICE_NAME = 'MYSERVICE_1' and username = 'MYSCHEMA_2'
    WHEN '%dc1%' THEN 'DC1'
    ELSE 'DC2'

Trace Session for Schema on Login

The following trigger starts a session trace whenever a session for a particular schema logs into the database:

execute immediate 'alter session set events ''10046 trace name context forever, level 12''';

Level definitions:

For methods that require tracing levels, the following are valid values:

  • 0 – No trace. Like switching sql_trace off.

  • 2 – The equivalent of regular sql_trace.

  • 4 – The same as 2, but with the addition of bind variable values.

  • 8 – The same as 2, but with the addition of wait events.

  • 12 – The same as 2, but with both bind variable values and wait events.

  • To turn off session level traces that are running:

    ALTER SESSION SET EVENTS '10046 trace name context off';

EXCEPT & INTERSECT in SQL Server – Same as MINUS & UNION in Oracle

Need to run a query to find the distinct different rows in an oracle table compared to a SQL server table. These tables are comparable because the source SQL Server table is replicated to the Oracle Schema.

Note: This example shows the use of Except in a more complex scenario but it can be used in its simplest form to return results from 2 SQL server tables (with the proviso that they are comparable tables)


This SQL requires a link server to be in place linking to the oracle schema with select permissions.

SELECT Person_ID, Misc_Data_ID
FROM OPENQUERY(ORA_TEMP, 'select * from MYSchema.misc_data')
SELECT Person_ID, Misc_Data_ID 
FROM misc_data
  WHERE (Person_ID IN 
    (SELECT Person_ID FROM Person 
    WHERE (Customer_ID IN 
        (SELECT Customer_ID FROM Customer WHERE Business_Subset_ID = 123456))))

In the above SQL, EXCEPT finds distinct rows which are in the oracle schema which are NOT in the SQL table. To make sure that we are returning exact rows, i use the primary key on the oracle table for the except comparison.

INTERSECT works in the same way, but finds all common rows between the 2 tables.

SELECT Person_ID, Misc_Data_ID
FROM OPENQUERY(ORA_TEMP, 'select * from MYSchema.misc_data')
SELECT Person_ID, Misc_Data_ID 
FROM misc_data
  WHERE (Person_ID IN 
    (SELECT Person_ID FROM Person 
    WHERE (Customer_ID IN 
        (SELECT Customer_ID FROM Customer WHERE Business_Subset_ID = 123456))))

EXCEPT is doing the same functionality as the MINUS operation in Oracle, whereas INTERSECT is the same as the UNION operation.

ORA-00304: requested INSTANCE_NUMBER is busy

Problem: Creating a copy of 2 databases using netapp flexclones on a sandpit cluster. Both databases were to be mounted on the same node as a colleague was using node 2 for a different database clone. One of the databases came up fine, but the other always gave the following error:

ORA-00304: requested INSTANCE_NUMBER is busy

The Alert log showed the following error:

USER (ospid: 1110): terminating the instance due to error 304
Instance terminated by USER, pid = 1110

Error 304 is:

00304, 00000, "requested INSTANCE_NUMBER is busy"

// *Cause: An instance tried to start by using a value of the

// initialization parameter INSTANCE_NUMBER that is already in use.

// *Action: Either

// a) specify another INSTANCE_NUMBER,

// b) shut down the running instance with this number

// c) wait for instance recovery to complete on the instance with this number.

Tried several different things including checking shared memory /dev/shm for any memory blocks relating to the instance i was trying to start (which have memory attached), but no memory for that instance.

Also tried rebooting the machine which didn’t help so started looking into the startup parameters.

Looked into this blog article but this wasn’t the solution for me: http://arup.blogspot.co.uk/2009/12/instancenumber-is-busy-message-during.html


As we were cloning and starting up several databases on Node 1, and a colleague was working on a separate database on Node 2. There was a point when his database on Node 2 was mounted as instance number 2 on the cluster, and my database was trying to be the same instance number on Node 1.

As CRS was still the underlying architecture holding the nodes together, the fact that we were mounting databases which to CRS were on the same cluster, it was causing an error.

Make sure to check all databases on the cluster when working in sandpit scenarios and in cluster aware situations! Once we shut down the database on Node 2, we were able to mount without issues.

CRS Commands

Some basic CRS Commands for 11.1

--start and stop
sudo su

cd /etc/init.d/
[HOST]host:/etc/init.d : ./init.crs start
Startup will be queued to init within 30 seconds.

cd /etc/init.d/
[HOST]host:/etc/init.d : ./init.crs stop

To view the current status of CRS

crs_stat -t

Full documentation list for CRS 11.1: