Active Data Guard Heavy GC and Latch Contention

This past week we experienced severe slowdowns on our Active Data Guard database.  I thought I’d document what we saw and the steps we took to remedy the issue.  Both primary and secondary databases are 16-node (two-rack) Exadatas running Oracle version 11.2.0.4.

What We Saw – Symptoms

Users were complaining of slow response times for their read-only queries on Active Data Guard and even simple things like logging in and simple queries against GV$/V$ views was slow.  We took several hanganalyze dumps and saw the following predominant wait events:

  • library cache lock / library cache load lock
  • cursor: pin S wait on X
  • gc cr request

We were also seeing ORA-4031 errors on large pool allocations and we had frequent memory stealing going on between DB cache and large pool.  And shared pool on the recovery instance had grown quite large (over 40 GB).

When I looked at SGA/shared pool allocations, I saw:

  1  select pool, name, bytes
  2  from v$sgastat
  3* order by pool, bytes desc

POOL        NAME                                  BYTES                                                                                              
----------- -------------------------- ----------------                                                                                               
shared pool gcs dynamic r                13,807,768,000                                                                                              
            gcs dynamic s                 8,961,551,424                                                                                               
            free memory                   7,530,946,112                                                                                              
            ASM extent pointer array      3,191,875,096                                                                                              
            ges enqueues                  2,925,579,056                                                                                               
            gcs resources                 1,118,565,984                                                                                              
            Checkpoint queue                983,056,384                                                                                               
            gcs shadows                     774,391,848                                                                                              
. . .
***********                            ----------------                                                                                              
sum                                      41,204,842,496                              

Querying GV$RESOURCE_LIMIT, we saw high numbers of “gcs resources” and “gc shadows” had been/were being used on instances 10 and 11 (MRP/recovery had been on instance 10 and was moved to instance 11 to try to mitigate the issue):

INST                                 CURRENT     MAX         INITIAL    LIMIT
ID    RESOURCE_NAME                  UTILIZATION UTILIZATION ALLOCATION VALUE
----- ------------------------------ ----------- ----------- ---------- ----------
    1 gcs_resources                          800     3594418    9923100 UNLIMITED
    1 gcs_shadows                            800     4341615    9923100 UNLIMITED
    2 gcs_resources                          800     3635583    9923100 UNLIMITED
    2 gcs_shadows                            800     4310007    9923100 UNLIMITED
    3 gcs_resources                          800     3673520    9923100 UNLIMITED
    3 gcs_shadows                            800     4425169    9923100 UNLIMITED
    4 gcs_resources                          800     1154037    9923100 UNLIMITED
    4 gcs_shadows                            800     2249026    9923100 UNLIMITED
    5 gcs_resources                          800     3259731    9923100 UNLIMITED
    5 gcs_shadows                            800     3956792    9923100 UNLIMITED
    6 gcs_resources                          800     3236723    9923100 UNLIMITED
    6 gcs_shadows                            800     3962286    9923100 UNLIMITED
    7 gcs_resources                          800     3251719    9923100 UNLIMITED
    7 gcs_shadows                            800     3944630    9923100 UNLIMITED
    8 gcs_resources                          800      967050    9923100 UNLIMITED
    8 gcs_shadows                            800     1961536    9923100 UNLIMITED
    9 gcs_resources                         1200     3237116    9891090 UNLIMITED
    9 gcs_shadows                           1200     3914301    9891090 UNLIMITED
   10 gcs_resources                         1200    70984040    5377680 UNLIMITED
   10 gcs_shadows                           1200    67610640    5377680 UNLIMITED
   11 gcs_resources                     18004454    18041094    9891090 UNLIMITED
   11 gcs_shadows                       28291646    28651690    9891090 UNLIMITED
   12 gcs_resources                         1200     3208902    9891090 UNLIMITED
   12 gcs_shadows                           1200     4040165    9891090 UNLIMITED
   13 gcs_resources                         1200     3182042    9891090 UNLIMITED
   13 gcs_shadows                           1200     3883302    9891090 UNLIMITED
   14 gcs_resources                         1200     3154007    9891090 UNLIMITED
   14 gcs_shadows                           1200     3899527    9891090 UNLIMITED
   15 gcs_resources                         1200     3146403    9891090 UNLIMITED
   15 gcs_shadows                           1200     3924065    9891090 UNLIMITED
   16 gcs_resources                         1200     2986005    9891090 UNLIMITED
   16 gcs_shadows                           1200     3839600    9891090 UNLIMITED

We also saw the following recurring messages about 11 seconds apart in the alert.logs on multiple instances:

Wed Sep 28 09:51:27 2016
Global flush: getting parseLock
Global flush: getting parseLock
Wed Sep 28 09:51:38 2016
Global flush: getting parseLock
Global flush: getting parseLock

We are used to seeing the following sequence in our alert.log:

Global flush: getting parseLock
Start global flush...
Complete global flush...

However, the previous night was the last time we saw a “Complete global flush” message.  And, starting at 04:04 in the morning, we got the following “redo cache is full” message:

Wed Sep 28 04:04:12 2016
Recovery: global flush, redo cache is full

To compound the problem, we also had user sessions running on the recovery node.  This caused additional contention as it took longer to serve GC requests.

We were also experiencing a recovery lag.

What We Found Out

In Active Data Guard, the node that MRP runs on (the recovery node) becomes the master for all blocks.  Therefore the GCS regions in the shared pool on that one instance become very large.  Every time the shared pool grew, it was causing large flushes to occur (up to 1000 objects at a time according to X$KSMLRU).  This also accounted for our ORA-4031 errors on large pool allocations as well as the severe library cache and shared pool contention.

What We Did to Address the Issue

The first thing we did was to migrate all users/services off of nodes 10 and 11.  We decided that until we got more definitive answers from Oracle Support, we would dedicate nodes 10 and 11 to recovery.  By keeping users off of both nodes, we could then “toggle” MRP back and forth between the two nodes and tweak our configurations without impact to users.

Support recommended that based on our high water marks for GCS allocations, that we increase the resource limits for those objects as well as increasing shared pool and SGA sizes on the two recovery nodes (nodes 10 and 11).  Here are the values we used:

 _gcs_resources=80000000
 _gcs_shadows_locks=80000000
 shared_pool_size=48G 
 db_cache_size=24G 
 sga_target=84G 
 sga_max_size=90G

We made the changes and then restarted both instances and MRP/recovery.

Results

After those changes, things stabilized and MRP caught up.  Users reported no further problems with queries and the problem went away.

Additional Info

There is a patch for bug 17597704 that allows blocks to be “multi-mastered” (mastered on all nodes in the cluster, instead of just on the recovery node).  We opted not to apply and enable this fix as we felt that it was potentially trading one problem for another.  In other words, instead of going to the recovery nodes for the block masters, queries running on one node would potentially have to visit any of the other 15 nodes.  Since performance was fine with mastering on the recovery node, we left this one alone.

In addition, in 11.2, these GCS-related allocations are permanent objects and not freeable.  There is a patch (21863727) in 12.1.0.2 that makes these objects freeable but unfortunately there is no remedy for 11.2 at this time.  This explains why flushes of the shared pool did not help.

 

SecureFile LOB Performance Issue

I wanted to document an issue that we experienced in one of our databases after upgrading to Oracle 12c (12.1.0.2).  The issue was that occasionally an update to one of our tables would take much longer to execute than normal.  Unfortunately, this issue did not get caught in testing and made its way into production.  It didn’t happen every time the update occurred and for this reason, it was difficult to track down.

Here’s how I approached the troubleshooting of this issue and the bug that we identified.

Since this wasn’t happening every time but we knew which SQL statement was involved, I elected to do a SQL_ID trace:

alter system set events 'sql_trace [sql:sql_id=89hm7xzttp9c9] plan_stat=all_executions,wait=true,bind=false';

After setting the trace, I monitored the trace directory waiting to find the long running UPDATEs.  To do this, I grep’d the trace files looking for EXEC operations on that SQL_ID with long elapsed time (“e=”) values.  The longest one was just over 89 seconds:

find . -type f -mtime -1 -name "*.trc" -exec grep -ln "89hm7xzttp9c9" {} \; |xargs egrep "EXEC" |cut -d"," -f2|cut -d"=" -f2|sort -n
6134275
6406358
53427796
78365426
89383890

Once I got several trace files, I turned the SQL_ID trace off:

alter system set events 'sql_trace [sql:sql_id=89hm7xzttp9c9] off';

I then pulled the trace files that had the longest EXEC times.  Here is the pattern I saw in the various trace files:

 PARSING IN CURSOR . . .
 UPDATE  x SET . . .
 . . .
 RETURNING  INTO :dl0
 END OF STMT
 BINDS #140344624162424:
. . .
 WAIT #140344624162424: nam='db file sequential read' ela= 634 file#=107 block#=3220853 blocks=1 obj#=161211 tim=2967436896291
 WAIT #140344624162424: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896532
 EXEC #140344624162424:c=3000,e=3721,p=1,cr=4,cu=23,mis=0,r=1,dep=0,og=1,plh=2409839646,tim=2967436896563
 WAIT #140344624162424: nam='SQL*Net message from client' ela= 88 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896739
 LOBPGSIZE: type=PERSISTENT LOB,bytes=8060,c=0,e=24,p=0,cr=0,cu=0,tim=2967436896797
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896823
 WAIT #0: nam='SQL*Net message from client' ela= 78 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967436896918
 WAIT #140344624822120: nam='SQL*Net more data from client' ela= 13 driver id=1650815232 #bytes=222 p3=0 obj#=87726 tim=2967436897091
 WAIT #140344624822120: nam='db file sequential read' ela= 350 file#=124 block#=212669 blocks=1 obj#=87726 tim=2967436897811
 WAIT #140344624822120: nam='db file sequential read' ela= 329 file#=124 block#=212670 blocks=1 obj#=87726 tim=2967436898239
 WAIT #140344624822120: nam='db file sequential read' ela= 336 file#=124 block#=212671 blocks=1 obj#=87726 tim=2967436898675

. . .  < 196000 WAITS !!! > . . .

 WAIT #140344624822120: nam='gc current grant 2-way' ela= 238 p1=15 p2=22656 p3=33554444 obj#=87726 tim=2967520128096
 WAIT #140344624822120: nam='gc current grant 2-way' ela= 208 p1=15 p2=22657 p3=33554433 obj#=87726 tim=2967520129038
 WAIT #140344624822120: nam='gc current grant 2-way' ela= 189 p1=15 p2=22658 p3=33554433 obj#=87726 tim=2967520129359
 LOBWRITE: type=PERSISTENT LOB,bytes=6468,c=33450915,e=83232687,p=133280,cr=419937,cu=274972,tim=2967520129628
 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967520129713
 WAIT #0: nam='SQL*Net message from client' ela= 608 driver id=1650815232 #bytes=1 p3=0 obj#=87726 tim=2967520130349

Between the LOBPGSIZE operation and the LOBWRITE operation, there were over 196,000 wait events, most of them “db file sequential read” and “gc cr disk read”.  The LOBWRITE operation took about 83 seconds.

I wanted to see if there were any particular blocks that were being visited as part of this operation so I grep’d and counted the various occurrences of the “sequential read” and “gc cr disk read” operations.   What I noticed is that the same blocks were being visited over and over again – some of them as many as 681 times:

$ grep "sequential read" pdb2_ora_76294.trc|cut -d" " -f9,10,12|sort|uniq -c|sort -rn|more
    681 file#=28 block#=1793355 obj#=96666
    681 file#=28 block#=1793354 obj#=96666
    681 file#=28 block#=1793353 obj#=96666
. . .
    680 file#=28 block#=1620571 obj#=96660
    680 file#=28 block#=1620570 obj#=96660
    680 file#=28 block#=1620569 obj#=96660
. . .
    320 file#=28 block#=3592814 obj#=96643
    320 file#=28 block#=3592813 obj#=96643
    320 file#=28 block#=3592812 obj#=96643
. . .

$ grep "gc cr disk read" pdb2_ora_76294.trc|cut -d" " -f9,10,12|sort|uniq -c|sort -rn|more
    681 p1=28 p2=1793355 obj#=96666
    681 p1=28 p2=1793354 obj#=96666
    681 p1=28 p2=1793353 obj#=96666
. . .
    680 p1=28 p2=1620570 obj#=96660
    680 p1=28 p2=1620569 obj#=96660
    680 p1=28 p2=1620568 obj#=96660
. . .
    320 p1=28 p2=3592814 obj#=96643
    320 p1=28 p2=3592813 obj#=96643
    320 p1=28 p2=3592812 obj#=96643
. . .

Here is what a “normal” execution of this UPDATE looked like.  The normal LOBWRITE took only 680 microseconds:

 PARSING IN CURSOR . . .
 UPDATE   set 
 RETURNING  INTO :dl0
 END OF STMT
 BINDS #140344626255832:
. . .
 WAIT #140344626255832: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169512
 EXEC #140344626255832:c=3000,e=4262,p=0,cr=5,cu=17,mis=0,r=1,dep=0,og=1,plh=2409839646,tim=2967434169557
 WAIT #140344626255832: nam='SQL*Net message from client' ela= 76 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169722
 LOBPGSIZE: type=PERSISTENT LOB,bytes=8060,c=0,e=25,p=0,cr=0,cu=0,tim=2967434169813
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169845
 WAIT #0: nam='SQL*Net message from client' ela= 76 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434169938
 WAIT #140344635454240: nam='SQL*Net more data from client' ela= 12 driver id=1650815232 #bytes=222 p3=0 obj#=-1 tim=2967434170178
 LOBWRITE: type=PERSISTENT LOB,bytes=5358,c=0,e=680,p=0,cr=2,cu=8,tim=2967434170670
 WAIT #0: nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434170717
 WAIT #0: nam='SQL*Net message from client' ela= 254 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=2967434170990

Believing this to be a bug, we then opened an SR and came to find out that we were experiencing bug # 22905136 – “SECUREFILE LOB HIGH SINGLE BLOCK PHYSICAL READ”.  The note for this bug is MOS Doc ID 2189248.1.

Applying the patch for this bug fixed the issue.

So if your application makes use of Securefile LOBs and you are upgrading to 12.1, check out this bug and make sure that this patch is applied before going into test and production.

“Result Cache: Channel” Reliable Message Waits

In this earlier post, I wrote about “reliable message” waits in Oracle and how to find out more about the channel that is being waited on.  One of these waits that we’ve been seeing with some frequency is on “Result Cache: Channel”.  In some cases, these waits have been debilitating and I believe it merits its own post, in case someone else runs into this issue.

Here is an example of one issue we experienced in 11.2.0.3.  Users were reporting a slow system and in looking at session waits, we saw long wait times for the “reliable message” event. In querying GV$CHANNEL_WAITS, we saw the following waits on instance 3 of a RAC database:

select inst_id, channel, messages_published, wait_count, WAIT_TIME_USEC/1000000 wait_time_sec
from GV$CHANNEL_WAITS
order by inst_id, wait_time_sec desc;
 INST
   ID CHANNEL                                                MESSAGES_PUBLISHED WAIT_COUNT WAIT_TIME_SEC
----- ------------------------------------------------------ ------------------ ---------- -------------
. . .
    3 Result Cache: Channel                                           474745796  473767959    267825.106
    3 obj broadcast channel                                             4928144    2643616      5065.185
    3 kxfp control signal channel                                       3349090    1140344       484.500
    3 MMON remote action broadcast channel                               401055       6999       266.940
    3 RBR channel                                                        320568      37809       137.223
. . .

Continue reading

Diagnosing Oracle “reliable message” Waits

Reliable Message waits are cryptic by nature.  It is a general purpose wait event that tracks many different types of channel communications within the Oracle database.  I’ve read some blogs that suggest that this is a benign wait event that can be ignored.  My experience is that they are not benign and should not be ignored.  This post will show you how to decipher these events and resolve the issue.

Here is what you might see in an AWR report:

Continue reading

Using Wireshark to Diagnose a Connection Drop Issue in Oracle

As a long-time performance DBA, I’ve often felt that it is important to know something about troubleshooting the layers that are upstream and downstream of the database in the technology stack.  Lately, I’ve been making use of packet captures and Wireshark to solve tough issues in the TCP layer.  We recently resolved a long-standing issue with TCP retransmissions that were causing connection drops between an application server and one of our databases and I thought this might help others faced with similar issues.

This problem started with a series of TNS-12535 messages that were seen in the Oracle alert logs for one of our databases:

Continue reading

JavaOne/Oracle OpenWorld Highlights-Part Two

In my previous post, I covered the first two days of JavaOne and Oracle OpenWorld.  I’ll cover the rest of the week’s highlights in this post.

Blockchain and the Internet of Things (IoT)

This presentation was interesting in that it talked about different ways that blockchain and alt-chain technologies (like “pegged sidechains”) might be used in the Internet of Things.  It also introduces one company (Blockstream) who is using pegged sidechains to extend Bitcoin and blockchains for micropayments, smart contracts and property registries.  This paper also introduces the concept of “Sensing as a Service” – an emerging business model for the Internet of Things.

Continue reading

Pumping Up Performance with Linux Hugepages – Part 2

optimizedhuge_book

In my previous article on hugepages, I discussed what hugepages are and talked about the page table, the Translation Lookaside Buffer (TLB) and TLB Misses, Page Walks and Page Faults. I also discussed how using hugepages reduces the amount of memory used and the also reduces the number of CPU cycles needed to do the logical to physical memory mapping.

In this post, I’d like to talk about how to use Hugepages with the Oracle database and with JVMs. I’ll also talk about Transparent Hugepages (THP) and why you should turn off this new Linux “feature”.

Continue reading

Lies, Damned Lies and Metrics

(Apologies to Mark Twain…)

I’ve long subscribed to the principle of “Follow the Data” when it comes to troubleshooting performance. However, sometimes the data can be misleading (“lie” is an awful strong word) and sometimes the metrics you need just aren’t there. I was reminded of that this week while looking into a production performance issue with one of our critical applications.

The issue was presenting itself as an I/O problem in the database layer. Oracle wait event metrics from ASH (Active Session History) were indicating that I/O operations were taking longer than normal. Normally when we see this, we gather data about the I/O subsystem using utilities like iostat. Since this was on an Exadata, we also used the cellcli utility to report on storage cell information. However – this time – neither of these utilities was showing long I/O waits corresponding to our issue.

Continue reading

Red Flags – How a Data Quality Issue Becomes a Performance Issue

When I’m reviewing the query performance of our applications, one of SQL constructs that raises a red flag for me is the use of LOWER and UPPER functions in the WHERE clause (as well as TRIM/LTRIM/RTRIM).

The reason these functions trigger a red flag is because they are typically used as a workaround to a data quality problem and this workaround usually causes downstream performance impacts. Developers use these functions when they don’t trust the format of the data in the column or when they don’t trust the format of the value that is being compared.

How does this create a downstream performance problem?  Wrapping an indexed column in a function disables the use of the index and often causes a full table scan to occur. This can take a query that would normally execute in less than a millisecond and cause it to take much longer – depending on how big the table is that needs to be full-scanned.

Here is an example from a recently reviewed application:

Continue reading