October | 2016 | The Performance Chronicles

This past week we experienced severe slowdowns on our Active Data Guard database. I thought I’d document what we saw and the steps we took to remedy the issue. Both primary and secondary databases are 16-node (two-rack) Exadatas running Oracle version 11.2.0.4.

What We Saw – Symptoms

Users were complaining of slow response times for their read-only queries on Active Data Guard and even simple things like logging in and simple queries against GV$/V$ views was slow. We took several hanganalyze dumps and saw the following predominant wait events:

library cache lock / library cache load lock
cursor: pin S wait on X
gc cr request

We were also seeing ORA-4031 errors on large pool allocations and we had frequent memory stealing going on between DB cache and large pool. And shared pool on the recovery instance had grown quite large (over 40 GB).

When I looked at SGA/shared pool allocations, I saw:

  1  select pool, name, bytes
  2  from v$sgastat
  3* order by pool, bytes desc

POOL        NAME                                  BYTES                                                                                              
----------- -------------------------- ----------------                                                                                               
shared pool gcs dynamic r                13,807,768,000                                                                                              
            gcs dynamic s                 8,961,551,424                                                                                               
            free memory                   7,530,946,112                                                                                              
            ASM extent pointer array      3,191,875,096                                                                                              
            ges enqueues                  2,925,579,056                                                                                               
            gcs resources                 1,118,565,984                                                                                              
            Checkpoint queue                983,056,384                                                                                               
            gcs shadows                     774,391,848                                                                                              
. . .
***********                            ----------------                                                                                              
sum                                      41,204,842,496

Querying GV$RESOURCE_LIMIT, we saw high numbers of “gcs resources” and “gc shadows” had been/were being used on instances 10 and 11 (MRP/recovery had been on instance 10 and was moved to instance 11 to try to mitigate the issue):

INST                                 CURRENT     MAX         INITIAL    LIMIT
ID    RESOURCE_NAME                  UTILIZATION UTILIZATION ALLOCATION VALUE
----- ------------------------------ ----------- ----------- ---------- ----------
    1 gcs_resources                          800     3594418    9923100 UNLIMITED
    1 gcs_shadows                            800     4341615    9923100 UNLIMITED
    2 gcs_resources                          800     3635583    9923100 UNLIMITED
    2 gcs_shadows                            800     4310007    9923100 UNLIMITED
    3 gcs_resources                          800     3673520    9923100 UNLIMITED
    3 gcs_shadows                            800     4425169    9923100 UNLIMITED
    4 gcs_resources                          800     1154037    9923100 UNLIMITED
    4 gcs_shadows                            800     2249026    9923100 UNLIMITED
    5 gcs_resources                          800     3259731    9923100 UNLIMITED
    5 gcs_shadows                            800     3956792    9923100 UNLIMITED
    6 gcs_resources                          800     3236723    9923100 UNLIMITED
    6 gcs_shadows                            800     3962286    9923100 UNLIMITED
    7 gcs_resources                          800     3251719    9923100 UNLIMITED
    7 gcs_shadows                            800     3944630    9923100 UNLIMITED
    8 gcs_resources                          800      967050    9923100 UNLIMITED
    8 gcs_shadows                            800     1961536    9923100 UNLIMITED
    9 gcs_resources                         1200     3237116    9891090 UNLIMITED
    9 gcs_shadows                           1200     3914301    9891090 UNLIMITED
   10 gcs_resources                         1200    70984040    5377680 UNLIMITED
   10 gcs_shadows                           1200    67610640    5377680 UNLIMITED
   11 gcs_resources                     18004454    18041094    9891090 UNLIMITED
   11 gcs_shadows                       28291646    28651690    9891090 UNLIMITED
   12 gcs_resources                         1200     3208902    9891090 UNLIMITED
   12 gcs_shadows                           1200     4040165    9891090 UNLIMITED
   13 gcs_resources                         1200     3182042    9891090 UNLIMITED
   13 gcs_shadows                           1200     3883302    9891090 UNLIMITED
   14 gcs_resources                         1200     3154007    9891090 UNLIMITED
   14 gcs_shadows                           1200     3899527    9891090 UNLIMITED
   15 gcs_resources                         1200     3146403    9891090 UNLIMITED
   15 gcs_shadows                           1200     3924065    9891090 UNLIMITED
   16 gcs_resources                         1200     2986005    9891090 UNLIMITED
   16 gcs_shadows                           1200     3839600    9891090 UNLIMITED

We also saw the following recurring messages about 11 seconds apart in the alert.logs on multiple instances:

Wed Sep 28 09:51:27 2016
Global flush: getting parseLock
Global flush: getting parseLock
Wed Sep 28 09:51:38 2016
Global flush: getting parseLock
Global flush: getting parseLock

We are used to seeing the following sequence in our alert.log:

Global flush: getting parseLock
Start global flush...
Complete global flush...

However, the previous night was the last time we saw a “Complete global flush” message. And, starting at 04:04 in the morning, we got the following “redo cache is full” message:

Wed Sep 28 04:04:12 2016
Recovery: global flush, redo cache is full

To compound the problem, we also had user sessions running on the recovery node. This caused additional contention as it took longer to serve GC requests.

We were also experiencing a recovery lag.

What We Found Out

In Active Data Guard, the node that MRP runs on (the recovery node) becomes the master for all blocks. Therefore the GCS regions in the shared pool on that one instance become very large. Every time the shared pool grew, it was causing large flushes to occur (up to 1000 objects at a time according to X$KSMLRU). This also accounted for our ORA-4031 errors on large pool allocations as well as the severe library cache and shared pool contention.

What We Did to Address the Issue

The first thing we did was to migrate all users/services off of nodes 10 and 11. We decided that until we got more definitive answers from Oracle Support, we would dedicate nodes 10 and 11 to recovery. By keeping users off of both nodes, we could then “toggle” MRP back and forth between the two nodes and tweak our configurations without impact to users.

Support recommended that based on our high water marks for GCS allocations, that we increase the resource limits for those objects as well as increasing shared pool and SGA sizes on the two recovery nodes (nodes 10 and 11). Here are the values we used:

 _gcs_resources=80000000
 _gcs_shadows_locks=80000000
 shared_pool_size=48G 
 db_cache_size=24G 
 sga_target=84G 
 sga_max_size=90G

We made the changes and then restarted both instances and MRP/recovery.

Results

After those changes, things stabilized and MRP caught up. Users reported no further problems with queries and the problem went away.

Additional Info

There is a patch for bug 17597704 that allows blocks to be “multi-mastered” (mastered on all nodes in the cluster, instead of just on the recovery node). We opted not to apply and enable this fix as we felt that it was potentially trading one problem for another. In other words, instead of going to the recovery nodes for the block masters, queries running on one node would potentially have to visit any of the other 15 nodes. Since performance was fine with mastering on the recovery node, we left this one alone.

In addition, in 11.2, these GCS-related allocations are permanent objects and not freeable. There is a patch (21863727) in 12.1.0.2 that makes these objects freeable but unfortunately there is no remedy for 11.2 at this time. This explains why flushes of the shared pool did not help.

The Performance Chronicles

Adventures in Troubleshooting and Tuning

Monthly Archives: October 2016

Active Data Guard Heavy GC and Latch Contention