Hybrid Histograms & Unpadded NCHAR Predicates

EDIT:2018-03-16 Seems WordPress or SQLPlus messed up my testcase, I have updated tihs post to include unedited testcase as well as output.

EDIT:2018-03-22 Oracle Support have created a new bug for this issue, Bug 27730925 – WRONG CARDINALITY ESTIMATION FROM HYBRID HISTOGRAM ON NCHAR COLUMN

Paraphrasing Jonathan Lewis it’s very easy to start ‘breaking’ Oracle as soon as you start combining a few features that most people are not running. The following demonstrates that if you are using NCHAR datatypes (in our case this is forced on us because due to our JDEdwards application) and if you have a hybrid histogram then cardinality estimates vary greatly depending on whether or not you pad your predicates.

Of course with NCHAR datatypes, the results and hence cardinality should be unchanged whether or not predicates are padded.  I have been unable to reproduce the symptoms with CHAR datatypes.

DROP TABLE HH_NC_TAB;
CREATE TABLE HH_NC_TAB (COL NvarCHAR2(10));

insert /*+APPEND*/ into HH_NC_TAB
SELECT 'AAA'||mod(rownum, 10)  FROM DUAL CONNECT BY LEVEL < 100000
union all
SELECT 'AAA'||rownum FROM DUAL CONNECT BY LEVEL < 100000; 
commit; 
EXEC DBMS_STATS.GATHER_TABLE_STATS(NULL, 'HH_NC_TAB', METHOD_OPT=>'FOR COLUMNS COL SIZE 254');

EXPLAIN PLAN FOR SELECT * FROM HH_NC_TAB WHERE COL = 'AAA1';
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

EXPLAIN PLAN FOR SELECT * FROM HH_NC_TAB WHERE COL = RPAD('AAA1', 10);
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
SQL> DROP TABLE HH_NC_TAB;

Table HH_NC_TAB dropped.

SQL> CREATE TABLE HH_NC_TAB (COL NCHAR(10));

Table HH_NC_TAB created.

SQL> insert /*+APPEND*/ into HH_NC_TAB
SELECT 'AAA'||mod(rownum, 10) FROM DUAL CONNECT BY LEVEL < 100000
union all
SELECT 'AAA'||rownum FROM DUAL CONNECT BY LEVEL < 100000; Commit complete.  SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(NULL, 'HH_NC_TAB', METHOD_OPT=>'FOR COLUMNS COL SIZE 254');

PL/SQL procedure successfully completed.

SQL>
SQL> EXPLAIN PLAN FOR SELECT * FROM HH_NC_TAB WHERE COL = 'AAA1';

Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
Plan hash value: 2981390417

---------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |     1 |    21 |    13  (24)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| HH_NC_TAB |     1 |    21 |    13  (24)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("COL"=U'AAA1')
       filter("COL"=U'AAA1')


14 rows selected.

SQL>
SQL> EXPLAIN PLAN FOR SELECT * FROM HH_NC_TAB WHERE COL = RPAD('AAA1', 10);

Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
Plan hash value: 2981390417

---------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |  9689 |   198K|    12  (17)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| HH_NC_TAB |  9689 |   198K|    12  (17)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("COL"=U'AAA1      ')
       filter("COL"=U'AAA1      ')


14 rows selected.

SQL>

Issue reproduces on 12.1, 12.2 and I’ve uploaded to LiveSQL so you can verify for yourself on 18.1

Multi-Tenant Resource Manager Oddity

I’ll try to write this up further when I make progress, but just wanted somewhere to share testcase. I’m guessing it’s a multi-tenant bug, but if that’s the case I’m a little surprised why no-one else has hit it.

SQL> CREATE TABLE DUMMY AS SELECT * FROM DBA_OBJECTS WHERE 1=0;

Table created.

SQL> SELECT resource_consumer_group
  2  FROM v$session
  3  WHERE sid IN (SELECT sid FROM v$mystat);

RESOURCE_CONSUMER_GROUP
--------------------------------
OTHER_GROUPS

SQL> BEGIN
  2    DBMS_SCHEDULER.create_job (
  3      job_name        => 'RC_JOBA',
  4      job_type        => 'PLSQL_BLOCK',
  5      job_action      => 'BEGIN INSERT INTO DUMMY SELECT * FROM DBA_OBJECTS; DELETE FROM DUMMY; COMMIT; END;',
  6      enabled         => TRUE);
  7  END;
  8  /

PL/SQL procedure successfully completed.

SQL> SELECT DISTINCT consumer_group
  2  FROM
  3      v$active_session_history ash,
  4      dba_rsrc_consumer_groups cg
  5  WHERE
  6      action = 'RC_JOBA'
  7      AND   ash.consumer_group_id = cg.consumer_group_id;

CONSUMER_GROUP
--------------------------------------------------------------------------------
SYS_GROUP

SQL>

Edit: I guess if you are testing you may need to update the task the job does to ensure it writes something into v$active_session_history, if you’ve got suggestion for better way to do that let me know, I can’t think of one right now.

Update 2018-03-02: Worryingly this doesn’t seem to be simply a reporting error. I updated the current database resource plan such that any SQL estimated to take over 1 second to complete would be cancelled.
plan

Running a bad SQL directly as user caused query to be cancelled as expected, running via dbms_scheduler attempted to execute the bad SQL.

SQL> CREATE OR REPLACE PROCEDURE KILL_ME_PROC AS
  2  BEGIN
  3     FOR LOOPY IN (SELECT NULL FROM ALL_OBJECTS A1, ALL_OBJECTS A2, ALL_OBJECTS A3, ALL_OBJECTS A3)
  4     LOOP
  5        NULL;
  6     END LOOP;
  7  END;
  8  /

Procedure created.

SQL>
SQL> EXEC KILL_ME_PROC;
BEGIN KILL_ME_PROC; END;

*
ERROR at line 1:
ORA-00040: active time limit exceeded - call aborted
ORA-06512: at "C##PATRICKJOLLIFFE.KILL_ME_PROC", line 3
ORA-06512: at line 1


SQL> SELECT resource_consumer_group
  2  FROM v$session
  3  WHERE sid IN (SELECT sid FROM v$mystat);

RESOURCE_CONSUMER_GROUP
--------------------------------
OTHER_GROUPS

SQL> BEGIN
  2      DBMS_SCHEDULER.create_job (
  3        job_name        => 'KILL_ME_JOB2',
  4        job_type        => 'STORED_PROCEDURE',
  5        job_action      => 'KILL_ME_PROC',
  6        enabled         => TRUE);
  7  END;
  8  /

PL/SQL procedure successfully completed.

SQL> SELECT ERRORS FROM DBA_SCHEDULER_JOB_RUN_DETAILS WHERE JOB_NAME = 'KILL_ME_JOB2' AND LOG_DATE > SYSDATE-1/24;

ERRORS
--------------------------------------------------------------------------------
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
ORA-06512: at "C##PATRICKJOLLIFFE.KILL_ME_PROC", line 3


SQL>

 

Windows Single Sign-On to an Oracle database using Kerberos

I’ve recently been investigating Kerberos authentication as a means for providing Windows Single Sign On from Windows clients into an Oracle database servers. Maybe you are wondering one of the following questions.
What on earth does Kerberos authentication have to do with Microsoft Windows?
Don’t you have to pay extra money to Oracle for such functionality?
How to I configure this, are any additional components required?

To answer those questions in order, firstly Kerberos is the default authentication protocol since Windows 2000 (type klist at your Windows command prompt if you’re logged into a Windows domain to see evidence of this).
Secondly, while initially part of the Advanced Security Option, Kerberos Authentication no longer requires this extra cost option.
To answer the final question, read on; the following are the steps I took to enable Kerberos Authentication on my lab environment running on AWS.

First of all the initial setup.  The following machines all belong to a domain called patrick.domain.

  • Host MSWDC runs Windows Server 2016 is the Domain Controller
  • Host WINCLIENT runs Windows Server 2016, and holds Oracle (instant) client binaries
  • Host linoradb runs Oracle Linux Server release 7.4, holding an Oracle 11.2.0.4 Enterprise Edition Database.  Kerberos client packages are installed

I have created the two domain user accounts on patrick.domain:

  • patrickjolliffe – my personal account with which will try to SSO to the database.
  • orasvc – a service account that will be used to generate the keytab file used by the database server

First I add the following entries to sqlnet.ora on the database server:

SQLNET.KERBEROS5_CONF=/etc/krb5.conf
SQLNET.KERBEROS5_CONF_MIT=true

And update /etc/krb5.conf so that it contains the following:

[libdefaults]
  default_realm = PATRICK.DOMAIN
[realms]
  PATRICK.DOMAIN = {
    kdc = mswdc.patrick.domain
    admin_server = mswdc.patrick.domain
  }

At this point it’s worth doing some initial checks to validate Kerberos communication between the Oracle Database Server and the Domain Controller.

First let’s obtain, view and then destroy a Kerberos ticket through Linux
Kerberos Stack:

[oracle@linoradb ~]$ kinit patrickjolliffe
Password for patrickjolliffe@PATRICK.DOMAIN:
[oracle@linoradb admin]$ klist
Ticket cache: FILE:/tmp/krb5cc_54321
Default principal: patrickjolliffe@PATRICK.DOMAIN

Valid starting Expires Service principal
02/24/2018 06:58:44 02/24/2018 16:58:44 krbtgt/PATRICK.DOMAIN@PATRICK.DOMAIN
renew until 02/25/2018 06:58:41
[oracle@linoradb admin]$ kdestroy

Next let’s perform the same task via the Oracle Kerberos stack:

[oracle@linoradb admin]$ okinit patrickjolliffe

Kerberos Utilities for Linux: Version 11.2.0.4.0 - Production on 24-FEB-2018 07:01:34

Copyright (c) 1996, 2013 Oracle. All rights reserved.

Password for patrickjolliffe@PATRICK.DOMAIN:
[oracle@linoradb admin]$ oklist

Kerberos Utilities for Linux: Version 11.2.0.4.0 - Production on 24-FEB-2018 07:01:45

Copyright (c) 1996, 2013 Oracle. All rights reserved.

Ticket cache: /tmp/krb5cc_54321
Default principal: patrickjolliffe@PATRICK.DOMAIN

Valid Starting Expires Principal
24-Feb-2018 07:01:38 24-Feb-2018 15:01:34 krbtgt/PATRICK.DOMAIN@PATRICK.DOMAIN
[oracle@linoradb admin]$ okdstry

Kerberos Utilities for Linux: Version 11.2.0.4.0 - Production on 24-FEB-2018 07:01:52

Copyright (c) 1996, 2013 Oracle. All rights reserved.

[oracle@linoradb admin]$

Next we generate a keytab file on the domain controller for deployment to the database server.

ktpass -princ oracle/linoradb.patrick.domain@PATRICK.DOMAIN -ptype KRB5_NT_PRINCIPAL
-crypto RC4-HMAC-NT -mapuser orasvc@patrick.domain -out krb5.keytab -pass

File krb5.keytab is created and I copy to /etc directory on the database server linoradb.
I’ll add a couple of entries to sqlnet.ora on the database server so the Oracle Kerberos Stack can find this keytab, and also provide the service name “oracle” which I specified when generating the keytab.

SQLNET.KERBEROS5_KEYTAB=/etc/krb5.keytab
SQLNET.AUTHENTICATION_KERBEROS5_SERVICE=oracle

Let’s validate the keytab through the Linux stack:

[oracle@linoradb admin]$ klist -k -e
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- --------------------------------------------------------------------------
4 oracle/linoradb.patrick.domain@PATRICK.DOMAIN (arcfour-hmac)
[oracle@linoradb admin]$

All good, so let’s do the same check using the Oracle stack:

[oracle@linoradb admin]$ oklist -k

Kerberos Utilities for Linux: Version 11.2.0.4.0 - Production on 24-FEB-2018 07:16:14

Copyright (c) 1996, 2013 Oracle. All rights reserved.

Service Key Table: /etc/krb5.keytab

Ver Timestamp Principal
4 01-Jan-1970 00:00:00 oracle/linoradb.patrick.domain@PATRICK.DOMAIN
[oracle@linoradb admin]$

Next I’ll add the following entry to sqlnet.ora on the database server to allow Kerberos in addition to BEQ (local) connection

SQLNET.AUTHENTICATION_SERVICES=(BEQ,KERBEROS5)

On the database I create a new user that will be authenticated using Kerberos. Note the double quotes and the capitalization.

SQL> create user "PATRICKJOLLIFFE@PATRICK.DOMAIN" identified externally;

User created.

SQL> GRANT CREATE SESSION TO "PATRICKJOLLIFFE@PATRICK.DOMAIN";

Grant succeeded.

Before we complicate matters by getting the Windows client involved, lets first check Kerberos authentication for a connection locally on the database server (note I create and use a tnsnames entry to make sure we don’t use local BEQ authentication):

[oracle@linoradb ~]$ kinit patrickjolliffe
Password for patrickjolliffe@PATRICK.DOMAIN:
[oracle@linoradb admin]$ sqlplus /@SSOTEST

SQL*Plus: Release 11.2.0.4.0 Production on Sat Feb 24 07:49:19 2018

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> SHOW USER;
USER is "PATRICKJOLLIFFE@PATRICK.DOMAIN"
SQL>

That’s working fine so finally lets try the Windows client. I have installed instant client 11.1, 11.2, 12.1 and 12.2 versions and ensured the following settings work with all of them.

For 11.x clients authentication service KERBEROS5 is used, with Credential Cache (CC_NAME) OSMSFT:

For 12.x client 12.x in theory, KERBROS5 service should be used with MSLSA: for the CC_NAME, however due to bug 18895651, KERBEROS5PRE is required with CC_NAME OSMSFT:

Sources on the internet specify “OSMSFT:” or “OSMSFT://”, testing shows that any characters after the “:” are ignored so I provide the simplest possible configuration.

SQLNET.AUTHENTICATION_SERVICES=(KERBEROS5PRE,KERBEROS5)
SQLNET.KERBEROS5_CONF_MIT=true
SQLNET.KERBEROS5_CONF=C:\Users\patrickjolliffe\krb5.conf
SQLNET.KERBEROS5_CC_NAME=OSMSFT:

Testing with instant clients show that KERBEROS5_CONF just needs to point to a valid kerberos configuration file, it can even be empty, however for consistency, I just configure exactly the same as the krb5.conf on the database server.

Below I show connecting via 11.2 instant client. For demonstration purposes I purge Kerberos tickets before connecting, and after successful connection show that a ticket has been obtained for the access to the database:

C:\Users\patrickjolliffe\instantclient_11_1>klist purge

Current LogonId is 0:0xa69ed
Deleting all tickets:
Ticket(s) purged!

C:\Users\patrickjolliffe\instantclient_11_1>sqlplus /@ssotest

SQL*Plus: Release 11.1.0.7.0 - Production on Sat Feb 24 08:19:15 2018

Copyright (c) 1982, 2008, Oracle. All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> show user
USER is "PATRICKJOLLIFFE@PATRICK.DOMAIN"
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

C:\Users\patrickjolliffe\instantclient_11_1>klist

Current LogonId is 0:0xa69ed

Cached Tickets: (2)

#0> Client: patrickjolliffe @ PATRICK.DOMAIN
Server: krbtgt/PATRICK.DOMAIN @ PATRICK.DOMAIN
KerbTicket Encryption Type: AES-256-CTS-HMAC-SHA1-96
Ticket Flags 0x40e10000 -> forwardable renewable initial pre_authent name_canonicalize
Start Time: 2/24/2018 8:19:15 (local)
End Time: 2/24/2018 18:19:15 (local)
Renew Time: 3/3/2018 8:19:15 (local)
Session Key Type: AES-256-CTS-HMAC-SHA1-96
Cache Flags: 0x1 -> PRIMARY
Kdc Called: MSWDC.patrick.domain

#1> Client: patrickjolliffe @ PATRICK.DOMAIN
Server: oracle/linoradb.patrick.domain @ PATRICK.DOMAIN
KerbTicket Encryption Type: RSADSI RC4-HMAC(NT)
Ticket Flags 0x40a10000 -> forwardable renewable pre_authent name_canonicalize
Start Time: 2/24/2018 8:19:15 (local)
End Time: 2/24/2018 18:19:15 (local)
Renew Time: 3/3/2018 8:19:15 (local)
Session Key Type: RSADSI RC4-HMAC(NT)
Cache Flags: 0
Kdc Called: MSWDC.patrick.domain

C:\Users\patrickjolliffe\instantclient_11_1>

Of course it’s never so easy in practice, there are always complications, debugging and investigation needed to get this working, but when you consider the benefits to a large organization of providing Single Sign-On to Oracle databases, I think it’s time well spent.

Direct Path Reads and Chained Rows

Another DBA was investigating a performance difference of a query between two different systems and asked me to take a look. The execution plan, a full table scan, was the same in both cases.  We ran Snapper on the problem system and observed that most of the time was spent on event ‘db file sequential read’.

-- Session Snapper v4.24 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)

---------------------------------------------------------------------------------------------------------------
  ActSes   %Thread | INST | SQL_ID          | SQL_CHILD | EVENT                               | WAIT_CLASS
---------------------------------------------------------------------------------------------------------------
     .98     (98%) |    1 | 2kcksks9ht14t   | 0         | db file sequential read             | User I/O
     .03      (3%) |    1 | 2kcksks9ht14t   | 0         | ON CPU                              | ON CPU

--  End of ASH snap 1, end=2017-11-24 16:12:18, seconds=5, samples_taken=40, AAS=1

‘db file sequential read’ is a single block access, not usually associated with a full table scan, however from previous experience I guessed that we had chained rows (one row split among multiple blocks), so we reran snapper and checked the value of statistic ‘table fetch continued row’.

-- Session Snapper v4.24 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
30, GLOGOWNER , STAT, session logical reads                                     ,          8477,      1.56k,         ,             ,          ,           ,      8.48k total buffer visits
30, GLOGOWNER , STAT, user I/O wait time                                        ,           531,      98.03,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, non-idle wait time                                        ,           531,      98.03,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, non-idle wait count                                       ,          8449,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, physical read total IO requests                           ,          8449,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, physical read total bytes                                 ,      69214208,     12.78M,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, cell physical IO interconnect bytes                       ,      69214208,     12.78M,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, consistent gets                                           ,          8477,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, consistent gets direct                                    ,          8477,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, physical reads                                            ,          8449,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, physical reads direct                                     ,          8449,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, physical read IO requests                                 ,          8449,      1.56k,         ,             ,          ,           ,      8.19k bytes per request
30, GLOGOWNER , STAT, physical read bytes                                       ,      69214208,     12.78M,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, file io service time                                      ,         16898,      3.12k,         ,             ,          ,           ,        2us bad guess of IO service time per IO request
30, GLOGOWNER , STAT, file io wait time                                         ,       5285794,    975.84k,         ,             ,          ,           ,   625.61us bad guess of IO wait time per IO request
30, GLOGOWNER , STAT, no work - consistent read gets                            ,          8478,      1.57k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, table scan rows gotten                                    ,          3798,     701.17,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, table scan blocks gotten                                  ,           882,     162.83,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, table fetch continued row                                 ,          7567,       1.4k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , STAT, buffer is not pinned count                                ,          8477,      1.56k,         ,             ,          ,           ,          ~ per execution
30, GLOGOWNER , TIME, DB CPU                                                    ,        716890,   132.35ms,    13.2%, [@@        ],          ,           ,
30, GLOGOWNER , TIME, sql execute elapsed time                                  ,       6001826,      1.11s,   110.8%, [##########],          ,           ,
30, GLOGOWNER , TIME, DB time                                                   ,       6001826,      1.11s,   110.8%, [##########],          ,           ,      -10.8 % unaccounted time
30, GLOGOWNER , WAIT, db file sequential read                                   ,       5300427,   978.54ms,    97.9%, [WWWWWWWWWW],      8409,      1.55k,   630.33us average wait

--  End of Stats snap 1, end=2017-11-24 16:12:29, seconds=5.4

We were starting to understand the problem we were facing, especially when the DBA observed that the table in question had more than 255 columns (rows in such tables will always be chained), however we still didn’t understand why on the good performing system, also with chained rows, the query ran very fast, mostly on CPU.

I reran the query with SQL trace on the problem system, and attach a small portion below:

WAIT #139940851571352: nam='direct path read' ela= 3354 file number=11 first dba=257026 block cnt=62 obj#=131201 tim=1988902071975
WAIT #139940851571352: nam='db file sequential read' ela= 316 file#=11 block#=13034 blocks=1 obj#=131201 tim=1988902073088
WAIT #139940851571352: nam='db file sequential read' ela= 576 file#=11 block#=13034 blocks=1 obj#=131201 tim=1988902073711
WAIT #139940851571352: nam='db file sequential read' ela= 472 file#=11 block#=118100 blocks=1 obj#=131201 tim=1988902074209
WAIT #139940851571352: nam='db file sequential read' ela= 382 file#=11 block#=13034 blocks=1 obj#=131201 tim=1988902074629
WAIT #139940851571352: nam='db file sequential read' ela= 558 file#=11 block#=118140 blocks=1 obj#=131201 tim=1988902075214
WAIT #139940851571352: nam='db file sequential read' ela= 479 file#=11 block#=13034 blocks=1 obj#=131201 tim=1988902075732
WAIT #139940851571352: nam='db file sequential read' ela= 532 file#=11 block#=141498 blocks=1 obj#=131201 tim=1988902076290
WAIT #139940851571352: nam='db file sequential read' ela= 605 file#=11 block#=13034 blocks=1 obj#=131201 tim=1988902076933
WAIT #139940851571352: nam='db file sequential read' ela= 512 file#=11 block#=12950 blocks=1 obj#=131201 tim=1988902077472
WAIT #139940851571352: nam='direct path read' ela= 4326 file number=11 first dba=259074 block cnt=62 obj#=131201 tim=1988902081885
WAIT #139940851571352: nam='db file sequential read' ela= 754 file#=11 block#=13035 blocks=1 obj#=131201 tim=1988902082905
WAIT #139940851571352: nam='db file sequential read' ela= 544 file#=11 block#=13035 blocks=1 obj#=131201 tim=1988902083483
WAIT #139940851571352: nam='db file sequential read' ela= 617 file#=11 block#=12955 blocks=1 obj#=131201 tim=1988902084132
WAIT #139940851571352: nam='db file sequential read' ela= 546 file#=11 block#=13035 blocks=1 obj#=131201 tim=1988902084723
WAIT #139940851571352: nam='db file sequential read' ela= 976 file#=11 block#=141559 blocks=1 obj#=131201 tim=1988902085730
WAIT #139940851571352: nam='direct path read' ela= 3240 file number=11 first dba=264450 block cnt=62 obj#=131201 tim=1988902089070
WAIT #139940851571352: nam='db file sequential read' ela= 380 file#=11 block#=13036 blocks=1 obj#=131201 tim=1988902090605
WAIT #139940851571352: nam='db file sequential read' ela= 812 file#=11 block#=13036 blocks=1 obj#=131201 tim=1988902091452
WAIT #139940851571352: nam='db file sequential read' ela= 452 file#=11 block#=118118 blocks=1 obj#=131201 tim=1988902091936
WAIT #139940851571352: nam='db file sequential read' ela= 608 file#=11 block#=13036 blocks=1 obj#=131201 tim=1988902092588
WAIT #139940851571352: nam='db file sequential read' ela= 566 file#=11 block#=13008 blocks=1 obj#=131201 tim=1988902093186

While trying to figure out why the same block, say block# 13034, in the trace was being reread from disk during subsequent events, rather than being accessed from the buffer cache, another of my colleagues (thanks Kei) pointed out that the event associated with the table scan was a ‘direct path read’, rather than the usual ‘db file scattered read’.

‘direct path reads’ bypass the buffer cache, reading blocks directly into the session’s PGA. It seems this behavior also affects subsequent ‘db file sequential read’s necessary for accessing chained row pieces, potentially causing some significant performance issues.

As is often the case, Jonathan Lewis has already blogged about this issue however it’s always more rewarding and fun (re)discovering these issues oneself!

Slovenia & Croatia Oracle User Group Conferences 2017

20171014_183311.jpgThe fact that these two conferences run back-to-back seemed a perfect opportunity to experience both, at the expense of a single long-haul flight.  Their taking place at amazing locations on the Istrian coast which I had never visited before sealed the deal. I submitted abstracts and was delighted when they both accepted me to speak.

I landed in Zagreb after eighteen hours in transit, picked up my rental car, and crossed the Croatian-Slovenian border on my way to the Slovenian capital Ljubljana.  The city was  beautiful, but I was only able to spend about half an hour looking around before exhaustion kicked in and I retired for an early night.

20171015_113318.jpgThe Slovenian conference was held on the town of Portoroz, and I had heard about some cave systems, Postojna and Skocjan on the way.  I had planned to visit both, but on the way to the Postojna, I saw a sign to a cave at Planina so I decided to check that out.  The public accessible area only extends a couple of hundred meters inside (much more is accessible with a guide)  but it was really cool just being alone in the quiet and pitch darkness.

20171015_131242.jpgPostojna Cave  is a complete contrast; you take a 20 minute train journey with hundreds of other sightseers deep into an amazing cave system, followed by a 40 minute guided tour, before taking the train back out again.  After these two experiences I’d had more then enough of caves, so I decided too skip Skocjan and head direct to the conference venue.

20171015_201945.jpg

20171016_203427.jpgThe speakers dinner was held at an fantastic restaurant, Na Burji, and the meal was accompanied by bread baked by the multi-talented Jože Senegačnik.  I was thrilled to finally catch up in person with Jure Bratina, the first subscriber to my blog (!).

Both were fantastic hosts, epitomizing the warmth of the all Slovenian and Croatian people I met on the trip.

After the two day conference I drove with ‘Mr SmartScan’ Roger McNicol back across the border to Croatia and the town of Rovinj.

The Croatian conference is held at Hotel Istra, on Red Island, a 20 minute ferry trip from Rovinj.  There are actually two islands joined together by a causeway, and you can walk the circumference of both in about forty minutes.  They must have been spectacular before the hotel was built!

20171018_072500.jpg

Most days during the conference I managed to swim in the morning and evening, one morning completing the 75 minute swim around the smaller island with Kamil “Let’s do it quickly before we realize it’s a stupid idea” Stawiarski!

One of the highlights was a guided trip to the historic town of Pula, followed by drinking and go-karting, organized by the larger than life figure of Davor Ranković.

20171018_150440.jpg

The above description makes everything seem like an amazing holiday, which indeed it was.  However in addition, I saw so many excellent and informative presentations on topics such as Oracle database, Application Express and Oracle JET.  Being able to interact and make connections in person with so many incredibly knowledgeable and talented people from both inside, and outside of, Oracle corporation is very empowering, and something that is just not possible through the internet. Without exception I found that everyone was so welcoming to me as a new speaker.

Thanks everyone that made it possible, or made it special just by being part of it!

 

 

 

 

Identifying invalid SQL with event 10035

I recently learned from Jonathan Lewis’s blog that in version 12.2 details of invalid SQL statements are written to the alert log (after every 100 occurrences for a particular SQL statement).

Its worth to  mention that similar functionality can be achieved in earlier versions by setting event 10035 (not to be confused with it’s more (in)famous sibling, event 10053).

I will demonstrate with version 12.1 using the same invalid SQL from Jonathan’s post (I have tested that this works at least as far back as 10.2):

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> ALTER SESSION SET EVENTS='10035 trace name context forever, level 1';

Session altered.

SQL> select count(*) frm dual;

Error starting at line : 1 in command -
select count(*) frm dual
Error at Command Line : 1 Column : 21
Error report -
SQL Error: ORA-00923: FROM keyword not found where expected
00923. 00000 - "FROM keyword not found where expected"
*Cause:
*Action:

The following information is written in the alert log:

PARSE ERROR: ospid=14737, error=923 for statement:
Sun Oct 08 01:14:43 2017
select count(*) frm dual
Additional information: hd=0x6bffaa18 phd=0x71c88e18 flg=0x28 cisid=8 sid=8 ciuid=8 uid=8
PARSE ERROR: ospid=14737, error=923 for statement:
Sun Oct 08 01:14:43 2017
select count(*) frm dual
Additional information: hd=0x6bffaa18 phd=0x71c88e18 flg=0x28 cisid=8 sid=8 ciuid=8 uid=8
Sun Oct 08 01:14:43 2017
----- PL/SQL Call Stack -----
object line object
handle number name
0x6f9e6a10 1120 package body SYS.DBMS_SQL
0x727c2ba8 15 anonymous block

Due to the extra overhead this would add to the parse of every invalid SQL, I would suggest to only set this event temporarily at session or system level after identifying that this is a problem from the statistic “parse count (failures)”.

Update

I just found out that Tanel covered this back in 2009 as part of an deep-dive article into diagnostic events, that is well worth a read.

In-Memory Column Store and tables with more than 255 columns

Last week, developers reported some unexpected behavior on a test system.  I thought I would take advantage of the long weekend here in Hong Kong to build the simplest possible reproducible testcase.  I find this is useful to not only in allowing others to easily test on their system, but also to distill exactly what is and is not relevant to the problem, an essential first step for understanding what is going on.

The result confirms something I think I heard from Jonathan Lewis, in that when you start to mix certain features that majority of customers are not using, in this case in-memory column store, tables with >255 columns, and NCHAR columns, then you can start to see some ‘interesting’ results.

For my testcase I am running 12.1.0.2 with 170418 DBBP.

Observe the first part of the setup, I simply use PL/SQL block to create a table with 256 columns (note the first is an NCHAR) and populate it with a couple of rows.  The rows both have c1 set to ‘X’, and on one I populate the last field c256 with value 0, on the other it is NULL.  A query from the the table where c1=’X’ returns both rows as expected.

SQL> DECLARE
  2      sqltext   VARCHAR2(32767);
  3  BEGIN
  4      sqltext := 'CREATE TABLE IM256 (C1 NCHAR(1), ';
  5      FOR loopy IN 2..256 LOOP
  6          sqltext := sqltext
  7           || 'c'
  8           || loopy
  9           || ' NUMBER,';
 10      END LOOP;
 11      -- Remove trailing comma and add close braket
 12      sqltext := SUBSTR(sqltext, 1, LENGTH(SQLTEXT)-1) || ')' ;
 13      EXECUTE IMMEDIATE sqltext;
 14  END;
 15  /

PL/SQL procedure successfully completed.

SQL>
SQL> INSERT INTO IM256(C1, C256) VALUES ('X' , 0);

1 row inserted.

SQL> INSERT INTO IM256(C1) VALUES ('X');

1 row inserted.

SQL> COMMIT;

Commit complete.

SQL>
SQL> SELECT * FROM IM256 WHERE C1 = 'X';

X














                                                                                                                                                                   0

X

















2 rows selected.

SQL>

As soon as I move the table into In-Memory Column Store, the query with the predicate starts to return only one row. The query without predicate still returns both rows.

SQL> ALTER TABLE IM256 INMEMORY NO MEMCOMPRESS;

Table IM256 altered.

SQL> SELECT * FROM IM256 WHERE C1 = 'X';

X














                                                                                                                                                                   0


1 row selected.

SQL> SELECT * FROM IM256;

X














                                                                                                                                                                   0

X














                                                                                                                                                                   0


2 rows selected.

SQL>

Stefan Koehler pointed me towards event 12099 which is mentioned in a couple of bug reports in this area, as a way to disable ‘turboscan’ and indeed setting this event causes the expected results to appear.

SQL>  alter session set events '12099 trace name context forever, level 1';

Session altered.

SQL> SELECT * FROM IM256 WHERE C1 = 'X';

X














                                                                                                                                                                   0

X

















2 rows selected.

Oracle support have identified this as bug 20508559 which is described in further detail in document 20508559.8 – Wrong Results / ORA-600 [kdmv_check_row] with more than 255 columns with nulls.

Edit: I have just tested the patch and after applying it the issue is resolved.

High Five POUG

Earlier in the year while digging into AWR Warehouse internals I noticed that POUG (Pint with/Polish Oracle User Group depending on who you ask) had opened their call for papers so on impulse I submitted an abstract and thought no more about it.

I was surprised to receive a response that I could attend as a reserve speaker, they hoped I wasn’t too disappointed. I don’t think they realized I considered this the perfect outcome, an opportunity to attend a fantastic conference without the pressure of having to present.

A couple of weeks before the conference I received some disturbing news, one of their speakers had dropped out, and would be presenting. I would actually have to work for my conference ticket!

The conference itself was everything I had hoped for and more. It was held in a microbrewery so there was free flow beer for the duration which really contributed to the relaxed social atmosphere. All the organizers and staff were great, in particular Kamil’s enthusiasm was complemented by Luiza’s attention to detail. They were so accommodating and had added so many personalized touches to how I was treated. Likely because of this they had also attracted a top list of A-list speakers.

One thing I hadn’t expected was the support from the other speakers. As a first time presenter I received reassuring words from many, but in particular those from Neil Chandler and Jim Czuprynski really helped calm my nerves.

IMG_1582

In summary if you have yet to present at a conference then consider submitting a paper, you never know what might happen. Also if you ever get the opportunity to attend a future POUG conference then don’t hesitate to go.

IMG_1584

“ORA-20110: set stamp set count conflict” after upgrading to 12.2 Recovery Catalog

This is the second time we have hit this error after upgrading the recovery catalog to 12.2.

The first is now documented by Oracle as Document 2291791.1, and affects 12.1 multi-tenant instances with DBBP 12.1.0.2.161018 or later applied.
If you have applied that patchset, you can see the problem by checking the contents of file $ORACLE_HOME/rdbms/admin/recover.bsq

  cursor bs(low_recid number, high_recid number) is
    select bs.recid, bs.stamp, bs.set_stamp, bs.set_count, bs.backup_type,
           bs.incremental_level,
           bs.pieces, start_time, completion_time, controlfile_included,
           bs.input_file_scan_only, keep_until,
           decode (bs.keep_options, 'LOGS'      , KEEP_LOGS
                               , 'NOLOGS'       , KEEP_NOLOGS
                               , 'BACKUP_LOGS'  , KEEP_CONSIST
                                                , 0) keep_options,
           bs.block_size, bs.multi_section, bs.guid,
           decode(bs.con_id, invalid_pdbid, 1, 0) dropped_pdb
    from v$backup_set bs, v$containers pdb
    where bs.recid between low_recid and high_recid
      and (bs.stamp >= kccdivts OR bs.recid = high_recid)
      and bs.stamp >= resyncstamp
      and bs.for_xtts != 'YES'
    order by bs.recid;

Note that the join has been missed between v$backup_set and v$containers(!). The next cursor in the source, bp, also has the same ommission.
You can work-around the issue by adding the following condition to the two cursors, but you should probably check with Oracle Support first.

    
 and bs.con_id = pdb.con_id(+) 

The latest issue is on a 11g (hence single tenant obviously) on a dataguard standby site (in this case we take backups from both primary and standby instances).
Tracing the resync of the catalog from the standby site, the issue begins when inserting to table bs (backup sets) in the recovery catalog.
The insert fails because of a violation the unique key BS_U2 on columns db_key, set_stamp and set_count.
Checking the recovery catalog there already is record for this combination of set_stamp and set_count, but from the primary instance.

My suspicion is that the unique key also needs to include site_key field, however I don’t have enough of an understanding of what the set_stamp, set_count combination represents to be sure.

The only reference to the fields I can find come quotes in the documentation such as the following:

The SET_STAMP value from V$BACKUP_SET. SET_STAMP and SET_COUNT form a concatenated key that uniquely identifies this record in the target database control file.

What does seem apparent is that the 12.2 recovery catalog is stricter in enforcing this uniqueness, thus revealing some bugs elsewhere in the codebase.

Instance crash when not restarted through clusterware after patching

We had some issues, fortunately on a test instance, after applying a database
patch. With the help of a colleague (thanks Kei) I was able to identify what had
gone wrong and setup a testcase to demonstrate the issue. I think it is worth
sharing as it reveals some of the complexities of the relationships between
the components of the Oracle stack.

My setup consists of two database instances on the same host configured in Oracle Restart. To start the patching process I stop both instances. Note in particular the group owner of oracle binary is asmadmin.

[oracle@hkexdb01 ~]$ srvctl stop database -db PVJA -o abort
[oracle@hkexdb01 ~]$ srvctl stop database -db PVJB -o abort
[oracle@hkexdb01 ~]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle asmadmin 327575894 Jun 13 14:52 /u01/app/oracle/product/12.1.0.2/dbhome_2/bin/oracle

Next I apply database patch (I have omitted some details from the
patch session as they are not relevant). Observe that after patch completes,
oracle binary has been re-linked, and owner is now oinstall.

[oracle@hkexdb01 ~]$ cd ~/25397136
[oracle@hkexdb01 25397136]$ $ORACLE_HOME/OPatch/opatch apply
Oracle Interim Patch Installer version 12.2.0.1.9
Copyright (c) 2017, Oracle Corporation.  All rights reserved.


Oracle Home       : /u01/app/oracle/product/12.1.0.2/dbhome_2
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/12.1.0.2/dbhome_2/oraInst.loc
OPatch version    : 12.2.0.1.9
OUI version       : 12.1.0.2.0
Log file location : /u01/app/oracle/product/12.1.0.2/dbhome_2/cfgtoollogs/opatch/opatch2017-06-13_14-58-15PM_1.log

Verifying environment and performing prerequisite checks...
OPatch continues with these patches:   24732088  25397136  

Do you want to proceed? [y|n]
y
User Responded with: Y
All checks passed.

Please shutdown Oracle instances running out of this ORACLE_HOME on the local system.
(Oracle Home = '/u01/app/oracle/product/12.1.0.2/dbhome_2')


Is the local system ready for patching? [y|n]
y
User Responded with: Y
Backing up files...
Applying sub-patch '24732088' to OH '/u01/app/oracle/product/12.1.0.2/dbhome_2'

--DETAILS OMITTED

Patching component oracle.rdbms.install.plugins, 12.1.0.2.0...
Composite patch 25397136 successfully applied.
Log file location: /u01/app/oracle/product/12.1.0.2/dbhome_2/cfgtoollogs/opatch/opatch2017-06-13_14-58-15PM_1.log

OPatch succeeded.
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle oinstall 327813870 Jun 13 14:59 /u01/app/oracle/product/12.1.0.2/dbhome_2/bin/oracle

After patching is complete, I now restart PVJA outside of
clusterware (note the group owner of oracle binary is still oinstall).

[oracle@hkexdb01 25397136]$ export ORACLE_SID=PVJA
[oracle@hkexdb01 25397136]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Tue Jun 13 15:00:54 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SYS@PVJA> startup
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size                  2926472 bytes
Variable Size            1564297336 bytes
Database Buffers          570425344 bytes
Redo Buffers                9834496 bytes
Database mounted.
Database opened.
SYS@PVJA> exit
Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle oinstall 327813870 Jun 13 14:59 /u01/app/oracle/product/12.1.0.2/dbhome_2/bin/oracle

I then restart PVJB instance through clusterware. Note that clusterware
updates the group owner of the oracle binary.

srvctl start database -db PVJB 
ls -alrt $ORACLE_HOME/bin/oracle

[oracle@hkexdb01 25397136]$ srvctl start database -db PVJB 
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle asmadmin 327813870 Jun 13 14:59 /u01/app/oracle/product/12.1.0.2/dbhome_2/bin/oracle
[oracle@hkexdb01 25397136]$ 

This change in group ownership ‘breaks’ the PVJA instance. There are lots of errors in the alert log.

Errors in file /u01/app/oracle/diag/rdbms/pvja/PVJA/trace/PVJA_j000_64968.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1001 (oinstall), current egid = 1006 (asmadmin)

Note the clusterware log seems to indicate the point at which this action took
place.

2017-06-13 14:55:47.233925 :CLSDYNAM:4045649664: [ora.pvjb.db]{1:31716:24755} [start] Utils:execCmd action = 1 flags = 6 ohome = /u01/app/12.1.0.2/grid cmdname = setasmgidwrap

This behaiour seems to match Bug 9784037 : SETASMGID CAUSING ORA-27303, which
was closed as “Not a Bug”. The provided workaround is “always remember to execute setasmgidwrap after doing relink all / patch apply in RDBMS Oracle Home”.
However I think just making sure that all instances are restarted through
clusterware is probably a better way to handle this.

Update, seems Frits Hoogland has run into this issue too, I now feel a bit better about my ‘mistake’.