Oracle – ANALYZE TABLE..VALIDATE STRUCTURE Runs Forever

corruptionoracleoracle-10gperformance

So, as I've mentioned before, I am a SQL Server DBA/Consultant who is helping out a customer with their Oracle databases. Mostly, I've been able to figure out what I need to and work through things with the help of the doc and Google but I am currently trying to deal with something that seems inscrutable.

One of the databases is an Oracle 10g LIMS database that has been around (and upgraded) for at least 10 years (maybe 20). It's a critical app but they've been having numerous reliability issues with it over the past couple of years. In order to try to get our hands around what's really wrong with it and what we might be able to do about it we had the hosting/MS provider make a physical copy of it on another (identical) server. (I am pretty sure that the provider interpreted that as meaning "RMAN backup and restore").

The idea being that I could then do whatever analysis, investigation, repair attempts, and use any tools that I wanted on it without any danger of impacting production. So far, so good. I ran DBVerify (dbv) on all of the data files, no problems.

Then I ran ANALYZE TABLE .. VALIDATE STRUCTURE CASCADE ONLINE; on every table in the database. For two of these tables, the command took much longer than the other 2500 tables (combined) so I cancelled them and let the others finish, which they did with no problems and no reported errors, etc.

These two tables as it turns out are also the biggest application data tables. So then I start with the largest table (23GB, 36M rows) and decide to analyze it by parts (without the CASCADE), first the table itself then the indexes. The table finishes in 30-60 minutes (I do not remember exactly), but the ANALYZE INDEX .. VALIDATE STRUCTURE ONLINE; for the first index never finishes.

I decided to let it run for a while, and to see if I could find a way to monitor it's progress. Googling this I find many (old) claims that ANALYZE TABLE can be monitored with v$session_longops but looking at it never showed anything from my command and I came to the conclusion that this must work only for the Stats gathering function, not the Validate function. After trying a bunch of things I eventually found that by looking at the v$session row for the executing command, I could use the P1* & P2* columns to track what file and block was being read.

This showed me that it was indeed reading from the expected data files and was reading different blocks from them. This and the fact that the disk utilization on the server was 100% while the command was running (and nothing else) convinced me that it was really doing something and was not blocked/locked/hung. So I just let it run.

It has now been over 3 days (+72 hours) and it shows no signs of finishing and I am kind of at a loss here.

Additional info:

The table's DDL:

-- Unable to render TABLE DDL for object O$LIMS.N__RESULTS with DBMS_METADATA attempting internal generator.
CREATE TABLE O$LIMS.N__RESULTS 
(
  SAMPLE_ID NUMBER(10, 0) NOT NULL 
, SUBMISSION_ID NUMBER(10, 0) NOT NULL 
, RESULT_ID NUMBER(10, 0) NOT NULL 
, RESULT_VERSION NUMBER(3, 0) NOT NULL 
, TASK_ID NUMBER(10, 0) NOT NULL 
, TASK_REPETITION NUMBER(3, 0) NOT NULL 
, TASK_VERSION NUMBER(3, 0) NOT NULL 
, REQUIRED VARCHAR2(1 BYTE) NOT NULL 
, METHOD_DATAGROUP VARCHAR2(40 BYTE) NOT NULL 
, COMPONENT VARCHAR2(40 BYTE) NOT NULL 
, MEASURE VARCHAR2(40 BYTE) NOT NULL 
, UNITS VARCHAR2(40 BYTE) NOT NULL 
, STATUS VARCHAR2(20 BYTE) NOT NULL 
, PLANNED_RESULT VARCHAR2(3 BYTE) NOT NULL 
, RESULT_ORIGIN VARCHAR2(1 BYTE) NOT NULL 
, CONDITION VARCHAR2(20 BYTE) NOT NULL 
, CONDITION_LEVEL VARCHAR2(20 BYTE) 
, VALUE_TYPE VARCHAR2(20 BYTE) NOT NULL 
, NUMBER_VALUE NUMBER 
, TEXT_VALUE VARCHAR2(80 BYTE) 
, TIME_VALUE DATE 
, REASON VARCHAR2(40 BYTE) 
, INLIMIT VARCHAR2(3 BYTE) 
, INDETECTION VARCHAR2(3 BYTE) 
, INSPEC VARCHAR2(3 BYTE) 
, ENTRY_USERID VARCHAR2(20 BYTE) 
, ENTRY_DATE DATE 
, SPEC_ID NUMBER(10, 0) 
, SPEC_VERSION NUMBER(3, 0) 
, DETECTION_ID NUMBER(10, 0) 
, DETECTION_VERSION NUMBER(3, 0) 
, LIMIT_ID NUMBER(10, 0) 
, LIMIT_VERSION NUMBER(3, 0) 
, CUSTOMER_DATAGROUP VARCHAR2(40 BYTE) 
, ANALYST VARCHAR2(20 BYTE) 
, REPORT VARCHAR2(3 BYTE) 
, TIMESTAMP DATE 
, USERSTAMP VARCHAR2(20 BYTE) 
, MEASURE_LINK O$LIMS.N_UT_MEASURE 
, RESULT_PLAN_LIST_LINK O$LIMS.N_UT_RESULT_PLAN_LIST 
, TEXT O$LIMS.N_TT_TEXT 
, ATTRIBUTES O$LIMS.N_TT_ATTRIBUTES 
, SEQUENCE NUMBER(4, 0) 
, CONSTRAINT N_C_RESULTS_1 PRIMARY KEY 
  (
    RESULT_ID 
  , RESULT_VERSION 
  )
  USING INDEX 
  (
      CREATE UNIQUE INDEX O$LIMS.N_C_RESULTS_1 ON O$LIMS.N__RESULTS (RESULT_ID ASC, RESULT_VERSION ASC) 
      LOGGING 
      TABLESPACE "SQLLIMS_INDEX" 
      PCTFREE 10 
      INITRANS 2 
      STORAGE 
      ( 
        INITIAL 311296 
        NEXT 1048576 
        MINEXTENTS 1 
        MAXEXTENTS UNLIMITED 
        BUFFER_POOL DEFAULT 
      ) 
      NOPARALLEL 
  )
  ENABLE 
) 
LOGGING 
TABLESPACE "SQLLIMS_RESULTS" 
PCTFREE 25 
INITRANS 1 
STORAGE 
( 
  INITIAL 566231040 
  NEXT 1048576 
  MINEXTENTS 1 
  MAXEXTENTS UNLIMITED 
  BUFFER_POOL DEFAULT 
) 
NOCOMPRESS 
NOPARALLEL 
NESTED TABLE TEXT STORE AS RESULT_TEXT RETURN AS VALUE  
NESTED TABLE ATTRIBUTES STORE AS RESULT_ATTRIBUTES RETURN AS VALUE

The DDL of the index currently being analyzed:

-- Unable to render INDEX DDL for object O$LIMS.SYS_C0010496 with DBMS_METADATA attempting internal generator.
CREATE UNIQUE INDEX O$LIMS.SYS_C0010496 ON O$LIMS.N__RESULTS (ATTRIBUTES ASC) 
LOGGING 
TABLESPACE "SQLLIMS_RESULTS" 
PCTFREE 10 
INITRANS 2 
STORAGE 
( 
  INITIAL 65536 
  NEXT 1048576 
  MINEXTENTS 1 
  MAXEXTENTS UNLIMITED 
  BUFFER_POOL DEFAULT 
) 
NOPARALLEL

My questions:

  1. Is it reasonable for the ANALYZE INDEX to take so long when the ANALYZE TABLE only took at most 1 hour?

  2. What are my options at this point? Assuming I kill this, how can I proceed with checking for (logical) corruption, etc.? Is there something I can do to get this command to finish in a reasonable time, or is there some other tool or approach I can use instead?


Version, etc. information:

  • Host: Windows 2003 server, sp2
  • Oracle version: 10g r2 ps3 10.2.0.4.0
  • Compatibility: 10.2.0.3.0

Here are the relevant SGA parameters (I think):

  • sga_target: 7012876288 (6688M)
  • sga_max_size: 10015997952 (9552M)
  • db_block_size: 8192
  • db_cache_size: 2885681152 (2752M)
  • db_*k_cache_size: all are zero

Best Answer

I think the article Raj quoted (https://www.pythian.com/blog/analyze-index-validate-structure-dark-side/) describes this pretty well. "clustering factor" was also my first guess while reading the description of your problem. I also prefer to use RMAN to check for corruption.

RMAN> backup check logical validate database;

Afterwards you can query V$DATABASE_BLOCK_CORRUPTION for details on any corrupt block.


Normally there is a reason for huge clustering. You can check the following details:

  • How many rows does the index have?
  • How many blocks does it consume?

This may help you to determine if the clustering_factor is high. You can also query dba_extents to determine the clustering_factor.

  • What type of management does the tablespace use? I guess it's not ASSM since the database is pretty old and was upgraded several times.

Why is clustering bad for performance? When Oracle reads data from disk or cache it always reads in block. If the block is half empty you loose 50% of your reading performance. In case you do a full table or full index scan Oracle scans all blocks belonging to the segment (index or table). It does not check if a block is empty or not. Oracle reads from the first block to the last (HWM). If your index has 10mil blocks but it only needs 1mil blocks Oracle reads 9mil trash blocks.

ASSM (Automatic Segment Space Management) helps a lot to reduce/prevent clustering. If possible you should migrate Manual Segment Space Managed Tbs to ASSM Tbs.


Why checking for corruption? I always start analyzing a database by gathering ASM or statspack reports from the production database (not a test/dev db). This does not hurt anyone and will give you a lot of details on the database and it's little secrets. Corruption does usually not affect performance since the database does not try to repair it -- it just crashes the current query.