"hastenthunder" wrote in message news:[email protected]
>> I've read many documentations online stating to only create an index if
>> queries against this table frequently retrieve less than 15% of the rows.
>> However, if the query returns, say, 40% of the rows, wouldn't indexing the
>> column still help by cutting the work by roughly half?
A much *simplified* example on how I teach this stuff...
Let's say we have a table that has 10,000,000 rows which are
stored in 1,000,000 data blocks meaning we have approximately
10 rows per block on average.
Let's say we have an index on this table that has 100,000 leaf
blocks meaning we have on average approximately 100 leaf entries
per leaf block the index has 3 levels.
Let's also say we have an "effective" multi-block read capability
of 10 blocks per I/O (meaning Oracle will read 10 "consecutive"
blocks at a time on average during a full table scan multiblock
Finally, let's say we're interested in accessing *just* 10% of
the data (or 1,000,000 of the total 10,000,000 rows). Will
Oracle use the index or won't it ? Hopefully, I've picked an
easy set of numbers to help illustrate the answer ...
Firstly, to calculate the "cost" of using the index access path.
We need to read the root block + a branch block in order to get
to the first leaf block of interest. That's 2 logical I/Os
(LIOs). We then need to read approximately 10% of the leaf
blocks in order to get our 1,000,000 leaf entries required to
directly access our 1,000,000 rows of interest, that's 10% of
the 100,000 leaf blocks = 10,000 leaf blocks. Because we're
reading an index via a range scan and because the leaf blocks
are not (necessarily) physically co-related, Oracle must read
each leaf block via a single I/O. So that's 10,000 LIOs. So,
just to read the index alone, we require 2 + 10,000 = 10,002
Note by default, Oracle assumes the above "cost" to be physical
I/Os (PIOs). Now assuming this index is heavily accessed, a good
number of these index blocks may already be cached in memory.
The optimizer_index_caching parameter can be used to adjust the
above cost by suggesting that x% are actually already cached and
so are "cheaper" to access. To keep things simple, we'll assume
the default value of 0% or that no index blocks are actually likely
to be cached (generally not a wise assumption but let's keep the
To access the corresponding table blocks, again Oracle can only
perform these reads via a single block read as each index entry
points to a table block that contains it's specific table row .
Now we're after 1,000,000 rows which means we require 1,000,000
LIOs in order to access the required rows. Question is, how many
*different* table blocks do we need to access ? Well, this is
entirely dependent on the Clustering Factor (CF) of the index,
or how closely aligned are the corresponding rows in the table
in relation to the order of the index (which must be in the
order of the index values). In the "best" possible case, all the
required rows are all ordered and grouped together in the same
"collection" of table blocks meaning we only have to access 10%
of the 1,000,000 table blocks or 100,000 table blocks in a
roughly *consecutively* manner.
However, as is more common, if the required rows are randomly
and evenly distributed among the table blocks, then on average
we need to read 1 row (10%) from *each and every table block*.
Note in your case of wanting to access 40% of the data, we might
depending on a poor CF need to visit on average *each and every*
data block *4 times*. This is the key point (no pun intended).
The greater the number of differing blocks we access, then the
less likely we will find the block in memory from it being
previously read and the more likely that the block will need to
be read from disk (PIO). Oracle considers this and uses the CF
in it's costing calculations. Assuming a randomly distributed
set of required rows, note we will need to visit *all* the table
blocks on average because on average we are interested in 1 in
10 of the rows that each block contains (yes, some blocks may
not actually be visited and some may be visited a number of
times but with such volume of blocks, it conceivably might be a
significant duration between reads to the same block meaning it
could easily have been aged and be physically re-read anyways).
The point though is that it's 1,000,000 LIOs regardless, of
which a very significant number *could* be *actual distinct*
(or differing) blocks. So that's 10,002 for the index + 1,000,000
for the table = 1,010,002 LIOs to read *just* 10% of the data
via an index.
Now to calculate the "cost" of a FTS. A FTS has a number of
advantages over an index access path. Firstly, because we read
each block "consecutively" (kinda) Oracle can investigate the
appropriate selectiveness of each row within the block ensuring
that each table block is read just *once* (special blocks such
as extent maps withstanding). Secondly, again because each block
is read consecutively, Oracle can perform a multi-block read and
read multiple blocks within the one LIO. This is based on factors
such as db_file_multiblock_read_count, system statistics, OS I/O
characteristics, the caching characteristics of the table and the
"fudge-factor" that theOracle CBO applies in it's calculations.
For simplicity (and to keep the numbers really simple), assuming
an effective multi-block read of 10, we can read the entire table
in approximately 1,000,000 table blocks / 10 = 100,000 LIOs.
Note that although these are larger and potentially more "costly"
I/Os than the single block I/Os used by the index, Oracle assumes
by default that the actual cost of each type of I/O to be the
same. The optimizer_index_cost_adj parameter can be used to more
accurately estimate (if necessary) the relative cost of a single
block I/O to that of a FTS multi-block I/O. Again for simplicity,
we'll assume the default of 100 meaning that the cost of a single
block I/O is 100% (or the same) as a FTS I/O.
So, we now have our two comparative costings. The index access
has a rough cost of 1,010,002 and the FTS has a rough cost of
just 100,000. The FTS wins hands down.... Note for 40% of the
data, the relative costs would have been roughly 4,040,002 vs.
100,000. Even more hands down ...
The break-even point can now be calculated based on the above
criteria, some of which include:
* the selectivity of the query
* number of leaf blocks
* average number of leaf entries per leaf block
* height of index
* caching characteristics of index
* clustering factor of index
* number of table blocks (below HWM)
* average number of rows per block
* effective (or calculated) multi-block read
* caching characteristics of the table (which can influence the effective
* relative cost of a single block I/O vs. a multi-block I/O
* amount of row migration / row chaining (although the CBO is not so good
* parallelism (potentially a major factor)
So your assumption that reading 40% of rows would cut the work
by roughly half is not correct. In the example above, it would
actually cost about 40 times as much. In my long-winded manner,
I hope this makes some kind of sense and goes some way to
One final piece of advice. Ignore any writings or suggestions
that there is a magical break even point is x% (where x could
be 2% or 10% or 50% or whatever). Hopefully the above will hint
that there is *no* such percentage as it all depends on too many
factors. I can easily give you an example where an index is most
efficient when reading 0% of data and I can easily give you an
example where an index is most efficient when reading *100%* of
data (and *any* value in between). When one understands how the
CBO functions, one understands why such so-called rules of thumb
are a nonsense.