Sunday, November 19, 2006

Histogram problems

A question just appeared on the Oracle-L mailing list about whether there are any issues with using histograms. I've seen a number of presentations over the last couple of years that have warned about issues where you have histograms on highly skewed data and use bind variable (or have cursor_sharing=force, which synthesises bind variable from literals in queries), so I responded:
On 19/11/06, A Joshi wrote:
> Hi,
> About use of histograms : I think histograms are useful for indexes on
> columns that are not very selective. However I came across note 1031826.6 on
> metalink. About maintenance and space cost. I think space cost is negligible
> and can be ignored. About maintenance : does it mean statistics need to be
> gather often? Or does it mean some other cost.
>
> Question : Is there any other overhead or any other negative impact of using
> histograms?

One downside I've been hearing a lot about over the last couple of
years results from Bind Variable Peeking (link to the relevant Oracle Documentation). When a where clause contains bind variables (either explicitly declared or synthesised due to cursor_sharing=force) the optimizer can see what they are and use the information to build an execution plan
during the hard parse phase. If the application executes the same query again whilst the plan is still cached then it will reuse the same plan even if the bind variables have different values (no need to reparse so no peeking).


Suppose you have a table (t) where a field (thefield) has a number of possible values, one of which (say 'X') appears in 50% of the records whilst the others are uniformly distributed over the remaining 50%. Very highly skewed and obviously you would expect to get a different plan for the high frequency value (probably a full table scan) than you would for a low frequency one say 'Y' (probably an index read anbd access by row id). You have an index on thefield and have collected stats with histograms on that table and thefield.

If you don't use bind variables then a queries like:

select [field_list]
from t
where thefield='X';


and

select [field_list]
from t
where thefield='Y';


will both be hard parsed get different execution plans appropriate to the values used in the where clause.

Now suppose you rewrite the query replaciong the constants with a bind variable. The first time the query is parsed the optimizer will use the value of the bind variable and the histograms to work out the best execution plan. The next time it will reuse the same plan. This is fine if the values used in the bind variable are of similar
distribution but if they are not then you get issues.

Say the first time through the value is 'X' (appears in 50% of rows) the optimizer will peek the bind variable and probably go for a full table scan as that is the most efficient way to pull back that proportion of the rows thanks to the wonders of multiblock reads and read ahead caching in the filesystem/SAN. The second time through the value is 'Y' (appears in very few rows), because there is already an
execution plan cached the optimizer won't do the hard parse and so won't peek the bind variable, it will just use the existing plan (a full table scan). A full table scan is a really inefficient way to read a small proportion of the rows in a table.

Reverse the order of the queries ('Y' first then 'X') and you have the database doing an index read then access by rowid to retrieve 50% of the rows in a table. Hitting at least 50% of the index and at least 50% of the table blocks (probably as single block reads), a really inefficient way to a large proportion of the rows in a table.



>
> Is it advisable to use histograms just for some tables and some specific
> columns or is it OK to just set database wide?
>

It depends on your app and your data. If the data is highly skewed and the app uses bind variables (or you have cursor_sharing=force) then possibly not.


Hopefully that's OK.

No comments: