Thursday, November 30, 2006

du and ls report very different file sizes for Oracle Tempfiles on NFS v3 mounted volume on NetApp filer under Solaris 9

We have a NetApp FAS filer attached via Gigabit Ethernet to a Sun V440 running Oracle databases the files for which are stored on the filer The NetApp volume is mounted under NFS version 3 at $ORACLE_BASE/oradata. Under that directory are 7 directories .snapshot, dctm, dchr, dctm_backup, dchr_backup, dctm_nobackup, dchr_nobackup.

.snapshot is where the NetApp filer stores snapshots of the volume.

The server runs two databases, dctm and dchr. The dctm and dchr directories hold the datafiles (and a small tempfile) for the appropriate database, *_backup holds the RMAN and cold backup sets for the appropriate databases. *_nobackup holds the bulk of the tempfiles (temporary tablespace, used for sorting) for the appropriate database (when we do cold backups we don't bother to backup the tempfiles).

For dctm, dchr and *_backup both du and ls report the same file sizes for both data and tempfiles. For *_nobackup ls reports the correct sizes (as specified in the Oracle data dictionary) but du reports a much smaller size for both the individual file size and the total for the directory.

v$sort_segment.max_blocks * db_block_size approximates to the size reported by the total size reported by du for the tempfiles in *_nobackup.

We know that ls gets file sizes by reading what is recorded in the inodes, or at least that's what the documentation says. The documentation for du doesn't specify how it gets it's space measurement but we've found pages on the web that indicate that it does this by counting allocated blocks. Normally these two ways of getting the file size should return the same result, the only exceptions being hard links (multiple directory entries point to the same file on the same file system) and sparse files. Neither directory contains any hard links and the tempfiles do not conform to our understanding of sparse files due to being formatted at creation.

Our current hypothesis is that the formattig does not write to every block and because of the way the NetApp filer handles block level writes it only bothers to allocate a block if that block is actually written to but still records the requested filesize in the inode, blocks that have not yet been written to still exist but only in a virtual state (ghost blocks?). With a NetApp filer blocks within the same file are not necessarily stored contiguously and a bloc that is read, changed and rewritten will probably not be written back to the same place they were originally read from. The Oracle buffer cache is likely to contain blocks from various segments read from various tablespaces and many different places on the disks. When Oracle flushes out dirty blocks the NetApp filer stores them in it's own cache. When the NetApp flushes it's own cache it will write as many blocks as possible into the first free space it can find, regardless of where they came from, so you get blocks from various segments in various files written next to each other on the disk, almost certainly somewhere other than where they were read from. The original blocks are then freed. Or at least that's how they explained it to me when I asked.

We are trying to verify this with NetApp but, what we believe has happened here is that when the files were created the NetApp wrote the inodes with the correct size but only allocated blocks on disk for the blocks that have actually been used (formatting of the file and the temp segments that have been written out between file creation and now), the remaining blocks are logically allocated but the physical blocks are still unallocated.

Whilst this no doubt improves speed of creation the file it would probably cause a performance hit on sorts when more sort space needs to be allocated as when blocks within the file are allocated to extents by Oracle the filer has to allocate them at filesystem level. We haven't noticed anything so far so presumably any hit is masked by the caching, it may become relevant if we get to the point where the cache becomes a bottleneck.

We await a response from NetApp.

Incidentally, the reason we don't backup the tempfiles is that the OS/Storage team were complaining about how long the backups took and said we were at risk of blowing the backup window (we weren't). After investigating I found that someone had changed the second tempfile in each database to autoextend and they had grown to 64GB each. Quite suprising as each database had under 5Gb of data (including system stuff, i.e. data dictionaty &c). After some investigation I found a lot of SQL was being run that was very unselective and returned very large datasets (over 10% of the total data) which would be sorted then sent to the application for display. When you have hundreds of users doing this concurrently you rapidly reach the point where many copies of the same data are being sorted at the same time, as Oracle cannot share sorts between sessions each copy is in a different sort segment, especially when many of the queries use joins that require sorts. We removed the large tampfiles and replaced them with 16*2Gb tempfiles in a separate directory for each database. These are not backed up. In the event of a recovery situation, and we cannot use the RMAN backups we take each night, we drop the missing tempfiles then recreate them.

The tempfiles in dctm and dchr are the originals and will have been filled before the second tempfile extended hence all blocks have been allocated at the fielsystem level.

Thursday, November 23, 2006

How people find this blog

I was just looking at my stat counter analysis for this blog, it's usually a good way to find out if anyone has linked to me. From the results it looks like the vast majority of people (around 97%) find this blog through a search engine, virtually all (over 99%) through Google. Most common search strings are or include "ORA-07445", "opmnctl failed to start process" and "ORA-01031".

Tuesday, November 21, 2006

Never volunteer

During his presentation on "Inside RAC" Julian Dyke appeared to suffer laptop difficulties so asked for 4 volunteers to assist in a demonstration. Being the helpful soul (i.e. idiot) that I am, I volunteered. Mr Dyke did seem remarkably wellprepared in terms of props, almost as if he had been expecting his laptop to have problems :-)!

Photographic evidence (courtesy of Connor McDonald) can be found here and here. I'm the fat, ugly, balding, bearded one on the far left, aka Node 1. Node 2 was Babbette Turner-Underwood, Node 3 (aka Master) was Murali Vallath and Node 4 was Carel-Jan Engel.

Sunday, November 19, 2006

Histogram problems

A question just appeared on the Oracle-L mailing list about whether there are any issues with using histograms. I've seen a number of presentations over the last couple of years that have warned about issues where you have histograms on highly skewed data and use bind variable (or have cursor_sharing=force, which synthesises bind variable from literals in queries), so I responded:
On 19/11/06, A Joshi wrote:
> Hi,
> About use of histograms : I think histograms are useful for indexes on
> columns that are not very selective. However I came across note 1031826.6 on
> metalink. About maintenance and space cost. I think space cost is negligible
> and can be ignored. About maintenance : does it mean statistics need to be
> gather often? Or does it mean some other cost.
> Question : Is there any other overhead or any other negative impact of using
> histograms?

One downside I've been hearing a lot about over the last couple of
years results from Bind Variable Peeking (link to the relevant Oracle Documentation). When a where clause contains bind variables (either explicitly declared or synthesised due to cursor_sharing=force) the optimizer can see what they are and use the information to build an execution plan
during the hard parse phase. If the application executes the same query again whilst the plan is still cached then it will reuse the same plan even if the bind variables have different values (no need to reparse so no peeking).

Suppose you have a table (t) where a field (thefield) has a number of possible values, one of which (say 'X') appears in 50% of the records whilst the others are uniformly distributed over the remaining 50%. Very highly skewed and obviously you would expect to get a different plan for the high frequency value (probably a full table scan) than you would for a low frequency one say 'Y' (probably an index read anbd access by row id). You have an index on thefield and have collected stats with histograms on that table and thefield.

If you don't use bind variables then a queries like:

select [field_list]
from t
where thefield='X';


select [field_list]
from t
where thefield='Y';

will both be hard parsed get different execution plans appropriate to the values used in the where clause.

Now suppose you rewrite the query replaciong the constants with a bind variable. The first time the query is parsed the optimizer will use the value of the bind variable and the histograms to work out the best execution plan. The next time it will reuse the same plan. This is fine if the values used in the bind variable are of similar
distribution but if they are not then you get issues.

Say the first time through the value is 'X' (appears in 50% of rows) the optimizer will peek the bind variable and probably go for a full table scan as that is the most efficient way to pull back that proportion of the rows thanks to the wonders of multiblock reads and read ahead caching in the filesystem/SAN. The second time through the value is 'Y' (appears in very few rows), because there is already an
execution plan cached the optimizer won't do the hard parse and so won't peek the bind variable, it will just use the existing plan (a full table scan). A full table scan is a really inefficient way to read a small proportion of the rows in a table.

Reverse the order of the queries ('Y' first then 'X') and you have the database doing an index read then access by rowid to retrieve 50% of the rows in a table. Hitting at least 50% of the index and at least 50% of the table blocks (probably as single block reads), a really inefficient way to a large proportion of the rows in a table.

> Is it advisable to use histograms just for some tables and some specific
> columns or is it OK to just set database wide?

It depends on your app and your data. If the data is highly skewed and the app uses bind variables (or you have cursor_sharing=force) then possibly not.

Hopefully that's OK.

Tuesday, November 14, 2006

UKOUG 2006 - Day 1

Today was the first day of the United Kingdom Oracle User Group 2006 conference.

The day itself was pretty good, lots of good talks (more on that later), but I ran into a couple of issues in the evening. The first issue was when I went into the Handmake Hamburger just accross the canal from the ICC, I was alone and the waitress/Maitre d' said they didn't have any free tables suitable for one person (looking around they had several free tables, just they were for 2, 4 or more people) so after expressing my disatisfaction and feeling about the poor quality of service I went accross Broad Street to Walkabout and had a Kangaroo Burger. The second issues was that I had hoped to go to the bloggers' meetup at All Bar One. Although I blog about Oracle I'm not sure if I qualify as a true Oracle Blogger so when everyone disapeared upstairs to a private party (UKOUG had laid on some sort of 'Thank You' event for the volunteers) I didn't follow. I know I'm probably worrying over nothing but didn't want to risk my biggest memory of this year's conference being getting thrown out of an event for gate crashing. Hopefully by next year I'll have sorted out what the criteria are for access to such gatherings, and will fulfil the criteria.

Getting back to the day, the talks.

A number of the talks I attended dealt with bitmap indexes to a greater or lesser degree. A bitmap index (for those who don't yet know and don't want to wade through the guides) is one where there is a series of bits (the bitmap) for each possible value for a field with a bit for each record in the table, so there are 4 possible values and 10 records there will be 4 bitmaps of 10 bits each (apparently bitmap indexes compress repeating identical values so the bitmaps would be smaller than that but I've only heard that from one source and haven't been able to confirm it yet, anyhow from a logical perspective there's still a bit for each records in each bit map).

For example suppose you have a people table which contains a gender column. The gender column can have 4 possible values (actually, depending on your (political/social) environment, it can have many more than that (at least 30 in some systems I know of) but that's a blog entry for another time): "Don't Know", "Refused to divulge", "Female" and "Male". The difference between "Don't Know" and "Refused to Divulge" is that in the case of "Don't Know" we haven't asked and for "Refused to Divulge" we asked but they wouldn't say, due to the nature of data privacy laws in the UK, except in very specific circumstances, we cannot compel an answer or put in what we believe the answer to be. I'll use a 1 letter code for each one in this example (D R F M) the letter at the start of the line is the field value, the string of digits is the value of the bits in the bitmap:
D 00000000000000000100
R 00001000001000000001
F 11100001100011111111
M 00010110010100000000

As you can see from this we have 1 record with "Don't Know", 3 with "Refused to Divulge", 13 with "Female" and 5 with "Male" in the indexed field.

An interesting point from the talks was that each presenter seemed to have a slightly different view on the criteria for when you use a bitmap index. They mostly agreed but there were slight variations in the detail and not all of them listed all the same criteria. Collating the criteria given ('bitwise and' :-)) gives something like:
  • Small number of possible values. The key question here is what constitutes a small number. 4 is a small number but if you only have a 5 records you might question if the overhead of the bitmap is worth it, you might even question if it's worth indexing the table at all (it might be, I recall attending a talk (by Connor McDonald IIRC) on the advantages of indexing very small tables). 100 isn't a particularly small number but if you've got 10,000 records then it is in comparison. Assuming the compression of repeating identical values does happen then, given that statistically for each of those 100 bitmaps 99% of the bits will be zero, you should get a reasonable compression rate as worst case scenario is you have sequences of a 1 followed by 9 zeros. If the distribution is highly skewed (say there's 100 different possible values but a few values account for most of the records) then your compression rate might be far better. Like many things it's probably a case of test it and see, I'll hold off judgement till I've had a chance to test it.
  • The field doesn't tend to get updated or deleted. This applies to any type of index really, if you can you want to avoid indexing fields that get updated (i.e. changed) or the records deleted as this will cause extra IO (updating the index when the field is updated or the record is deleted) and your index to hold 'dead keys' (entries that used to point to a record but don't any more because the indexed field was updated or the record deleted) which waste space and thus can slow down IO on the index (Oracle has to read blocks which contain dead space). Sometime I really must look into how a bitmap index handles dead keys. If anyone else does please comment here with a link to your results.
  • Field tends to be used a lot on where clauses. Again applies to any sort of index. Indexes generally slow down updates/inserts/deletes (something else to update) and speed up queries that use the indexed field in their where clause. If the field is rarely or never used in a where clause then the index is not used and is just wasting resources.
  • Field tends to be used in where clauses in combination with other bitmap indexable fields. Oracle can do bitwise operations (AND, OR &c) on the bitmaps, bitwise operations tend to be very fast (they're implemented at a very low level in the microcode in the processor). Suppose your people table in the example above also has fields for marital status, employment status and ethnicity. If you run queries with where clauses such as "gender='F' and marital_status='Married' and ethnicity='White British' and employment_status='Full Time Permanent'" (useful for Equal Opportunities monitoring) then Oracle can just load the appropriate bitmaps and do a bitwise AND to quickly identify all records that satisfy that clause.
I'm sure I'll have forgotten some but once the presentations are up on the web I'll read them over again along with my notes and fill in the blanks.

One suprise during the day was I ran into John Ryland, a DBA from Capita currently seconded to Service Birmingham (where I work); I hadn't known he was going to be there. Service Birmingham is a joint venture company between Birmingham City Council and Capita group in April 2006 which provides ICT and Business Transformation services to Birmingham City Council. The majority of the staff are secondees from Birmingham City Council (mostly from the Business Solutions and IT department) with some secondees from Capita and TUPE transferees from the council and other bodies. John hadn't been aware that Service Birmingham had a membership or that Birmingham City Council had had a membership, apparently when Capita asked the UKOUG (as part of their due dilligence checking) the UKOUG and denied we had a membership.

Of the presentations that I attended today the one that probably grabbed my attention and soparked my interest the most was "Accountability for System Performance (introducing Six Sigma quality in Oracle software performance)" from Cary Millsap. What I picked up from it was that everything should be instrumented, in every layer and every transation (that is from when the user clicks a button or link to when they get the result, not just a database transaction) should measure and record how long it took. When I get back to work next week I think I might propose that we put into our standards for applications (that we procure externally (most of them) and the ones we develop internally (a few)) that the application must be instrumented to record time based stats at every level. Of course the 'Holy Grail' would be to tag every transaction with a unique ID that would be resolvable at every level (a lot of our apps are browser-apps server-database-SAN/NAS or client-apps server-database-SAN/NAS) and have each level record start/end times and as many other time based statistics as possible (if the application does three things before it contacts the database and 2 things with the results it gets back then I want to know how long each of the 5 things took and what it did at each step). Over the years the vast majority of the performance problems I've seen have been the result of slow steps outside the database (e.g. a user sees a transaction taking a minute but only half a second of that is in the database, the slow step is in the client, apps server or simply network delay and invisible to the database instrumentation, you can pick up some storage related information from the database instrumentation but having instrumentation in that level would be much better). If the application had been properly instrumented then solving the problem would have been a lot quicker (if only because it would have saved a lot of time arguing over where the problem was, saying it's not in the database is one thing but being able to say where it is, with hard evidence to back it up, is much better). After the session I spoke with Cary about tacking instrumentation on to uninstrumented apps and he suggested a few UNIX tools that would be useful. I'm going to be in Tom Kyte's session on "Instrumentation 101" tommorrow, maybe I'll pick up more then.

I had intended to talk more about the other sessions but it's after midnight and I have to be up in a little over 5 hours so they'll just have to wait.