Thursday, November 30, 2006

du and ls report very different file sizes for Oracle Tempfiles on NFS v3 mounted volume on NetApp filer under Solaris 9

We have a NetApp FAS filer attached via Gigabit Ethernet to a Sun V440 running Oracle databases the files for which are stored on the filer The NetApp volume is mounted under NFS version 3 at $ORACLE_BASE/oradata. Under that directory are 7 directories .snapshot, dctm, dchr, dctm_backup, dchr_backup, dctm_nobackup, dchr_nobackup.

.snapshot is where the NetApp filer stores snapshots of the volume.

The server runs two databases, dctm and dchr. The dctm and dchr directories hold the datafiles (and a small tempfile) for the appropriate database, *_backup holds the RMAN and cold backup sets for the appropriate databases. *_nobackup holds the bulk of the tempfiles (temporary tablespace, used for sorting) for the appropriate database (when we do cold backups we don't bother to backup the tempfiles).

For dctm, dchr and *_backup both du and ls report the same file sizes for both data and tempfiles. For *_nobackup ls reports the correct sizes (as specified in the Oracle data dictionary) but du reports a much smaller size for both the individual file size and the total for the directory.

v$sort_segment.max_blocks * db_block_size approximates to the size reported by the total size reported by du for the tempfiles in *_nobackup.

We know that ls gets file sizes by reading what is recorded in the inodes, or at least that's what the documentation says. The documentation for du doesn't specify how it gets it's space measurement but we've found pages on the web that indicate that it does this by counting allocated blocks. Normally these two ways of getting the file size should return the same result, the only exceptions being hard links (multiple directory entries point to the same file on the same file system) and sparse files. Neither directory contains any hard links and the tempfiles do not conform to our understanding of sparse files due to being formatted at creation.

Our current hypothesis is that the formattig does not write to every block and because of the way the NetApp filer handles block level writes it only bothers to allocate a block if that block is actually written to but still records the requested filesize in the inode, blocks that have not yet been written to still exist but only in a virtual state (ghost blocks?). With a NetApp filer blocks within the same file are not necessarily stored contiguously and a bloc that is read, changed and rewritten will probably not be written back to the same place they were originally read from. The Oracle buffer cache is likely to contain blocks from various segments read from various tablespaces and many different places on the disks. When Oracle flushes out dirty blocks the NetApp filer stores them in it's own cache. When the NetApp flushes it's own cache it will write as many blocks as possible into the first free space it can find, regardless of where they came from, so you get blocks from various segments in various files written next to each other on the disk, almost certainly somewhere other than where they were read from. The original blocks are then freed. Or at least that's how they explained it to me when I asked.

We are trying to verify this with NetApp but, what we believe has happened here is that when the files were created the NetApp wrote the inodes with the correct size but only allocated blocks on disk for the blocks that have actually been used (formatting of the file and the temp segments that have been written out between file creation and now), the remaining blocks are logically allocated but the physical blocks are still unallocated.

Whilst this no doubt improves speed of creation the file it would probably cause a performance hit on sorts when more sort space needs to be allocated as when blocks within the file are allocated to extents by Oracle the filer has to allocate them at filesystem level. We haven't noticed anything so far so presumably any hit is masked by the caching, it may become relevant if we get to the point where the cache becomes a bottleneck.

We await a response from NetApp.

Incidentally, the reason we don't backup the tempfiles is that the OS/Storage team were complaining about how long the backups took and said we were at risk of blowing the backup window (we weren't). After investigating I found that someone had changed the second tempfile in each database to autoextend and they had grown to 64GB each. Quite suprising as each database had under 5Gb of data (including system stuff, i.e. data dictionaty &c). After some investigation I found a lot of SQL was being run that was very unselective and returned very large datasets (over 10% of the total data) which would be sorted then sent to the application for display. When you have hundreds of users doing this concurrently you rapidly reach the point where many copies of the same data are being sorted at the same time, as Oracle cannot share sorts between sessions each copy is in a different sort segment, especially when many of the queries use joins that require sorts. We removed the large tampfiles and replaced them with 16*2Gb tempfiles in a separate directory for each database. These are not backed up. In the event of a recovery situation, and we cannot use the RMAN backups we take each night, we drop the missing tempfiles then recreate them.

The tempfiles in dctm and dchr are the originals and will have been filled before the second tempfile extended hence all blocks have been allocated at the fielsystem level.

1 comment:

Bjørn said...

Stephen,
We're seeing the same thing on a SAN connected to an AIX server. Sorry, I don't have the SAN details, but ls and du do report very different zizes.