FAQ - Berkeley DB


Last Updated: July 26, 2007

General

· Is a Berkeley DB database the same as a SQL "table"?
· How can I architect a multi-process Berkeley DB application without a separate monitor process?
· How can I associate application information with a database or database environment handle?
· What is the ECCN (Export Control Classification Number) for Berkeley DB?

Installation and Build

· Why do I get a compilation error on Windows when I create a project using MFC or anything that includes oledb.h?
· Why does the Windows Release build fail with Visual C++ version 6?

Troubleshooting

· What do I do when I run out of lockers / locks / lock objects?
· Berkeley DB returned error number 12 or 22. What does that mean?
· I'm seeing database corruption when I run out of disk space.
· I'm seeing database corruption when creating multiple databases in a single physical file.
· It takes a long time for my application to exit, and I can hear the disk spinning.
· Why do I get a "unable to initialize mutex: Function not implemented" error?
· Why do I get memory faults after allocating memory in a callback on Windows?
· Why do I get reports of uninitialized memory reads and writes when running software analysis tools (for example, Rational Software Corp.'s Purify tool)?

Databases and Access Methods

· Berkeley DB operations are returning errors like EINVAL and I'm using the API correctly.
· Are Berkeley DB databases portable between architectures with different integer sizes and different byte orders?
· How do I find out the size of a physical database file?
· Is there a way to get a count of the number of keys in a database without iterating through a cursor?
· What's the fastest way to check for the existence of a key/data pair?
· Is there any way to compact databases and return unused database pages to the filesystem?
· Why can't I get the compact method to release emptied pages back to the filesystem when using the DB_FREE_SPACE flag?
· How do I perform a custom sort of secondary duplicates?

Database Environments

· Will checkpointing my database environment block access to my databases?
· A transactional database environment is hanging, and no threads of control are making progress.
· Locks are accumulating, or threads and/or processes are deadlocking in a transactional environment, even though there is no concurrent access to the database.
· A transactional database environment cannot be recovered or normal database operations fail with messages that "LSN" values are past the end of the log.
· A transactional application is seeing an inordinately high number of deadlocks.

Transactions

· Berkeley DB occasionally returns the error: "Unable to allocate memory for transaction detail". What does that mean?
· How do I use the DB_AUTO_COMMIT flag with a database cursor?

Filesystems

· Can Berkeley DB use NFS, SAN, or other remote/shared/network filesystems for databases and their environments?
· Can Berkeley DB store data to raw disk partitions?
· What Linux filesystem is best for Berkeley DB database storage?

Performance

· I'm using integers as keys for a Btree database, and even though the key/data pairs are entered in sorted order, the page-fill factor is low.
· Is there any way to avoid double buffering in the Berkeley DB system?


Is a Berkeley DB database the same as a SQL "table"?

Yes; "tables" are databases, "rows" are key/data pairs, and "columns" are application-encapsulated fields. The application must provide its own methods for accessing a specific field, or "column" within the data value.

Back to top


How can I architect a multi-process Berkeley DB application without a separate monitor process?

See the "Handling failure in Transactional Data Store applications" and "Architecting Transactional Data Store applications" sections of the Berkeley DB Reference Guide for more information.

Back to top


How can I associate application information with a database or database environment handle?

In the C API, the DB and DB_ENV handles each contain an "app_private" field intended to be used to reference application-specific information. See the db_create and db_env_create documentation for more information.

In the C++ or Java APIs, the easiest way to associate application-specific data with a handle is to subclass the Db and DbEnv handles, for example subclassing Db to get MyDb. Objects of type MyDb will still have the Berkeley DB API methods available on them, and you can put any extra data or methods you want into the MyDb class. If you are using "callback" APIs that take Db or DbEnv arguments (for example, the Db.set_bt_compare method), these will always be called with the Db or DbEnv objects you create. So if you always use MyDb objects, you will be able to take the first argument to the callback function and cast it to a MyDb (in C++, cast it to (MyDb*)). That will allow you to access your data members or methods.

Back to top


What is the ECCN (Export Control Classification Number) for Berkeley DB?

As an open source product, DB does not fall under standard export controls. However, DB does optionally include strong cryptographic support; export/import and/or use of cryptography software, or even communicating technical details about cryptography software, is illegal in some parts of the world. You are strongly advised to pay close attention to any export/import and/or use laws which apply to you when you import a release of Berkeley DB including cryptography to your country or re-distribute source code from it in any way. If this is a concern, we recommend using versions of DB that do not include strong cryptographic support.

Back to top


Why do I get a compilation error on Windows when I create a project using MFC or anything that includes oledb.h?

Berkeley DB's header file db.h and Microsoft's header file oledb.h both define the symbol DBTYPE. Unfortunately, changing either use of this symbol would break existing code.

The first and simplest solution to this problem is to organize your source code so that only one of these two header files is needed in any of your sources. In other words, separate the uses of Berkeley DB and the uses of Microsoft's OLE DB library so that they are not mixed in your code.

Then, just choose either db.h or oledb.h, but do not mix them in one source file.

If that is not possible, and you have to mix both headers, wrap one of the #include lines as follows.

Find the line where oledb.h is included in your source code. This may be in the automatically-generated stdafx.h include file. Decide whether that header file is really needed. If it is, change the include line from this:


#include <oledb.h>

to this:


/* Work around DBTYPE name conflict with Berkeley DB */
#define DBTYPE MS_DBTYPE
#include <oledb.h>
#undef DBTYPE

Then if you need to use Microsoft's DBTYPE, refer to it as MS_DBTYPE.

Alternatively, for C applications, you can wrap the include of db.h in a similar way.

You can not wrap db_cxx.h using this technique. If you are using the C++ interface to Berkeley DB, you need to with avoid mixing oledb.h with db_cxx.h or wrap the include of oledb.h as described above.

Back to top


Why does the Win32 Release build fail with Visual C++ version 6?

The Visual C++ project files included in Berkeley DB 4.3.29 have a bug when used with Visual C++ version 6. The symptom is that at link time, the following error is generated:

Linking...
   Creating library Release/libdb43.lib and object Release/libdb43.exp
mut_win32.obj : error LNK2001: unresolved external symbol __imp__SetSecurityDescriptorDacl@16
mut_win32.obj : error LNK2001: unresolved external symbol __imp__InitializeSecurityDescriptor@8
Release/libdb43.dll : fatal error LNK1120: 2 unresolved externals
Error executing link.exe.

This release requires an additional library at link time called advapi32.lib. Although the library is included in the project file, it is not recognized by Visual C++ version 6 (versions 7.0 and higher do not have this problem).

The solution is to add this library manually to the build of the db_dll target:

  1. In Visual C++, right-click on the db_dll target and choose Settings...
  2. Select the Link tab
  3. In the "Object/library modules" field, add advapi32.lib.

Back to top


What do I do when I run out of lockers / locks / lock objects?

The Berkeley DB environment keeps memory for a fixed number of lockers, locks and lock objects -- so it is always possible to run out of these resources.

The maximum amount of lock resources to be allocated is set when the database environment is created, so to change this number, you will need to increase the increase the number of lockers, locks and/or lock objects and re-create your environment.

See the "Configuring locking: sizing the system" section of the Berkeley DB Reference Guide for more information.

Back to top


Berkeley DB returned error number 12 or 22. What does that mean?

The application is calling the Berkeley DB API incorrectly or configuring the database environment with insufficient resources.

The Berkeley DB library outputs a verbose error message whenever it is about to return a general-purpose error, or throw a non-specific exception. Whenever it is not clear why an application call into Berkeley DB is failing, the first step is always to review the verbose error messages, which will almost always explain the problem.

See the "Run-time error information" section of the Berkeley DB Reference Guide for more information.

It's also useful to know how Berkeley DB divides up the error name space: Except for the historic dbm, ndbm, and hsearch interfaces, Berkeley DB does not use the global variable errno to return error values. The return values for all Berkeley DB functions are grouped into the following three categories:

A return value of 0 indicates the operation was successful.

A return value that is greater than 0 indicates that there was a system error. The errno value returned by the system is returned by the function; for example, when a Berkeley DB function is unable to allocate memory, the return value from the function will be ENOMEM.

A return value that is less than 0 indicates a condition that was not a system failure, but was not an unqualified success, either. For example, a routine to retrieve a key/data pair from the database may return DB_NOTFOUND when the key/data pair does not appear in the database; as opposed to the value of 0, which would be returned if the key/data pair were found in the database.

All values returned by Berkeley DB functions are less than 0 in order to avoid conflict with possible values of errno. Specifically, Berkeley DB reserves all values from -30,800 to -30,999 to itself as possible error values. There are a few Berkeley DB interfaces where it is possible for an application function to be called by a Berkeley DB function and subsequently fail with an application-specific return. Such failure returns will be passed back to the function that originally called a Berkeley DB interface. To avoid ambiguity about the cause of the error, error values separate from the Berkeley DB error name space should be used.

Finally, you can always get the message string that's associated with the error number that Berkeley DB returns from the db_strerror function. The db_strerror function is a superset of the ANSI C X3.159-1989 (ANSI C) strerror(3) function. If the error number error is greater than or equal to 0, then the string returned by the system function strerror(3) is returned. If the error number is less than 0, an error string appropriate to the corresponding Berkeley DB library error is returned.

Back to top


I'm seeing database corruption when I run out of disk space.

Berkeley DB can continue to run when when out-of-disk-space errors occur, but it requires the application to be transaction protected. Applications which do not enclose update operations in transactions cannot recover from out-of-disk-space errors, and the result of running out of disk space may be database corruption.

Back to top


I'm seeing database corruption when creating multiple databases in a single physical file.

This problem is usually the result of database handles not sharing an underlying database environment. See the "Opening multiple databases in a single file" section of the Berkeley DB Reference Guide for more information.

Back to top


It takes a long time for my application to exit, and I can hear the disk spinning.

When a Berkeley DB application calls the database handle close method to discard a database handle, the dirty pages in the cache will written to the backing database file by default. To change this behavior, specify the DB_NOSYNC flag to the Db.close (or the noSync flag to the Database.close method when using the Java API); setting this flag will cause the handle close method to ignore dirty pages in the cache.

Many applications do not need to flush the dirty pages from the cache when the database handle close method is called. Applications using transactions or replication for durability don't need to flush dirty pages as the transactional mechanisms ensure that no data is ever lost. Further, there is never a requirement to flush the dirty pages from the cache until the database environment is about to be removed: processes can join and leave a database environment without flushing the dirty pages held in the cache, and only when the database environment will never be accessed again should dirty pages be flushed to the backing file.

Back to top


Why do I get a "unable to initialize mutex: Function not implemented" error?

The most common reason for this error in a Berkeley DB application is that a system call underlying a mutex configured by Berkeley DB is not available on the system, thus, the return of ENOSYS, (which is the system error associated with the "Function not implemented" message).

Generally, this happens because the Berkeley DB library build was specifically configured to use POSIX mutexes, and POSIX mutexes aren't available on this system, or the library was configured on a different system where POSIX mutexes were available, and then the library was physically moved to a system where POSIX mutexes were not available.

We see this error occasionally on Linux systems, because some Linux systems have POSIX mutex support in the C library configuration, but not in the operating system, or the POSIX mutex support they have is only for intra-process mutexes, not inter-process mutexes.

To avoid this error, explicitly specify the mutex implementation DB should use, with the --with-mutex=MUTEX configuration flag:

	
    --with-mutex=MUTEX
        To force Berkeley DB to use a specific mutex implementation,
        configure with --with-mutex=MUTEX, where MUTEX is the mutex
        implementation you want. For example,
        --with-mutex=x86/gcc-assembly will configure Berkeley DB to use
        the x86 GNU gcc compiler based test-and-set assembly mutexes.
        This is rarely necessary and should be done only when the
        default configuration selects the wrong mutex implementation. A
        list of available mutex implementations can be found in the
        distribution file dist/aclocal/mutex.ac

Back to top


Why do I get memory faults after allocating memory in a callback on Windows?

On Windows, an application may end up using multiple versions of the standard allocation routines (malloc, realloc and free) depending on the build flags used for each component. In addition, memory may be allocated from multiple heaps depending on where in the code the allocation takes place.

So when the application allocates some memory with its version of malloc, and Berkeley DB later attempts to that memory using its version of free, a memory fault may occur.

The solution is to use the DB_ENV->set_alloc method before the call to DB_ENV->open. In C:

dbenv->set_alloc(dbenv, malloc, realloc, free);

In C++:

dbenv->set_alloc(::malloc, ::realloc, ::free);

Back to top


Why do I get reports of uninitialized memory reads and writes when running software analysis tools (for example, Rational Software Corp.'s Purify tool)?

For performance reasons, Berkeley DB does not write the unused portions of database pages or fill in unused structure fields. To turn off these errors when running software analysis tools, configure Berkeley DB with the --enable-umrw configuration option before building.

Back to top


Berkeley DB operations are returning errors like EINVAL and I'm using the API correctly.

The application is failing to zero out DBT objects before calling Berkeley DB. Before using a DBT, you must initialize all its elements to 0 and then set the ones you are using explicitly.

Another reason for this symptom is the application may be using m4_db handles in a free-threaded manner, without specifying the DB_THREAD flag to the DB->open or DB_ENV->open methods. Any time you are sharing a handle across multiple threads, you must specify DB_THREAD when you open that handle.

Another reason for this symptom is the application is concurrently accessing the database, but not acquiring locks. The Berkeley DB Data Store product does no locking at all; the application must do its own serialization of access to the database to avoid corruption. The Berkeley DB Concurrent Data Store and Berkeley DB Transactional Data Store products do lock the database, but still require that locking be configured.

Back to top


Are Berkeley DB databases portable between architectures with different integer sizes and different byte orders?

Yes. Specifically, databases can be moved between 32- and 64-bit machines, as well as between little- and big-endian machines. See the "Selecting a byte order" section of the Berkeley DB Reference Guide for more information.

Back to top


How do I find out the size of a physical database file?

This information is returned by the DB->stat method.

Back to top


Is there a way to get a count of the number of keys in a database without iterating through a cursor?

You can use the DB->stat method to obtain key and data item counts; see that method's documentation for additional information.

Back to top


What's the fastest way to check for the existence of a key/data pair?

Use the DB->exists method.

Back to top


Is there any way to compact databases and return unused database pages to the filesystem?

In all Berkeley DB access methods, as database pages are emptied they are made available for other uses, that is, new pages will not be allocated from the underlying filesystem as long as there are unused pages available. Additionally, Btree databases may be compacted at run-time and pages returned to the filesystem by calling the DB->compact method. Finally, Queue access method extent files are removed when they are emptied, and their pages returned to the underlying filesystem.

Back to top


Why can't I get the compact method to release emptied pages back to the filesystem when using the DB_FREE_SPACE flag?

If you plan to utilize the compact method with a volatile database, it is recommended that the database reside in its own file. If you have databases x,y,z in a single file and plan to compact database x and free emptied pages to the filesystem it is likely that databases y, z will prevent pages from being freed to the filesystem. If you run compact with the DB_FREE_SPACE flag, emptied pages will still be placed on the free page list, but not freed to the filesystem. To free space from the file there must be free pages at the end of the file and the other databases (y,z) may contain those pages and in particular the meta data pages from those databases can never be freed.

Back to top


How do I perform a custom sort of secondary duplicates?

If you have a secondary database with sorted duplicates configured, you may wish to sort the duplicates according to some other field in the primary record. Let's say your secondary key is F1 and you have another field in your primary record, F2, that you wish to use for ordering duplicates. You would like to use F1 as your secondary key, with duplicates ordered by F2.

In Berkeley DB, the "data" for a secondary database is the primary key. When duplicates are allowed in a secondary, the duplicate comparison function simply compares those primary key values. Therefore, a duplicate comparison function cannot be used to sort by F2, since the primary record is not available to the comparison function.

The purpose of key and duplicate comparison functions in Berkeley DB is to allow sorting values in some way other than simple byte-by-byte comparison. In general it is not intended to provide a way to order keys or duplicates using record data that is not present in the key or duplicate entry. Note that the comparison functions are called very often -- whenever any Btree operation is performed -- so it is important that the comparison be fast.

There are two ways you can accomplish sorting by F2:

  1. Instead of using F1 as the secondary key, use a concatenated key F1+F2 as the secondary key. When you wish to do a lookup by F1, use arrange search (Cursor.getSearchKeyRange).
  2. Use F1 as the secondary key, as you are already doing. When you query the duplicates for F1, sort them manually by F2 after you query them. Since when you query the secondary you will have the primary record in hand, F2 will be available for sorting.

Option #1 has the advantage of automatically sorting by F2. However, you will never be able to do a join (via the Database.join method) on the F1 key alone. You will be able to do a join on the F1+F2 value, but it seems unlikely that will be useful.

Secondaries are often used for joins. Therefore, we recommend option #2 unless you are quite sure that you won't need to do a join on F1.

The trade-offs are:

  • Option #1 does not allow performing join on F1.
  • Option #1 has larger secondary keys (more overhead).
  • Option #2 requires programming the sort by F2 manually.
  • Option #2 requires enough memory to sort all duplicates for a given key F1.

Back to top


Will checkpointing my database environment block access to my databases?

A checkpoint doesn't block access to the Berkeley DB database environment, and threads of control can continue to read and write databases during checkpoint. However, the checkpoint potentially triggers a large amount of I/O which could slow other threads of control, and make it appear that access has been blocked.

You can use the DB_ENV->memp_trickle method to spread out the I/O that checkpoint will need to perform (the DB_ENV->memp_trickle method ensures a specified percent of the pages in the cache are kept clean). Alternatively, you can limit the number of sequential write operations scheduled by the DB library, using the DB_ENV->memp_set_max_write method. The DB_ENV->memp_set_max_write method affects all of the methods that flush the database cache (checkpoint, as well as other methods, for example, DB->sync).

Back to top


A transactional database environment is hanging, and no threads of control are making progress.

The most common cause of this failure is a thread of control exiting unexpectedly, while holding a Berkeley DB mutex or a read/write logical database lock. If a thread of control exits holding a data structure mutex, other threads of control will likely lock up fairly quickly, queued behind the mutex. If a thread of control exits holding a logical database lock, other threads of control may lock up over a long period of time, as they will not be blocked until they attempt to acquire the specific page for which a lock is not available. See the "Deadlock debugging" section of the Berkeley DB Reference Guide for more information on debugging deadlocks.

Whenever a thread of control exits m4_db holding a mutex or logical lock, the failure must be resolved. See the "Handling failure in Transactional Data Store applications" section of the Berkeley DB Reference Guide for more information.

Finally, the Berkeley DB API is not re-entrant, and it is usually unsafe for signal handlers to call the Berkeley DB methods. See the "Signal handling" section of the Berkeley DB Reference Guide for more information.

Back to top


Locks are accumulating, or threads and/or processes are deadlocking in a transactional environment, even though there is no concurrent access to the database.

The application may have failed to close a cursor. Cursors retain locks between calls. Everywhere the application uses a cursor, the cursor should be explicitly closed as soon as possible after it is used.

Another explanation for this symptom is the application is not checking for DB_LOCK_DEADLOCK errors (or DbDeadlockException exceptions in the C++ or Java APIs). Unless you are using the Berkeley DB Concurrent Data Store product, whenever there are multiple threads and/or processes concurrently accessing a database and at least one of them is writing the database, there is potential for deadlock.

If deadlock can occur, applications must test for deadlock failures and abort the enclosing transaction, or locks will be left. See the "Recoverability and deadlock handling" section of the Berkeley DB Reference Guide for more information.

Back to top


A transactional database environment cannot be recovered or normal database operations fail with messages that "LSN" values are past the end of the log.

The application may have removed all of its log files without resetting the database log sequence numbers (LSNs). Log files should never be removed unless explicitly authorized by the db_archive utility or the DB_ENV->log_archive method. Note that those interfaces will never authorize removal of all existing log files.

Another reason for this symptom is the application may have created a database file in one transactional environment and then moved it into another transactional environment. While it is possible to create databases in non-transactional environments (for example, when doing bulk database loads) and then move them into transactional environments, once a database has been used in a transactional environment, it cannot be moved to another environment without first resetting the database log sequence numbers. See the DB_ENV->lsn_reset method documentation for more information.

Back to top


A transactional application is seeing an inordinately high number of deadlocks.

The application may be acquiring database objects in inconsistent orders; having threads of control always acquire objects in the same order will reduce the frequency of deadlocks.

If you frequently read a piece of data, modify it and then write it, you may be inadvertently causing a large number of deadlocks. Try specifying the DB_RMW flag on your get calls.

Finally, reducing the isolation level can significantly reduce the number of deadlocks seen by the application. See the "Isolation" and "Degrees of isolation" sections of the Berkeley DB Reference Guide for more information.

Back to top


Berkeley DB occasionally returns the error: "Unable to allocate memory for transaction detail". What does that mean?

This error means the maximum number of active transactions configured for Berkeley DB has been reached. The Berkeley DB environment should be configured to support more active transactions. When all of the memory available in the database environment for transactions is in use, calls to being a transaction will fail until some active transactions complete. By default, the database environment is configured to support at least 20 active transactions.

For more information see the "Configuring transactions" section of the Berkeley DB Reference Guide.

Back to top


How do I use the DB_AUTO_COMMIT flag with a database cursor?

The DB_AUTO_COMMIT flag does not apply to cursors. If you want transactions for cursor operations you must create and use an explicit transaction.

Back to top


Can Berkeley DB use NFS, SAN, or other remote/shared/network filesystems for databases and their environments?

This is a tricky question. The answer isn't as obvious as you might think.

Berkeley DB works great with a SAN (and with any other filesystem type as far as we know), but if you attempt to access any filesystem from multiple machines, you are treating the filesystem as a shared, remote filesystem and this can cause problems for Berkeley DB. See the "Remote filesystems" section of the Berkeley DB Reference Guide for more information.

There are two problems with shared/remote filesystems, mutexes and cache consistency.

First, mutexes:
For remote filesystems that do allow remote files to be mapped into process memory, database environment directories accessed via remote filesystems cannot be used simultaneously from multiple clients (that is, from multiple computers). No commercial remote filesystem of which we're aware supports coherent, distributed shared memory for remote-mounted files. As a result, different machines will see different versions of these shared region files, and the behavior is undefined.

For example, if machine A opens a database environment on a remote filesystem, and machine B does the same, then machine A acquires a mutex backed in that remote filesystem, the mutex won't correctly serialize machine B. That means both machines are potentially modifying a single data structure at the same time, and any bad database thing you can imagine can happen as a result.

Second caches:
Databases, log files, and temporary files may be placed on remote filesystems, as long as the remote filesystem fully supports standard POSIX filesystem semantics (although the application may incur a performance penalty for doing so). Further, read-only databases on remote filesystems can be accessed from multiple systems simultaneously. However, it is difficult (or impossible) for modifiable databases on remote filesystems to be accessed from multiple systems simultaneously. The reason is the Berkeley DB library caches modified database pages, and when those modified pages are written to the backing file is not entirely under application control. If two systems were to write database pages to the remote filesystem at the same time, database corruption could result. If a system were to write a database page back to the remote filesystem at the same time as another system read a page, a core dump in the reader could result.

For example, if machine A reads page 5 of a database (and page 5 references page 6), then machine B writes page 6 of the database, and then machine A reads page 6 of the database, machine A has an inconsistent page 5 and page 6, which can lead to incorrect or inconsistent data being returned to the application, or even core dumps.

The core dumps and inconsistent data are limited to the readers in this scenario, and some applications might choose to live with that.

You can, of course, serialize access to the databases outside of Berkeley DB, but that would imply a pretty significant hit to the overall performance of the system.

So Berkeley DB is not designed to fully support multi-system concurrent access to a database environment on a shared disk available either on a network filesystem or a SAN.

Back to top


Can Berkeley DB store data to raw disk partitions?

Berkeley DB wasn't designed to use raw disk partitions, for a few different reasons:

First, using a raw disk partition requires specialized archival, tuning and other database administration tools, because you can't trivially write tools to access the physical database and other files. Berkeley DB's design allows use of existing tools for database administration (for example, using the POSIX cp, tar, or pax utilities for hot backups), resulting in better integration into the local environment. This is also an advantage for the 3rd party software vendors that license DB, as they don't want to require non-standard archival procedures or tools or having to create and provide the same to their customers. Another reason is it's difficult or impossible to extend raw partitions, and so it becomes significantly harder to change the size of a database installation. When using the file system as DB does, you can mount another partition or disk, and you're done. (I should mention that DB database applications are able to continue running when there is no disk space available, unlike many database products -- because other applications can run DB applications out of disk space, it was necessary to make DB resilient to a lack of disk space.)

Second, raw partitions don't return much of a performance improvement anyway, at least on modern operating systems. Transactional performance in a write-ahead logging database system is usually bounded by writing log files, which are written sequentially, and writing the file sequentially minimizes any file system overhead. In terms of operation latency, Berkeley DB will only go to the file system if a read or write misses in the cache. Certainly, data sets exist where the working set doesn't fit into available cache, but there aren't many of them. In short, porting DB to raw partitions would not improve performance for applications where the working set fits into cache. You can put 100's of GB of cache on a system now, and that can handle a pretty large working set.

Third, there would be a lot of additional code needed to name "files" on the raw partition, allocate blocks to files, extend files, and so on. You have to write a layer that looks a lot like a file system, significantly increasing Berkeley DB's footprint, and that code will require continuous porting and performance tuning. If DB implemented its own file system, we would have to invest time in adding new drivers and working on disk performance optimizations, and that's not where we have deep expertise, it's not where we can add value. In the case of the existing architecture, customers don't have to worry if DB can run on a new piece of hardware, they just know it will. Further, DB automatically performs better as the underlying file system is tuned, or new file systems are rolled out for new types of disks (for example, solid-state disks, NVRAM disks, or new RAID devices).

All that said, the one strong argument for porting to a raw partition is to avoid double buffering (where a copy of a Berkeley DB database page is held in both the DB cache and the operating system's buffer cache). Fortunately, modern operating systems allow you to configure I/O to copy directly to/from the DB cache, avoiding the OS buffer cache and double buffering.

It would not be a lot of work to change Berkeley DB to create databases on a raw partition: simply replace the underlying open, read, write and lseek interface calls to work on a raw partition. However, making Berkeley DB fully functional in that environment would require a lot of work, in order to make the general administration of the database environment work as smoothly as it does now. That said, third-party researchers experimenting with Berkeley DB have done this. Their work is not available in any form, but serves as a proof point.

Back to top


What Linux filesystem is best for Berkeley DB database storage?

We believe that ext2 is the best performing Linux file system for TP applications, but using it may lead to data corruption because ext2 lacks ordered data mode. For that reason we recommend using ext3, as it both performs well and supports ordered data mode. We don't have performance measurement information for XFS, but we've seen failures in the field (XFS has problems with applications which repeatedly extend files, and that is a common usage pattern in Berkeley DB databases).

Back to top


I'm using integers as keys for a Btree database, and even though the key/data pairs are entered in sorted order, the page-fill factor is low.

This is usually the result of using integer keys on little-endian architectures such as the x86. Berkeley DB sorts keys as byte strings, and little-endian integers don't sort well when viewed as byte strings. For example, take the numbers 254 through 257. Their byte patterns on a little-endian system are:


254	fe 0 0 0
255	ff 0 0 0
256	 0 1 0 0
257	 1 1 0 0

If you treat them as strings, then they sort badly:


256
257
254
255

On a big-endian system, their byte patterns are:


254	0 0 0 fe
255	0 0 0 ff
256	0 0 1 1
257	0 0 1 1

and so, if you treat them as strings they sort nicely. Which means, if you use steadily increasing integers as keys on a big-endian system Berkeley DB behaves well and you get compact trees, but on a little-endian system Berkeley DB produces much less compact trees. To avoid this problem, you may want to convert the keys to flat text or big-endian representations, or provide your own Btree comparison function using the DB->set_bt_compare method.

Back to top


Is there any way to avoid double buffering in the Berkeley DB system?

While you cannot avoid double buffering entirely, there are a few things you can do to address this issue:

First, the Berkeley DB cache size can be explicitly set. Rather than allocate additional space in the Berkeley DB cache to cover unexpectedly heavy load or large table sizes, double buffering may suggest you size the cache to function well under normal conditions, and then depend on the file buffer cache to cover abnormal conditions. Obviously, this is a trade-off, as Berkeley DB may not then perform as well as usual under abnormal conditions.

Second, depending on the underlying operating system you're using, you may be able to alter the amount of physical memory devoted to the file buffer cache. Running as the system super-user makes a difference for some UNIX or UNIX-like operating systems as well.

Third, changing the size of the Berkeley DB environment regions can change the amount of space the operating system makes available for the file buffer cache, and it's often worth considering exactly how the operating system is dividing up its available memory. Further, moving the Berkeley DB database environment regions from filesystem backed memory into system memory (or heap memory), can often make additional system memory available for the file buffer cache, especially on systems without a unified buffer cache and VM system.

Finally, for operating systems that allow buffering to be turned off, specifying the DB_DIRECT_DB and DB_DIRECT_LOG flags will attempt to do so.

Back to top

E-mail this page
Printer View Printer View
Oracle Is The Information Company About Oracle | Oracle RSS Feeds | Careers | Contact Us | Site Maps | Legal Notices | Terms of Use | Privacy