Minimizing Memory Usage for Creating Application Subprocesses

   
By Greg Nakhimovsky, May 2006  

Abstract: This article explains how a Solaris OS application with large memory requirements can effectively create a subprocess without unduly running out of memory or creating a deadlock. It also explores a related issue of how application memory is committed in the Solaris OS as opposed to other operating systems such as Linux.

Contents
 
Historical Background and Problem Description
posix_spawn(3C), popen(3C), and system(3C) Interfaces
Solaris Implementation of posix_spawn(), system(3C), and popen(3C)
Memory Overcommit: The Solaris OS vs. Other Operating Systems
How to Call posix_spawn() Only When It's Available
Acknowledgments
References
About the Author

Historical Background and Problem Description

Traditionally, Unix has had only one way to create a new process: using a fork() system call, often followed by an exec() system call. The fork() call makes a copy of the entire parent process' address space, and exec() turns that copy into a new process.

(Note: In the Solaris OS, the term swap space is used to describe a combination of physical memory and disk swap space configured for the system. However, with other Unix systems this term may mean swap space on disk, also known as backing store. To avoid any confusion, I'll use the term Virtual Memory (VM) to mean physical memory plus disk swap space.)

Generally, the fork/exec method has worked quite well. However, it has disadvantages in some cases, such as running out of memory without a good reason and poor fork performance.

Out of Memory: For a large-memory process, the fork() system call can fail due to an inadequate amount of VM, because fork() requires twice the amount of the parent memory. This can happen even when fork() is immediately followed by an exec() call that would release most of that extra memory. When this happens, the application will usually terminate.

For example, suppose a 64-bit application is consuming 6 gigabytes (Gbytes) of VM at the moment, and it needs to create a subprocess to run the ls(1) command. The parent process issues a fork() call that will succeed only if there is another 6 Gbytes of VM available at the moment. If the system doesn't have that much VM available (which is a frequent situation), fork() will fail with ENOMEM. Obviously, the ls(1) command doesn't need anywhere near 6 Gbytes of memory to run, but fork() doesn't know that.

Not only applications, but also Sun's own tools can suffer from the same problem. For example, the following Sun RFE (request for enhancement) has been filed for dbx: "4748951 dbx shell should use posix_spawn() for non-builtin commands rather than fork(2)".

RFE 4748951 came about when a customer's utility invoked dbx to read a huge core file using a script that also needed to run a cut(1) command from within dbx. They got a cannot fork - try again error message causing dbx to abort. An investigation revealed that dbx used fork/exec to execute that tiny cut(1) command and ran out of VM during the fork() call.

The Solaris Java Virtual Machine (JVM) is also suffering from the same problem currently, as described in this Sun RFE: "5049299 Use posix_spawn, not fork, on S10 to avoid swap exhaustion".

Fork Performance: The fork() call can hurt performance. Even though fork() has been improved over the years to use the COW (copy-on-write) semantics, it still has to copy a certain amount of data from parent to child. That copying is not really necessary if the child process is immediately replaced with a new one by a call to exec(). This performance hit may be especially noticeable when the parent process has many memory mapped regions.

These disadvantages, i.e. running out of memory and poor fork performance, become especially important when the parent process consumes a large amount of memory (which has become very common in recent years), and when the memory requests require reservation (commitment) of VM as they do in the Solaris OS; see discussion of Solaris memory commitment below.

To deal with such disadvantages of the fork/ exec model when fork() is immediately followed by exec(), the Berkeley version of Unix (BSD) introduced the vfork() system call in the early 1980s. vfork(2) does not copy the parent process to child. Both processes share the parent's virtual address space; the parent is suspended until the child exits or calls exec().

The vfork(2) system call was also adopted in the Solaris OS. Much later, however, when multithreading (MT) became available and widely used it was discovered that vfork() may introduce a new problem when the application has multiple threads running: deadlock. The deadlock can happen due to the dynamic linker ld.so.1 involvement in resolving the necessary symbols. Particularly, if the child process calls an external function (such as exec()), the dynamic linker may be invoked to resolve the Procedure Linkage Table (PLT) entry, for which the dynamic linker will acquire a mutex lock. This lock may already be held by a different thread in the parent process. If this happens it will create a deadlock between the parent and child processes, because no thread in the parent can run until the child has called exec() or exit(). As a result, both the parent and the child processes will hang.

The introduction of posix_spawn(3C), see reference [ 1], starting with the Solaris 10 OS, has solved these problems. In the earlier Solaris versions (Solaris 9 and Solaris 8 2/02), the system(3C) and popen(3C) calls were also made safe of deadlock and not demanding a double amount of VM.

Some operating systems, notably Linux, have what's known as the memory overcommit feature. That is, when an application calls malloc() or any other memory-requesting interface, the OS always returns a non-NULL pointer without reserving any swap space for the requested memory.

Linux has posix_spawn() implemented with fork() and exec(). Due to its memory overcommit policy (as explained below), Linux may not suffer from the first problem described above: running out of VM space when a large process calls fork(). However, the memory overcommit feature has created serious problems of its own. See the "Memory Overcommit" discussion below.

It's worth mentioning here that there is an alternative method to create subprocesses that can be safely and efficiently used with any version of the Solaris OS or other Unix versions. During initialization of the main process (before it creates any threads or allocates any significant amount of memory) you can fork/exec a special small helper process dedicated to creating subprocesses for the big parent process. The parent process can send requests to create a subprocess, commands to execute, and so on, to the helper process via a pipe or any other inter-process communication mechanism. The helper process will not run out of VM space during the fork() call and it will be fast, because its memory requirements are very small so it can safely call fork/exec to create each subprocess.

The disadvantage of this alternative method is its extra complexity. An application using this method will have to make sure the extra helper process is properly terminated any time the main process terminates, to use MT-safe methods to communicate with the helper process, and so on. Also, using a helper process makes it harder to share file descriptors with the child processes. Calling posix_spawn(), popen(), or system() directly from the large parent process is much simpler.

posix_spawn(3C), popen(3C), and system(3C) Interfaces

The simplest way to start a new process from a C/C++ application is to use system(3C). The system() call causes the shell to execute the given command. See the system(3C) man page for details. This interface is adequate if all you need to do is run a shell command and wait until it's finished executing.

A somewhat more powerful interface to start a new process is popen(3C). In addition to starting a new process, popen() allows you to capture the output of the given shell command and manipulate it in various ways, for example parse it. For details, see the popen(3C) man page.

In addition to more flexibility, the popen() call is safer in multithreaded programs than system(). The system() call modifies certain signal dispositions that may affect other threads, while popen() doesn't do that. This is why popen(3C) man page marks it MT-Safe, while system(3C) is marked as MT-unsafe. See the system(3C) man page for details.

The most powerful interface for this functionality is posix_spawn(3C) (and its variant posix_spawnp(3C)) introduced in the Solaris 10 OS. It allows you to do additional manipulations of the kind that can be done between the fork() and exec() calls.

The posix_spawn() interface requires you to specify the full path of the executable file that the child process will run. The posix_spawnp() variant can do the same, but it can also search the $PATH directories for the executable file if the file name is provided without a path.

Here are two example programs showing how you can use the posix_spawn() interface. These examples should be useful, considering that the posix_spawn(3C) man page contains no examples and that the use of posix_spawn() can be quite bewildering.

Here is the first example:

              posix_spawn_example1.c
(Note: Please save file without the .txt suffix.)
          

This shows how to call posix_spawn() for the simple purpose of executing a shell command ( /bin/ls -l /etc/passwd in this case) and waiting until the command is finished. Also note how the program checks for errors from posix_spawn().

The second example is more sophisticated:

              posix_spawn_example2.c
(Note: Please save file without the .txt suffix.)
          

This example demonstrates how you can use posix_spawn() to do the kind of manipulation that is often done between fork() and exec(). It mimics what a shell might do for file redirection. The program creates a child process that has its standard input bound to a particular file, without disturbing the open file descriptors in the parent. In the interest of simplicity, it's using /bin/cat as the child process.

The posix_spawn_example2.c example program shows the use of both posix_spawn() and fork()/exec(), such that you can compare the interfaces.

If invoked with the argument -spawn, posix_spawn_example2.c uses posix_spawn() and binds the child's input to file /etc/hosts.

If invoked with no argument, posix_spawn_example2.c uses fork()/ exec() and binds the child's input to the file /etc/passwd.

Even without a debugger, you can verify which path is being used by using truss(1). The fork() option actually calls fork1() in the Solaris 10 OS, while the posix_spawn() execution path calls vfork() (the libc-internal version of it).

Note how posix_spawn_example2.c performs error detection. It checks for an error code returned from each function related to posix_spawn(). In addition, it checks for error code 127 that may be returned from the child process to indicate a problem there. See the ERRORS section of the posix_spawn(3C) man page (reference [ 1]) for details of when the child process may exit with status 127.

These examples demonstrate how to use the simpler abilities of posix_spawn(), which is enough in many cases. However, posix_spawn() can also perform more esoteric adjustments to the child process, such as changing user and group IDs, signal mask, and scheduling class. See the posix_spawn() man page for details.

Solaris Implementation of posix_spawn(), system(3C), and popen(3C)

These interfaces can be implemented in a variety of ways. The most obvious way is to use fork() followed by exec(). Some operating systems do just that, Linux for example. However, see reference [ 2], which is a request to improve the Linux implementation of posix_spawn() by calling vfork() instead of fork().

Starting with the Solaris 10 OS for posix_spawn() (and starting with the Solaris 8 2/02 release for system() and popen()), all three of these interfaces in the Solaris OS will never create a deadlock, and they will not cause the out-of-swap condition for large applications.

In the Solaris 10 OS, posix_spawn() is currently implemented using private-to-libc vfork(), execve(), and exit() functions. They are identical to regular vfork(), execve(), and exit() in functionality, but they are not exported from libc and therefore don't cause the deadlock-in-the-dynamic-linker problem that any multithreaded code outside of libc that calls vfork() can cause.

The Solaris 10 versions of system(3C) and popen(3C) are implemented using posix_spawn(). The Solaris 9 and Solaris 8 2/02 versions of those interfaces are implemented with the private-to-libc vfork() and execve().

Now that the Solaris OS has been open sourced, you can see all the implementation details online, here for example:

http://cvs.opensolaris.org/source/xref/usr/src/lib/libc/port/threads/spawn.c

Of course, the Solaris implementation of posix_spawn() can change in the future, perhaps to make it more efficient in various ways. However, any new implementation will always support the standard API.

Note that posix_spawn() is a requirement of Unix Specification, Version 3 (SUSv3), see reference [ 3].

The vfork(2) system call itself can't be made MT-safe and it's no longer necessary anyway. In the Solaris 10 OS, vfork(2) is deprecated. According to its man page, "Its sole legitimate use as a prelude to an immediate call to a function from the exec family can be achieved safely by posix_spawn(3C) or posix_spawnp(3C)."

Memory Overcommit: The Solaris OS vs. Other Operating Systems

Some operating systems (such as Linux, IBM AIX, and HP-UX) have a feature called memory overcommit (also known as lazy swap allocation). In a memory overcommit mode, malloc() does not reserve swap space and always returns a non-NULL pointer, regardless of whether there is enough VM on the system to support it or not.

The memory overcommit feature has advantages and disadvantages.

Advantages

  • The fork() call never fails because of the lack of VM.
  • The system memory can be used more flexibly and efficiently, especially when application programs dynamically allocate a lot of memory but don't fill much of it with data.
  • Memory overcommit can be used for growable ( infinitely large) arrays and memory buffers; compare with "Virtual Memory Arrays for Application Software," reference [ 4].
  • As a variation of the growable array idea described above, many Fortran programs (especially those created before Fortran-90) don't use dynamic memory allocation, because earlier Fortran versions did not provide any standard facility for it. Instead, they have very large arrays declared, but only use parts of those arrays. With a non-overcommit system these Fortran programs may not load at all or only load when you waste a lot of disk space for never used swap space.

Disadvantages

  • The memory overcommit feature changes the malloc() failure semantics. All those good applications that faithfully check for malloc() returning NULL and producing meaningful error messages and workarounds in that case are doing it for nothing when memory overcommit is used.
  • Arguably, memory overcommit violates the C/C++ standards that require that when malloc() returns a non-NULL pointer, the allocated memory should be available when needed.
  • The memory overcommit feature is global for the entire system. There is no way to use it for some applications but not others, or only for certain memory buffers within a certain application.
  • Most importantly, when a memory overcommit system is out of VM, one or more processes will be killed by the infamous OOM (out-of-memory) process killer due to memory pressure. This may be unacceptable, especially in enterprise-class environments. Random application programs shouldn't be killed just because somebody else allocated too much VM and filled it with data.

An interesting analogy for this issue is described in "Respite from the OOM killer," see reference [ 5] (which also contains a related discussion). Look for the sentence starting with "An aircraft company discovered ..."

The Linux documentation regarding this issue is somewhat contradictory. Red Hat has the following article explaining it: "Understanding Virtual Memory" (reference [ 6]), containing an explanation of the overcommit_memory parameter. See the paragraph starting with "overcommit_memory is a value ..."

Compare the overcommit_memory explanation in the above Red Hat article with the one given in the Linux Documentation, see reference [ 7]. See the section there starting with "The Linux kernel supports the following overcommit handling modes."

Under the Linux kernel version 2.6 and later, theoretically there is a way to modify the kernel's behavior so that it will not overcommit memory. This is done by selecting what is called the strict overcommit mode via sysctl:

sysctl -w vm.overcommit_memory=2

or placing an equivalent vm.overcommit_memory=2 entry in /etc/sysctl.conf.

 

Mode 2 (which is new in 2.6) is certainly an improvement over modes 0 and 1 available in the older versions of the Linux kernel. However, mode 2 doesn't mean that memory will never be overcommitted. It just uses a different heuristic for guessing how much memory is safe to allow to be allocated.

Also note that vm.overcommit_memory=2 is still not the default setting. The default is vm.overcommit_memory=0.

In contrast, under the Solaris OS when the application calls malloc() (internally invoking sbrk(2) to get more memory from the system), the kernel goes through its free memory lists trying to find the requested amount of VM. If found, the kernel returns a pointer to that memory and reserves the swap space for it such that no other process can use it until the owner releases it. If not found, malloc() will return NULL.

The Solaris OS does not use any heuristics or guessing of any kind. Therefore, it never needs to kill random processes when it runs out of memory.

While the Solaris OS doesn't use memory overcommit in its malloc() and sbrk(), it does allow similar functionality but with a much finer granularity and more control, via the mmap(MAP_NORESERVE) feature. Using mmap(MAP_NORESERVE), you can use this facility only for certain selected memory buffers and/or only in selected applications. For details, see "Virtual Memory Arrays for Application Software," reference [ 4].

How to Call posix_spawn() Only When It's Available

The posix_spawn(3C) interface has been introduced in the Solaris 10 OS. If your application is built and can be run under an earlier version of Solaris where posix_spawn(3C) is not available, you can use the following method.

Your application can dynamically determine if posix_spawn() is available by calling dlsym(RTLD_NEXT,"posix_spawn")). For example:

 
#include <dlfcn.h>
int (*posix_spawn_ptr)();
...
posix_spawn_ptr = (int(*)())dlsym(RTLD_NEXT, "posix_spawn");
if(posix_spawn_ptr != NULL)
{
  /* Call posix_spawn_ptr() the same way as posix_spawn() */
}
else
{
  /* posix_spawn() is not available; use older methods */
}

 

However, there is a problem with using this general method in this case. Using posix_spawn(3C) requires inclusion of system include file spawn.h like the following:

#include <spawn.h>

The problem is that the file spawn.h is not available under the Solaris 9 OS or earlier. To work around this problem, you can replace the above #include <spawn.h> statement with the following:

 
/*
 * To allow compiling such a program under Solaris 9 or earlier,
 * copy /usr/include/spawn.h from Solaris 10
 * locally and explicitly add definition of _RESTRICT_KYWD.
 * Note that "spawn.h" is included, rather than <spawn.h>.
 */
#if (defined(__STDC__) && defined(_STDC_C99))
#define _RESTRICT_KYWD  restrict
#else
#define _RESTRICT_KYWD
#endif
#include "spawn.h"
 

The resulting program will compile successfully under either Solaris 10 or an earlier Solaris version.

Note that having a local copy of a Solaris 10 header file in your application area is likely to require a certain amount of maintenance in the future when the system header file spawn.h may change. When you start building your application on the Solaris 10 OS or later, it will be best to remove the local copy of spawn.h and the special trick shown above, and use the system version of spawn.h directly instead. You may want to add a comment to that effect to your special code, so that the future developers will know why you added this code and a local copy of spawn.h to your application.

Acknowledgments

I'd like to thank my Sun colleagues Morgan Herrington (who also created example program posix_spawn_example2.c), Chris Quenelle, and Eric Sosman for their advice related to the issues discussed in this article.

References
About the Author

Greg Nakhimovsky is a Sun engineer working with application software vendors to make sure their products run well on Sun systems.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.