Murthy Chintalapati, Meena Vyas, and Ramchander Varadarajan, August 2008
This article describes the system architecture, configuration, and deployment troubleshooting for Sun Forums and covers the following topics:
Sun Forums (available at forums.sun.com) is one of the busiest Sun.com sites after www.sun.com and the Sun download centers.
Deployed on Sun Java System Web Server 7.0, Sun Forums offers Web 2.0-style forum services for all Sun products and technologies, and features RSS feeds and a rich-text editor for forum posts and responses.
Sun Forums traffic is increasing at the rate of over 3 percent year over year. In 2007, this web site had over 60 million hits. In 2008, it should surpass that number quite easily. In March 2008, the site had 6.6 million page views and 3.8 million visits (not including RSS feeds). Typically, over 60 users are logged in to the site at all times, and the site reaches a maximum of 1,200 users online during high-traffic times.
This article provides a system architecture and administrator view of how Sun Forums is deployed.
Figure 1 outlines the Sun Forums system architecture. As you can see, it's a very typical web tier deployment with a farm of web servers that are secured by a firewall, front-ended by a load balancer, and supported by a database server with redundancy in the back end.
The forums site is deployed on the Sun Java System Web Server (henceforth, "Web Server") 7.0 Update 2 release. The site runs on a couple of Sun Fire V40Z servers (which are superseded by Sun Fire X4600 M2 servers) and an Oracle 10g database server running on a Sun Fire V490 server using the Solaris 10 Operating System.
Sun Forums is based on Jive Forums, a Java web application that is deployed as a simple WAR file on the web server running on Java Platform, Standard Edition (Java SE) 6.
Sun Forums is a two-tier web deployment with Web Server 7.0 handling traffic from the Internet and a Nortel Alteon 2424ssl load balancer on the front end.
Figure 1: Sun Forums System Architecture
There are a couple of configuration files that use Web Server 7.0, which administrators often play with:
server.xml file contains information related to server runtime behavior, such as number of threads, number of concurrent connections, and Java Virtual Machine (JVM) tuning for the built-in web container to serve servlet and JSP applications.
<vs>-obj.conf contains information specific to virtual servers and how requests are processed.
For more information on Sun Java System Web Server configuration files, please refer to the product documentation mentioned in the For More Information section.
TIP : Default configuration values are often good. Regarding system-specific performance tuning, Web Server 7.0 Update 2 and later detects the available system resources and automatically chooses values related to the number of threads or concurrent connections to provide optimal out-of-the box performance. Hence, it is always best to start with the default values and then tune the web server or JVM configurations depending on a given application.
This section shows the configuration changes that the Sun Forums team needed to make to get the desired site performance. Here are relevant sections of the web server
server.xml configuration file, with inline comments (for the sake of brevity, selected attributes are shown):
One of the themes of the Sun Java System Web Server 7.0 Update 2 is manageability of clustered deployments. Any real-world deployment such as Sun Forums will undoubtedly run into production issues. However, having administration tools to troubleshoot those issues and apply necessary remedies is critical for successful production deployment.
This section examines a couple of early production issues and the tools used by Sun Forums administrators to troubleshoot those issues as they took the site live successfully with Web Server 7.0 Update 2.
When Sun Forums first upgraded to Web Server 7.0 Update 2, the site ran into a couple of production issues. For example, the site would be unresponsive or appear to be hung every 40 to 50 minutes after a fresh restart. The problem could have been due to assorted reasons, such as runaway database connections, excessive tuning of thread pool, and so on.
Troubleshooting is an exercise in systematically identifying the underlying root causes of a site's symptoms by using a combination of tools that are appropriate to different aspects of the problem. As the root causes are identified, you apply necessary tunings, configuration changes, and other remedies, see how the site responds to those changes, and repeat this process iteratively until the site performs satisfactorily.
Sun Java System Web Server includes tools for administrators to troubleshoot and resolve the issues mentioned previously. In particular, the following tools were used by the Sun Forums team early on:
This tool provides a quick peek into what's happening inside the server, such as connection queue and queueing delays, file cache size and hit ratio, requests in flight and response latencies, and so on. A few improvements were made to
perfdump in Web Server 7.0 Update 2 that the Sun Forums team found particularly useful:
perfdumpprovides a list of the functions the worker threads are processing and the URIs currently being serviced (see the Appendix for a sample output).
perfdumpcan be accessed through the command-line administration interface and continues to function even when the web server appears to be hung or unresponsive, which is very handy when troubleshooting production issues. Enabling
perfdumpis simple. (See the
Administration Console's monitoring GUI
One of the features of the monitoring GUI that the Sun Forums team found particularly useful was the database connection pool statistics: current connection pool sizes, number of connections leased, and so on. Using the monitoring GUI, the team traced a database connection leak bug in the application. By observing that the database connections weren't being returned back (or were leaking), they quickly turned on server error logging at a fine level and traced the problem to a section of the application's exception handling code where the connection wasn't getting freed.
Operating system tools
The Solaris 10 OS provides several nice system-level tools to track down production issues that administrators often debug. In particular, the Sun Forums team found
prstat -L -p <server pid> and
pstack <server pid> to be quite helpful.
When the Sun Forums team first observed the performance issues and the site would become unresponsive after every 40 to 50 minutes, the team noticed a couple of related things. The server actually loaded fine, and then the load would increase slowly until
prstat showed it reaching close to 98% of CPU utilization. The database server is a separate component, and the database administrator confirmed that the load on the database was not that high.
Sometimes, during the peak heavy load, although
prstat indicated about a 98% load, the site responded extremely quickly for a while and then slowed and then picked up. Eventually, though, the server became unresponsive.
So, as a first step of troubleshooting, the team used
perfdump. (Bear in mind that even if the web server seems to be hanging,
perfdump continues to work, and it is accessed from command-line administration.)
$ wadm get-perf-dump --config=<config_name> --node=<node_name>
When the web server seems unresponsive, most often worker threads (referred to as daemon session threads) are busy doing something or blocked waiting on some resource (such as waiting on a response from the back-end database or an LDAP server or another web service). The
perfdump output shows which threads are blocked and where. (It shows the service function as well as the URI that got it into this blocked state.)
perfdump output shows that when worker threads are blocked, the connection queue gets filled, since there is no thread to service those connections. Once the connection queue is full, the server starts returning server-busy errors. So, given that the web server was taking 98% CPU time,
perfdump showed a few worker threads in the J2EE service function (that is, inside the Jive forums application) taking up time. In order to look deeper, it became instructive to look at a thread dump.
prstatto Trace Threading Issues
It's relatively simple to take a thread dump of both the native server threads and the threads created by the JVM, embedded by Web Server. Running
prstat -L shows thread-level CPU usage, so you can identify the busiest threads. Then, use
pstack <server pid> > outputfile to find the thread
#xxx to review the actual thread stack. Note: <server pid> should be the actual process id of the Web Server instance (webservd process).
The Sun Forums team used a thread dump to track down the 98% CPU usage problem (in combination with
perfdump output). The Java thread dumps, which were taken at regular intervals, indicated that in this case the problem was in the application code. That code seemed to throw exceptions. It turned out (upon enabling finer logging), that the application code leaked JDBC connections in the exception handling code.
By the way, it's useful to take such thread dumps at some intervals for comparison. Java thread dumps could, in fact, be taken at regular intervals this way:
#! /bin/sh while true; do kill -QUIT <pid> sleep 60 done
Back to the
perfdump output, it became clear that the server was tuned rather excessively. For example, the keep-alive thread count of 96 was pretty high, because the recommendation is to use a small multiple of the number of processors or cores on the system. Likewise, the stack size of 512 Kbyte appeared to be too high. The worker threads (daemon session threads) maximum setting of 512 was rather high, too.
So, as a result of looking at the output, the team quickly tuned down the keep-alive threads to be around 8, and they dialed down the worker thread pool.
Again, back to the
perfdump output, the team also noticed that some static content services were waiting forever to be completed. The application loads a lot of static resources for some actions, and severe performance degradation was noticeable. Eventually, the server had to be restarted.
Here is how the file cache on the monitoring statistics looked at the time:
Total Cache Hits 288832 Total Cache Misses 24256 Total Cache Content Hits 94769 Number of File Lookup Failures 244 Number of File Information Lookups 193775 Number of File Information Lookup Failures 20442 Number of Entries 251 Maximum Cache Size 1024 Number of Open File Entries 0 Number of Maximum Open Files Allowed 1024 Heap Size 493456 Maximum Heap Cache Size 10743828 Size of Memory Mapped File Content 0 Maximum Memory Mapped File Size 0 Maximum Age of Entries 30
Since the Sun Forums site had lots of images and other static resources that were served on every request but changed relatively rarely, the question was whether to increase the maximum age. Clearly, if the static content isn't updated frequently, then increasing the maximum age will help. Refer to the file cache tuning section of the Sun Java System Web Server 7.0 Update 3 Performance Tuning, Sizing, and Scaling Guide.
Depending on how large the static content is (for example, number of files, mostly large files, or mostly small files), it might be appropriate to tune the
The Sun Forums team increased the cache max age to 120 from the default of 30 and enabled
sendfilev usage. Site response seemed to improve as the server cached the content for a longer time, and the system load came down quite a bit. Sun Java System Web Server file caching is quite versatile and it allows:
The file cache is very scalable. Studies have shown it to scale to 32 Gbyte of static resources cached. Also, when increasing the
max-open-files setting, you should review and tune the file descriptor limit, too, because there will be a relatively larger number of open file descriptors corresponding to the files cached. The
fd-limit on the Solaris 10 OS is 64k.
Note : Do not use
sendfilev unless you are on the Solaris 10 8/07 (
s10u4-b12b) or later update releases or on OpenSolaris.
Sun Forums (forums.sun.com) is one of the busiest Web 2.0-style sun.com sites. It has growing traffic demands and has been upgraded successfully to Sun Java System Web Server 7.0, which offers greater deployment ease of use, performance, and scalability. Based on enterprise-class forum software (Jive forum) and an Oracle database backend, Sun Forums is a classic web-tier Java deployment.
Deploying and taking on a new web site can pose a number of production usage issues. Web Server 7.0 made it easy for us to troubleshoot these deployment issues. This article showcases troubleshooting tools available to web site administrators who are using Web Server 7.0, and it demonstrates how the Sun Forums team resolved early deployment issues and made the new site launch a successful venture.
Murthy Chintalapati was an architect and is engineering manager for Sun's Web Tier products. Meena Vyas is a software engineer with the Sun Java System Web Server development engineering team. Ramachander Varadarajan is a systems architect at Sun Forums.
Here are additional resources:
Here is a sample of output from
Unless otherwise licensed, code in all technical manuals herein (including articles, FAQs, samples) is provided under this License.
More Systems Downloads