Making Java Technology Faster Than C with LRWP

   
By Nagendra Nagarajayya, Ashish Bannerjee, Ranjan Kumar, Vikas Gera, Gayathry Manikandan, and Dmitry Isakbayev,
August 2007
 

Introduction

Multi-threading applications are a way to scale and meet today's growing business requirements while reducing the number of systems needed. However, a multi-threaded application's scalability is limited by portions of code that cannot run in parallel; these serial components limit scalability, see Amdahl's Law and problems with I/O. Our previous paper Horizontal Scaling on a Vertical System Using Solaris Zones , described a workaround using Zones to scale Xitami/NexSRS by running copies of each in a Zone. Running a copy in each zone improved performance by more than 100% but still was not the solution to the scalability problem with Xitami. Some of the solutions were to either migrate the Long Running Web Process (LRWP) protocol to the Sun Web Server or migrate NexSRS to use Netscape Server API (NSAPI). We decided to try something totally different, implementing the LRWP protocol in Java, running in a web container. GlassFish was open sourced at around this time so we chose GlassFish to try our idea. We expected performance close to Xitami/NexSRS performance on smaller systems and expected to scale well on bigger systems.

The implementation was faster than Xitami/NexSRS -- Xitami is a very small web server written in C, and is one of the top 10 web servers. Our implementation was to scale better on bigger CMT systems, but LRWP in Java was faster by 23% on a single core system and by 78% on a 4 core system, showing scalability from a single core to multiple cores while Xitami's scaled to just about 15K CPM on the 4 core system, Fig 1.

Figure 1: Xitami/C vs. GlassFish/Java Scaling
Figure 1: Xitami/C vs. GlassFish/Java Scaling
 
Long Running Web Process (LRWP)

LRWP is a protocol used by the Xitami web server to communicate with its peers. Peers are processes that communicate with web clients. Web clients could be browsers or other types of clients communicating over HTTP. LRWP is similar to CGI where a web client makes a request to a cgi-bin/context which allows the web container to invoke a cgi-bin executable, pass on the input from the web client to the executable, and return the output back to the web client. In LRWP, a TCP connection is established between the LRWP peer and a LRWP agent. The LRWP agent could be the web container or a process running within the web container and the LRWP peer could be any process running on a network. At connection the LRWP peer registers the web context that the peer is interested in. The web context could be any context such as /osp or /tep or /cgi-bin itself. When a request for that context is made, the agent transfers the input to the LRWP peer and sends the output from the peer back to the web client. The LRWP agent also supports multiple peers at the same time. The peers could be different threads within one process, or multiple processes. Each peer makes a connection and registers the context it is interested in.

LRWP in Java

To implement this protocol in Java, we needed a web container to process HTTP. As the ISV was interested in an open source web container that was free, we choose GlassFish, a Java Platform, Enterprise Edition (Java EE) application server built on top of the Apache Tomcat web container. The design was to use servlets to listen to HTTP requests and pass on requests to a LRWP agent running within the container. The LRWP agent would then pass on the request to the right LRWP peer, and pass the response back to the servlet. The LRWP agent server registers the contexts that a peer is interested in and waits for a request on the interested contexts. If a request matches a context, the servlet thread goes to sleep on a context lock and the agent passes on the request to the LRWP peer, waits for a response from the peer, wakes up the servlet thread and passes it the response. The servlet thread returns the response back to the web client.

LRWP in Java
LRWP in Java
 

The LRWP agent has been implemented as a web application listening to "/*" context, so that every request first comes to this application. If the request is to a context registered by a LRWP peer, the request is passed onto the LRWP peer for further processing and if the request is to a context not registered with the LRWP agent, the request is dispatched to the default servlet for further processing by using the ServletContext RequestDispatcher object. The Service Providers (LRWP peer applications) who wish to provide LRWP service, must register themselves with the LRWP agent using the LRWP protocol. After initial exchange of messages, as per LRWP protocol, the connection between LRWP agent and peer gets established through the LRWP RequestHandler. An instance of LRWP RequestHandler is created per peer application. The peer application registers a relative URL context, and any requests to the context are forwarded to the peer application. A LRWP peer application can open more than one connection with LRWP agent registering the same context to achieve load balancing. A LRWP peer application can also register more than one URL context.

Integration with a LRWP peer, NexSRS

Open Settlements Protocol (OSP) is an international standard for VoIP carriers that provides a secure mechanism for IP communication. An OSP server authorizes call setup between peer VoIP gateways, Fig 1. The source gateway (the originating gateway in a call setup) sends an authorization request message to the OSP server to obtain the IP address of a destination gateway that can complete the call to the dialed number. The OSP server sends an authorization response message back to the source gateway. The authorization response message contains the IP address of the destination gateway that can complete the call to the dialed number and also a digitally signed token to be used by the source gateway in a call setup. The source gateway uses the digitally-signed token to connect to the destination gateway; the destination gateway verifies the token to make sure that it's coming from a trusted source.

When the call is over, the source gateway and destination gateway both send a UsageIndication message to the OSP server. This message is confirmed by a UsageConfirmation message from the OSP server to the source gateway and destination gateway, Fig 2.

Figure 2: UsageConfirmation message
Figure 2: UsageConfirmation Message
 

NexSRS is a multi-threaded OSP server that is also a LRWP peer. Clients communicate with NexSRS using HTTP. NexSRS uses an external web server, to process the HTTP requests. The external web server passes on the client request to NexSRS using the LRWP protocol for processing. NexSRS connects to a LRWP agent at HOSTNAME:1081 and registers multiple contexts like /osp, /tep, /cgi-bin, etc. and waits for requests from a web client. Each context is handled with a separate thread. The LRWP agent registers the context and when a web client makes a request to a URL as http://hostname:1080/osp, the LRWP agent matches /osp to a LRWP peer and passes the request to it for processing.

Improving LRWP Performance

Tuning LRWP agent Java Code

The initial design was to use a multi-threaded server to act as an LRWP agent within the ServletContainer. Each LRWP peer would be registered with the server which started a thread to handle the connection. The servlet would pass the request to the agent which would wake up the thread send across the request from the web client while putting the servlet thread to sleep. A response from the peer would be passed back to the web client by waking up the servlet thread while putting the agent peer thread to sleep. This was modified to make use of the container threading model and to use the servlet thread itself to pass the request to the peer, wait for a response, and return the response back to the web client. The design was to use an LRWP agent thread that would accept connections from the peer and register that connection by creating a RequestHandler instance. The agent maintained the handlers using a Vector object as a list of LRWP RequestHandlers. The design was also to add and remove the handlers as a request was forwarded to the LRWP peer and a response processed from the LRWP peer. This turned out to be a bottleneck as Vector is a synchronized structure, and access to the critical section in the method was also synchronized through a wait/notify mechanism. The lookup of the RequestHandlers in the first design was to iterate through the list finding the context match and the RequestHandler. This was changed by introducing a ContextAssistantManager object which managed the list of RequestHandlerss. Instead of adding and removing the request handlers, the ContextAssistantManager kept track that a handler was in use and a request to the same context would put a servlet thread to sleep.

Code snippet of the ContextAssistantManager:

class ContextAssistantManager {

    public synchronized HttpProxyService getPeerService() {
        ProxyServiceWrapper wrpSvc = null;
        if(vectSize == 0) {
            return null;
        }
 
        do {
            if(lastExec < 0 || lastExec > vectSize)
                lastExec = 0;
            wrpSvc = (ProxyServiceWrapper)vect.get(lastExec);
            lastExec++;
        } while (!(wrpSvc.isFree()));
        return wrpSvc;
    }
}
Code Sample 1: Code snippet of the ContextAssistantManager
 

Some of the other changes that could improve performance:

  1. Avoiding multiple copies
    The coding style using a function call return to check for a condition, for example:
    if (getValue() == null) {
            error;
    } else {
            String value = getValue();
    }
    
    is expensive as getValue generates two String objects or the return types of objects. This needs to be changed to: 
    
    String value = getValue();
    
    if (value == null) {
            error;
    } 
    
     
  2. Avoiding allocating byte arrays
    Network code to receive or send to a socket needs to be in bytes. So if the data is held as a String it will need to be converted to bytes, allocating a byte array to send or receive the data. Instead of holding the data as a String or a byte array, using direct mapped buffers such as a ByteBuffer allows one to avoid creating and destroying these objects. For example, to send a request to a peer, headers are created in a StringBuffer, which is converted to a String to get access to the bytes, which might instead be stored using a ByteBuffer, a view created as CharBuffer which could be operated on instead of using a StringBuffer or a String. The same could be used to return the response back also.
Tuning GlassFish

Tuning HTTPConnector Grizzly

GlassFish's HTTPConnector, Grizzly, by default uses NIO to handle connection requests from clients. New Input/Output (NIO) is the IO introduced with JDK 1.4 that provided a scalable network and file IO, and native buffer management capability. NIO introduced channels which allow streams to be channels. SocketChannel is a selectable channel and allows multiple streams to be selectable for reading and writing. It eliminates the requirement of a thread per connection. So servers can now be built with a few threads that can handle multiple client connections, enabling increased performance and eliminating thread overhead. SocketChannel can be blocking or non-blocking. Grizzly provides both blocking and non-blocking implementations and by default is non-blocking and uses 2 threads and a maximum of 5 threads to handle requests from clients. This is tunable, and increasing the number of threads to 10 gave the best performance. Increasing pool sizes also improved performance. The following pool sizes were increased:

<request-processing header-buffer-length-in-bytes="4096" initial-thread-count="2" request-timeout-in-seconds="30" thread-count="10" thread-increment="1"/>
<keep-alive max-connections="10000000" thread-count="1" timeout-in-seconds="30"/>
<connection-pool max-pending-count="14096" queue-size-in-bytes="14096" receive-buffer-size-in-bytes="14096" send-buffer-size-in-bytes="18192"/>
 

keepalive was also increased by increasing the max-connections to 10000000. There seems to be a problem with keepalive, as increasing the count did not actually stop the server from making new connections. We handled this by changing the TCP_TIME_WAIT_INTERVAL to 1000, and increasing the file descriptor limit.

Tuning Garbage Collection
<jvm-options>-Xms3400m</jvm-options>
<jvm-options>-Xmx3400m</jvm-options>
<jvm-options>-XX:UseParallelGC</jvm-options>
<jvm-options>-Xmn256m</jvm-options>
 

Using the parallel collector improved GlassFish performance to 27K ( Fig 1.) The pause seen with the default collector on 4 cores disappeared with the use of the parallel collector. Increasing the heap from 1400m to 3400m improved performance. Increasing it further to about 7m should see more improvements in performance. (GlassFish seemed to have a problem with 64bit JVM and we could not try this.)

Tuning Solaris

Solaris 10 is tuned for performance out-of-box. The tunables that we used were TCP_TIME_WAIT_INTERVAL to1000 to remove connections from being in this state for the default interval of 4 minutes, increasing the file descriptor limit, increasing TCP_CONN_REQ_MAX_Q [7] to 10000 and TCP_CONN_REQ_MAX_Q0 to 1000. We also added two other tunables, IP:IPCL_CONN_HASH_SIZES to 10000, and TCP_IP_ABORT_INTERVAL to 500 in /etc/system.

        $ndd -set /dev/tcp tcp_time_wait_interval 1000
        $ndd -set /dev/tcp  tcp_conn_req_max_q 10000
        $ndd -set /dev/tcp  tcp_conn_req_max_q0 10000
 

The following were added to /etc/system

set ip:ipcl_conn_hash_sizes=10000
set tcp_ip_abort_interval=500
 

Setting ndd -set /dev/tcp tcp_time_wait_interval 1000 and increasing the file descriptor limit increased performance from 4500 CPM to the current number. This is related to the problem described above with Keepalive.

Running On an x86-Based System

Load Generation

The load was generated using a Sun Fire V280 (2 CPUs) and ApacheBench tool. Three instances of ApacheBench were started using a script, each sending a message to a URL such as http://eagle:1080-/osp.

%./ab.sol8 -p auth.xml -n 1000000 -c 10 -k http://eagle:1080/osp &
%./ab.sol8 -p src.xml  -n 1000000 -c 10 -k http://eagle:1080/osp &
%./ab.sol8 -p dest.xml -n 3000000 -c 30 -k http://eagle:1080/osp &
 

On the server side, an instance of GlassFish listened to requests on port 1080 and passed on the request to the LRWP agent web application listening to "/*" context.

Measuring CPS

Calls per second (CPS) was measured by tailing the nexus.log file -- the log files show calls per minute (CPM), which needs to be converted. ApacheBench also outputs CPS at the end of the test, and this was compared with the log file to ensure that tests ran successfully.

System Performance

The tests were run on a x4100 (2 cores each on 2 sockets, 8GB, 2.6Ghz) running Solaris 10. The cores were enabled/disabled using Solaris' dynamic processor configuration utility, psradm.

Table 1. CPU Usage with Xitami/NexSRS
 
 

Cores

Calls per Minute

CPU Utilization (%)

nexus_server

Xitami

1

8575

53

45

2

12880

44

34

4

15470

31

21

 
Table 2. CPU Usage with GlassFish(LRWP agent in Java)/NexSRS
 
 

Cores

Calls per Minute

CPU Utilization (%)

nexus_server

GlassFish

1

10569

68

28

2

20578

66

26

4

27246

43

21

 
Improvement in Performance

GlassFish (LRWP agent in Java)/NexSRS performance exceeded Xitami/NexSRS performance from a single core to 4 cores. GlassFish/NexSRS was 23% faster on a single core, and 76% faster with 4 cores. NexSRS uses about 68% of CPU with GlassFish on a single core while using about 43% with 4 cores. With Xitami, NexSRS uses about 53% of CPU on a single core while using 31% with 4 cores. GlassFish uses an average of 25% from a single core to 4 cores, while Xitami uses about 45% on a single core to about 21% with 4 cores. GlassFish with NIO seems to use less CPU time as compared to Xitami, allowing NexSRS to scale better.

Conclusions

"LRWP agent in Java with GlassFish" performs very well exceeding the "LRWP agent in C with Xitami" 1 performance from a single core to 4 cores. GlassFish with NIO scales extremely well from a single core to 4 cores and could see further improvement in performance on a 64bit JVM with an increased heap.

Acknowledgments

We would like to thank Satyajit Tripathi for excellent project management, managing resources efficiently across time zones with time lines, making communication very efficient. We would also like to thank the GlassFish performance team of Scott Oaks and Jean-Francois Arcand for helping to tune the HTTP Grizzly connector, and Bruce Chapman, for reviewing and providing some fine suggestions including help with the chart.

About the Authors

Nagendra Nagarajayya, has been working with Sun for the last 13 years. He works as a Staff Engineer at ISV Engineering working with Independent Software Vendors (ISVs) in the tele-communications (telco) industry on issues related to architecture, performance tuning, sizing and scaling, benchmarking, porting, etc. He specializes in multi-threaded issues, concurrency and parallelism, HA, distributed computing, networking and performance tuning.

Dmitry Isakbayev has worked at TransNexus since 1997 and leads all software development. TransNexus has been an innovator of commercial and open source VoIP Operations and Billing Support Systems (OSS/BSS) since 1997. Deployment of the TransNexus OSS/BSS solution provides wholesale VoIP carriers with an immediate increase in operational profits. Key features include Least Cost Routing, Quality of Service Routing, secure inter-domain VoIP peering, traffic analysis and control, management reports, new revenues from wholesale services and lower cost back-office operations.

Satyajit Tripathi is a Computer Engineer with 10+ years industry experience. Practicing Project Management in Sun Microsystems. Previous experience with working on Network Identity Management, Hospital SCM System, Mobile SOA etc.

Ashish Banerjee is an Independent Software Developer, having 20 years of programming experience. He is passionate about Solaris Internals and Java Technology.

Ranjan Kumar is a software professional presently working with Headstrong Inc. He has extensively worked on OOPS and Open System technologies. His white paper on Network Virtualization published in IBM and is active on open source development. Project administrator for a project on sourceforge.net

Vikas Gera is a distributed computing specialist with 8 years of work experience in C++ programming on Solaris platform. He enjoys learning japanese.

Download

Source for LRWP in Java/Glassfish

References
  1. Horizontal Scaling on a Vertical System Using Zones in the Solaris 10 OS, Ashutosh Kumar, et. al.
  2. LRWP Protocol
  3. Xitami web server
  4. GlassFish
  5. JDK 1.6 Garbage Collector
  6. Solaris tunables
  7. OSP protocol (PDF)
  8. NexSRS (PDF)
  9. Threads
  10. Wikipedia
  11. Amdahl's law
 
Glossary

LRWP – Long Running Web Process is a protocol used by Xitami web server to communicate with its peers. Peers are processes that communicate with web clients. LRWP is similar to CGI but the peer maintains the connection with the web server across requests, increasing performance.

Xitami – An open source web server, written in C.

GlassFish – A Java EE open source application server.

User threads – Also known as fiber, these are threads as part of a user library and run in user space. A user threads library provides good performance but can be a scalability bottleneck. Xitami makes uses of its own user threads library.

Solaris threads – Threads on Solaris OS. Solaris provides a 1x1 threading model. A 1x1 threading model is where every application thread has a corresponding kernel thread.

Posix or Pthreads – Threads that adhere to the posix standard. Solaris OS provides a posix and a Solaris specific thread api.

OSP – Open Settlements Protocol is an international standard for VoIP carriers that provides a secure mechanism for IP communication.

NexSRS – An OSP application from Transnexus.

LRWP Peer – Peers are processes that communicate with web clients. Web clients could be browsers or other types of clients communicating over HTTP. Peers use LRWP protocol to communicate with the web server which in turn communicates with the web client using HTTP.

LRWP Agent – A process that can communicate LRWP protocol with an LRWP peer. The LRWP agent could be the web container or a component running within the web container.

HTTP Hypertext Transfer Protocol (HTTP) is a method used to transfer or convey information on the World Wide Web. Its original purpose was to provide a way to publish and retrieve HTML pages.

CPM – Calls Per Minute. A call could be a VoIP, mobile or fixed line call.

CPS – Calls Per Second.

CPU – A central processing unit (CPU), or sometimes simply processor, is the component in a digital computer that interprets computer program instructions and processes data. CPUs provide the fundamental digital computer trait of programmability, and are one of the necessary components found in computers of any era, along with primary storage and input/output facilities.

Apache Bench (ab)ApacheBench is a command line computer program for measuring the performance of HTTP web servers, in particular the Apache HTTP Server. It was designed to give an idea of the performance that a given Apache installation can provide. In particular, it shows how many requests per second the server is capable of serving [11].

Grizzly (HTTP Connector) – The HTTP Connector used by GlassFish.

NIO – A collection of Java programming language APIs that offer features for intensive I/O operations. It was introduced with the J2SE 1.4 release of Java by Sun Microsystems to complement an existing standard I/O. NIO was developed under the Java Community Process as JSR 51.

ServletContext – Servlets allows a software developer to add dynamic content to a web server using the Java platform. The generated content is commonly HTML, but may be other data such as XML. Servlets are the Java counterpart to non-Java dynamic Web content technologies such as PHP, CGI and ASP.NET. Servlets can maintain state across many server transactions by using HTTP cookies, session variables or URL rewriting. There is only one ServletContext in every application. This object can be used by all the servlets to obtain application level information or container details.

CMT – Today's traditional single-core processors can only process one thread at a time, spending a majority of time waiting for data from memory. In sharp contrast, chip multithreading (CMT) refers to a processor's ability to process multiple software threads. A CMT processor could implement this multithreaded capability using a variety of methods, such as (i) having multiple cores on a single chip (CMP), (ii) executing multiple threads on a single core (SMT), or (iii) combination of both CMP and SMT.

Solaris Zones – Solaris Containers (including Solaris Zones) is a virtualization feature first available with Solaris 10. This is an implementation of operating system-level virtualization technology.


1Processor sets were not tried. Processor sets could help improve Xitami's performance.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.