Skip to main content

     
  TPF : Library : Newsletters
  Products > Software > Host Transaction Processing > TPF > Library > Newsletters >

Super Scalable Supported Socket Server Solutions

Mark Gambino, IBM TPF Development

One of the most frequently asked questions is, "How do I write a socket application?" The answer is, "That's easy, in C language." The more important question is, "How do I design a socket application?" That topic was the primary focus of the socket education class that was taught at the Spring 2000 TPF Users Group (TPFUG) meeting in Lost Wages (Las Vegas). Before we continue that discussion, we need to level set the requirements.

A simple utility program that generates a few dozen messages per minute is not interesting because even the most inefficient design would have no real impact on system performance. Instead, our first goal is to create a mainline socket application that is capable of scaling to well over 100 000 active connections. You could throw 5000 small servers together and try to manage that mess with operators on roller skates going around with floppy disks, machine to machine, trying to keep the database synchronized, or you could use TPF. The other goal is to design an application that communicates with a small set of remote nodes, but does so at a rate of hundreds of messages per second over each pipe. We will examine how a TPF server meets these requirements.

You have a large network with over 100 000 users. These include human beings (and managers) that log on to your TPF system in the morning and stay connected all day long. There are also computer-to-computer connections like automated teller machines (ATMs) that stay connected all the time. The UNIX model ties a socket to a process, meaning that when the process goes away, the socket is cleaned up. That model falls apart when your requirement is to have 100 000 active sockets. This is one reason why the TPF TCP/IP native stack implementation allows a socket to be shared by all processes (ECBs) in the system. Taking that statement a step further, sockets can remain active even when there are no active ECBs in the system. For capacity planning, the message rate (messages per second) generated by these users is the key statistic. For example, having a TPF system with 100 000 active sockets that generates a total of 2000 messages per second is virtually the same as a TPF system with 150 000 active sockets that also generates 2000 messages per second. The only difference in this case is that the second system would require approximately an extra 15 MB of storage to handle the control block structures for the additional 50 000 sockets. All other factors for these TPF systems could be identical. These include the processor capacity required to handle the load (meaning both TPF systems would be running at the same CPU utilization), the number of ECBs defined, and the number of IP routers required to handle the volume of traffic. How is that possible? Let's walk through the details.

In the environment described, a user starts a connection that remains active for a long period of time. For example purposes, assume the connections remain active for 8 hours (though no one works for only 8 hours a day anymore). An 8-hour time period at an average rate of 2000 messages per second yields a total of 57.6 million messages. Taking the best case, each message is three packets: a single packet in, followed by one packet out (which also acknowledges the input packet), and then the remote end sends an acknowledgment to the TPF output message. That totals 172.8 million packets in our example. Three flows are required to start a TCP connection and up to four flows are required to end the connection. Starting and ending 100 000 or even 150 000 connections accounts for less than 1% of the total packets that flow in the network in this example. Even though going from 100 000 to 150 000 connections is an increase of 50% in the number of connections, at the same message rate (2000 packets per second in this example) the total number of packets in the network increases by only 0.2%. This tiny increase in packets is the only effect on the network. With TCP/IP, only the endpoints of the connection have knowledge of the sockets; IP routers have no knowledge of sockets, nor do they require any storage per socket. Instead, the amount of routers you need is based solely on message rate (and message size). The number of routers needed for a rate of 2000 messages per second is the same regardless of the number of sockets it takes to generate that message rate. Now that we have shown that packet rate rather than number of sockets is what determines the network size requirement, let's look at the TPF server itself.

Each active socket requires a fixed amount of main memory storage on the TPF host for its control block structures. Having 100 000 active sockets requires approximately 30 MB of storage. The amount of storage required is directly proportional to the number of sockets; therefore, 150 000 sockets requires 45 MB of storage on TPF. The IP message table (IPMT) is a separate table that is used to store input messages until a socket application reads them and holds output messages until they are acknowledged by the remote end. The amount of storage needed for the IPMT is based on message size, message rate, and round-trip time (how long it takes the remote end to acknowledge output messages sent by TPF). In our example, the average rate was 2000 messages per second. However, message rates are never steady (otherwise, the capacity planning job would be too easy), so assume the peak message rate is 3000 messages per second. We also assume that there is always a read() or activate_on_receipt() (AOR) call pending when a message is received from the network by TPF, which is a fair assumption in this environment. Because there is always a read/AOR pending, input messages are passed directly to the application and bypass the IPMT. To keep the math simple, assume the average size of an output message is 4000 bytes and the average round-trip time is 1 second. The IPMT would need to be at least 12 MB in size to handle peak periods. To account for network failures and the occasional unexpected spike in traffic, you could play it safe and nearly triple the size and define the IPMT size as 35 MB in this example. What this means is that a single TPF processor (CEC) could have 150 000 active sockets processing 2000 messages per second (3000 at peak) and the TCP/IP stack requires only 80 MB of storage. So far, so good.

When we look at system performance, we assume the application is using the AOR model, meaning that there is an AOR pending for each active socket. When an input message is received by the TPF system, an ECB is created, the input message is passed to the application, the application issues another AOR (that will create a new ECB when the next message comes in on this socket), the application processes the input message, sends the output message, and exits. At 2000 messages per second, 2000 ECBs are created each second. This is true regardless of the number of active sockets. In the case of long-life sockets using the traditional sockets programming model, a read() is done on each socket, resulting in one active process for each active socket. For large networks, the traditional model is not practical; for example, good luck trying to manage 150 000 active ECBs in a TPF system or 150 000 active processes on any platform!

To enable the TPF system to be a truly large TCP/IP server, the AOR model was invented. Besides the obvious scalability advantages of the AOR model, it is also simple to implement. In the traditional TCP model, a new process or thread (ECB in the case of TPF) is created when a socket is created and the new process/thread/ECB sits in a loop issuing a read() to read in the next input message, processes that message, issues a write() to send the output message, and then issues another read(). This pattern continues until the socket ends. To use the AOR model, all you need to do is add an AOR call to the application program and exit instead of looping. It's that easy.

Figure 1 shows long-life sockets using the traditional TCP server model. The Internet daemon (INETD) listens for connections and, when a remote client connects, a new task is created to exchange data with this client for the life of the connection.

Newsletter Article Illustration

Figure 1. Traditional TCP Server Model

Figure 2 shows the same application using the AOR model instead. The INETD still listens for new connections and, when a remote client connects, the INETD issues the first activate_on_receipt(), which creates a new ECB (Application Program Instance 1) when the first input message is received on this socket, and the application program is activated. When the application program is activated, it issues a read() to get the input message and then issues another AOR. This will cause another instance of the application program to be activated in a new ECB when the next input message comes in on the socket. The application processes the input message and then sends the output message, the same as always. Rather than looping, the program exits in this case because the next input message coming in will kick off a new ECB to process it.

Newsletter Article Illustration

Figure 2. AOR TCP Server Model

So far, the discussion has been about a large number of sockets. Now, let's look at the opposite extreme with an application that has only 10 sockets, but each of those sockets generates 200 messages per second. With only 10 sockets, you might think that the traditional TCP server model is OK to use because there would only be 10 active ECBs. Having 10 active ECBs is not a problem; however, the fact that these are long-running ECBs does present problems. The AOR TCP server model solves the long-running ECB problems, but if you know 99% of the time when you finish processing one input message that the next input message has already been received from the network, why should the application program exit and have a new ECB created when the existing ECB could just process the next message? For situations like this, or for cases where application initialization is very expensive, you want to use a hybrid implementation that combines the advantages of the traditional and AOR server models. Each ECB processes a fixed number of input messages (50 in this example) using regular read() calls, issues AOR, and exits. Each ECB will process, at most, 50 messages and then exit; therefore, no long-running ECBs exist. Instead of having ECB and application initialization/exit overhead for each message, you only execute the initialization/exit code one time for every 50 messages. For times when traffic is light, you do not necessarily want the ECB to wait for 50 messages to arrive before the ECB exits. To prevent this from happening, run the socket in nonblocking mode and, if a read() indicates that there are no more input messages to process, have the application AOR and exit.

Figure 3 shows the application program logic to implement the hybrid TCP server model. The INETD is still set up to use the AOR model. When an application instance (ECB) is created, it places the socket in nonblocking mode using ioctl() to make sure that read operations do not wait for data to arrive. Next, the ECB loops to process 50 messages, or until there are no more input messages to process. After that, a new ECB is created to process the next batch of messages.

Newsletter Article Illustration

Figure 3. Hybrid TCP Server Model

Data flow control is done on a per socket basis and is the responsibility of the protocol layer. TCP uses a windowing mechanism where endpoint Y controls the rate of data sent by endpoint X to endpoint Y. The base TCP architecture supports window sizes up to 64 KB, meaning that a node could send, at most, 64 KB of data and would then have to wait for an acknowledgment from the remote end before it could send more data. In the 150 000 sockets example at peak (3000 messages per second), each socket generates only one message every 50 seconds on the average; therefore, data flow is not an issue for these sockets. The other application that generated 200 messages per second over each of its 10 sockets is more interesting when you look at data flow. That message rate with 4000-byte messages yields 800 000 bytes per second over each of these sockets. For applications like this, a 64-KB window size is probably not sufficient because 64-KB bursts of data with idle time in between waiting for acknowledgments limits the throughput on the socket. To take full advantage of high-bandwidth sockets (and networks), an extension to the TCP architecture was made called window scaling. If both endpoints of the socket support TCP window scaling, the window size can be well above 64 KB. TPF supports window scaling and window sizes up to 1 MB per socket.

Now that we have crunched the numbers, let's summarize the results in plain English:

  • The TPF system can handle a very large number of sockets and a high volume of data on individual sockets. These are the two extremes of TCP/IP networks, both of which can be supported at the same time by a single TPF system.
  • Server applications developed on other platforms using the traditional TCP server model require only minor modifications to use the AOR server model, which gives you enough scalability to roll out mainline TCP applications on TPF that your entire network can use.
  • While it is true that activate_on_receipt() is a unique function to the TPF system, that does not mean you need people with TPF-specific skills to write TCP applications to run on the TPF system. You can leverage the existing TCP application design skills in your shop without changing the core processing of the application at all, enable that application to run on TPF, and handle far more connections and traffic than is possible on other server platforms.

Third Quarter 2000 - Table of Contents