 |
Super Scalable Supported Socket Server Solutions
Mark Gambino, IBM TPF Development
One of the most frequently asked questions is, "How do I write a
socket application?" The answer is, "That's easy, in C language."
The more important question is, "How do I design a
socket application?" That topic was the primary focus of the socket
education class that was taught at the Spring 2000 TPF Users Group
(TPFUG) meeting in Lost Wages (Las Vegas). Before we continue that
discussion, we need to level set the requirements.
A simple utility program that generates a few dozen messages per
minute is not interesting because even the most inefficient design
would have no real impact on system performance. Instead, our first
goal is to create a mainline socket application that is capable of
scaling to well over 100 000 active connections. You could throw
5000 small servers together and try to manage that mess with
operators on roller skates going around with floppy disks, machine
to machine, trying to keep the database synchronized, or you could
use TPF. The other goal is to design an application that
communicates with a small set of remote nodes, but does so at a
rate of hundreds of messages per second over each pipe. We will
examine how a TPF server meets these requirements.
You have a large network with over 100 000 users. These include
human beings (and managers) that log on to your TPF system in the
morning and stay connected all day long. There are also
computer-to-computer connections like automated teller machines
(ATMs) that stay connected all the time. The UNIX model ties a
socket to a process, meaning that when the process goes away, the
socket is cleaned up. That model falls apart when your requirement
is to have 100 000 active sockets. This is one reason why the TPF
TCP/IP native stack implementation allows a socket to be shared by
all processes (ECBs) in the system. Taking that statement a step
further, sockets can remain active even when there are no active
ECBs in the system. For capacity planning, the message rate
(messages per second) generated by these users is the key
statistic. For example, having a TPF system with 100 000 active
sockets that generates a total of 2000 messages per second is
virtually the same as a TPF system with 150 000 active sockets that
also generates 2000 messages per second. The only difference in
this case is that the second system would require approximately an
extra 15 MB of storage to handle the control block structures for
the additional 50 000 sockets. All other factors for these TPF
systems could be identical. These include the processor capacity
required to handle the load (meaning both TPF systems would be
running at the same CPU utilization), the number of ECBs defined,
and the number of IP routers required to handle the volume of
traffic. How is that possible? Let's walk through the details.
In the environment described, a user starts a connection that
remains active for a long period of time. For example purposes,
assume the connections remain active for 8 hours (though no one
works for only 8 hours a day anymore). An 8-hour time period at an
average rate of 2000 messages per second yields a total of 57.6
million messages. Taking the best case, each message is three
packets: a single packet in, followed by one packet out (which also
acknowledges the input packet), and then the remote end sends an
acknowledgment to the TPF output message. That totals 172.8 million
packets in our example. Three flows are required to start a TCP
connection and up to four flows are required to end the connection.
Starting and ending 100 000 or even 150 000 connections accounts
for less than 1% of the total packets that flow in the network in
this example. Even though going from 100 000 to 150 000 connections
is an increase of 50% in the number of connections, at the same
message rate (2000 packets per second in this example) the total
number of packets in the network increases by only 0.2%. This tiny
increase in packets is the only effect on the network. With TCP/IP,
only the endpoints of the connection have knowledge of the sockets;
IP routers have no knowledge of sockets, nor do they require any
storage per socket. Instead, the amount of routers you need is
based solely on message rate (and message size). The number of
routers needed for a rate of 2000 messages per second is the same
regardless of the number of sockets it takes to generate that
message rate. Now that we have shown that packet rate rather than
number of sockets is what determines the network size requirement,
let's look at the TPF server itself.
Each active socket requires a fixed amount of main memory
storage on the TPF host for its control block structures. Having
100 000 active sockets requires approximately 30 MB of storage. The
amount of storage required is directly proportional to the number
of sockets; therefore, 150 000 sockets requires 45 MB of storage on
TPF. The IP message table (IPMT) is a separate table that is used
to store input messages until a socket application reads them and
holds output messages until they are acknowledged by the remote
end. The amount of storage needed for the IPMT is based on message
size, message rate, and round-trip time (how long it takes the
remote end to acknowledge output messages sent by TPF). In our
example, the average rate was 2000 messages per second. However,
message rates are never steady (otherwise, the capacity planning
job would be too easy), so assume the peak message rate is 3000
messages per second. We also assume that there is always a
read() or activate_on_receipt() (AOR) call
pending when a message is received from the network by TPF, which
is a fair assumption in this environment. Because there is always a
read/AOR pending, input messages are passed directly to the
application and bypass the IPMT. To keep the math simple, assume
the average size of an output message is 4000 bytes and the average
round-trip time is 1 second. The IPMT would need to be at least 12
MB in size to handle peak periods. To account for network failures
and the occasional unexpected spike in traffic, you could play it
safe and nearly triple the size and define the IPMT size as 35 MB
in this example. What this means is that a single TPF processor
(CEC) could have 150 000 active sockets processing 2000 messages
per second (3000 at peak) and the TCP/IP stack requires only 80 MB
of storage. So far, so good.
When we look at system performance, we assume the application is
using the AOR model, meaning that there is an AOR pending for each
active socket. When an input message is received by the TPF system,
an ECB is created, the input message is passed to the application,
the application issues another AOR (that will create a new ECB when
the next message comes in on this socket), the application
processes the input message, sends the output message, and exits.
At 2000 messages per second, 2000 ECBs are created each second.
This is true regardless of the number of active sockets. In the
case of long-life sockets using the traditional sockets programming
model, a read() is done on each socket, resulting in
one active process for each active socket. For large networks, the
traditional model is not practical; for example, good luck trying
to manage 150 000 active ECBs in a TPF system or 150 000 active
processes on any platform!
To enable the TPF system to be a truly large TCP/IP server, the
AOR model was invented. Besides the obvious scalability advantages
of the AOR model, it is also simple to implement. In the
traditional TCP model, a new process or thread (ECB in the case of
TPF) is created when a socket is created and the new
process/thread/ECB sits in a loop issuing a read() to
read in the next input message, processes that message, issues a
write() to send the output message, and then issues
another read(). This pattern continues until the
socket ends. To use the AOR model, all you need to do is add an AOR
call to the application program and exit instead of looping. It's
that easy.
Figure 1 shows long-life sockets using the traditional TCP
server model. The Internet daemon (INETD) listens for connections
and, when a remote client connects, a new task is created to
exchange data with this client for the life of the connection.
Figure 1. Traditional TCP Server Model
Figure 2 shows the same application using the AOR model instead.
The INETD still listens for new connections and, when a remote
client connects, the INETD issues the first
activate_on_receipt(), which creates a new ECB (Application
Program Instance 1) when the first input message is received on
this socket, and the application program is activated. When the
application program is activated, it issues a read()
to get the input message and then issues another AOR. This will
cause another instance of the application program to be activated
in a new ECB when the next input message comes in on the socket.
The application processes the input message and then sends the
output message, the same as always. Rather than looping, the
program exits in this case because the next input message coming in
will kick off a new ECB to process it.
Figure 2. AOR TCP Server Model
So far, the discussion has been about a large number of sockets.
Now, let's look at the opposite extreme with an application that
has only 10 sockets, but each of those sockets generates 200
messages per second. With only 10 sockets, you might think that the
traditional TCP server model is OK to use because there would only
be 10 active ECBs. Having 10 active ECBs is not a problem; however,
the fact that these are long-running ECBs does present problems.
The AOR TCP server model solves the long-running ECB problems, but
if you know 99% of the time when you finish processing one input
message that the next input message has already been received from
the network, why should the application program exit and have a new
ECB created when the existing ECB could just process the next
message? For situations like this, or for cases where application
initialization is very expensive, you want to use a hybrid
implementation that combines the advantages of the traditional and
AOR server models. Each ECB processes a fixed number of input
messages (50 in this example) using regular read()
calls, issues AOR, and exits. Each ECB will process, at most, 50
messages and then exit; therefore, no long-running ECBs exist.
Instead of having ECB and application initialization/exit overhead
for each message, you only execute the initialization/exit code one
time for every 50 messages. For times when traffic is light, you do
not necessarily want the ECB to wait for 50 messages to arrive
before the ECB exits. To prevent this from happening, run the
socket in nonblocking mode and, if a read() indicates
that there are no more input messages to process, have the
application AOR and exit.
Figure 3 shows the application program logic to implement the
hybrid TCP server model. The INETD is still set up to use the AOR
model. When an application instance (ECB) is created, it places the
socket in nonblocking mode using ioctl() to make sure
that read operations do not wait for data to arrive. Next, the ECB
loops to process 50 messages, or until there are no more input
messages to process. After that, a new ECB is created to process
the next batch of messages.
Figure 3. Hybrid TCP Server Model
Data flow control is done on a per socket basis and is the
responsibility of the protocol layer. TCP uses a windowing
mechanism where endpoint Y controls the rate of data sent by
endpoint X to endpoint Y. The base TCP architecture supports window
sizes up to 64 KB, meaning that a node could send, at most, 64 KB
of data and would then have to wait for an acknowledgment from the
remote end before it could send more data. In the 150 000 sockets
example at peak (3000 messages per second), each socket generates
only one message every 50 seconds on the average; therefore, data
flow is not an issue for these sockets. The other application that
generated 200 messages per second over each of its 10 sockets is
more interesting when you look at data flow. That message rate with
4000-byte messages yields 800 000 bytes per second over each of
these sockets. For applications like this, a 64-KB window size is
probably not sufficient because 64-KB bursts of data with idle time
in between waiting for acknowledgments limits the throughput on the
socket. To take full advantage of high-bandwidth sockets (and
networks), an extension to the TCP architecture was made called
window scaling. If both endpoints of the socket support TCP
window scaling, the window size can be well above 64 KB. TPF
supports window scaling and window sizes up to 1 MB per socket.
Now that we have crunched the numbers, let's summarize the
results in plain English:
- The TPF system can handle a very large number of sockets and a
high volume of data on individual sockets. These are the two
extremes of TCP/IP networks, both of which can be supported at the
same time by a single TPF system.
- Server applications developed on other platforms using the
traditional TCP server model require only minor modifications to
use the AOR server model, which gives you enough scalability to
roll out mainline TCP applications on TPF that your entire network
can use.
- While it is true that
activate_on_receipt() is a
unique function to the TPF system, that does not mean
you need people with TPF-specific skills to write TCP applications
to run on the TPF system. You can leverage the existing TCP
application design skills in your shop without changing the core
processing of the application at all, enable that application to
run on TPF, and handle far more connections and traffic than is
possible on other server platforms.
Third Quarter 2000 - Table of Contents
|