Separating Contenders from Pretenders
Mark Gambino, IBM TPF Development
The TCP/IP native stack road map for TPF 4.1 was presented
at the Fall 2003 TPF Users Group meeting. A copy of this presentation
is available at
http://www.ibm.com/software/htp/tpf/tpfug/tgf03/tgf03.htm
Each Sold Separately and Some Assembly Required
The purpose of this presentation was twofold: First, to organize
the more than 60 native stack enhancements into logical categories
such as availability, performance, security, and so on. Second,
to point out that all TCP/IP stacks are not the same and that
many quality of service (QoS) features that you expect and require
are optional features in the architecture or are purely roll your
own (RYO). This second point is expanded on in this article to
help explain how it differentiates itself from other servers and
TCP/IP stacks. Besides implementing the base requirements of the
Internet Protocol (IP) and Transmission Control Protocol (TCP)
architectures, TPF supports several optional features of the IP
and TCP architectures that have been developed in recent years
along with numerous features that are unique to the TPF platform.
A Long Time Ago In a Galaxy Far, Far Away...
Request for Comments (RFC) documents 791 and 793 define the
IP and TCP architectures, respectively. These documents define
the bits and bytes of packets that flow between nodes and the
sequence of events for starting and ending a given session. Both
RFCs came out in 1981. What were considered large buffers
and high-speed networks in 1981 are orders of magnitude
smaller than what they are today; therefore, modern high-end servers
need to take into consideration the much higher throughput requirements
and how that affects the various TCP/IP timers and algorithms.
The base IP architecture was designed when a given application
ran on only one server instance, the physical network interface
on the server was a single point of failure, all packets in the
network (destined for different servers, applications, or both)
flowed at the same priority, and security was limited to physical
connectivity built around private trusted networks. Oh, how times
have changed!
What's Mine Is Mine
Today, most applications run on multiple server instances,
generically called server farms or clusters. TPF has had this
capability for decades with its loosely coupled feature. Some
servers are dedicated to a single application type (like Web servers,
file servers, or mail servers) while others, including most TPF
systems, run a variety of different applications on the same server.
In a distributed transaction environment, multiple heterogeneous
servers are involved in the processing of a given transaction.
For example, an end user sends a request message into server X
causing server X to send an authorization or availability query
to server Y. Once server X receives the response to its query,
it sends a reply message to the end user. The TCP/IP stack of
most systems implements process-scoped sockets. This means
that a given socket (TCP/IP session) is tied to a given process.
The process that creates a socket is the only process that can
use that socket and, if the process ends for any reason, the socket
is cleaned up. In the previous example, if server X implements
process-scoped sockets, the process must remain active while waiting
for the answer from server Y. If server Y takes, on average, 1
second to respond and the requirement is for server X to handle
500 messages per second, that means there would need to be 500
active processes running on server X. As the message rate increases
or the response time of server Y increases, the scalability concerns
of this design become more and more evident because having that
many active processes on server X is not possible.
We Share Because We Care
One of the fundamental design points of TPF TCP/IP native stack
is kernel-scoped sockets, where the system owns all sockets
rather than a socket being tied to a process (where an ECB represents
a process in TPF). Part of this design includes a TPF-unique capability
called activate_on_receipt (AOR). Let's look at the distributed
transaction example now.
ECB 1 in TPF (server X) receives the request from the end user
over socket 1. After ECB 1 sends the query to server Y over socket
2, ECB 1 uses the AOR function to tell TPF that when the response
from Y comes in on socket 2, create a new ECB (ECB 2) and pass
the response to the specified application program in ECB 2. ECB
1 can exit after issuing AOR. This means that no ECBs (processes)
are tied up (active) while waiting for data to arrive from server
Y. When ECB 2 is created, it will send the reply message to the
end user over socket 2 and then exit. Taking advantage of the
kernel-scoped sockets and AOR capabilities of TPF, you can scale
up to tens of thousands of messages per second on a single server
image. ECB 1 received the request message from the end user over
the socket, but ECB 2 sent the reply message. This is one example
of shared sockets where the same socket can be used for
multiple ECBs (processes). Shared sockets is a very powerful feature
because, for example, you could create a single socket that is
used to log data to a remote system and have all ECBs send data
on that same socket, either over sockets directly or using a higher
level messaging protocol like MQSeries.
Mobile Homes Coming to a Network Near You
IP is connectionless oriented. Only the two nodes that are
the endpoints of the sockets have knowledge of individual sockets.
An IP packet contains the destination address, but does not specify
what path to take to reach that destination. IP routers keep track
of the available paths and select the path that a given packet
will take. If a router in the middle of the network fails, IP
will reroute traffic through another path (assuming alternate
paths exist), enabling sockets to survive the failure of a network
component. If a server has multiple network interfaces and one
of those interfaces fails, do you want your sockets to fail? Of
course not. If sockets are tied to a real IP address in the server,
those sockets are tied to the physical network interface associated
with that real IP address. In other words, if that network adapter
fails, the sockets fail as well. Virtual IP addresses (VIPAs)
were created to enable sockets to survive the failure of a network
adapter on a server. VIPAs accomplish this by allowing a VIPA
to be moved from one network adapter to another adapter on that
server. High-availability servers, like TPF, support VIPAs. TPF
has extended the concept of VIPA with movable VIPAs, which
allow a VIPA to be moved from one server to a network adapter
in another server in the loosely coupled complex.
This not only provides for even higher availability, but gives
you the ability to balance the load across servers in the TPF
complex.
Server Images Are Snowflakes---No Two Are Exactly the Same
In addition to movable VIPAs, the TPF Domain Name System (DNS)
server is another method to balance traffic across TPF servers
in the complex and across multiple network interfaces on a single
TPF server. DNS allows multiple IP addresses to be defined for
the host name representing the server complex. However, using
external DNS servers to select the server IP address to be used
for a given client session does not necessarily result in a balanced
load because external DNS servers assume all server images are
equal and active.
For example, what if one server image is running on a more
powerful processor than other server images, or if one server
image is currently running CPU-intensive utilities. In other words,
not all active server images have the same processing power available
for new transactions. In the case of a TPF complex, it is quite
common to expand the complex (add more server images) during peak
periods and then collapse the complex when the load drops off.
To overcome the problems of external DNS servers assigning more
work to an overloaded processor or selecting an IP address of
an inactive server image, TPF has its own internal DNS server
that can be used to balance traffic for connections to the TPF
complex. If you want to know the status of the server complex
regarding which server images and network interfaces are currently
active, and what the current load is on each server image, you
need to ask the server complex itself. The TPF DNS server always
responds with a usable (active) IP address and is customizable
to enable the path selection (load balancing) logic to take into
consideration whatever factors are appropriate to your environment.
The TPF DNS server has another important advantage---centralized
load balancing logic. The more external DNS servers that you allow
to do path selection for new sessions, the less likely it is that
you will end up with a balanced load on the server.
I'm Sorry, Sir. This Event Is by Invitation Only!
We have discussed the methods for deciding which path a client
should use to reach the server; however, that assumes this particular
client is allowed to connect to the server. That blind and trusting
assumption is not wise in this security conscious era. Instead,
you should verify that this client is authorized to connect to
not only the server node itself, but to the requested server application.
At the network level, this can be done using firewall filter rules
or access control lists. A comprehensive security strategy should
include firewalls at the edge routers of your private network
and in server nodes as well. The TPF TCP/IP native stack includes
a built in firewall that allows you to define filter rules to
control access to TPF applications from externals users as well
as users on your private network. Incorporating the firewall into
the TPF native stack also has the benefit of being able to detect
and prevent denial of service (DoS) attacks that attempt to exploit
holes in the TCP/IP architecture. For end-to-end security, you
can implement secure sockets layer (SSL) functionality in your
applications. SSL-enabled applications are able to validate the
identity of the partner and exchange data in a secure manner over
public networks. Besides standard SSL support, TPF has shared
SSL support that provides TPF-unique capabilities like the
ability to share SSL sessions across multiple ECBs and AOR functionality
for SSL sessions.
We All Have Our Limits..... Don't Push It!
A remote client requests a connection with the server. The
rules state that this client is authorized to connect to the specified
server application; therefore, it would seem that the server should
accept the connection request. Not necessarily.
If multiple applications run in the server, you might want
to limit the amount of resources that a given application can
use so that one application does not monopolize the entire server.
The TCP connection limiting support of TPF provides this capability
by allowing you to define the maximum number of active sessions
that are allowed for each TCP application in TPF. If the limit
is reached and a new connection request is received, the connection
request is rejected. By limiting the number of active sessions,
you can control the amount of network, CPU, and server resources
that the application can use. Connection limiting is valuable
for overload situations where the traffic rate is much higher
than normal, and for intentional floods during DoS attacks aimed
at trying to take down the server.
OK, You Can Come in, but We'll Be Keeping a Close Eye on
You!
Some applications like Web servers and mail servers use short-lived
connections where a socket is started, only one or a few transactions
flow, and then the socket is closed. Connection limiting works
very well for this type of application. However, many applications
use long-life socket connections where the connection is started,
remains active for hours or even days, and is used to for thousands
of transactions. For applications like this, it is not enough
to just make resource checks when the connection is first started;
resource checks must be made throughout the life of the connection.
This is where TPF traffic limiting support comes into play.
Traffic limiting allows you to define the maximum message rate
(in messages per second) for a given socket and for each application.
If the socket or application limit is reached and the application
attempts to read another message over this socket, the application
will be blocked, making it look like there is no message available
to read even if there are messages to read. Once the current time
interval expires, if there is a message available to read, the
application will be posted and passed the message. Traffic limiting
has the ability to control the rate at which input messages are
given to an application and does so without any changes required
to the application program. Similar to connection limiting, traffic
limiting is also valuable for overload situations where the traffic
rate is much higher than normal, and for intentional floods as
part of a DoS attack aimed at trying to take down the server.
Traffic limiting has additional benefits in that it can be used
for UDP applications as well as TCP applications. For TCP applications,
no messages are lost, even if the traffic limits are exceeded.
For UDP applications, if messages arrive faster than they are
allowed to be given to the application (based on the defined traffic
limits) and the socket receive buffer fills up, some input messages
may be lost. This is consistent with UDP behavior because even
if you do not use traffic limiting, messages can arrive faster
than the UDP application reads them, and if the socket receive
buffer becomes full, some input message are lost.
You Cannot Put 10 Pounds in a 5-Pound Bag!
We have seen that TPF has methods for controlling the rate
at which traffic flows from clients to the TPF server, but what
about traffic flowing from the server to a client? For TCP sockets,
the remote client controls the rate at which traffic flows from
server to client. This is based purely on the available resources
of the client node, but there are other factors to consider. For
example, just because the client says that it is ready for 100
K of data from the server, that does not mean the network can
handle a burst of traffic that large. If the network cannot handle
the data rate, the server needs to slow down the rate at which
it sends data. TPF has congestion control built into the TCP layer
based on RFC 2001. TPF has also implemented TCP congestion avoidance
mechanisms. Congestion control is reactive while congestion avoidance
is proactive. What does that mean?
Let's say that snow is falling and my sidewalk is getting slippery.
If I were purely reactive, I would wait until someone slipped
and fell, and then shovel some snow off the sidewalk. Next, I
would wait until the next person fell and then shovel more snow.
If I were proactive, I would see the snow building up and would
start shoveling before anyone has fallen in an effort to reduce
the likelihood that anyone does fall. Similarly, TCP congestion
control (reactive) waits for problems to occur (packets to become
lost in the network) and then takes action (reduces the rate at
which data is sent). TCP congestion avoidance (proactive) monitors
the round-trip times (RTTs) of messages to anticipate when congestion
is likely to occur and takes action before any packets
are lost. Congestion control mixed with congestion avoidance is
a powerful combination. It greatly reduces packet loss and increases
end-to-end throughput.
What Is This, a Television Mini-Series?
This concludes part 1 of our discussion about the capabilities
of the TCP/IP native stack that differentiates TPF from other
platforms. This article touched on high availability, load balancing,
sharing sockets, security, and methods for controlling traffic
flowing in and out of TPF. Part 2 will follow in a subsequent
TPF Systems Technical Newsletter edition and discuss performance,
advanced socket features available to TPF applications, and the
many diagnostic tools that are available.