IEN: 168
VAX-UNIX Networking Support Project
Implementation Description
Robert F. Gurwitz
Computer Systems Division
Bolt Beranek and Newman, Inc.
Cambridge, MA 02138
January, 1981
VAX-UNIX Networking January, 1981
Support Project IEN 168
1 Introduction
The purpose of this report is to describe the implementation
of network software for the VAX-11/780 * running UNIX. ** This is
being done as part of the VAX-UNIX Networking Support Project.
The overall purpose of this effort is to provide the capability
for the VAX to communicate with other computers via packet-
switching networks, such as the ARPANET. Specifically, the
project centers around an implementation of the DoD standard
host-host protocol, the Transmission Control Protocol (TCP) [4].
TCP allows communication with ARPANET hosts, as well as hosts on
networks outside the ARPANET, by its use of the DoD standard
Internet Protocol (IP) [3]. The implementation is designed for
the VAX, running VM/UNIX, the modified version of UNIX 32/V
developed at the University of California, Berkeley [1]. This
version of UNIX includes virtual paging capabilities.
In the following paragraphs, we will discuss some features
and design goals of the implementation, and its organization.
2 Features of the Implementation
2.1 Protocol Dependent Features
2.1.1 Separation of Protocol Layers
The TCP software that we are developing for the VAX
incorporates several important features. First, the
implementation provides for separation of the various protocol
layers so that they can be accessed independently by various
applications. (1) Thus, there is a capability for access to the
TCP level, which will provide complete, reliable, multiplexed,
host-host communications connections. In addition, the IP level
is also accessible for applications other than TCP, which require
its internet addressing and data fragmentation/reassembly
services. Finally, the implementation also allows independent
access to the local network interface (in this case, to the
ARPANET, whose host interface is defined in BBN Report No. 1822
_______________
* VAX is a trademark of Digital Equipment Corporation.
** UNIX is a trademark of Bell Laboratories.
(1) In this context, the terms application and user refer to any
software that is a user of lower level networking services. Thus,
programs such as FTP and TELNET can be considered applications
when viewed from the TCP level, and TCP itself may be viewed as
an application from the IP level.
-1-
VAX-UNIX Networking January, 1981
Support Project IEN 168
[2]) in a "raw" fashion, for software which wishes to
communicate with hosts on the local network and do its own higher
level protocol processing.
2.1.2 Protocol Functions
Another feature of the implementation is to provide the full
functionality of each level of protocol (TCP and IP), as
described in their specifications [3,4]. Thus, on the TCP level,
features such as the flow control mechanism (windows),
precedence, and security levels will be supported. On the IP
level, datagram fragmentation and reassembly will be supported,
as well as IP option processing, gateway-host flow control
(source-quenching) and routing updates. However, it is
anticipated that some of these features (such as handling IP
gateway-host routing updates, and IP option processing) will be
implemented in later stages of development, after more basic
features (such as TCP flow control and IP
fragmentation/reassembly) are debugged.
2.2 Operation System Dependent Features
2.2.1 Kernel Resident Networking Software
There are several features of the implementation which are
operating system dependent. The most important of these is the
fact that the networking software is being implemented in the
UNIX kernel as a permanently resident system process, rather than
a swappable user level process.
This organization has several implications which bear on
performance. The most obvious effect is that since the
networking software is always resident, it can more efficiently
respond to network and user initiated events, as it is always
available to service such events and need not be swapped in. In
addition, residence in the kernel removes the burden of the use
of potentially inefficient interprocess communication mechanisms,
such as pipes and ports, since simpler data structures, such as
globally available queues, can be used to transmit data between
the network and user processes. Kernel provided services, (e.g.,
timers and memory allocation) also become much easier and more
efficient to use.
-2-
VAX-UNIX Networking January, 1981
Support Project IEN 168
The large address space of the VAX makes this organization
practical and allows the avoidance of expedients like the NCP
split kernel/user process implementation, that have been
necessary in previous UNIX networking software on machines with
limited address space, like the PDP 11/70. It is hoped that the
kernel resident approach will contribute to the speed and
efficiency of this TCP.
2.2.2 User Interface
Use of the "traditional" UNIX file oriented user interface
is another operating system dependent feature of this
implementation. The user will access the network software by
means of standard system file I/O calls: open, close, read, and
write. This entails modification of certain of these calls to
accommodate the extra information needed to open and maintain a
connection. In addition, the communication of exceptional
conditions to the user (such as the foreign host going down) must
also be accommodated by extension of the standard system calls.
In the case of open, for example, use of the call's mode field
will be extended to accommodate a pointer to a parameter
structure. In the case of exceptional conditions, the return
code for reads and writes will be used to signal the presence of
exceptional conditions, much like an error. An additional status
call (ioctl) will be provided for the user to determine detailed
information about the nature of the condition, and the general
status of the connection.
In this way, the necessary additional information needed to
maintain network communications will be supported, while still
allowing the use of the functionality that the UNIX file
interface provides, such as the pipe mechanism.
In the initial versions, this interface will be the standard
UNIX blocking I/O mechanism. Thus, outstanding reads for data
which has not been accepted from the foreign host, and writes
which exceed the buffering resources of a connection will block.
It is expected that the await/capacity mechanism, currently
available for Version 6 systems, will be added to the VM/UNIX
kernel in the near future. These non-blocking I/O modifications
will be supported by the network software, relieving the blocking
restriction.
-3-
VAX-UNIX Networking January, 1981
Support Project IEN 168
3 Design Goals
Several design goals have been formulated for this
implementation. Among these goals are efficiency and low
operating system overhead, promoted by a kernel resident network
process, which allows for reduced process and interprocess
communication overhead.
Another goal of the implementation is to reduce the amount
of extraneous data copying in handling network traffic. To
achieve this, a buffer data structure has been adopted which has
the following characteristics: intermediate size (128 bytes);
low overhead (10 bytes of control information per buffer); and
flexibility in data handling through the use of data offset and
length fields, which reduce the amount of data copying required
for operations like IP fragment reassembly and TCP sequence space
manipulations.
The use of queueing between the various software levels has
been limited in the implementation by processing incoming network
data to the highest level possible as soon as possible. Thus, an
unfragmented message coming from the network is passed to the IP
and TCP levels, with queueing taking place at the device driver
only until the message has been fully read from the network.
Similarly, on the output side, data transmission is only
attempted when the software is reasonably certain that the data
will be accepted by the network.
Finally, it is planned that the inclusion of the network
software will entail relatively little modification of the basic
kernel code beyond that provided by Berkeley. The only
modifications to kernel code outside the network software will be
slight changes to the file I/O system calls to support the user
interface described above. In addition, an extension to the
virtual page map data structure in low core will be necessary to
support the memory allocation scheme, which makes use of the
kernel's page frame allocation mechanisms.
4 Organization
4.1 Control Flow
-4-
VAX-UNIX Networking January, 1981
Support Project IEN 168
4.1.1 Local Network Interface
The network software can be viewed as a kernel resident
system process, much like the scheduler and page daemon of
Berkeley VM/UNIX. This process is initiated as part of network
initialization. A diagram of its control and data flow is shown
in Figure 1.
| |-----| |-----| |-----| |-----| |
| |LOCAL| |-----| |LOCAL| | | | | |
| | NET | |input| | NET | | IP | | TCP | |
|->|INPUT|->|queue|->|INPUT|->|INPUT|->|INPUT| |
| | I/F | |-----| | | | | | | |
N | |-----|==========>|-----| |-----| |-----| |
| ^ (wakeup) ^ \ (timer) |
E | | | \ / | U
| (input) V \ / |
T | ( int ) |-----| \ / | S
| |frag | |-----| |-----| |
W | |queue| | |=>| |->| E
| |-----| | TCP | |USER | |
O | |-----||-----| |MACH | | I/F | | R
| |unack|| snd |<---->| |<=| |<-|
R | (outpt) |queue||queue| |-----| |-----| |
| ( int ) |-----||-----| / \ / |
K | | / \ / |
| V / \ / |
| |-----| |-----| |-----| |-----| \ / |
| |LOCAL| |-----| |LOCAL| | | | | |-----| |
| | NET | |outpt| | NET | | IP | | TCP | | rcv | |
|<-|OUTPT|<-|queue|<-|OUTPT|<-|OUTPT|<-|OUTPT| |queue| |
| | I/F | |-----| | | | | | | |-----| |
| |-----|<----------|-----| |-----| |-----| |
| |
| |
|<----------TCP PROCESS------------>|
| |
Figure 1 . Network Software Organization
Its main flow of control is an input loop which is activated (via
wakeup) by the network interface device driver when an incoming
message has been completely read from the network. (It can also
be awakened by TCP user or timer events, described below.) The
message is then taken from an input queue and dispatched on the
basis of local network format (e.g., 1822 leader link number).
ARPANET imp-host messages (RFNMs, incompletes, imp/host status)
-5-
VAX-UNIX Networking January, 1981
Support Project IEN 168
are handled at this level. For other types of messages, the
local network level input handler calls higher level "message
handlers." The "standard message handler" is the IP input
routine. Handlers for other protocols at this level (such as
UNIX NCP) may be accommodated in either of two ways. First, a
"raw message" service is available which simply queues data on
specified links to/from the local network. By reading or writing
on a connection opened for this service, a user process may
handle its own higher level protocol communication.
Alternatively, for frequently used protocols, a new handler may
be defined in the kernel and called directly.
4.1.2 Internet Protocol
At the IP level, the fragment reassembly algorithm is
executed. Unfragmented messages with valid IP leaders are passed
to the higher level protocol handler in a manner similar to the
lower level dispatch, but on the basis of IP protocol number.
The "standard handler" is TCP. Another protocol handler
interprets IP gateway-host flow control and routing update
messages.
Fragmented messages are placed on a fragment reassembly
queue, where incoming fragments are separated by source and
destination address, protocol number, and IP identification
field. For each "connection" (as defined by these fields), a
linked list of fragments is maintained, tagged by fragment offset
start and end byte numbers. As fragments are received, the
proper list is found (or a new one is created), and the new
fragment is merged in by comparing start and end byte numbers
with those of fragments already on the list. Duplicate data is
thrown away. A timer is associated with this queue, and
incomplete messages which remain after timeout are dropped and
their storage is freed. Completed messages are passed to the
next level.
4.1.3 TCP Level
At the TCP level, incoming datagrams are processed via calls
to a "TCP machine." This is the TCP itself, which is organized
as a finite state machine whose states are roughly the various
states of the protocol as defined in [4], and whose inputs
include incoming data from the network, user
open/close/read/write requests, and timer events. Input from the
network is handled directly, passing through the above described
-6-
VAX-UNIX Networking January, 1981
Support Project IEN 168
levels. User requests and timer events are handled through a
work queue.
When a user process executes a network request via system
call, the relevant data (on a read or write) is copied from user
to kernel space (or vice versa), a work entry is enqueued, and
the network process is awakened. Similarly, when timers
associated with TCP (such as the retransmission timer) go off,
timer work requests are enqueued and the network input process is
awakened. Once awakened, it checks for the presence of completed
messages from the network interface and processes them. After
these inputs are processed, the TCP machine is called to handle
any outstanding requests on the work queue. The network process
then sleeps, waiting for more network input or work requests.
Thus, the TCP machine may be called directly with network input,
or awakened indirectly to check its work queue for user and timer
requests.
After reset processing and sequence and acknowledgement
number validation, acceptable received data is sequenced and
placed on the receive queue. This sequencing process is similar
to the IP fragment reassembly algorithm described above. Data
placed on this queue is acknowledged to the foreign host.
Received data whose sequence numbers lie outside the current
receive window are not processed, but are placed on an
unacknowledged message queue. The advertised receive window is
determined on the basis of the remaining amount of buffering
allocated to the connection (see below). When buffering becomes
available, data on the unacknowledged message queue is then
processed and placed on the receive data queue.
On the output side, TCP requests for data transmission
result in calls to the IP level output routine. This routine
does fragmentation, if necessary, and makes calls on the local
network output routine. Outgoing messages are then placed on a
buffering queue, for transmission to the network interface by the
device driver. In data transmission, an attempt is made to
ensure that data moving from the highest level (TCP), will not be
sent unless there is reasonable certainty that the lower levels
will have the necessary resources to accept the message for
transmission to the network.
All data to be sent is maintained on a single send queue,
where data is added on user writes, and removed when proper
acknowledgement is received. Whenever the TCP machine sends
data, a retransmission timer is set, and the sequence number of
the first data byte on the queue is saved. After initial
transmission the sequence number of the next data to send is
advanced beyond what was first sent. If the retransmission timer
-7-
VAX-UNIX Networking January, 1981
Support Project IEN 168
goes off before that data is acknowledged, the sequence number of
the next data to send is backed up, and the contents of the send
buffer (for the length determined by the current send window) is
retransmitted, with the ACK and window fields set appropriately.
The retransmission timer is set with increasingly higher values
from 3 to 30 seconds, if the saved sequence number does not
advance.
A persistence timer is also set when data is sent. This
allows communication to be maintained if the foreign process
advertises a zero length window. When the persistence timer goes
off, one byte of data is forced out of the TCP.
4.2 Buffering Strategy
As mentioned earlier, all data is passed from the network to
the various protocol software layers in intermediate sized (128
byte) buffers. The buffers have two chain pointers, a data
offset, and a data length field (see Figure 2). As data is read
from the network or copied from the user, multiple buffers are
chained together. Protocol headers are also held in these
buffers. As messages are passed between the various software
levels, the offset is modified to point at the appropriate
header. The length field gives the end of data in a particular
buffer. This offset/length pair facilitates merging of messages
in IP fragment reassembly and TCP sequencing.
The allocation of these buffers is handled by the network
software. Buffers are obtained by "stealing" page frames from
the kernel's free memory map (CMAP). In VM/UNIX, these page
frames are 1024 bytes long, and thus have room for eight 128 byte
buffers. The advantage of using kernel paging memory as a source
of network buffers is that their allocation can be done totally
dynamically, with little effect on the operation of the overall
system. Buffers are allocated from a cache of free page frames,
maintained on a circular free list by the network memory
allocator. As the demand for buffers increases, new page frames
are stolen from the paging freelist and added to the network
buffer cache. Similarly, as the need for pages decrease, free
pages are returned to the system. To minimize fragmentation in
buffer allocation within the page frames, the free list is
sorted. When no more pages are available for allocation, data on
the IP reassembly and TCP unacknowledged data queues are dropped,
and their buffers are recycled.
-8-
VAX-UNIX Networking January, 1981
Support Project IEN 168
^ |------------------------| ^
| | -> NEXT BUFFER | |
10 |------------------------| |
BYTES | QUEUE LINK | |
| |-----------|------------| |
V | OFFSET | LENGTH | |
|-----------|------------| |
| | 128
| | BYTES
| | |
| D A T A | |
| | |
| | |
| | |
| | |
|------------------------| V
Figure 2 . Layout of a Network Buffer
The number of pages that can be stolen from the system is
limited to a moderate number (in practice 64-256, depending on
network utilization in a particular system). To enforce fairness
of network resource utilization between connections, the number
of buffers that can be dedicated to a particular connection at
any time is limited. This limit can be varied to some small
degree by the user when a connection is opened. Thus, a TELNET
user may open a connection with the minimum 1K bytes of send and
receive buffering; while an FTP user, anticipating larger
transfers, might desire up to 4K of buffering. The effect of
this connection buffering allocation is to place a limit on the
amount of data that the TCP may accept from the user for sending
before blocking, and the amount of input from the network that
the TCP may acknowledge. Note that in receiving, the network
software may allocate available buffers beyond the user's
connection limit for incoming data. However, this data is
considered volatile, and may be dropped when buffer demands go
higher. Incoming data is acknowledged by TCP only until the
user's connection buffer limit is exhausted. The advertised TCP
flow control window for a connection is set on the basis of the
remaining amount of this buffering.
Thus, the network software must insure that it has enough
buffering for 1) its own internal use in processing data on the
IP and local network levels; 2) retaining acknowledged TCP data
-9-
VAX-UNIX Networking January, 1981
Support Project IEN 168
that have not been copied to user space; and 3) retaining data
accepted by the TCP for transmission which have not yet been
acknowledged by the foreign host TCP. Other data, such as
unacknowledged TCP input from the network and fragments on the IP
reassembly queue are vulnerable to being dropped when demand for
more buffers makes necessary the recycling of buffers on these
queues. Since there is an absolute limit on the number of page
frames that may be stolen from the paging system, and hence the
total number of buffers available, there is a resultant limit on
the total number of simultaneous connections.
Several data structures are required for stealing page
frames from the kernel and maintaining the buffer free list.
These include enough page table entries for mapping the maximum
number of page frames which can be stolen from the system, an
allocation map for allocating these page table entries, and the
free page list itself. For a 256 page maximum, this requires 2K
bytes of page tables, 1K bytes for page frame allocation mapping,
and another 1K bytes for the network freelist. The maximum page
parameter and others, including the minimum and maximum amount of
buffering that the user may specify are modifiable constants of
the implementation.
4.3 Data Structures
Along with the data structures needed to support the buffer
management system, there are several others used in the network
software (see Figure 3). The focus of activity is the user
connection block (UCB), and the TCP control block (TCB). The UCB
is allocated from a table on a per connection basis. It holds
non-protocol specific information to maintain a connection. This
includes a pointer the UNIX process structure of the opener of a
connection, (2) a pointer to the foreign host entry for the peer
process's host, a pointer to the protocol-specific connection
control block (for TCP, the TCB), pointers to the user's send and
receive data buffer chain, and miscellaneous flags and status
information. When a network connection is opened, an entry in
the user's open file table is allocated, which holds a pointer to
the UCB.
For TCP connections, a TCB is allocated. All TCBs are
chained together to facilitate buffer recycling. The TCB
contains a pointer to the corresponding UCB, a block of sequence
number variables and state flags used by the TCP finite state
_______________
(2) For details on data structures specific to UNIX, see [5].
-10-
VAX-UNIX Networking January, 1981
Support Project IEN 168
Foreign
Host Table
|--------|
Network |------>|Host Adr|
Conn Table | |--------|
|--------| | | #RFNM |
|--->|->Proc |<--+--| |--------|
| |--------| | | | Status |
| |->Host |---| | |--------|
Per User | |--------| |
File Table | | ->TCB |---| | TCB
|--------| | |--------| | | |--------|
| Flags | | |->S Buf | |--+--->| ->next |
|--------| | |--------| | |--------|
| ->UCB |---| |->R Buf | |----| ->UCB |
|--------| |--------| |--------|
| Flags | | FSM |
| and | |Sequence|
| Status | | Vars |
|--------| |--------|
|->Snd Q |
|--------|
|->Rcv Q |
|--------|
|->UnackQ|
|--------|
| Flags |
| and |
| Status |
|--------|
Figure 3 . Network Data Structures
machine, pointers to the various TCP data queues, and flags and
state variables. Protocols other than TCP would have their own
control blocks instead of the TCB. For the "raw" local network
and IP handlers, all necessary information is kept in the UCB.
Finally, there is a foreign host table, where entries are
allocated for each host that is part of a connection. The entry
contains the foreign host's internet address, the number of
outstanding RFNM's for 1822 level host-imp communication, and the
status of the foreign host. Entries in this table are hashed on
the foreign host address.
-11-
VAX-UNIX Networking January, 1981
Support Project IEN 168
5 References
[1] Babaoglu, O., W. Joy, and J. Porcar, "Design and
Implementation of the Berkeley Virtual Memory Extensions to
the UNIX Operating System," Computer Science Division, Dept.
of Electrical Engineering and Computer Science, University
of California, Berkeley, December, 1979.
[2] Bolt Beranek and Newman, "Specification for the
Interconnection of a Host and an IMP," Bolt Beranek and
Newman Inc., Report No. 1822, May 1978 (Revised).
[3] Postel, J. (ed.), "DoD Standard Internet Protocol," Defense
Advanced Research Projects Agency, Information Processing
Techniques Office, RFC 760, IEN 128, January, 1980.
[4] Postel, J. (ed.), "DoD Standard Transmission Control
Protocol," Defense Advanced Research Projects Agency,
Information Processing Techniques Office, RFC 761, IEN 129,
January, 1980.
[5] Thompson, K., "UNIX Implementation," The Bell System
Technical Journal, 57 (6), July-August, 1978, pp. 1931-1946.
-12-
VAX-UNIX Networking January, 1981
Support Project IEN 168
Table of Contents
1 Introduction.......................................... 1
2 Features of the Implementation........................ 1
2.1 Protocol Dependent Features......................... 1
2.1.1 Separation of Protocol Layers..................... 1
2.1.2 Protocol Functions................................ 2
2.2 Operation System Dependent Features................. 2
2.2.1 Kernel Resident Networking Software............... 2
2.2.2 User Interface.................................... 3
3 Design Goals.......................................... 4
4 Organization.......................................... 4
4.1 Control Flow........................................ 4
4.1.1 Local Network Interface........................... 5
4.1.2 Internet Protocol................................. 6
4.1.3 TCP Level......................................... 6
4.2 Buffering Strategy.................................. 8
4.3 Data Structures.................................... 10
5 References........................................... 12
-i-