IEN 104
Minutes of the Fault Isolation Meeting
12 March 1979
Virginia Strazisar
Bolt, Beranek, and Newman
20 March 1979
Minutes of the Fault Isolation Meeting held at BBN on March 12
Attendees:
Virginia Strazisar, BBN, chairman
Peter Sevcik, BBN
Dale McNeill, BBN
Noel Chiappa, MIT
Ray McFarland, DOD
Mike Wingfield, BBN
Jack Haverty, BBN
Bill Plummer, BBN
Mike Brescia, BBN
Ginny suggested that there are three situations in which fault
isolation is needed: 1) the user at a terminal on the catenet
who cannot reach some destination on the catenet, 2) a catenet
control center that must decide what network or gateway in the
catenet has failed, and 3) the gateway implementor who must
decide what part of the gateway hardware or software has failed.
These situations were put forth as a framework for discussing the
types of fault isolation facilities that we need. Ginny stated
that the object of the meeting was to draw up a list of fault
isolation tools needed, giving special consideration to what
situations each of these tools would be used in and what
questions they could be used to answer. From the suggestions
drawn up at the meeting, the detailed formats and protocols could
be designed; this level of design was specifically avoided at the
meeting.
The first situation discussed was the user at a catenet terminal,
who discovers that he either cannot connect to a particular
destination host or that he no longer gets any response from his
previously working connection. At present no information is
passed to the user in either of these cases. Everyone agreed
that the user should receive some error reply. It was suggested
that the user should receive a response indicating that either 1)
the destination host is unreachable, 2) the local gateway or
network is unreachable or 3) the catenet is inoperational. Most
people agreed that the naive user does not care to know what the
catenet problems are in any more detail than this. For example,
an error messgage of the form "Can't reach destination network
because gateway 3 is down" would be totally useless to the naive
user. The user also wants to know when the service will be
restored, either "within a short time" such that the user is
willing to wait for the service to be restored; or "not for a
long time" such that the user will quit trying to use the service
at this time. Several people pointed out that a more
sophisticated user may want to know exactly what component of the
catenet failed. There was some discussion as to whether users
should be given access to tools that would enable them to probe
the catenet gateways to determine where the failure occurred.
The consensus of opinion was that the user should be given access
to such tools, but that no user should be required to use such
tools. Our model was that the naive user on receiving an error
message would call a network or catenet control center, whereas
the more sophisticated user may attempt to track down the problem
before contacting the control center. We discussed in more
detail what sort of message a gateway could return to the user.
It was suggested that if the network returned an error message
about a specific host that that error message (text) should be
returned verbatim to the user. It was also suggested that error
codes be defined for "common" failures, i.e. net down, host down,
and that these be included in the error message. It was pointed
out that the gateways currently return messages to the source
host if they believe (based on their routing information) that
the destination network is unreachable. These messages contain
the source and destination addresses and the protocol field from
the original datagram. Several people pointed out that this
information is insufficient to return an error message to the
source user and that the entire internet header of the original
datagram should be returned in the error message. We discussed
the problem of what to do in the case where datagrams are lost in
a gateway or network in such a manner that no error message is
generated and returned to the source. It was decided in this
case that the source host should automatically probe the gateways
in order to return a reasonable status message to the user. It
was assumed that the user is running a program that implements
some type of internet protocol, such as TCP, and that that
program is capable of detecting long delays or mutiple
retransmisssions and of generating some type of probe packet to
attempt to track down the failure when this occurred. These
probe packets are discussed in more detail below. Information
obtained from such probing could also be sent to a monitoring
center.
We discussed the concept of a monitoring or control center. The
primary purpose of a monitoring or control center in terms of
fault isolation is to isolate the component (network or gateway)
that failed and to notify the proper authority to have it fixed.
We felt that a control center was needed to avoid having all the
users in the catenet calling any and all implementors they felt
might be responsible for problems. The concept of a single
control center was discussed and rejected for both technical and
political reasons. From the technical point of view, it was
pointed out that the catenet could become partitioned such that
the control center was cut off from part of the catenet and thus
could no longer handle faults in that portion of the catenet. On
the political side, it was pointed out that organizations
responsible for the individual networks may be unwilling to
support one control center run by one organization. We agreed
that the catenet control center should actually be multiple
control centers. These could be either the existing network
control centers working in co-operation or separate catenet
control centers, each of which was established by co-operating
network groups. Tools that these control centers would need
included a facility to probe gateways to determine why a
particular destination was unreachable.
We elaborated slightly on the design of a facility for probing
gateways. A host or control center sends its local gateway a
message saying "poll the gateways in the catenet to determine why
I can not get to destination X". The gateway then polls its
neighbors, its neighbors' neighbors, etc., extracting routing
tables, addresses of neighbor gateways, status of neighbor
gateways and networks, etc. to determine why the destination is
unreachable. The gateway would then formulate a response to the
host; this response would be of the form: "the network
connection between gateway 3 and net 2 is down", "gateway 5 and
gateway 6 are down", etc. This mechansim would be an extension
of the gateway-gateway protocol as defined in IEN #30. This
probe facility would be used by the source host to generate a
message to the user in the case where no response is recieved
from the destination and no error message is returned by the
gateways. The facility would also be used by catenet control
centers to isolate the componenet of the catenet that has failed.
It was pointed out that we should be concerned not only with
total failures, but also with system performance, especially
delay. In this context, we were not concerned with cases where
delay seemed slightly longer than usual, but rather cases in
which traffic crossed the catenet with extrememly high delays,
i.e several minutes. A facility was suggested to track this sort
of problem: generate a packet from source A addressed to
destination B; have this packet trace its route and timestamp it
at each gateway on the route to B; at B, echo the packet; return
the packet to the source, A, using source routing and the route
stored in the packet via the trace mechanism; timestamp the
packet on its route back to A. The timestamps in the packet
could now be interpreted to yield transit times across each
network as there would be a pair of timestamps for each gateway
traversed.
The final stage of fault isolation is the situation in which the
failure has been attributed to a particular gateway and the
implementor of that gateway must debug it. This part of fault
isolation was not discussed in detail. It was suggested that at
this point, it would be very useful to be able to turn off
timeouts in the catenet to avoid having the state of the catenet
change in such a way that the problem can no longer be isolated.
In summary, the following list of tools and situations in which
they would be used was suggested.
1) Error messages indicating whether the destination host, the
local network or gateway, or the catenet had failed, and
indicating the time at which service should be restored.
These are to be returned automatically to the catenet user
whenever there is a failure in using a catenet service.
2) Gateway to gateway probing mechanism that can be initiated
with a host to gateway message.
This mechanism would be used by a control center to isolate a
component failure. It would also be available to the user. It
would be used by source host protocol programs to formulate an
error message for the user when no repsonse was received from the
destination and no error message was received from the gateways.
3) Ability to trace, echo and source route packet with
timestamping.
This facility would be used to determine where delays are
occurring when a destination is reachable, but delays cannot be
accounted for.
4) Ability to echo packets off any gateway.
5) Ability to trace packets.
6) Ability to source route packets.
7) Ability to dump gateway tables.
8) Ability to trace packets by sending replies from every
gateway that handles the packet.
These capabilities would be used by control centers and gateway
implementors to isolate failed components and determine the
reasons for failure. These facilities were not discussed in
detail. A description of mechanisms for tracing packets and
source routing packets was given in IEN #30, although these have
not yet been implemented.
The next step in developing fault isolation mechanisms for the
catenet is to work out the detailed design for the mechanisms
suggested above, and to implement these in hosts, gateways and
control centers.