Connection-oriented data networks or "Optical Networks" are gaining populairity. Creating a connection in these networks is done manually, and often takes weeks. One of the reasons is that fault detection and fault isolation is very hard, and thus takes a very long time. This short article describes four possible methods to detect and isolate faults in optical network connection.
During previous years, network connections have been set up for demonstrations during Supercomputing and iGrid conferences, as well as for other experiments. This all involved connection-oriented network connections that crossed the Atlantic Ocean, and spanned networks of multiple administrative domains. Many of these network connections have been set up by us or our partners, including the UvA, SURFnet, SARA, and UIC.
Not every connection worked right from the start. During SuperComputing 2004, a lightpath had 1-2% packet loss and another had a broken trans-Atlantic link. The problem was detected by an end-user. Tracking the source of the problem involved that the user mails all network operators. Each network operators then logs in on the machines they owned, did some test, and after a few remarks between the network operators on the mail, some potential problems were found, and just after a week the problem was solved. According to the people involved, the whole process was a relatively smooth one, with great co-operation between everyone. This time scale was also quite fast in our experience. A possible cause is that everyone already knew each other, and thus knew who to contact. Most of the delay was caused by time-zone difference, and miscommunications. One of the miscommunications was a changes to the network during a test by one network operator, while others were not aware off the change.
This debugging process does not scale to more complex network connections. The above scenario involved four organisations. With six or more organisations involved, or with network operators who are not familiar with each other's working methods, the time to solve a problem will greatly increase.
For tracking the source of connection problems, there is a clear need to monitor network connections, or the underlying links. First of all, an early-detection system is necessary for pointing at network failures before they occur. Secondly, both end-users and sysadmins need a tool to pinpoint the location of the problem to one or two domains at most. For layer 3, the tool to accomplish this is traceroute. So, we need a traceroute-like tool for Layer 0 to Layer 2. Since Layer 0 and Layer 1 device do not look into packets (or do interfere with it), this may have to be done out-of-band.
Four possible solutions are outlined below: telephony networks, expert system, traceroute-like tool and in-band control
Telephony networks adhere to strict availability requirements. Since 1965 digital exchanges, also known as Stored Program Controlled (SPC) exchanges, have come in production. One of the tools to detect errors in an early stage is that SPC exchanges can: do routine checks on the exchange equipment; periodically test attached lines and trunks; generate data traffic for testing purposes; and produce equipment trouble reports.
Perhaps these checks can be applied to optical networks as well. The solution is to place a monitoring host in each domain, which periodically makes connections with monitoring hosts in neighbouring domains. If that fails, there must be a problem in or between the two domains.
Each device may produce a different type of monitoring information, like signal strength, bit error rates, and packet loss. A solution might be to gather all this information from devices along the path, and expose it to the end-user. For example, using webservices. The can gather all data, and invoke some sort of export system to draw conclusions from the data.
Regrettably, there is no one-to-one correspondence between signal strength and packet loss. Thus, there is not guarantee to see if signal loss is the cause of bit errors. Nonetheless, it can be used to give an estimate. For example > -13 dBm is likely enough power, while < -16 dBm is likely to cause problems. Similarly, there is no trivial relation between bit error rate and packet loss. However, all the combined information on signal strength, bit error rates, packet loss and total packet counter does give a rather good indication of the location of problems.
An advantage of this method is that it will leave the connection intact, which is important if the error is a 1% packet loss on a production network, and you can't disconnect it to do some tests.
If you take the concept of traceroute down to layer 0, you would have a tool that makes loopbacks: starting with one of the end-points, make a loopback at each network device along the way. Measure if the loopback works properly (perhaps using a modified ping tool). If it does work, connect to the next device along the way, and make a loopback there. If it does not work, there is an error in the link between this and the previous device or in one of these devices itself. This is a rather good pinpoint of the problem.
This method works for layer 0 devices, since optical cross connects (OXCs) can easily make loopbacks, though only for full-duplex connections. A drawback is that this method modifies the settings of the network equipment, which is sometimes not possible or causes problems. For example, you can not make a loopback on one wavelength if the device switches all wavelengths at the same time, and has no WDM demuxing/muxing support; you will surely interrupt any possible production traffic in the other wavelengths. An example of the possible problems is that a layer 2 devices may automatically disable a link when it detects a loop-back in the spanning tree.
Theoretically, it may be possible to define in-band control packets which can be detected by network devices. Imagine a path over which a control packet is sent, and all devices along the way report if they saw this packet. Then you would be able to pinpoint the location of problems.
This method relies on in-band signaling, and modification of devices to recognize special control packets is required. This is non-desirable. It probably also suffers from the same problem SDH/SONET has. SDH has excellent in-band control, and you can clearly pinpoint an error, even going through multiple SDH devices. However, this typically only works within a single domain, since administrators will most likely not allow users from other domains to look into their device settings, even if it is possible. The result may be that all domains claim their SDH circuit works fine, but the end user still experiences problems. That was exactly the problem we want to resolve, which results in vicious circle.