NetCraftsmen does a lot of network assessments. Customers want us to tell them why their “network is slow”. Once again, there’s the assumption that there’s something wrong with the network. What they really want to know is why their applications are slow and what they can do about it. That’s where Application Performance Management (APM) products provide tangible benefits. APM technology has many forms. Some systems add instrumentation options to client systems, browsers, or in the call stacks of production servers. Other systems passively capture and analyze network traffic. My focus in this article is on APM that passively watches applications on the network. We like to think of it as turbocharging the assessment because we can produce exceptional results quickly and gain views into the network that we can’t obtain with other tools.
On a number of cases where there are reports of the network being slow, I’ve found instead that it wasn’t actually the network that was at fault. It was something else in the environment that caused application slowness. Early in my consulting career, it was a backup that was running during the day because the backup set had grown to the point that it couldn’t finish in the allotted time. The network utilization caused by the backup created congestion, which in turn caused packet loss. Small amounts of packet loss will cause large degradations in TCP performance, as shown in my blog on TCP Performance and the Mathis Equation. In a case this past year, we found congested Inter-Data Center links were causing significant application slowness. Using APM on this case provided valuable insight and allowed us to quickly identify the source of the packet loss.
Focusing on the Network
A network assessment would normally be focused on network design and implementation, perhaps with some analysis of network operations. But that approach ignores the real reason for most network assessments – slow application performance.
In almost every network, it is easy to find areas in which to make improvements. Occasionally, we’ll have an engagement where someone wants us to check the network against industry best practices. But most of the time, we’re looking at a network with problems.
There are almost always problems with spanning tree, duplex mismatch, and topology designs. We know how to look for one of these problems, spot it quickly, and determine the extent to which the problem exists. What is missing from the basic network assessment is the evaluation of the applications themselves. That’s where APM provides valuable insights.
APM: The Assessment Booster
We like to use APM to validate what we’re seeing on the network side. Frequently, it provides visibility into problems that we didn’t otherwise know existed. In a recent case, we were working on a network assessment and had identified several minor network improvements that could be made. But there wasn’t anything that jumped out as a significant contributor to poor application performance, which was why we had been contracted to do the assessment.
We deployed OPNET’s AppResponse Xpert appliance in the data center where we could look at traffic for most of the key applications. We quickly identified that the network was indeed causing communications problems. Within a day, we knew that there were very high volumes of TCP retransmissions in the applications. A little more investigation allowed us to determine the source of the retransmissions. We did SNMP polling to find a set of inter-data center interfaces that had somewhat strong peaks of discards during the times when we observed high TCP retransmissions. But the number of discards didn’t look too much out of place, considering the volume of data transiting the 1G interfaces. However, the APM analysis showed that some applications were experiencing 0.08% retransmissions. Based on our work with the Mathis Equation (see the link above), we knew that something was causing TCP retransmit timers to trigger. Either packet loss or very high jitter existed. Armed with that knowledge, we started checking the path in detail. For a description of the analysis, see my blog Application Analysis Using TCP Retransmissions, Part 1. We found that the 1G inter-data center links had been configured with extremely large buffers – enough buffering to extend the normally 2ms RTT to 14ms. So even though the interface stats didn’t look too bad, TCP was timing out some packets and retransmitting them. The excessive buffering was circumventing the TCP flow control mechanisms and congestive collapse occurred when the load exceeded the link capacity.
Upon further analysis, we also found duplicate TCP ACKs, which indicates that duplicate packets arrived at the destination. This is another indication that TCP timed out and retransmitted packets. The retransmitted packets then consume additional bandwidth, exacerbating the problem. Without APM, it would have been much more difficult to spot the problem and eventually determine its cause. Our primary recommendation was to increase the link capacity. The secondary recommendation was to reduce the buffering to less than 4ms of data at 1Gbps.
Rapid APM Deployment
One of the benefits of using APM in a network assessment is ease of deployment. Network assessments need to happen fast. The customer is losing money and the network team is being blamed for the problem. It isn’t acceptable to wait a month while someone methodically gathers data, analyzes it, and finally writes a report. We like tools that quickly produce useful results. AppResponse Xpert is one of those tools.
Installation primarily consists of determining where in the network the application flows can be obtained. A span port is needed to provide the raw data to the APM system. In a permanent installation, we often recommend a span port aggregator, such as is sold by Gigamon, Anue/Ixia or Net Optics Director. It is useful to be able to get data from large, multi-tiered applications so that if a back-end server is slow, or there is a networking problem within the data center, it can be easily detected.
Once span data is being fed to the APM system, we determine groups of clients and groups of servers. AppResponse Xpert automatically identifies applications from the traffic. We find it useful to build a ‘business group’ of clients for each important application and a separate group of servers for the same applications. We can then work on a per-application basis, identifying each that has a problem. Do we see network-induced problems like TCP retransmissions and duplicate TCP ACKs or do we see slow server response times? We might also see that data transfer times dominate in an application, indicating that the application architecture may need to change or that higher speed links may need to be used.
We have seen how application analysis can highlight network problems that would otherwise remain hidden. Of course there are certain classes of problems that require APM instrumentation outside of the network domain. And of course, APM can’t help with network design review or identify redundancy failures — that’s where a comprehensive network assessment provides value. But the addition of APM to network assessments provides a valuable look at how the applications use the network. The result is a turbocharged network assessment, quickly delivering results that are useful to more than just the network team.
Thanks to Terry Slattery and the OPNET for this Article