Turbocharge Network Assessments with APM

NetCraftsmen does a lot of network assessments. Customers want us to tell them why their “network is slow”. Once again, there’s the assumption that there’s something wrong with the network. What they really want to know is why their applications are slow and what they can do about it. That’s where Application Performance Management (APM) products provide tangible benefits. APM technology has many forms. Some systems add instrumentation options to client systems, browsers, or in the call stacks of production servers. Other systems passively capture and analyze network traffic. My focus in this article is on APM that passively watches applications on the network. We like to think of it as turbocharging the assessment because we can produce exceptional results quickly and gain views into the network that we can’t obtain with other tools.

On a number of cases where there are reports of the network being slow, I’ve found instead that it wasn’t actually the network that was at fault. It was something else in the environment that caused application slowness. Early in my consulting career, it was a backup that was running during the day because the backup set had grown to the point that it couldn’t finish in the allotted time. The network utilization caused by the backup created congestion, which in turn caused packet loss. Small amounts of packet loss will cause large degradations in TCP performance, as shown in my blog on TCP Performance and the Mathis Equation. In a case this past year, we found congested Inter-Data Center links were causing significant application slowness. Using APM on this case provided valuable insight and allowed us to quickly identify the source of the packet loss.

Focusing on the Network

A network assessment would normally be focused on network design and implementation, perhaps with some analysis of network operations. But that approach ignores the real reason for most network assessments – slow application performance.

In almost every network, it is easy to find areas in which to make improvements. Occasionally, we’ll have an engagement where someone wants us to check the network against industry best practices. But most of the time, we’re looking at a network with problems.

There are almost always problems with spanning tree, duplex mismatch, and topology designs. We know how to look for one of these problems, spot it quickly, and determine the extent to which the problem exists. What is missing from the basic network assessment is the evaluation of the applications themselves. That’s where APM provides valuable insights.

APM: The Assessment Booster

We like to use APM to validate what we’re seeing on the network side. Frequently, it provides visibility into problems that we didn’t otherwise know existed. In a recent case, we were working on a network assessment and had identified several minor network improvements that could be made. But there wasn’t anything that jumped out as a significant contributor to poor application performance, which was why we had been contracted to do the assessment.

We deployed OPNET’s AppResponse Xpert appliance in the data center where we could look at traffic for most of the key applications. We quickly identified that the network was indeed causing communications problems. Within a day, we knew that there were very high volumes of TCP retransmissions in the applications. A little more investigation allowed us to determine the source of the retransmissions. We did SNMP polling to find a set of inter-data center interfaces that had somewhat strong peaks of discards during the times when we observed high TCP retransmissions. But the number of discards didn’t look too much out of place, considering the volume of data transiting the 1G interfaces. However, the APM analysis showed that some applications were experiencing 0.08% retransmissions. Based on our work with the Mathis Equation (see the link above), we knew that something was causing TCP retransmit timers to trigger. Either packet loss or very high jitter existed. Armed with that knowledge, we started checking the path in detail. For a description of the analysis, see my blog Application Analysis Using TCP Retransmissions, Part 1. We found that the 1G inter-data center links had been configured with extremely large buffers – enough buffering to extend the normally 2ms RTT to 14ms. So even though the interface stats didn’t look too bad, TCP was timing out some packets and retransmitting them. The excessive buffering was circumventing the TCP flow control mechanisms and congestive collapse occurred when the load exceeded the link capacity.

Upon further analysis, we also found duplicate TCP ACKs, which indicates that duplicate packets arrived at the destination. This is another indication that TCP timed out and retransmitted packets. The retransmitted packets then consume additional bandwidth, exacerbating the problem. Without APM, it would have been much more difficult to spot the problem and eventually determine its cause. Our primary recommendation was to increase the link capacity. The secondary recommendation was to reduce the buffering to less than 4ms of data at 1Gbps.

Rapid APM Deployment

One of the benefits of using APM in a network assessment is ease of deployment. Network assessments need to happen fast. The customer is losing money and the network team is being blamed for the problem. It isn’t acceptable to wait a month while someone methodically gathers data, analyzes it, and finally writes a report. We like tools that quickly produce useful results. AppResponse Xpert is one of those tools.

Installation primarily consists of determining where in the network the application flows can be obtained. A span port is needed to provide the raw data to the APM system. In a permanent installation, we often recommend a span port aggregator, such as is sold by Gigamon, Anue/Ixia or Net Optics Director. It is useful to be able to get data from large, multi-tiered applications so that if a back-end server is slow, or there is a networking problem within the data center, it can be easily detected.

Once span data is being fed to the APM system, we determine groups of clients and groups of servers. AppResponse Xpert automatically identifies applications from the traffic. We find it useful to build a ‘business group’ of clients for each important application and a separate group of servers for the same applications. We can then work on a per-application basis, identifying each that has a problem. Do we see network-induced problems like TCP retransmissions and duplicate TCP ACKs or do we see slow server response times? We might also see that data transfer times dominate in an application, indicating that the application architecture may need to change or that higher speed links may need to be used.

Summary

We have seen how application analysis can highlight network problems that would otherwise remain hidden. Of course there are certain classes of problems that require APM instrumentation outside of the network domain. And of course, APM can’t help with network design review or identify redundancy failures — that’s where a comprehensive network assessment provides value. But the addition of APM to network assessments provides a valuable look at how the applications use the network. The result is a turbocharged network assessment, quickly delivering results that are useful to more than just the network team.

Thanks to Terry Slattery and the OPNET for this Article

Advertisements

Defending the Network from Application Performance problems (part II)

In my prior blog post, I wrote about different network problems that negatively impact application performance. In this post, I’ll follow up with non-network problems that impact application performance, but for which the network provides a unique vantage point from where such problems can be identified and solved. In the next post, I’ll tie everything together by describing how to determine if the network is at fault and how to get the other organizations to understand more about application performance.

Slow Client

Many modern web-based applications often push a bunch of the user interaction work to the client workstation. Sometimes it is done in a way that pushes a lot of data to the workstation where some JavaScript code processes the data. I’ve seen applications that had long, multi-second pauses because the JavaScript process had to handle hundreds or thousands of rows of data before the client display could be updated.

A good Application Performance Management (APM) system identifies clients that have these types of delays. It requires looking at the client-to-server transactions and identifying when the client is paused due to internal processing. The analysis needs to differentiate between the client workstation application pauses and the “think time” of the human who is interacting with the application.

Slow Server

The server teams don’t like to hear it, but the most common causes of slow application performance are the applications or the servers themselves. I’ve found that it frequently is not the network that is the cause, even though the network often gets the blame.

Modern applications are typically deployed on a multi-tiered infrastructure. There often is a front-end web server that talks with an application server. The application server in turn talks with a middleware server that queries one or more database servers for the data it needs. These servers may all talk with DNS servers to look up IP addresses or to map IP addresses back to server names. All it takes is for any one of these servers to have performance problems and the whole application runs slow. Of course, the problem is then one of identifying the slow server out of the set of servers that implement an application.

Understanding the interactions between multiple components in an application is an essential part of understanding the root cause of performance problems. This process, called Application Dependency Mapping, is typically part of an integrated APM approach, and ideally leverages information from already in-place monitoring solutions to draw a dependency map between system components. The network provides a unique vantage point to derive these relationships, and as such the network team can provide strong value to the application and server teams.

Although we can collect a lot of very rich information from the network, using packet capture tools to answer the question of “Is it the network or the application?” could take many, many hours of work. All the while, the application is running slow, affecting the productivity of anyone using that application.

I’ve used Application Response Xpert to significantly reduce the time to identify why a slow application was slow. Once you have set up the proper monitoring points and some basic configurations, it is very easy to use  and provides immediate value for “the network is slow” fire drills. The information gathered by AppResponse Xpert also provides input to AppMapper Xpert, to automatically draw dependency maps of critical applications.

Identifying Database Scaling Problems

A common cause of application slowness is that the application was developed with a small data set on a fast LAN development environment. Then the application is rolled out to production. It may initially run with acceptable performance. But over time, as the database grows, it becomes slower and slower. A quick analysis with AppResponse Xpert shows that one of the key middleware servers is making a lot of requests to a database server. One client request can result in many database requests or perhaps result in the transfer of a significant volume of data. Changing the database query to be more efficient typically solves the problem.

I’ve also found the case where a database server takes many seconds to return data to the middleware or application server. The application team can use AppResponse Xpert’s Database Monitoring module to identify the offending query. Sometimes a good development team can look at the user transaction and quickly determine what queries are likely to be the culprit while other times, the application is making so many database queries that a SQL query analysis tool is really what is needed. In the cases I’ve seen, the queries were poorly structured, sometimes joining large tables that resulted in extremely long query times on production data sets. Simply rewriting the queries dropped the query times by several orders of magnitude. This is where these tools pay off. The advantage using deep packet inspection on the network to identify problems with SQL queries is that there is no overhead added to the database.  This is another example of how the network team can provide value to other IT teams.

Chatty Conversation

Another typical example of problems within the application is the chatty conversation. One application server, or perhaps the client itself, will make many, many small requests to execute one transaction on behalf of the person running the application. It runs fine as long as the network latency between the client and server is low. However, with the advent of virtualization, the server team may have configured automatic migration of the server image to a lightly loaded host. This might move a server image to a location that puts it several milliseconds further away from other servers or from its disk storage system. A few milliseconds may not be much unless the application does hundreds or perhaps thousands of small requests to complete one transaction.  Suddenly, the application goes from an acceptable level of performance to unacceptable performance. Of course, database size also affects the performance because the number of small requests goes up with the database size.

You need visibility into the number of requests between systems, where the systems are connected to the network, and the delays between requests. Getting a baseline of system performance against which you can measure future performance is extremely useful for identifying whether a given application is performing as expected and possibly identifying which server needs to be examined.

This kind of examination can be automated by AppTransaction Xpert, which can capture baseline transactions from the packet store of AppResponse Xpert and predict the change in their response times given different network parameters such as latency, bandwidth, and loss rate.

Slow Network Services

Finally, the problem may be due to slow network services. This isn’t the network itself, but services that most network-based applications depend upon for proper operation. Consider an application that makes queries to a DNS server, but the primary DNS doesn’t exist, so the app must time out the first request before attempting to query the second DNS server. I’ve seen applications that would have a 30-60 second delay upon the first execution, but would then run fine for a while. Periodically, the application would be very slow, but run fine the rest of the time. Intermittent problems are very challenging to diagnose, so this is where having something like AppResponse Xpert watching and recording all the transactions is extremely helpful. Just identify the time of the slow performance and look for something in the data. In this case, it would be an unanswered DNS request, which was successful when tried against the secondary DNS server.

Summary

Accurately diagnosing application performance can be impossible or very time consuming with the wrong tools. With the right tools and a good installation, where the tools capture the necessary data, the analysis and diagnosis can proceed very quickly. In addition, these tools not only help to defend or troubleshoot the network, but also provide value to other IT teams in the organization. I know of one site that went from not being able to help diagnose slow applications, to being able to provide deep visibility into what an application is doing from the network perspective, and providing real value to the application teams to solve the problem.

We thank Terry Slattery and OPNET for this Article

OPNET Positioned in the “Leaders” Quadrant of the Magic Quadrant for Application Performance Monitoring

Evaluation Based in Completeness of Vision and ABility to Execute

BETHESDA, MD – August 20, 2012– OPNET Technologies, Inc. (NASDAQ: OPNT), the leading provider of solutions for application and network performance management, today announced that it has been positioned by Gartner, Inc. in the “Leaders” quadrant of the “Magic Quadrant for Application Performance Monitoring” published on August 16, 2012 and written by Jonah Kowall and Will Cappelli. Gartner evaluated 14 vendors in the report, and recognized 8 in the “Leaders” quadrant based on their completeness of vision and ability to execute.

Marc Cohen, OPNET’s Chairman and CEO, stated, “We are very pleased that OPNET has been recognized by Gartner as a leader in the rapidly growing APM market.  A core part of OPNET’s business strategy has been to be a market leader in APM, both in terms of innovation and market share.  APM now represents over 70% of our product sales, and has been growing over 30% per year for the last four years.”

“Managing the performance and availability of applications has become a top priority for IT organizations,” stated Alain Cohen, OPNET’s President and CTO.  “APM offers a paradigm shift for ensuring the quality of end user experience and the continuity of business operations. We believe that Gartner’s recognition of OPNET as a leader in application performance monitoring underscores OPNET’s success in delivering innovative APM solutions that advance this critical IT discipline.”

According to P. J. Malloy, OPNET Senior Vice President of R&D and Chief Architect of APM Solutions, “APM Xpert’s capabilities are the most comprehensive in the industry, covering all key functional areas. OPNET’s High Definition APM approach emphasizes breadth and depth in monitoring, with analytics that detect patterns and pinpoint relevant information. Big Data technology is leveraged to efficiently analyze billions of transactions. OPNET’s solutions dramatically accelerate problem resolution, enable problem prevention, and improve the effectiveness of the IT organization.”

 

OPNET Talks End-to-End Management and Monitoring of Unified Communications

Unified Communications (UC), especially the real-time applications, is unique because of user expectations. Management of UC is very important as it ensures service availability and service performance, as well as other aspects of UC including security and compliance.

“At the end of the day, the user expectation is real-time communication services are available any time all the time,” Gurmeet Lamba, VP R&D Unified Communications Management, OPNET told TMCnet in an exclusive interview.

For example, if your phone is not connected you can’t make a phone call. A user has an expectation that the phone should work every it’s used. If you call somebody on a cell phone, but the voice is broken, you can’t complete your conversation. In both of these cases, the service performance is not adequate and the user’s expectations are not met.

“The reason the user expectation is so high is because it is so critical to the users living their lives,” said Dave Roberts, director of product management, Unified Communications, OPNET. “You car should start, your lights should turn on, your shower should run.

”End-to-End in UC means managing and monitoring the breadth and the depth of all components involved in orchestrating a successful communications session. To make this happen in today’s world, there are a number of components that are involved.

For instance, in order to complete a conference call, all parties phones must work, the network to the conferencing server must work, and the conferencing server itself must work properly.

End-to-end management and monitoring means getting visibility and performance of every single component involved in the complete communication session.  End-to-end involves management and monitoring of every component in the breadth and the depth of the communication session. The breadth consists of the applications – unified messaging, conferencing server, devices, call management servers, etc. When it comes to the depth, components include a client such as a phone or an application on your computer, the configuration of the application, the network, the virtual server, and the physical server.

“You can see the technology stack from top to bottom and from left to right. All of it has to work,” explained Lamba. “Communication is the oil that keeps everything moving

”When it comes to the most perfect End-to-End Management and Monitoring UC solution ever invented, Dave Roberts, director of product management, Unified Communications, OPNET, said, “The first goal would be to have 100 percent visibility to all of the information at all times. You would have to know every configuration of everything that is involved with the communication and every state of everything happening on the network at any instance. And then, it would have to have a way to correlate and use that information to do to a few things including detect problems, analyze information to find the cause of the problem, and fix the problem.

Article from TMCnet.com by Amada Ciccatelli  on OPNET

Best Practices for Migrating Applications

Application Performance Management (APM) can help you answer questions when migrating applications

  • Will the migration be successful? Will applications perform?
  • How best to execute the migration?
  • How to manage performance once the project is complete?

The cost advantages of consolidating your data center can be significant, but you may have concerns about application performance after the migration is completed.  These 4 steps can ensure that application performance is maintained

Application Performance Baselining.

Before migration to a consolidated datacenter, characterize application usage and behavior. Data collected by APM solutions can help to accurately size the new infrastructure, eliminate pre-existing performance problems, and determine which applications are suitable for migration.

Application Dependency Mapping.

A clear understanding of the intricate front and back end client-server relationships is vital to executing the migration since physically separating highly dependent application tiers can cause serious performance degradations (or in some cases, complete failures). Ideally, clear run-time dependency maps should be automatically assembled from the APM instrumentation already in place from the baselining exercise.

Application Migration Planning.

Once the applications suitable for migration have been identified and their dependent components reviewed, predictive analysis on key application transactions should provide accurate insight into post-migration performance. The models used for this analysis are driven by a few key parameters of the infrastructure, as well as application profiles captured from the live environment (before the migration).

Post-Migration, Production Monitoring.

After migration, it’s important to verify that performance objectives are met on an on-going basis, and when they are not, why not? This is particularly important when a new department or service provider is involved, for obvious reasons. Who is responsible? What to do? The most meaningful way to monitor application performance is from an “end user transaction” perspective rather than solely from a resource perspective. APM technologies can make it easy to see individual user transactions for multi-tier applications through SSL, through Citrix sessions, across virtual and physical systems, and to highlight specific database or web queries.  When considering how to monitor application performance from the end user’s perspective,  this blog post “Making Sense of End User Experience Monitoring” may be useful

By implementing best practices, this customer and many others have achieved success

Thanks to OPNET APM Solutions for sharing this article