The Top 25 Network Problems and Their Business Impact (Part 2)

In our last post we posted the first 13 top network problems and how it can impact the applications that our business needs to operate.   In this issue we reveal issue 14 to 25 and how you can communicate this to the business people in your organization.

14.  OSPF recalculations high

Routing protocol unstable; poor and inconsistent application performance. Link stability, link errors, or spanning tree stability can cause an OSPF topology to be unstable. The routing protocol may intermittently select non-optimum paths. Applications experience high jitter or loss of connectivity if routes are flapping as a result.

15.  Poor VoIP quality

Due to high jitter, delay, or packet loss; Choppy voice calls; Calls mysteriously disconnect. The root cause of poor VoIP quality can be many other problems. By monitoring delay, jitter, and packet loss, you can reduce the set of possible problems to examine. By identifying the range of phones that are reporting poor statistics, you can better identify the potential source of the problem.

16.  Routing Neighbor changes high

Access via this router is negatively affected by a high number of neighbor changes (BGP, OSPF, EIGRP). Similar to problems 13 and 14, something is causing the neighbor relationships to change regularly, which affects the stability and reliability of the routing protocol. As a result, applications can experience high jitter or packets arriving out of order. Finding and fixing the cause of the neighbor changes will result in a more stable and efficient network.

17.  OSPF area not connected to backbone

The disconnected OSPF area will not be reachable from other OSPF areas, impacting applications that need to communicate between areas. OSPF intra-area routing relies on connectivity through the backbone area (area 0). When an area is disconnected from the backbone, communications within the area works, but communications between systems in that area and systems in other areas will not work (the intra-area routes don’t exist). Users and systems within the area will report what seems to be intermittent connectivity, which is based on whether the destination is located within the area or in another area.

18.  Unidirectional traffic flow

Typically the result of misconfigured routing, application traffic will be using non-optimum paths, increasing delay and potentially overloading other links and affecting other applications. Sometimes asymmetric routing is desired; however, it increases network complexity and complicates troubleshooting. Servers are often configured with incoming and outgoing interfaces, which may cause unicast flooding, a condition in which frames are sent to all ports in a VLAN. High traffic levels result, impacting the operation of all devices in the VLAN. In routed networks, a measure of zero packets in one direction on a link for long time periods indicates a potential routing misconfiguration.

19.  Router interface down

Any router interface marked administratively up but is operationally down is likely to be a redundant connection that will cause an outage if the other connection also fails, affecting all applications that use it. Redundant networks hide first failures, so it is important to identify those failures before a second failure causes an outage. Best practices are to administratively shutdown router interfaces that are not supposed to be active, therefore making any interface in up/down state an indication of something that’s failed.

20.  Unstable root bridge

Bridge priority not set; applications quit working over unstable VLANs. An inexpensive switch that has the same bridge priority but lower MAC address as the desired root bridge in a spanning tree will try to become the root bridge. But in a busy VLAN, it may not have the backplane bandwidth or CPU to handle the task and not send BPDUs as frequently as it should (2 seconds by default). When several BPDUs are missed, the other switches elect another switch as the root. The STP re-convergence will affect application connectivity. The change is difficult to troubleshoot because it is working by the time a network engineer looks at it. Application connectivity seems to be intermittent.

21.  Duplex mismatch

Increasing link errors; Applications get slower as traffic volume increases. CRC errors, late collisions, and FCS errors are indicators of duplex mismatch. A server is installed and ping works, so it is declared functional, but as the traffic to it builds, errors increase. Finger pointing between the network, server, and application teams often results until the duplex mismatch is discovered. Vendor recommendations (Microsoft: fixed full duplex; Cisco: auto-negotiate) exacerbate the problem.

22.  Downstream hub or switch

Unauthorized devices added to the network; Compromise to network integrity and security; See 20. Wireless routers, switches, hubs, and other network devices should be under a common administration in order to provide the best network security. Another switch could have a lower priority, making it the root bridge of a VLAN and causing stability problems (see 20). Rogue DHCP servers in wireless routers can cause intermittent connectivity problems within a subnet, unless specific configurations protect against it.

23.  Port in ErrDisable state

The set of end stations connected via this port are disconnected from the network until the port is enabled (either automatically or by user control). A variety of configuration options allow switch ports to be disabled when certain conditions occur, such as receiving BPDUs or DHCP responses (see 20, 22). Some vendors will disable a port if it experiences too many errors. Automatically identifying these ports can avoid a trouble call from a user or server administrator who is having connectivity problems as a result of a port being disabled.

24.  Unbalanced & unused ether-channels

Increased latency & jitter affecting sensitive applications like VoIP; Compromised redundancy. Packet distribution across an ether-channel may be unbalanced if a non-optimum packet distribution algorithm is selected. By changing the algorithm, the ether-channel packet distribution is more balanced and overall throughput increases. An unbalanced ether-channel will be more easily congested, resulting in application performance that’s less than expected.

25.  HSRP or VRRP peer not found

Redundancy configured and not operating correctly; Outage when a second failure occurs. A connectivity or application outage may have not yet occurred, because one device in the redundant pair is still running. But the backup device is not known. The cause may be a broken link between devices, the redundant device has not yet been installed, or the redundant device, or its interface, has failed. When the second failure in the redundant configuration occurs, a network outage occurs, impacting applications. Knowing that a redundant configuration is not operational allows it to be corrected before important applications are affected. Identifying and correcting these problems will allow your network to better service your business’ network requirements.

We want to thank NetCordia for sharing these with us and how NetMRI has helped them to discover these problems and reduce network outages.  You can view the first list here

The Top 25 Network Problems and Their Business Impact (Part1)

It is always interesting to talk about the importance of network analysis and the problems that NetMRI can discover.  But often these problems don’t mean very much, the business person wants to know what the business impact is.  For the network engineer, the problems are interesting, but need to be related to the business in order to communicate the importance to the business people.

NetMRI has compiled a list of the top 25 network problems and understand how it impacts the applications which is how our business operates.  We will look at problems 1 to 13 in this issues and look at 14 to 25 in the next issue

1. Configuration not saved: Reboot will cause the new configuration to be lost. Due to a power outage on a network device, the operation of the network changes because the new configuration is replaced by the old one upon reboot.

2. Saved configurations don’t meet corporate policy: Source of many problems, from performance to reliability to security. Corporate policy may be due to regulatory policies (PCI, HIPAA, SOX), or may be based on accepted best practices. Checking that they are consistently applied across hundreds of routers and switches is nearly impossible to do with manual processes.

3. Bloated firewall rule set, unused ACL entries: Poor firewall performance; Open, unused rules, creating potential security problems. Identifying unused firewall rules makes understanding and maintaining firewall rule sets much easier, identifying unused rules that can be safely removed, resulting in improved network security.

4. Firewall connection count exceeded: New connections via the firewall fail; Business applications exhibit intermittent failure at high firewall loads; VPNs begin to fail. When the connection count of a busy firewall is exceeded, new connections are refused. The applications experience intermittent network connectivity as the connection count is exceeded and then drops, making it difficult to troubleshoot.

5. Link hog – downloading music or videos: Slower application response, impacting user productivity. When one application or user is consuming most of the bandwidth on a link, it impacts the other applications and users of that link. NetMRI uses Getflowˇ to immediately collect netflow data on a link that’s suddenly running at high utilization to identify applications and users of the link, allowing the network engineer to quickly understand the cause of the slowdown to other applications and take action if necessary.

6. Interface traffic congestion: Unpredictable application performance, impacting user productivity. When a router interface is congested, it starts discarding packets, so monitoring packet discards is an early indicator that the applications using the link need more bandwidth, or that a rogue application is now consuming bandwidth that’s needed by business applications.

7. Link problems & stability: Physical or DataLink errors cause slow or intermittent application performance; Link or interface stability can impact routing and spanning tree (see 13, 14, 15, 16, 20). Whenever a link has high errors or is unstable, applications will have problems making effective use of the link. When routing or spanning-tree protocols are impacted, the effects may spread to other parts of the network, depending on the network’s design.

8. Environmental limits exceeded: Fan failure, power supply problems, and high temperatures are indicators of problems that will likely cause a network device to reboot, affecting any applications relying on the device. Identifying and correcting environmental problems will make the network, and the applications that depend on it, more reliable.

9. Memory utilization increasing: A bug in the device’s operating system is consuming more memory and when no free memory exists, the device will reboot, disrupting applications that are transiting the device. Imagine troubleshooting a network problem that occurs every two weeks as the device runs out of memory and reboots. We’ve seen this happen in production networks. The business impact depends on how often it occurs and what applications are affected.

10. Incorrect serial bandwidth setting: Causes routing protocols to make non-optimum routing decisions. If the bandwidth is too low, it can affect the operation of the routing protocol itself, making routes unstable. Remote branches will experience unreliable application operation, which will be difficult to troubleshoot because you’ll have to catch it when it is happening. As applications begin using more link bandwidth, the routing protocol can become unstable. If you need to alter network traffic paths, use policy based routing mechanisms instead of changing link bandwidth parameters. Also make sure tunnels have accurate bandwidth settings.

11. No QoS: Important business applications are not prioritized, yielding unpredictable or poor performance during times of interface congestion. Applications like VoIP or SAP are susceptible to high jitter and packet loss when QoS is not used. Configurations that match corporate policy for QoS deployment are important (see 2).

12. QoS Queue Drops: Important business applications are slow; Business needs have changed since the queue definitions were created. A network design for four concurrent VoIP calls will not perform well when more people are hired and the number of concurrent calls increases. Similar conditions exist for other applications. Queue drops are an early indicator of potential problems that require a network change.

13. Route flaps: Poor application performance as packets take the wrong or inefficient paths in the network. It may be caused by unstable links or improperly configured routing protocol timers (see 2, 7). Packets may also arrive out of order, which some applications cannot tolerate. Varying paths will also cause high jitter, which affects time sensitive applications like VoIP and SAP. Studies have shown that people can deal with relatively high delay as long as the variance in delay is constant. But high variance in application response will drive people crazy. Identifying and correcting these problems will allow your network to better service your business’ network requirements.

For more information on how Telnet Networks can help solve your network Problems or if you would like your own poster E-Mail at sales@telnetnetworks.ca or go to www.telnetnetworks.ca

Fine-Tuning WAN Acceleration with Observer

As more users access applications from remote locations over the WAN, it will be critical for you to ensure a positive user experience. WAN accelerating technology is an attractive way to enhance application delivery while reducing bandwidth needs. The key to successfully deploying WAN optimization technology is to be aware of how acceleration will impact application delivery and quality and understanding the source of delay to improve troubleshooting.  This article is a follow-up from our session Preparing for WAN Optimization.  In this post we will discuss how we can use Network Instruments Observer® can help evaluate, optimize and troubleshoot WAN performance before and after rolling out WAN optimization technology.

  1. Auto-Baselining
    Run auto-baselining frequently to establish application performance baselines and compare application response and operation times over time. For example, verify that applications such as back-ups and software updates are running as scheduled and not impeding other business functions. Sometimes, operations can take longer to complete, which can indicate a problem. Second, before implementing WAN acceleration, it’s important to understand response and completion time, so that you can identify and measure any improvements post deployment.To view Auto-Baselining reports:
    Main Menu > Trending/Analysis > Start Web Browser Report > select Application Transaction Analysis or Application Performance Analysis Baselining Report
  2. Application Transaction Analysis (ATA)
    ATA plays a critical role in allowing you to identify the specific point of an application delay and whether the point of delay was caused by the application traversing the WAN.To view ATA:
    Main Menu > Trending/Analysis > Application Transaction Analysis
  3. VoIP Expert Events
    In converged networks with VoIP, it can be difficult to isolate the source of delay. VoIP Expert Events is critical for identifying the impact traversing the WAN may have on calls, as well as isolating specific problem points in a VoIP call. You’ll also be able to monitor VoIP in relationship to other applications on the network.To view VoIP Events:
    Capture > Packet Capture > Start Capture > Decode > Expert Analysis Tab > VoIP Events
  4. End-to-End Analysis
    Quickly identify whether delay issues occurred on the WAN or at the server or client.

View Sever Analysis:
Main Menu > Trending/Analysis > End-to-End Server Analysis

MultiHop Analysis
Once delay has been identified, use MutliHop Analysis to pinpoint the specific network segment or hop where delay occurred on the WAN.

View MultiHop Analysis:
Main Menu > Trending

Preparing for WAN Acceleration

Over the past few years, we have seen server consolidation occurring for many reasons ranging from security to virtualization and simple cost-cutting efforts. In addition to consolidation at the core, users are more distributed accessing the network from a variety of locations including remote and home offices and smart phones.

Having an increased number of remote users accessing applications on fewer servers can introduce significant bandwidth and latency issues. Applications are typically designed to operate in LAN environments, and may not function well when accessed via WAN. In this article, we’ll look at using WAN acceleration technologies to address latency issues, types of WAN accelerators, and key issues when deploying WAN accelerators

Overcoming WAN Performance Problems
To overcome WAN performance constraints and address latency and delay issues many have turned to WAN acceleration solutions. WAN accelerators speed the delivery of applications by eliminating redundant transmissions, staging data in local caches, and compressing and prioritizing data. WAN accelerators generally perform their task via three methods:

Tokenization: Saves bandwidth by ‘remembering’ chunks of data at each accelerator and forwarding tokens as reference points for data previously transmitted, rather than sending the same data over and over. When a token is received, the local accelerator swaps the received token for the referenced data to be forwarded from its local memory. This method works well for most applications with the occasional exception of digital scanner systems.

Compression: Likely the most effective method of acceleration, it compresses the raw data before sending. Data is then uncompressed on the remote machine unbeknownst to the user.

Caching: Data sent to a remote site is cached locally and synchronized at scheduled times, thus allowing content to be sent at off-peak times. This method can greatly reduce the amount of data sent over WAN links during normal work hours. Having data stored locally speeds its delivery to remote offices, and can allow them to function even if the WAN is down. When WAN service is restored, there can be issues with recompiling the newly cached data at each remote site with the core data.

The type of acceleration used will depend upon your goals, the applications you’re optimizing, and network devices and configuration. For example, if you have implemented Quality of Server (QoS) for WAN prioritization, you’ll want to understand whether the WAN accelerator will impact application QoS settings.

Deployment Preparations
Now let’s look at five key considerations your network team will want to make before rolling out any type of WAN optimization device.

  1. Define the applications traversing the WAN and identify the underlying protocols and codecs. You’ll want to understand how each of these protocols is impacted by different types of acceleration. In the case of VoIP, for example, g.711 codec’s quality will be impacted when traversing the WAN, whereas g.729 would only be minimally affected.
  2. Understand which stations are communicating across the WAN and their locations. Ensure that equipment and stations are placed correctly for optimizing WAN performance. If you have multi-tiered applications with web frontends interfacing with SQL servers accessing a database to pull objects, it is best to keep the SQL packets local to one network and not flooding the WAN with 80-byte packets.
  3. Define baseline measurements for application utilization and performance, including utilization throughout the day, application response times, and when operations should occur. Are backups taking too long, or occurring during peak demand times?
  4. What is the actual latency for WAN links? It’s important to understand whether latency is significant enough of an issue to justify WAN acceleration. A rule of thumb is if latency is above 45 ms, it is worth considering WAN acceleration.
  5. Optimize applications by pinpointing delay locations. The WAN is an easy target to blame, however, the backend core processes are frequently the more likely culprit. Issues experienced by users when accessing the WAN may actually be attributed to the LAN, but due to proximity and bandwidth issues they may not be easy to detect. Acceleration in this case may help but not be as effective in resolving delay issues as making configuration changes at the core.

In Conclusion
As more users access applications from remote locations over the WAN, it will be critical for you to ensure a positive user experience. WAN accelerating technology is an attractive way to enhance application delivery while reducing bandwidth needs. The key to successfully deploying WAN optimization technology is to be aware of how acceleration will impact application delivery and quality and understanding the source of delay to improve troubleshooting. In the next issue, we’ll discuss using Observer® to prepare the network for a WAN acceleration deployment as well troubleshooting WAN performance problems