How would you fix a network issue?

question mark, sign, problem-1872634.jpg

How do you approach it? Do you panic? Or do you start with the unofficial level 1 support greeting, “Did you try turning it off and on?”

“You don’t have a place to start if you’re put in the middle.”

That being, said, this 6-step approach that works best for me:

Identify the problem
Establish a theory
Game plan
Implement
Verify
Documentation update (This step typically makes people laugh until one day when they realize that an issue they currently face could have been fixed much faster if the documentation was up-to-date. At least that’s what my colleagues say; personally, I have documented everything from day one in my career, I swear.)

Remember that when identifying the issue, there are three approaches:

a) top-bottom
b) bottom-top
c) and divide and conquer

The latter one is usually the most time-efficient, while the first is the most consistent and has never failed me in finding the root cause of the issue.

Let’s put this reasoning to the test and consider the following scenario:

Users of PC1 and PC2 are reporting that they cannot access the corporate web page. This, along with a network diagram-style topology, is all the information you have. You figure out that the web page is hosted on the HTTP Server located in the internal data centre network. An inverse DNS search shows that the server’s IP address is 10.1.1.100.

If you’re following the strategy, the process should look something like this:

Step 1: Identify the issue. Let’s use a bottom-up approach.

Layer 1 (Physical Layer) Testing:

Power Status: Verify that all devices are powered on, including the PCs, the HTTP Server, Multilayer Switch S1, Multilayer Switch S2, and the Firewall (FW1).
Physical Connectivity: Check that PC1 and PC2 are physically connected to Multilayer Switch S1, and the HTTP Server is connected to Multilayer Switch S2. Ensure that the cables are plugged in securely at both ends.
LED Status: Observe the LED link status indicators on the relevant ports on S1 and S2. If any LED is off, test the cable with a cable tester.

Layer 2 (Data Link Layer) Testing:

Neighbour Visibility/Link: On the switches, check the ARP cache to ensure that the MAC addresses of the PCs and the HTTP Server can be seen. This confirms Layer 2 connectivity.
VLAN Configuration: Ensure that the correct VLANs are assigned to the switch ports connected to PC1, PC2, and the HTTP Server. Based on the topology, all devices should be on VLAN 10.
Spanning Tree Protocol (STP): Check the STP status on S1 and S2 to ensure there are no weird things happening.
EtherChannel: It’s worth checking if any are configured and, if so, that the configuration is correct.

Layer 3 (Network Layer) Testing:

IP Configuration: Verify the IP settings on PC1 and PC2. Both should have IP addresses within the 10.1.2.0/24 network, with the default gateway set to the internal interface of FW1 (10.1.2.254). Confirm the HTTP Server’s IP configuration is also correct.
DHCP Settings: Check the DHCP server configuration and lease assignments to ensure the scopes are correct.
Ping Test: From PC1 and PC2, ping the IP address of the HTTP Server and other hosts in the 10.1.1.X network to test connectivity. Since everything seems to work, we move forward.
Access-List Check: After thoroughly reviewing the access control lists (ACLs) on the Cisco Firewall (FW1), we discover that there appears to be a issue.
An entry in the ACL is denying HTTP traffic from the internal network (10.1.1.0/24) to the HTTP Server’s IP address:
#access-list 100 deny tcp 10.1.1.100 0.0.0.255 host 10.1.1.100 eq 80

Step 2: Establish a Theory

The ACL entry that was found is denying HTTP traffic from the 10.1.1.0/24 network to the HTTP Server at 10.1.1.100. Since the ping test from PC1 and PC2 to the HTTP Server works, we know that network IP connectivity exists, but the HTTP traffic is being blocked. This incorrect ACL entry is likely the cause of the issue.

Step 3: Game Plan

Let’s try to remove the erroneous ACL entry and replace it with one that permits HTTP traffic.
It’s also a good time to review all ACL’s for any other potential misconfigurations.

Step 4: Implement

Since the devices are Cisco, we need to do the following:
Log in to FW1 with the necessary credentials.
Enter privileged EXEC mode by typing ‘enable.’
Enter global configuration mode by typing ‘conf t.’
Remove the incorrect ACL entry by typing ‘no access-list 100 deny tcp 10.1.1.100 0.0.0.255 host 10.1.1.100 eq 80.’
Insert the correct ACL entry by typing ‘access-list 100 permit tcp 10.1.1.100 0.0.0.255 host 10.1.1.100 eq 80.’
Exit global configuration and save the configuration by typing ‘write memory.’

Step 5: Verify

Test connectivity from PC1 and PC2 to the web server page.
Since we dealt with access lists, I strongly recommend checking all services and device statuses in the organization. This is where automation shines.

Step 6: Documentation Update

In this scenario, make sure to have an up-to-date running config backup.
Additionally, see if logs from when the change that created the issue in the first place are still available, so you have the upper hand in case you need a favour from the person who messed up.

How would you fix a network issue?

How do you approach it? Do you panic? Or do you start with the unofficial level 1 support greeting, “Did you try turning it off and on?”

That being, said, this 6-step approach that works best for me:

Step 1: Identify the issue. Let’s use a bottom-up approach.

Layer 1 (Physical Layer) Testing:

Layer 2 (Data Link Layer) Testing:

Layer 3 (Network Layer) Testing:

Step 2: Establish a Theory

Step 3: Game Plan

Step 4: Implement

Step 5: Verify

Step 6: Documentation Update

Comments

Leave a Reply Cancel reply

More posts

One network monitoring tool to rule them all

First steps in reclaiming your online privacy

How would you fix a network issue?