Approach to Troubleshoot

ajaym259GeneralLeave a Comment

Important is “ A problem is a problem”. It must be resolved no matter what how critical it is. It can be a problem or Incident. Let it be one affected user or one machine or multiple.

When you trouble shoot an issue, it’s important to first speak to the user of the application. Start by asking some questions by user and then reproduce the same issue at our end. There are four steps we always follow while resolving any issue or after resolved 

Define the Issue and document it – Make sure you understand the incident you are facing if it’s a client issue make sure you have confirmed your understanding with theirs and you understand what your success criteria for closing this issue will be. Then the process need to be followed is

  1. Incident/Problem (issue reported by user or tool) and Cause

Identify and diagnose the potential causes of the issue, document each stage of your investigation and results to make sure you don’t repeat the same path. At this stage you should build your theory as to why the issue is happening. 

  • Resolution (what you have followed to resolve) 

Based on your diagnosis attempt a planned resolution again making sure you document what you have done. If it works go to the next stage if not, consider what you’ve learned and go back to the diagnosis.

  • Acceptance

If you believe it’s fixed go back to your original issue statement and test this against your solution. If it’s a client issue then walk this through with the client. 

  • Closure

Now that the client has accepted it is resolved what can we do to prevent it happening again.

Ask some very basic questions, for instance: 

  1. When did this issue first occur? 
  2. Is the issue intermittent? 
  3. Can you replicate the issue? 
  4. What are you expecting the application to do? 
  5. How have you been working around the issue? 

Then ascertain the scope of the issue, for instance 

  1. How big is the issue in comparison to other issues currently being triaged? 
  2. How important is this issue to the user? 
  3. How many people are affected by this issue 
  4. How much (if any) loss of revenue is this issue causing 

Then work with the user or a group of users depending on the issue to see the problem, then work from the error down until the root cause is located. Depending on the cause would ultimately depend on how it’s fixed. You need to gather as much information on the issue and the expectation before trying to look into a problem.

Then when you look at the issue work systematically top down until the root cause is located. This may take time, but once the cause of an issue is located one or more symptoms will be solved by resolving it. Plus you have the added bonus of sometimes learning something new. You must:

  1. Listen / read actively and catch the keywords. 
    1. Probe and determine the real issue – slowdown and deadlock isn’t the same. Client is slow and client doesn’t respond anymore can mean the same to most end users.
    1. Probe until you narrow down where the issue is – only one user affected is a client issue, several affected is likely a server side problem. 
    1. Do NOT stop once you believe you have put your finger on it, user has SP1 installed whereas every other user has SP5 installed. Replicate the problem if you can. Probe further and you may discover the issue could also be due to lack of RAM. This user may have 2 GB RAM and the others may have 4 GB. 

The customers/users are mostly looking for the 2 types solutions: i.e.:

  1. Quick Fix for the solution to avoid any work stoppage.
    Support Team should always have the “Work Around” for all the “Known Issues” so that a Quick Fix can be provided as soon as possible.

  2. A permanent solution for the problem which annoys the customers/users repetitively.
     Support Analyst should log the recurring Issue as “Incident” and perform the “Root Cause Analysis – RCA” (may be by Reproducing the Issue, Try in a different environment/user, gathering all the related information from user, Consult/Escalate to the appropriate Team/Person if required)to fix the problem permanently for the user.

It is always good practice to provide a tentative ETA to the customer/user to maintain the transparency between the Support Team and Customers. It depends on which level of support you are providing like Leve1, Level2…Level4. It always depends on organization and how they are following ITIL Process. 

Level 1 Support – Identification of incidents, first point of contact; diagnosis, escalation and resolution based on documented processes and procedures. 
Level 2 Support – First point of escalation provides guidance and instructions to Level 1 support to diagnose and resolve. Take ownership of incidents where subject matter expertise and experience is required for diagnosis. 
Level 3 Support – Change to component is required to resolve. It also depends on what type of issue we are dealing with, like code, config, performance issue.

Use FAQ lists, Organisation’s wikipedia, KEDB as much as possible. This may provide a list with proven solutions indicated by clear pictures to explain what could be the issue but more important what can be the solution to the issue. You can filter about 90%. For technical issues that can’t be filtered by the FAQ’s (the other 10%) or the 1st timers, work on as if this is a P1 to resolve the problem and minimise the business loss and financial loss.

You must follow some basic rules

  1. Main rule is to keep production environment under version control (configurations, DB schemas, etc). Rational Clear Case or Subversion – does not matter.
  2. Second rule is to know who and what changed previous day. Rational Clear Quest, JIRA or anything similar. 
  3. Third rule is to reduce direct, straight changes on production environment. And require all changes to be done via releases, after appropriate testing and improvement. 
  4. Fourth rule is to write & share actual useful function specification, clear user guide and test plan.

The main demand is to keep all artefacts under control of some system of configuration management. Subversion, Rational ClearCase – does not matter. If you keep source code and all configuration under control, you may answer to 4 main questions in any moment: 
– where was change 
– when it was changed 
– who made the change 
– what was reason / purpose for the change 

And you are going to have stress every single morning unless you have that information. You may be going to have problems (and bad reputation) every single day until you miss any of these requirements.

GEMS

  1. Listen to the user and make sure about the steps he/she is taking(Good listening skills).
  2. Know your product, know your client and know the application.
  3. Understand the symptoms of the trouble and get a history of performance.
  4. Don’t lose focus on the client and the client’s needs against the scope of agreed services provided.
  5. Know when to involve tech teams in a positive manner to target the best solution for the client. 

Leave a Reply

Your email address will not be published. Required fields are marked *