I have worked with many customers to track service level agreements in their BSM implementation. I can honestly say that there is only one thing that all of the projects had in common: they were extremely difficult.
Now, I was usually called in mid way through the implementation when the decisions had already been made and the schedule was looking impossible. Or even worse, I would become involved after the implementation had been put in Production and the mistakes were already made.
So why are SLAs so challenging to track and manage?
- Have you seen the contracts? In general, I don’t like contracts. I’m not a lawyer, and let’s face it, they can be difficult to decipher. With SLAs, the first thing that needs to be done is take the contract and figure out what exactly was promised. Then determine what underlying data should be used for the calculations. Then figure out how to get that data from the IT devices and put it all together for the service. These steps are crucial to success, and must all be done before implementing the SLA solution.
- It’s just (total time – downtime)/total time… Saying that a service needs to be available 99% of the time during peak hours is easy. Determining the actual availability key metric is more challenging. You need to determine what exactly constitutes an outage, set up calendars for peak hours, and determine any outages that shouldn’t count (should 1 second of downtime count?). The math for simple availability isn’t difficult, but accounting for all of the necessary factors…well, that is more complex.
- So many numbers…so little time. Since computers have existed, engineers have worked tirelessly to optimize performance. There are limitations to what software can do. One must think about the amount of data to be stored and calculated. For instance, if the data for availability is being stored every minute, and the report shows the last two years of availability metrics, oh, and also real-time metrics, this report is going to take some time to calculate and display the results.
These are the main three challenges I see when working with SLA implementations. Now how do we solve these?
- Know the data before starting. This sounds like a simple task, and most people think they have a good understanding of all of the underlying devices, metrics, relationships that go into defining the service and the key metric for their SLA. No one would want to start implementing a SLA project without knowing all of the ins and outs. Or would they? People often start modeling their services and tying services to SLAs before all of the underlying infrastructure is in place. A thorough understanding of where this data will come from (monitoring software, trouble ticket systems, back-end databases) is critical because the calculation can change due to the type of data.
- Determine what details can alter the key metric. Like I mentioned earlier, calculating availability is not difficult. However, determining the total time and downtime can be. Take into account the time periods that determine maintenance. Is there a weekly maintenance period? What is “on time”? Also, what sort of data can be ignored? Are there certain outages that do not affect the service’s availability? Don’t be too generic…try to figure out all of the details that contribute to the SLA’s key metric.
- Be realistic when creating reports. The dashboards or reports are what we really care about. We need a way to show how the SLAs are tracking. We need a nice way to get a quick visual on what might have failed or what is on its way to failing. Putting 1000 services on a single page is probably not the way to go. Let’s also not reinvent the wheel. If your organization has been calculating SLA metrics for years in an external program, use that data. Why spend the extra time to set up the lower level data to feed into a program that is going to do the same calculation?
Tracking and managing Service Level Agreements will continue to take time and effort. It requires buy-in from many different departments and resources, but BSM should and can simplify an SLA implementation.