Service Level Objectives
05-30-2020, 12:46 AM (This post was last modified: 06-06-2020 11:02 PM by Bill Duncan.)
Post: #1
 Bill Duncan Member Posts: 86 Joined: Jan 2016
Service Level Objectives
Background

This will mostly be of interest to systems operations folk,
SREs (System Reliability Engineers), DevOps these days.

Things fail. Systems fail. Large scale systems often depend
on hundreds or even thousands of "Backend" systems these days;
usually Virtual Machines (VMs) or more recently "containers".

The more backend systems which are used (usually to improve
response times), the more likely there will be failures to
deal with.

The terminology that has grown around this includes:

SLI - Service Level Indicators -- things like availability, latency, errors

SLO - Service Level Objectives -- objectives based on the
indicators that are used to gauge reliability of a
service.

SLA - Service Level Agreements -- sometimes objectives are
communicated with customers in the form of agreements,
often with penalties to the provider if the objectives
are not met.

The reliability (eg. availability, latency, errors) that users
experience can dramatically deteriorate when the number of
backend systems is increased. This program is about exploring
some of the variables involved; how the number and reliability
of backend systems impacts the user experience, probably.

A more detailed description and background can be found in these two articles:

The Tail at Scale
The Tail at Scale Revisited

Operation:

This program enables you to play with the numbers a bit, possibly
while developing SLOs (for the backend and/or users) and SLAs.

Code:
 +----+----+----+----+----+ | SL | FR | A  | BE | N  |  Mnemonics +----+----+----+----+----+ | A  | B  | C  | D  | E  |  User Keys +----+----+----+----+----+ | 01 | 02 | 03 | 04 | 05 |  Registers +----+----+----+----+----+   ^    ^    ^    ^    ^   |    |    |    |    |   |    |    |    |    |   |    |    |    |    +---> Number of Back End Systems   |    |    |    +---> Back End Service Level  (SLO)   |    |    +---> Agreement (SLA/SLO) Customer, Front End   |    +---> Failure Rate (reciprocal)   +---> Service Level

The "A" and "B" keys (and corresponding registers) translate
between "service level" (probability of meeting objective) and
"failure rate" (reciprocal). Two ways of describing the same thing.

The "C", "D" and "E" keys (and registers) are used to look at the
relationship between the front end SLO or SLA and backend SLO for
supporting it. The "E" key specifies the number of backend services.

"F" key translates back end service level to a level which
involves two replicas. Also updates Register 04.

Pressing "R/S" after any calculations or storing is finished
will bring up the mnemonics again. Pressing "R/S" one more time
will turn the calculator off in a way that will display the
mnemonics (and remind you what program you're in) when you turn
it on again.

Using the user keys usually works fine. You can also RCL the register
directly or prefix with "XEQ" to force the calculation. (User keys
will fail to detect "number entry" if you use an existing number in
the X register for example. Just STO the number. Also, if you had entered
a number that you hadn't intended to store, pressing a user key will
store it. Pressing it again will do the calculation, or use "XEQ" directly.)

Example:

Some customers are complaining that our services are not meeting target
objectives (or agreements). We find that the backend services are
failing to meet their time budgets at a rate of about one in a thousand
which is a few orders of magnitude better than the front end. (99.9% vs.
90% in the front end.)

Most of our customers are small and the queries hit a few dozen backend
systems while the few larger customers who are complaining can sometimes
hit 500+ systems.

What service level objectives should we be aiming for in the backend
to meet the objectives for all clients? How can we best do that?

Code:
   1000     # backend failure rate recriprocal, roughly 1/1000   B        # stores it   A        # Calculate Service Level, see "99.9"   STO D    # backend SLO   95 C     # frontend SLO (Objective, more conservative than SLA)   E        # calculate number of systems we're good to.  51.27   500 E    # number of systems req'd for large customers   C        # Calculate frontend service level probability, 60.64%            # Fret! Barely better than 50/50   RCL D    # backend service level   F        # calculate service level while querying replicas   C        # Calculate the new improved front end service level, 99.95%            # Celebrate!!   STO A   XEQ B    # 1 fail in 2000 for Front end!   50 E   C        # 99.995% for N==50  !!  Bonus!   STO A   XEQ B    # 1 in 20,000 fail for N==50 querying both replicas

The Code:

Code:
 LBL SLO   LBL 00   SL FR A BE N   AVIEW   CF 22   SF 27 RTN   SF 11   OFF GTO 00 LBL A   FC?C 22   GTO 01   STO 01 RTN GTO 00 LBL B   FC?C 22   GTO 02   STO 02 RTN GTO 00 LBL 01   RCL 02   1/X   XEQ 09   STO 01 RTN GTO 00 LBL 02   RCL 01   XEQ 08   1/X   STO 02 RTN GTO 00 LBL C   FC?C 22   GTO 03   STO 03   RTN GTO 00 LBL 03   RCL 04   XEQ 11   RCL 05   X^Y   XEQ 12   STO 03 RTN GTO 00 LBL D   FC?C 22   GTO 04   STO 04 RTN GTO 00 LBL 04   RCL 03   XEQ 11   RCL 05   1/X   X^Y   XEQ 12   STO 04   RTN GTO 00 LBL E   FC?C 22   GTO 05   STO 05   RTN GTO 00 LBL 05   RCL 03   XEQ 11   LOG   RCL 04   XEQ 11   LOG   /   STO 05   RTN GTO 00 LBL F   XEQ 08   X^2   XEQ 09   STO 04 RTN GTO 00 LBL 11   1 E2   / RTN LBL 12   1 E2   * RTN LBL 09   1   X<>Y   -   XEQ 12 RTN LBL 08   XEQ 11   1   X<>Y   - RTN
07-19-2020, 01:29 AM
Post: #2
 Bill Duncan Member Posts: 86 Joined: Jan 2016
RE: Service Level Objectives
I've added another post with a "close enough" approximation.

The approximation is close enough in the range of customer happiness that matters and so simple a calculator isn't really required.. lol..

https://billduncan.org/the-tail-at-scale-approximation/
 « Next Oldest | Next Newest »