Post Reply 
Service Level Objectives
05-30-2020, 12:46 AM (This post was last modified: 06-06-2020 11:02 PM by Bill Duncan.)
Post: #1
Service Level Objectives
Background

This will mostly be of interest to systems operations folk,
system administrators (sysadmins), and often called
SREs (System Reliability Engineers), DevOps these days.

Things fail. Systems fail. Large scale systems often depend
on hundreds or even thousands of "Backend" systems these days;
usually Virtual Machines (VMs) or more recently "containers".

The more backend systems which are used (usually to improve
response times), the more likely there will be failures to
deal with.

The terminology that has grown around this includes:

SLI - Service Level Indicators -- things like availability, latency, errors

SLO - Service Level Objectives -- objectives based on the
indicators that are used to gauge reliability of a
service.

SLA - Service Level Agreements -- sometimes objectives are
communicated with customers in the form of agreements,
often with penalties to the provider if the objectives
are not met.


The reliability (eg. availability, latency, errors) that users
experience can dramatically deteriorate when the number of
backend systems is increased. This program is about exploring
some of the variables involved; how the number and reliability
of backend systems impacts the user experience, probably.

A more detailed description and background can be found in these two articles:

The Tail at Scale
The Tail at Scale Revisited


Operation:

This program enables you to play with the numbers a bit, possibly
while developing SLOs (for the backend and/or users) and SLAs.

Code:


+----+----+----+----+----+
| SL | FR | A  | BE | N  |  Mnemonics
+----+----+----+----+----+
| A  | B  | C  | D  | E  |  User Keys
+----+----+----+----+----+
| 01 | 02 | 03 | 04 | 05 |  Registers
+----+----+----+----+----+
  ^    ^    ^    ^    ^
  |    |    |    |    |
  |    |    |    |    |
  |    |    |    |    +---> Number of Back End Systems
  |    |    |    +---> Back End Service Level  (SLO)
  |    |    +---> Agreement (SLA/SLO) Customer, Front End
  |    +---> Failure Rate (reciprocal)
  +---> Service Level

The "A" and "B" keys (and corresponding registers) translate
between "service level" (probability of meeting objective) and
"failure rate" (reciprocal). Two ways of describing the same thing.

The "C", "D" and "E" keys (and registers) are used to look at the
relationship between the front end SLO or SLA and backend SLO for
supporting it. The "E" key specifies the number of backend services.

"F" key translates back end service level to a level which
involves two replicas. Also updates Register 04.

Pressing "R/S" after any calculations or storing is finished
will bring up the mnemonics again. Pressing "R/S" one more time
will turn the calculator off in a way that will display the
mnemonics (and remind you what program you're in) when you turn
it on again.

Using the user keys usually works fine. You can also RCL the register
directly or prefix with "XEQ" to force the calculation. (User keys
will fail to detect "number entry" if you use an existing number in
the X register for example. Just STO the number. Also, if you had entered
a number that you hadn't intended to store, pressing a user key will
store it. Pressing it again will do the calculation, or use "XEQ" directly.)


Example:

Some customers are complaining that our services are not meeting target
objectives (or agreements). We find that the backend services are
failing to meet their time budgets at a rate of about one in a thousand
which is a few orders of magnitude better than the front end. (99.9% vs.
90% in the front end.)

Most of our customers are small and the queries hit a few dozen backend
systems while the few larger customers who are complaining can sometimes
hit 500+ systems.

What service level objectives should we be aiming for in the backend
to meet the objectives for all clients? How can we best do that?

Code:


  1000     # backend failure rate recriprocal, roughly 1/1000
  B        # stores it
  A        # Calculate Service Level, see "99.9"
  STO D    # backend SLO
  95 C     # frontend SLO (Objective, more conservative than SLA)
  E        # calculate number of systems we're good to.  51.27
  500 E    # number of systems req'd for large customers
  C        # Calculate frontend service level probability, 60.64%

           # Fret! Barely better than 50/50

  RCL D    # backend service level
  F        # calculate service level while querying replicas
  C        # Calculate the new improved front end service level, 99.95%
           # Celebrate!!

  STO A
  XEQ B    # 1 fail in 2000 for Front end!

  50 E
  C        # 99.995% for N==50  !!  Bonus!

  STO A
  XEQ B    # 1 in 20,000 fail for N==50 querying both replicas

The Code:

Code:

LBL SLO
  LBL 00
  SL FR A BE N
  AVIEW
  CF 22
  SF 27
RTN
  SF 11
  OFF
GTO 00

LBL A
  FC?C 22
  GTO 01
  STO 01
RTN
GTO 00

LBL B
  FC?C 22
  GTO 02
  STO 02
RTN
GTO 00

LBL 01
  RCL 02
  1/X
  XEQ 09
  STO 01
RTN
GTO 00

LBL 02
  RCL 01
  XEQ 08
  1/X
  STO 02
RTN
GTO 00

LBL C
  FC?C 22
  GTO 03
  STO 03
  RTN
GTO 00

LBL 03
  RCL 04
  XEQ 11
  RCL 05
  X^Y
  XEQ 12
  STO 03
RTN
GTO 00

LBL D
  FC?C 22
  GTO 04
  STO 04
RTN
GTO 00

LBL 04
  RCL 03
  XEQ 11
  RCL 05
  1/X
  X^Y
  XEQ 12
  STO 04
  RTN
GTO 00

LBL E
  FC?C 22
  GTO 05
  STO 05
  RTN
GTO 00

LBL 05
  RCL 03
  XEQ 11
  LOG
  RCL 04
  XEQ 11
  LOG
  /
  STO 05
  RTN
GTO 00

LBL F
  XEQ 08
  X^2
  XEQ 09
  STO 04
RTN
GTO 00


LBL 11
  1 E2
  /
RTN

LBL 12
  1 E2
  *
RTN

LBL 09
  1
  X<>Y
  -
  XEQ 12
RTN

LBL 08
  XEQ 11
  1
  X<>Y
  -
RTN
Find all posts by this user
Quote this message in a reply
07-19-2020, 01:29 AM
Post: #2
RE: Service Level Objectives
I've added another post with a "close enough" approximation.

The approximation is close enough in the range of customer happiness that matters and so simple a calculator isn't really required.. lol..

https://billduncan.org/the-tail-at-scale-approximation/
Find all posts by this user
Quote this message in a reply
Post Reply 




User(s) browsing this thread: