05-30-2020, 12:46 AM
Background
This will mostly be of interest to systems operations folk,
system administrators (sysadmins), and often called
SREs (System Reliability Engineers), DevOps these days.
Things fail. Systems fail. Large scale systems often depend
on hundreds or even thousands of "Backend" systems these days;
usually Virtual Machines (VMs) or more recently "containers".
The more backend systems which are used (usually to improve
response times), the more likely there will be failures to
deal with.
The terminology that has grown around this includes:
SLI - Service Level Indicators -- things like availability, latency, errors
SLO - Service Level Objectives -- objectives based on the
indicators that are used to gauge reliability of a
service.
SLA - Service Level Agreements -- sometimes objectives are
communicated with customers in the form of agreements,
often with penalties to the provider if the objectives
are not met.
The reliability (eg. availability, latency, errors) that users
experience can dramatically deteriorate when the number of
backend systems is increased. This program is about exploring
some of the variables involved; how the number and reliability
of backend systems impacts the user experience, probably.
A more detailed description and background can be found in these two articles:
The Tail at Scale
The Tail at Scale Revisited
Operation:
This program enables you to play with the numbers a bit, possibly
while developing SLOs (for the backend and/or users) and SLAs.
The "A" and "B" keys (and corresponding registers) translate
between "service level" (probability of meeting objective) and
"failure rate" (reciprocal). Two ways of describing the same thing.
The "C", "D" and "E" keys (and registers) are used to look at the
relationship between the front end SLO or SLA and backend SLO for
supporting it. The "E" key specifies the number of backend services.
"F" key translates back end service level to a level which
involves two replicas. Also updates Register 04.
Pressing "R/S" after any calculations or storing is finished
will bring up the mnemonics again. Pressing "R/S" one more time
will turn the calculator off in a way that will display the
mnemonics (and remind you what program you're in) when you turn
it on again.
Using the user keys usually works fine. You can also RCL the register
directly or prefix with "XEQ" to force the calculation. (User keys
will fail to detect "number entry" if you use an existing number in
the X register for example. Just STO the number. Also, if you had entered
a number that you hadn't intended to store, pressing a user key will
store it. Pressing it again will do the calculation, or use "XEQ" directly.)
Example:
Some customers are complaining that our services are not meeting target
objectives (or agreements). We find that the backend services are
failing to meet their time budgets at a rate of about one in a thousand
which is a few orders of magnitude better than the front end. (99.9% vs.
90% in the front end.)
Most of our customers are small and the queries hit a few dozen backend
systems while the few larger customers who are complaining can sometimes
hit 500+ systems.
What service level objectives should we be aiming for in the backend
to meet the objectives for all clients? How can we best do that?
The Code:
This will mostly be of interest to systems operations folk,
system administrators (sysadmins), and often called
SREs (System Reliability Engineers), DevOps these days.
Things fail. Systems fail. Large scale systems often depend
on hundreds or even thousands of "Backend" systems these days;
usually Virtual Machines (VMs) or more recently "containers".
The more backend systems which are used (usually to improve
response times), the more likely there will be failures to
deal with.
The terminology that has grown around this includes:
SLI - Service Level Indicators -- things like availability, latency, errors
SLO - Service Level Objectives -- objectives based on the
indicators that are used to gauge reliability of a
service.
SLA - Service Level Agreements -- sometimes objectives are
communicated with customers in the form of agreements,
often with penalties to the provider if the objectives
are not met.
The reliability (eg. availability, latency, errors) that users
experience can dramatically deteriorate when the number of
backend systems is increased. This program is about exploring
some of the variables involved; how the number and reliability
of backend systems impacts the user experience, probably.
A more detailed description and background can be found in these two articles:
The Tail at Scale
The Tail at Scale Revisited
Operation:
This program enables you to play with the numbers a bit, possibly
while developing SLOs (for the backend and/or users) and SLAs.
Code:
+----+----+----+----+----+
| SL | FR | A | BE | N | Mnemonics
+----+----+----+----+----+
| A | B | C | D | E | User Keys
+----+----+----+----+----+
| 01 | 02 | 03 | 04 | 05 | Registers
+----+----+----+----+----+
^ ^ ^ ^ ^
| | | | |
| | | | |
| | | | +---> Number of Back End Systems
| | | +---> Back End Service Level (SLO)
| | +---> Agreement (SLA/SLO) Customer, Front End
| +---> Failure Rate (reciprocal)
+---> Service Level
The "A" and "B" keys (and corresponding registers) translate
between "service level" (probability of meeting objective) and
"failure rate" (reciprocal). Two ways of describing the same thing.
The "C", "D" and "E" keys (and registers) are used to look at the
relationship between the front end SLO or SLA and backend SLO for
supporting it. The "E" key specifies the number of backend services.
"F" key translates back end service level to a level which
involves two replicas. Also updates Register 04.
Pressing "R/S" after any calculations or storing is finished
will bring up the mnemonics again. Pressing "R/S" one more time
will turn the calculator off in a way that will display the
mnemonics (and remind you what program you're in) when you turn
it on again.
Using the user keys usually works fine. You can also RCL the register
directly or prefix with "XEQ" to force the calculation. (User keys
will fail to detect "number entry" if you use an existing number in
the X register for example. Just STO the number. Also, if you had entered
a number that you hadn't intended to store, pressing a user key will
store it. Pressing it again will do the calculation, or use "XEQ" directly.)
Example:
Some customers are complaining that our services are not meeting target
objectives (or agreements). We find that the backend services are
failing to meet their time budgets at a rate of about one in a thousand
which is a few orders of magnitude better than the front end. (99.9% vs.
90% in the front end.)
Most of our customers are small and the queries hit a few dozen backend
systems while the few larger customers who are complaining can sometimes
hit 500+ systems.
What service level objectives should we be aiming for in the backend
to meet the objectives for all clients? How can we best do that?
Code:
1000 # backend failure rate recriprocal, roughly 1/1000
B # stores it
A # Calculate Service Level, see "99.9"
STO D # backend SLO
95 C # frontend SLO (Objective, more conservative than SLA)
E # calculate number of systems we're good to. 51.27
500 E # number of systems req'd for large customers
C # Calculate frontend service level probability, 60.64%
# Fret! Barely better than 50/50
RCL D # backend service level
F # calculate service level while querying replicas
C # Calculate the new improved front end service level, 99.95%
# Celebrate!!
STO A
XEQ B # 1 fail in 2000 for Front end!
50 E
C # 99.995% for N==50 !! Bonus!
STO A
XEQ B # 1 in 20,000 fail for N==50 querying both replicas
The Code:
Code:
LBL SLO
LBL 00
SL FR A BE N
AVIEW
CF 22
SF 27
RTN
SF 11
OFF
GTO 00
LBL A
FC?C 22
GTO 01
STO 01
RTN
GTO 00
LBL B
FC?C 22
GTO 02
STO 02
RTN
GTO 00
LBL 01
RCL 02
1/X
XEQ 09
STO 01
RTN
GTO 00
LBL 02
RCL 01
XEQ 08
1/X
STO 02
RTN
GTO 00
LBL C
FC?C 22
GTO 03
STO 03
RTN
GTO 00
LBL 03
RCL 04
XEQ 11
RCL 05
X^Y
XEQ 12
STO 03
RTN
GTO 00
LBL D
FC?C 22
GTO 04
STO 04
RTN
GTO 00
LBL 04
RCL 03
XEQ 11
RCL 05
1/X
X^Y
XEQ 12
STO 04
RTN
GTO 00
LBL E
FC?C 22
GTO 05
STO 05
RTN
GTO 00
LBL 05
RCL 03
XEQ 11
LOG
RCL 04
XEQ 11
LOG
/
STO 05
RTN
GTO 00
LBL F
XEQ 08
X^2
XEQ 09
STO 04
RTN
GTO 00
LBL 11
1 E2
/
RTN
LBL 12
1 E2
*
RTN
LBL 09
1
X<>Y
-
XEQ 12
RTN
LBL 08
XEQ 11
1
X<>Y
-
RTN