Threaded Mode | Linear Mode

Bill Duncan · 02-05-2018, 02:07 AM

Apologies in advance; this really doesn't have much to do with HP calculators,
except that I'd love to develop an approximation model that would run on one..
Wink

I'm hoping for some expertise from someone smarter than me..

I'm trying to model a system that involves about 2400 hosts in a sharded
database of sorts. Users enter queries, they get distributed to the 2400
systems simultaneously and when all have finished, the user gets the results.

The overall system has an SLA (Service Level Agreement) that the queries
finish in a certain length of time, 90% of the time.

Obviously, the weakest link in the 2400 systems determines the length of time
that a search completes in.

A naive approach that assumes all queries on the backend are independent
would mean we are looking for the probability that "at least one of" the
backend systems is over SLA. To find that of course, we'd use the complement
of probabilities; the intersection set of probabilities where all queries on
the back end are within SLA. If they were all equal and independent, it might
look something like:

Probability of Search being within SLA == 0.99995^2400

Unfortunately, the queries on the backend aren't totally independent.
Sometimes they are, such as a query hitting a system that is busy doing
something else. However, sometimes the query performance rests with the
complexity, number of terms etc, and will impact a number of backend systems.

Nevertheless, there is a long tail of systems that are always within SLA.

I'm wondering if there is a relatively simple way to model the behaviour
such that we can "play with the numbers" and get approximate results.

Eg. What percentage of the systems would we have to "fix" and by how
much to maintain the user experience etc.

As a sample, the worst systems are within SLA 95% of the time, while the
best are within 100%. The user experience however is closer to 90% which
is natural of course, as the probabilities would accumulate and be worse than
the worst one..

Thoughts? Suggestions? Pointers? I'm stuck..

Thanks!