The (Not Much) Better Than Excel Distributed Systems Simulator fills a gap where precise simulation of a distributed system is too time consuming or expensive but making a bunch of stuff up on a whiteboard or in Excel is unseemly.
This can help illuminate choke points or issues in a fast and simple way to people inexperienced with the subtlities of the problems. It CANNOT help with correctness (this is not TLA+). It CANNOT help solve existing performance issues. It can help answer very simple what if questions.
> mkdir build
> cd build
> cmake ../
> cmake --build .
This project uses rapidjson.
> bte input.json output.csv
The tool will print some output on it's configuration and then begin running the simulation. The csv file will contain data with the tick, node type, instance id of that node, the current number of requests on that instance, as well as the timeouts and median latency for requests on that instance.
Imagine the organically-grown, no preservatives cluster of something below. Why does the SpeedyDataStore talk to the Singleton PostgreSQL? No one knows; but it was the right decision at the time. All the load comes in through a background process and two different frontends.
A cluster with some cycles in the dependencies which slowly ramps up load and then down. The Background jobs have a geometric distribution on the hour; meaning that most of the jobs are started within the first 15 minutes. Frontend A has a uniform load distribution coming in but Frontend B has a normal distribution.
Each node type has different latencies, self-time latencies, timeouts and cache hit rates, as well as different ways of balancing load across the instances.
In the image below each node type is a row and the colors represent individual instances. The y axis is the number of current requests. When a large background operation occurs at around minute 300 we see that cascade through the cluster.
Active Requests: sum of active requests across instances
Latency: p95 of the median of each instance per tick
Timeouts: sum of timeouts per instance per tick
Same as example_ok.json but we cut the capacity of the SingletonPSQL by hal f which gets saturated; which results in the latency issues and requests building up across the cluster.
With this single change we see that almost everything gets worse.
Active Requests: sum of active requests across instances
Latency: p95 of the median of each instance per tick
Timeouts: sum of timeouts per instance per tick
ms_per_tick
is the number of milliseconds per simulation tick
"drivers": [...]
is the set of node types which receive load from outside of the cluster
A node type is defined with an object such as
"NodeTypeA": {
"instances": 10, // number of this type
"scheduled-instances": [5,5,10...] // Optional scheduled scaling of instances
"timeout": 10000, // request timeout in ms
"balanced": "content", // load balancing across instances
"growthmodel": "linear", // capacity growth model
"limit": 10, // capacity limit
"load": { // external load to generate per hour
"distribution": "geometric", // distribution within hour
"requests": [...], // number of requests per hour
"users" : [...], // maximum number of users in hour
"content" : [...], // max "content" per hour
"sites" : [...]. // max "sites" per hour
},
"cache": { "dist": "normal", "mean": 50, "stddev": 25 },
"self_time": { "dist": "uniform", "min": 10, "max": 250 },
"network_latency": { "dist": "geometric", "prob": 0.05},
"dependencies": {
"NodeTypeB": { "dist": "normal", "mean": 50, "stddev": 10 },
"NodeTypeC": { "dist": "normal", "mean": 50, "stddev": 10 },
}
}
instances
defines the number of instances of this node typescheduled-instances
allows for instance count changes on the hour. You specify the number of instances at hour 1, hour 2 and so on. The number of scheduled instances must be less than theinstances
parameter.timeout
defines the timeout of requests on an instance, in msbalanced
defines how requests are balanced across those instances. The options arerandom
,content
,sites
,users
growthmodel
defines how capacity on an instance is regulated as load changes. Currently the options are simplelinear
andlogistic
modelslimit
is used by the growth model to set the current capacityload
is used by the load drivers within the cluster to define the requests and the maximum distinct users, content, and sites to generate per hour. Each of these arrays should be of length 24 (for each hour of the day) and integers.- Distribution: You must specify how the load is distributed throughout the hour. The parameterization of these distributions is fixed. The choices are:
geometric
where most load is distributed at the start of the houruniform
where it is distributed uniformly throughout the hournormal
which is a normal distribution around the 30 minute mark.
- The integers in the array are related by
sites < users < content < requests
- Distribution: You must specify how the load is distributed throughout the hour. The parameterization of these distributions is fixed. The choices are:
cache
defines the probability that the cache will be hit for a request. you specify a distribution (uniform
ornormal
). A value above 50 will use the cache, so the distributions should be sensibly configured to fall within this range.self_time
defines the time spent servicing the request by the instance itself. It is specified in a distribution of milliseconds in either auniform
ornormal
distribution orgeometric
.network_latency
is similar toself_time
, but cache hits will not impact it.dependencies
is a list of node types this node depends on and the probability that a particular dependency will be used.
The std c++ libraries are used to generate the distributions. Some care should be used to pick distributions that make sense and that the parameters are correct. For load distribution within an hour, "normally" distributed load generation make very little sense, but a "geometrically" decreasing load makes sense for jobs which often start on the hour.
uniform
only uses the parametersmax
andmin
normal
uses the parametersmean
andstddev
geometric
usesmean
which is used in the cummulative distribution function-(mean * log(-[random 0:1] +1))
to find a value
The output of the csv can be used with the visualization tool of your choice. A Tableau workbook is provided to help (used above).
- Random errors
- Better capacity growth models