Workload-Suite Configuration
Workload suites are exactly what they sound like. They are logical groups of workloads. Workload suites can be composed with each other for benchmarking tasks or to simulate different cluster use cases.
Parameters
Name | Required | Default | Description |
---|---|---|---|
benchmark-output | no | - | path to the file where benchmark results should be stored, or use "console" to print to the terminal |
save-mode | no | errorifexists | Options are “errorifexists”, “ignore” (no-op if exists), “overwrite”, and “append” |
descr | yes | - | Human-readable string description of what the suite intends to do |
parallel | no | false | Whether the workloads in the suite run serially or in parallel. Defaults to false . |
repeat | no | 1 | How many times the workloads in the suite should be repeated. |
benchmark-output
Control where results are outputted by using benchmark-output
. While each workload can output the results of its particular algorithm
using the configurable parameter workload
within a workload block, benchmark-output
collects the benchmark results in one place.
For example, in the following configuration, the contents of output
will be the dataset generated by running the query over in the input.
The contents of benchmark-output
will be one single line containing the timing results of the sql run.
workload-suites = [
{
descr = "One run of a SQL query"
benchmark-output = "hdfs:///tmp/sql-benchmark-results.csv"
workloads = [
{
name = "sql"
input = "/tmp/generated-kmeans-data.parquet"
output = "/tmp/sql-query-results.parquet"
query = "select `0` from input where `0` < -0.9"
}
]
}
]
Omitting benchmark-output
will prevent benchmark results from being written. For example, this will run the same workloads
as above but the benchmark results will not be written, but the workload output will be written.
workload-suites = [
{
descr = "One run of a SQL query with no benchmark result output"
workloads = [
{
name = "sql"
input = "/tmp/generated-kmeans-data.parquet"
output = "/tmp/sql-query-results.parquet"
query = "select `0` from input where `0` < -0.9"
}
]
}
]
save-mode
If users specify benchmark-output they can use this option to specify write behavior. Options are
- errorifexists: if the file exists, throw an error
- ignore: if the file exists, no-op
- overwrite: if the file exists, overwrite it
- append: if the file exists, append to it
Note: “append” is allowed for benchmark-output as it may be conceptually the same dataset, but disallowed for workload output as those are conceptually different datasets.
descr
descr
is simply a human-readable string that gets added to benchmark results.
parallel
The parameter parallel
is a boolean that controls whether the workloads within the suite run serially or are launched in parallel.
The default is false
, meaning that workloads will run serially by default.
repeat
repeat
controls how many times the suite repeats.
For example, say a suite contains the workloads A, B, a different instance of B, and C.
Let’s also say it’s running serially, and repeat is 2.
This setup will run like this:
A
B
B
C
---
A
B
B
C
---
Done
And it will NOT run like this:
// Will NOT run like this!
A
A
--
B
B
--
B
B
--
C
C
---
Done