Workload-Suite Configuration

Workload suites are exactly what they sound like. They are logical groups of workloads. Workload suites can be composed with each other for benchmarking tasks or to simulate different cluster use cases.

Parameters
benchmark-output
descr
parallel
repeat

Parameters

Name	Required	Default	Description
benchmark-output	no	-	path to the file where benchmark results should be stored, or use `"console"` to print to the terminal
save-mode	no	errorifexists	Options are “errorifexists”, “ignore” (no-op if exists), “overwrite”, and “append”
descr	yes	-	Human-readable string description of what the suite intends to do
parallel	no	false	Whether the workloads in the suite run serially or in parallel. Defaults to `false`.
repeat	no	1	How many times the workloads in the suite should be repeated.

benchmark-output

Control where results are outputted by using benchmark-output. While each workload can output the results of its particular algorithm using the configurable parameter workload within a workload block, benchmark-output collects the benchmark results in one place.

For example, in the following configuration, the contents of output will be the dataset generated by running the query over in the input. The contents of benchmark-output will be one single line containing the timing results of the sql run.

workload-suites = [
  {
    descr = "One run of a SQL query"
    benchmark-output = "hdfs:///tmp/sql-benchmark-results.csv"
    workloads = [
      {
        name = "sql"
        input = "/tmp/generated-kmeans-data.parquet"
        output = "/tmp/sql-query-results.parquet"
        query = "select `0` from input where `0` < -0.9"
      }
    ]
  }
]

Omitting benchmark-output will prevent benchmark results from being written. For example, this will run the same workloads as above but the benchmark results will not be written, but the workload output will be written.

workload-suites = [
  {
    descr = "One run of a SQL query with no benchmark result output"
    workloads = [
      {
        name = "sql"
        input = "/tmp/generated-kmeans-data.parquet"
        output = "/tmp/sql-query-results.parquet"
        query = "select `0` from input where `0` < -0.9"
      }
    ]
  }
]

save-mode

If users specify benchmark-output they can use this option to specify write behavior. Options are

errorifexists: if the file exists, throw an error
ignore: if the file exists, no-op
overwrite: if the file exists, overwrite it
append: if the file exists, append to it

Note: “append” is allowed for benchmark-output as it may be conceptually the same dataset, but disallowed for workload output as those are conceptually different datasets.

descr

descr is simply a human-readable string that gets added to benchmark results.

parallel

The parameter parallel is a boolean that controls whether the workloads within the suite run serially or are launched in parallel. The default is false, meaning that workloads will run serially by default.

repeat

repeat controls how many times the suite repeats. For example, say a suite contains the workloads A, B, a different instance of B, and C. Let’s also say it’s running serially, and repeat is 2. This setup will run like this:

A
B
B
C
---
A
B
B
C
---
Done

And it will NOT run like this:

// Will NOT run like this!
A
A
--
B
B
--
B
B
--
C
C
---
Done