Workload suites are exactly what they sound like. They are logical groups of workloads. Workload suites can be composed with each other for benchmarking tasks or to simulate different cluster use cases.

Parameters

Name Required Default Description
benchmark-output no - path to the file where benchmark results should be stored, or use "console" to print to the terminal
save-mode no errorifexists Options are “errorifexists”, “ignore” (no-op if exists), “overwrite”, and “append”
descr yes - Human-readable string description of what the suite intends to do
parallel no false Whether the workloads in the suite run serially or in parallel. Defaults to false.
repeat no 1 How many times the workloads in the suite should be repeated.

benchmark-output

Control where results are outputted by using benchmark-output. While each workload can output the results of its particular algorithm using the configurable parameter workload within a workload block, benchmark-output collects the benchmark results in one place.

For example, in the following configuration, the contents of output will be the dataset generated by running the query over in the input. The contents of benchmark-output will be one single line containing the timing results of the sql run.

workload-suites = [
  {
    descr = "One run of a SQL query"
    benchmark-output = "hdfs:///tmp/sql-benchmark-results.csv"
    workloads = [
      {
        name = "sql"
        input = "/tmp/generated-kmeans-data.parquet"
        output = "/tmp/sql-query-results.parquet"
        query = "select `0` from input where `0` < -0.9"
      }
    ]
  }
]

Omitting benchmark-output will prevent benchmark results from being written. For example, this will run the same workloads as above but the benchmark results will not be written, but the workload output will be written.

workload-suites = [
  {
    descr = "One run of a SQL query with no benchmark result output"
    workloads = [
      {
        name = "sql"
        input = "/tmp/generated-kmeans-data.parquet"
        output = "/tmp/sql-query-results.parquet"
        query = "select `0` from input where `0` < -0.9"
      }
    ]
  }
]

save-mode

If users specify benchmark-output they can use this option to specify write behavior. Options are

  • errorifexists: if the file exists, throw an error
  • ignore: if the file exists, no-op
  • overwrite: if the file exists, overwrite it
  • append: if the file exists, append to it

Note: “append” is allowed for benchmark-output as it may be conceptually the same dataset, but disallowed for workload output as those are conceptually different datasets.

descr

descr is simply a human-readable string that gets added to benchmark results.

parallel

The parameter parallel is a boolean that controls whether the workloads within the suite run serially or are launched in parallel. The default is false, meaning that workloads will run serially by default.

repeat

repeat controls how many times the suite repeats. For example, say a suite contains the workloads A, B, a different instance of B, and C. Let’s also say it’s running serially, and repeat is 2. This setup will run like this:

A
B
B
C
---
A
B
B
C
---
Done

And it will NOT run like this:

// Will NOT run like this!
A
A
--
B
B
--
B
B
--
C
C
---
Done