About Workloads

The atomic unit of organization in spark-bench is the workload. Workloads are standalone Spark jobs that read their input data, if any, from disk, and write their output, if the user wants it, out to disk.

Some workloads are designed to exercise a particular algorithm implementation or a particular method. Others are designed to simulate Spark use cases such as multiple notebook users hitting a single Spark cluster.

Types of Workloads

Some existing categories of workloads include:

  • ML workloads: Logistic Regression, KMeans, etc.
  • “Exercise” workloads: designed to examine one particular portion of the Spark pipeline. A good example is SparkPi, a very compute-heavy workload with no need to for disk IO.
  • Data Generators: spark-bench has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc.

                 Data generators are run just like workloads in spark-bench. Users should exercise caution to ensure that data generation happens before the workloads that need that input run.
                 This is fairly simple to ensure in most cases.
                 However, if in doubt, a bullet-proof way to do this is to create two different configuration files, one for your data generation and one with your workloads, and run them each through spark-bench.
    

Custom Workloads

Users can create and run their own workloads by implementing the Workload trait and placing the resulting JAR in the classpath. For details, see the developer guide.

Parameters

Workloads are all highly configurable. You can see the available workloads and their parameters in the workloads tab. Workloads have some or all of the following parameters in common.

Name Description
name The name/type of workload. For example, “kmeans”, “sparkpi”, “logisticregression”…
input The path (local, hdfs, S3, etc.) to the input dataset for the workload. Some workloads (ex: SparkPi) do not require input.
output If the user wishes to keep the output of the workload (ex: the results of a query in the SQL workload), they can specify a path here.
arguments specific to the workload Configuration arguments. Ex: Value of K for KMeans, the query string for SQL, etc.

Benchmark Output vs. Workload Output

Output: The timing results, workload config, system config, and other benchmarking info. One row per workload.

Workload Output: The output generated by the workload itself. Examples: the data returned by a SQL query. The clusters generated by running KMeans. The model generated by running Logistic Regression.

In the configuration file, paths for benchmark output are set in the Suite (see more below). Paths for workload output paths are set in the workloads.