Workloads
-
Data Generator - Graph Generator
-
Data Generator - KMeans
-
Data Generator - Linear Regression
-
KMeans
-
Logistic Regression
-
Sleep
-
SparkPi
-
SQL
About Workloads
The atomic unit of organization in spark-bench
is the workload. Workloads are standalone Spark jobs that read their input data, if any,
from disk, and write their output, if the user wants it, out to disk.
Some workloads are designed to exercise a particular algorithm implementation or a particular method. Others are designed to simulate Spark use cases such as multiple notebook users hitting a single Spark cluster.
Types of Workloads
Some existing categories of workloads include:
- ML workloads: Logistic Regression, KMeans, etc.
- “Exercise” workloads: designed to examine one particular portion of the Spark pipeline. A good example is SparkPi, a very compute-heavy workload with no need to for disk IO.
-
Data Generators:
spark-bench
has the capability to generate data according to many different configurable generators. Generated data can be written to any storage addressable by Spark, including local files, hdfs, S3, etc.Data generators are run just like workloads in spark-bench. Users should exercise caution to ensure that data generation happens before the workloads that need that input run. This is fairly simple to ensure in most cases. However, if in doubt, a bullet-proof way to do this is to create two different configuration files, one for your data generation and one with your workloads, and run them each through spark-bench.
Custom Workloads
Users can create and run their own workloads by implementing the Workload
trait and placing the resulting JAR in the classpath. For details, see the developer guide.
Parameters
Workloads are all highly configurable. You can see the available workloads and their parameters in the workloads tab. Workloads have some or all of the following parameters in common.
Name | Description |
---|---|
name | The name/type of workload. For example, “kmeans”, “sparkpi”, “logisticregression”… |
input | The path (local, hdfs, S3, etc.) to the input dataset for the workload. Some workloads (ex: SparkPi) do not require input. |
output | If the user wishes to keep the output of the workload (ex: the results of a query in the SQL workload), they can specify a path here. |
arguments specific to the workload | Configuration arguments. Ex: Value of K for KMeans, the query string for SQL, etc. |
Benchmark Output vs. Workload Output
Output: The timing results, workload config, system config, and other benchmarking info. One row per workload.
Workload Output: The output generated by the workload itself. Examples: the data returned by a SQL query. The clusters generated by running KMeans. The model generated by running Logistic Regression.
In the configuration file, paths for benchmark output are set in the Suite (see more below). Paths for workload output paths are set in the workloads.