Runs a SQL query over the input dataset. input from the common parameters is required.

The query string is required to use “input” as the name of the table. For example:

select * from input where SomeNumericField < 15

You can also query unnamed columns such as those generated by the KMeans data generator, as in this example from the SQL Workload unit test which uses `0` as the name of the first column, `1` as the name of the second, and so on.

select `0` from input where `0` < -0.9

Parameters

Name Required Default Description
name yes “sql”
input yes the input dataset
output no If users wish to capture the actual results of the SQL query, they can specify an output file here.
save-mode no errorifexists Options are “errorifexists”, “ignore” (no-op if exists), and “overwrite”
query yes the sql query to perform. The table name must be “input” as shown in the examples above.
cache no false whether the dataset should be cached after being read from disk
partitions no Natural partitioning If users specify output for this workload, they can optionally repartion the dataset using this option.

Examples

{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  output = "/tmp/sql-query-results.parquet"
  query = "select `0` from input where `0` < -0.9"
}
{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  query = "select `0` from input where `0` < -0.9"
  cache = true
}
{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  query = "select `0` from input where `0` < -0.9"
  cache = true
  output = "hdfs:///query-output-in-three-partitions.csv"
  partitions = 3 // will repartition the dataset before writing out
}