SQL

Runs a SQL query over the input dataset. input from the common parameters is required.

The query string is required to use “input” as the name of the table. For example:

select * from input where SomeNumericField < 15

You can also query unnamed columns such as those generated by the KMeans data generator, as in this example from the SQL Workload unit test which uses `0` as the name of the first column, `1` as the name of the second, and so on.

select `0` from input where `0` < -0.9

Parameters

Name	Required	Default	Description
name	yes	–	“sql”
input	yes	–	the input dataset
output	no	–	If users wish to capture the actual results of the SQL query, they can specify an output file here.
save-mode	no	errorifexists	Options are “errorifexists”, “ignore” (no-op if exists), and “overwrite”
query	yes	–	the sql query to perform. The table name must be “input” as shown in the examples above.
cache	no	false	whether the dataset should be cached after being read from disk
partitions	no	Natural partitioning	If users specify `output` for this workload, they can optionally repartion the dataset using this option.

Examples

{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  output = "/tmp/sql-query-results.parquet"
  query = "select `0` from input where `0` < -0.9"
}

{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  query = "select `0` from input where `0` < -0.9"
  cache = true
}

{
  name = "sql"
  input = "/tmp/generated-kmeans-data.parquet"
  query = "select `0` from input where `0` < -0.9"
  cache = true
  output = "hdfs:///query-output-in-three-partitions.csv"
  partitions = 3 // will repartition the dataset before writing out
}