SQL
Runs a SQL query over the input dataset. input
from the common parameters is required.
The query string is required to use “input” as the name of the table. For example:
select * from input where SomeNumericField < 15
You can also query unnamed columns such as those generated by the KMeans data generator, as in this example from the SQL Workload unit test which uses `0` as the name of the first column, `1` as the name of the second, and so on.
select `0` from input where `0` < -0.9
Parameters
Name | Required | Default | Description |
---|---|---|---|
name | yes | – | “sql” |
input | yes | – | the input dataset |
output | no | – | If users wish to capture the actual results of the SQL query, they can specify an output file here. |
save-mode | no | errorifexists | Options are “errorifexists”, “ignore” (no-op if exists), and “overwrite” |
query | yes | – | the sql query to perform. The table name must be “input” as shown in the examples above. |
cache | no | false | whether the dataset should be cached after being read from disk |
partitions | no | Natural partitioning | If users specify output for this workload, they can optionally repartion the dataset using this option. |
Examples
{
name = "sql"
input = "/tmp/generated-kmeans-data.parquet"
output = "/tmp/sql-query-results.parquet"
query = "select `0` from input where `0` < -0.9"
}
{
name = "sql"
input = "/tmp/generated-kmeans-data.parquet"
query = "select `0` from input where `0` < -0.9"
cache = true
}
{
name = "sql"
input = "/tmp/generated-kmeans-data.parquet"
query = "select `0` from input where `0` < -0.9"
cache = true
output = "hdfs:///query-output-in-three-partitions.csv"
partitions = 3 // will repartition the dataset before writing out
}