Writing workflow definitions for HealthOmics workflows
HealthOmics supports workflow definitions written in WDL, Nextflow, or CWL. To learn more about these workflow
languages, see the specifications for WDL
HealthOmics supports version management for the three workflow definition languages. For more information, see Version support for HealthOmics workflow definition languages .
Topics
Writing workflows in WDL
The following tables show how inputs in WDL map to the matching primitive type or complex JSON type. Type coercion is limited and whenever possible, types should be explicit.
WDL type | JSON type | Example WDL | Example JSON key and value | Notes |
---|---|---|---|---|
Boolean |
boolean |
Boolean b |
"b": true |
The value must be lower case and unquoted. |
Int |
integer |
Int i |
"i": 7 |
Must be unquoted. |
Float |
number |
Float f |
"f": 42.2 |
Must be unquoted. |
String |
string |
String s |
"s": "characters" |
JSON strings that are a URI must be mapped to a WDL file to be imported. |
File |
string |
File f |
"f": "s3://amzn-s3-demo-bucket1/path/to/file" |
Amazon S3 and HealthOmics storage URIs are imported as long as the IAM
role provided for the workflow has read access to these objects. No
other URI schemes are supported (such as file:// ,
https:// , and ftp:// ). The URI must
specify an object. It cannot be a directory meaning it can not end with
a / . |
Directory |
string |
Directory d |
"d": "s3://bucket/path/" |
The Directory type isn't included in WDL 1.0 or 1.1, so
you will need to add version development to the header of
the WDL file. The URI must be a Amazon S3 URI and with a prefix that ends
with a '/'. All contents of the directory will be recursively copied to
the workflow as a single download. The Directory should
only contain files related to the workflow. |
Complex types in WDL are data structures comprised of primitive types. Data structures such as lists will be converted to arrays.
WDL type | JSON type | Example WDL | Example JSON key and value | Notes |
---|---|---|---|---|
Array |
array |
Array[Int] nums |
“nums": [1, 2, 3] |
The members of the array must follow the format of the WDL array type. |
Pair |
object |
Pair[String, Int] str_to_i |
“str_to_i": {"left": "0", "right": 1} |
Each value of the pair must use the JSON format of its matching WDL type. |
Map |
object |
Map[Int, String] int_to_string |
"int_to_string": { 2: "hello", 1: "goodbye" } |
Each entry in the map must use the JSON format of its matching WDL type. |
Struct |
object |
|
|
The names of the struct members must exactly match the names of the JSON object keys. Each value must use the JSON format of the matching WDL type. |
Object |
N/A | N/A | N/A | The WDL Object type is outdated and should be replaced
by Struct in all cases. |
The HealthOmics workflow engine doesn't support qualified or name-spaced input parameters. Handling of qualified parameters and their mapping to WDL parameters isn't specified by the WDL language and can be ambiguous. For these reasons, best practice is to declare all input parameters in the top level (main) workflow definition file and pass them down to subworkflow calls using standard WDL mechanisms.
Writing workflows in Nextflow
HealthOmics suppports Nextflow DSL1 and DSL2. For details, see Nextflow version support.
Nextflow DSL2 is based on the Groovy programming language, so parameters are dynamic
and type coercion is possible using the same rules as Groovy. Parameters and values
supplied by the input JSON are available in the parameters (params
) map of
the workflow.
Topics
Using nf-schema and nf-validation plugins
Note
Summary of HealthOmics support for plugins:
v22.04 – no support for plugins
v23.10 – supports
nf-schema
andnf-validation
v24.10 – supports
nf-schema
HealthOmics provides the following support for Nextflow plugins:
-
For Nextflow v23.10, HealthOmics pre-installs the nf-validation@1.1.1 plugin.
-
For Nextflow v23.10 and later, HealthOmics pre-installs the nf-schema@2.3.0 plugin.
-
You cannot retrieve additional plugins during a workflow run. HealthOmics ignores any other plugin versions that you specify in the
nextflow.config
file. -
For Nextflow v24 and higher,
nf-schema
is the new version of the deprecatednf-validation
plugin. For more information, see nf-schemain the Nextflow GitHub repository.
Specifying storage URIs
When an Amazon S3 or HealthOmics URI is used to construct a Nextflow file or path object, it makes the matching object available to the workflow, as long as read access is granted. The use of prefixes or directories is allowed for Amazon S3 URIs. For examples, see Amazon S3 input parameter formats.
HealthOmics supports the use of glob patterns in Amazon S3 URIs or HealthOmics Storage URIs.
Use Glob patterns in the
workflow definition for the creation of path
or file
channels.
Setting maximum task duration using time directives
HealthOmics provides an adjustable quota (see HealthOmics service quotas) to specify the maximum duration for a run. For Nextflow v23 and v24 workflows, you can also specify maximum task durations using Nextflow time directives.
During new workflow development, setting maximum task duration helps you catch runaway tasks and long-running tasks.
For more information about the Nextflow time directive, see time directive
HealthOmics provides the following support for Nextflow time directives:
-
HealthOmics supports 1 minute granularity for the time directive. You can specify a value between 60 seconds and the maximum run duration value.
-
If you enter a value less than 60, HealthOmics rounds it up to 60 seconds. For values above 60, HealthOmics rounds down to the nearest minute.
-
If the workflow supports retries for a task, HealthOmics retries the task if it times out.
-
If a task times out (or the last retry times out), HealthOmics cancels the task. This operation can have a duration of one to two minutes.
-
On task timeout, HealthOmics sets the run and task status to failed, and it cancels the other tasks in the run (for tasks in Starting, Pending, or Running status). HealthOmics exports the outputs from tasks that it completed before the timeout to your designated S3 output location.
-
Time that a task spends in pending status does not count toward the task duration.
-
If the run is part of a run group and the run group times out sooner than the task timer, the run and task transition to failed status.
Specify the timeout duration using one or more of the following units: ms
, s
,
m
,h
, or d
. You can specify time directives in the Nextflow config file
and in the workflow definition. The following list shows order of precedence, from lowest to highest
priority:
-
Global configuration in the config file.
-
Task section of the workflow definition.
-
Task-specific selectors in the config file.
The following example shows how to specify global configuration in the Nextflow config file. It sets a global timeout of 1 hour and 30 minutes:
process { time = '1h30m' }
The following example shows how to specify a time directive in the task section of the workflow definition.
This example sets a timeout of 3 days, 5 hours, and 4 minutes. This value takes precedence over the global value
in the config file, but doesn't take precedence over a task-specific time directive for my_label
in
the config file:
process myTask { label 'my_label' time '3d5h4m' script: """ your-command-here """ }
The following example shows how to specify task-specific time directives in the Nextflow config file, based
on the name or label selectors. This example sets a global task timeout value of 30 minutes. It sets a value of
2 hours for task myTask
and sets a value of 3 hours for tasks with label my_label
. For
tasks that match the selector, these values take precedence over the global value and the value in the workflow
definition:
process { time = '30m' withLabel: 'my_label' { time = '3h' } withName: 'myTask' { time = '2h' } }
Exporting task content
For workflows written in Nextflow, define a publishDir directive to export task content
to your output Amazon S3 bucket. As shown in the following example, set the publishDir value to
/mnt/workflow/pubdir
. To export files to Amazon S3, the files must be in this directory.
nextflow.enable.dsl=2 workflow { CramToBamTask(params.ref_fasta, params.ref_fasta_index, params.ref_dict, params.input_cram, params.sample_name) ValidateSamFile(CramToBamTask.out.outputBam) } process CramToBamTask { container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud" publishDir "/mnt/workflow/pubdir" input: path ref_fasta path ref_fasta_index path ref_dict path input_cram val sample_name output: path "${sample_name}.bam", emit: outputBam path "${sample_name}.bai", emit: outputBai script: """ set -eo pipefail samtools view -h -T $ref_fasta $input_cram | samtools view -b -o ${sample_name}.bam - samtools index -b ${sample_name}.bam mv ${sample_name}.bam.bai ${sample_name}.bai """ } process ValidateSamFile { container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud" publishDir "/mnt/workflow/pubdir" input: file input_bam output: path "validation_report" script: """ java -Xmx3G -jar /usr/gitc/picard.jar \ ValidateSamFile \ INPUT=${input_bam} \ OUTPUT=validation_report \ MODE=SUMMARY \ IS_BISULFITE_SEQUENCED=false """ }
Writing workflows in CWL
Workflows written in Common Workflow Language, or CWL, offer similar functionality to workflows written in WDL and Nextflow. You can use Amazon S3 or HealthOmics storage URIs as input parameters.
If you define input in a secondaryFile in a sub workflow, add the same definition in the main workflow.
HealthOmics workflows don't support operation processes. To learn more about operations processes in CWL workflows,
see the CWL
documentation
To convert an existing CWL workflow definition file to use HealthOmics, make the following changes:
-
Replace all Docker container URIs with Amazon ECR URIs.
-
Make sure that all the workflow files are declared in the main workflow as input, and all variables are explicitly defined.
-
Make sure that all JavaScript code is strict-mode complaint.
CWL workflows should be defined for each container used. It isn't recommended to hardcode the dockerPull entry with a fixed Amazon ECR URI.
The following is an example of a workflow written in CWL.
cwlVersion: v1.2 class: Workflow inputs: in_file: type: File secondaryFiles: [.fai] out_filename: string docker_image: string outputs: copied_file: type: File outputSource: copy_step/copied_file steps: copy_step: in: in_file: in_file out_filename: out_filename docker_image: docker_image out: [copied_file] run: copy.cwl
The following file defines the copy.cwl
task.
cwlVersion: v1.2 class: CommandLineTool baseCommand: cp inputs: in_file: type: File secondaryFiles: [.fai] inputBinding: position: 1 out_filename: type: string inputBinding: position: 2 docker_image: type: string outputs: copied_file: type: File outputBinding: glob: $(inputs.out_filename) requirements: InlineJavascriptRequirement: {} DockerRequirement: dockerPull: "$(inputs.docker_image)"
The following is an example of a workflow written in CWL with a GPU requirement.
cwlVersion: v1.2 class: CommandLineTool baseCommand: ["/bin/bash", "docm_haplotypeCaller.sh"] $namespaces: cwltool: http://commonwl.org/cwltool# requirements: cwltool:CUDARequirement: cudaDeviceCountMin: 1 cudaComputeCapability: "nvidia-tesla-t4" cudaVersionMin: "1.0" InlineJavascriptRequirement: {} InitialWorkDirRequirement: listing: - entryname: 'docm_haplotypeCaller.sh' entry: | nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv inputs: [] outputs: []
Example workflow definition
The following example shows the same workflow definition in WDL, Nextflow, and CWL.
WDL workflow definition example
The following examples show private workflow definitions for converting from
CRAM
to BAM
in WDL. The CRAM
to
BAM
workflow defines two tasks and uses tools from the
genomes-in-the-cloud
container, which is shown in the example and is
publicly available.
The following example shows how to include the Amazon ECR container as a parameter. This allows HealthOmics to verify the access permissions to your container before it starts the run the run.
{ ... "gotc_docker":"<account_id>.dkr.ecr.<region>.amazonaws.com/genomes-in-the-cloud:2.4.7-1603303710" }
The following example shows how to specify which files to use in your run, when the files are in an Amazon S3 bucket.
{ "input_cram": "s3://amzn-s3-demo-bucket1/inputs/NA12878.cram", "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict", "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta", "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai", "sample_name": "NA12878" }
If you want to specify files from a sequence store, indicate that as shown in the following example, using the URI for the sequence store.
{ "input_cram": "omics://429915189008.storage.us-west-2.amazonaws.com/111122223333/readSet/4500843795/source1", "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict", "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta", "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai", "sample_name": "NA12878" }
You can then define your workflow in WDL as shown in the following.
version 1.0 workflow CramToBamFlow { input { File ref_fasta File ref_fasta_index File ref_dict File input_cram String sample_name String gotc_docker = "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the- cloud:latest" } #Converts CRAM to SAM to BAM and makes BAI. call CramToBamTask{ input: ref_fasta = ref_fasta, ref_fasta_index = ref_fasta_index, ref_dict = ref_dict, input_cram = input_cram, sample_name = sample_name, docker_image = gotc_docker, } #Validates Bam. call ValidateSamFile{ input: input_bam = CramToBamTask.outputBam, docker_image = gotc_docker, } #Outputs Bam, Bai, and validation report to the FireCloud data model. output { File outputBam = CramToBamTask.outputBam File outputBai = CramToBamTask.outputBai File validation_report = ValidateSamFile.report } } #Task definitions. task CramToBamTask { input { # Command parameters File ref_fasta File ref_fasta_index File ref_dict File input_cram String sample_name # Runtime parameters String docker_image } #Calls samtools view to do the conversion. command { set -eo pipefail samtools view -h -T ~{ref_fasta} ~{input_cram} | samtools view -b -o ~{sample_name}.bam - samtools index -b ~{sample_name}.bam mv ~{sample_name}.bam.bai ~{sample_name}.bai } #Runtime attributes: runtime { docker: docker_image } #Outputs a BAM and BAI with the same sample name output { File outputBam = "~{sample_name}.bam" File outputBai = "~{sample_name}.bai" } } #Validates BAM output to ensure it wasn't corrupted during the file conversion. task ValidateSamFile { input { File input_bam Int machine_mem_size = 4 String docker_image } String output_name = basename(input_bam, ".bam") + ".validation_report" Int command_mem_size = machine_mem_size - 1 command { java -Xmx~{command_mem_size}G -jar /usr/gitc/picard.jar \ ValidateSamFile \ INPUT=~{input_bam} \ OUTPUT=~{output_name} \ MODE=SUMMARY \ IS_BISULFITE_SEQUENCED=false } runtime { docker: docker_image } #A text file is generated that lists errors or warnings that apply. output { File report = "~{output_name}" } }