在 WDL 中编写工作流程在 Nextflow 中编写在 CWL 中编写工作流程工作流程定义示例 WDL 工作流程定义示例

为工作流程编写工作 HealthOmics 流定义

HealthOmics 支持用 WDL、Nextflow 或 CWL 编写的工作流程定义。要了解有关这些工作流程语言的更多信息，请参阅 WDL、Nextflow 或 CWL 的规范。

HealthOmics 支持三种工作流程定义语言的版本管理。有关更多信息，请参阅对 HealthOmics 工作流定义语言的版本支持。

主题

在 WDL 中编写工作流程
在 Nextflow 中编写
在 CWL 中编写工作流程
工作流程定义示例
WDL 工作流程定义示例

在 WDL 中编写工作流程

下表显示了 WDL 中的输入如何映射到匹配的原始类型或复杂 JSON 类型。类型强制是有限的，只要有可能，类型就应该是显式的。

原始类型
WDL 类型	JSON 类型	示例 WDL	JSON 密钥和值示例	备注
`Boolean`	`boolean`	`Boolean b`	`"b": true`	该值必须为小写且不带引号。
`Int`	`integer`	`Int i`	`"i": 7`	必须不加引号。
`Float`	`number`	`Float f`	`"f": 42.2`	必须不加引号。
`String`	`string`	`String s`	`"s": "characters"`	作为 URI 的 JSON 字符串必须映射到要导入的 WDL 文件。
`File`	`string`	`File f`	`"f": "s3://amzn-s3-demo-bucket1/path/to/file"`	只要为工作流程提供 URIs 的 IAM 角色具有对这些对象的读取权限，就会导入 Amazon S3 和 HealthOmics 存储。不支持其他 URI 方案（例如`file://https://`、和`ftp://`）。URI 必须指定一个对象。它不能是目录，这意味着它不能以结尾`/`。
`Directory`	`string`	`Directory d`	`"d": "s3://bucket/path/"`	该`Directory`类型不包含在 WDL 1.0 或 1.1 中，因此您需要将该类型添加`version development`到 WDL 文件的标题中。URI 必须是 Amazon S3 URI，且前缀必须以 “/” 结尾。该目录的所有内容将以递归方式复制到工作流程中，一次下载即可。`Directory`应仅包含与工作流程相关的文件。

WDL 中的复杂类型是由原始类型组成的数据结构。诸如列表之类的数据结构将转换为数组。

复杂类型
WDL 类型	JSON 类型	示例 WDL	JSON 密钥和值示例	备注
`Array`	`array`	`Array[Int] nums`	`“nums": [1, 2, 3]`	数组的成员必须遵循 WDL 数组类型的格式。
`Pair`	`object`	`Pair[String, Int] str_to_i`	`“str_to_i": {"left": "0", "right": 1}`	该对的每个值都必须使用其匹配的 WDL 类型的 JSON 格式。
`Map`	`object`	`Map[Int, String] int_to_string`	`"int_to_string": { 2: "hello", 1: "goodbye" }`	地图中的每个条目都必须使用其匹配的 WDL 类型的 JSON 格式。
`Struct`	`object`	`struct SampleBamAndIndex { String sample_name File bam File bam_index } SampleBamAndIndex b_and_i`	`"b_and_i": { "sample_name": "NA12878", "bam": "s3://amzn-s3-demo-bucket1/NA12878.bam", "bam_index": "s3://amzn-s3-demo-bucket1/NA12878.bam.bai" }`	结构成员的名称必须与 JSON 对象键的名称完全匹配。每个值都必须使用匹配的 WDL 类型的 JSON 格式。
`Object`	不适用	不适用	不适用	WDL `Object` 类型已过时，`Struct`在所有情况下都应替换为。

工作 HealthOmics 流引擎不支持限定或命名间隔的输入参数。WDL 语言未指定对限定参数的处理及其与 WDL 参数的映射，因此可能含糊不清。出于这些原因，最佳做法是在顶级（主）工作流定义文件中声明所有输入参数，然后使用标准 WDL 机制将它们传递给子工作流调用。

在 Nextflow 中编写

HealthOmics 支持 Next DSL1 flow 和。 DSL2有关更多信息，请参阅 Nextflow版本支持。

Nextflow 基 DSL2 于 Groovy 编程语言，因此参数是动态的，并且可以使用与 Groovy 相同的规则进行类型强制。输入 JSON 提供的参数和值可在工作流程的参数 (params) 映射中找到。

使用 nf 架构和 nf 验证插件

注意

插件 HealthOmics 支持摘要：

v22.04 — 不支持插件
v23.10 — 支持和 nf-schema nf-validation
v24.10 — 支持 nf-schema

HealthOmics 为 Nextflow 插件提供了以下支持：

对于 Nextflow v23.10， HealthOmics 预安装 nf-validation @1 .1.1 插件。
对于 Nextflow v23.10 及更高版本， HealthOmics 预安装 nf-schema @2 .3.0 插件。
在工作流程运行期间，您无法检索其他插件。 HealthOmics 忽略您在nextflow.config文件中指定的任何其他插件版本。
对于 Nextflow v24 及更高版本，nf-schema是已弃用nf-validation插件的新版本。有关更多信息，请参阅 Next GitHub flow 存储库中的 nf-schema。

指定存储 URIs

使用 Amazon S3 或 HealthOmics URI 构建 Nextflow 文件或路径对象时，只要授予读取权限，它就会使匹配的对象可供工作流程使用。Amazon S3 URIs 允许使用前缀或目录。有关示例，请参阅亚马逊 S3 输入参数格式。

HealthOmics 支持在 Amazon S3 URIs 或 HealthOmics 存储 URIs中使用全局模式。在工作流程定义中使用 Glob 模式来创建path或file频道。

使用时间指令设置最长任务持续时间

HealthOmics 提供了可调整的配额（参见HealthOmics 服务配额），用于指定跑步的最大持续时间。对于 Nextflow v23 和 v24 工作流程，您还可以使用 Nextflow 时间指令指定最大任务持续时间。

在新工作流程开发过程中，设置最大任务持续时间可以帮助你捕捉失控的任务和长时间运行的任务。

有关 Nextflow 时间指令的更多信息，请参阅 Nextflow 参考中的时间指令。

HealthOmics 为 Nextflow 时间指令提供了以下支持：

HealthOmics 支持时间指令的 1 分钟粒度。您可以指定一个介于 60 秒和最大运行持续时间值之间的值。
如果输入的值小于 60，则将其 HealthOmics 四舍五入到 60 秒。对于大于 60 的值，向下 HealthOmics 舍入到最接近的分钟。
如果工作流程支持任务的重试，则在任务超时时时 HealthOmics 重试该任务。
如果任务超时（或上次重试超时），则 HealthOmics 取消该任务。此操作的持续时间可能为一到两分钟。
任务超时时， HealthOmics 将运行和任务状态设置为失败，并取消运行中的其他任务（适用于处于 “启动”、“待处理” 或 “正在运行” 状态的任务）。 HealthOmics 将其在超时之前完成的任务的输出导出到您指定的 S3 输出位置。
任务处于待处理状态的时间不计入任务持续时间。
如果运行是运行组的一部分，并且运行组的超时时间早于任务计时器，则运行和任务将转换为失败状态。

使用以下一个或多个单位指定超时持续时间：mss、m、h、或d。您可以在 Nextflow 配置文件和工作流程定义中指定时间指令。以下列表显示优先级顺序，从低到高优先级：

配置文件中的全局配置。
工作流定义的任务部分。
配置文件中特定于任务的选择器。

以下示例说明如何在 Nextflow 配置文件中指定全局配置。它将全局超时设置为 1 小时 30 分钟：


process {
    time = '1h30m'
}

以下示例说明如何在工作流定义的任务部分中指定时间指令。此示例将超时设置为 3 天、5 小时和 4 分钟。此值优先于配置文件中的全局值，但不优先于配置文件my_label中特定于任务的时间指令：


process myTask {
    label 'my_label'
    time '3d5h4m'
        
    script:
    """
    your-command-here
    """
}

以下示例说明如何根据名称或标签选择器在 Nextflow 配置文件中指定特定于任务的时间指令。此示例将全局任务超时值设置为 30 分钟。它将任务的值设置为 2 小时myTask，将带有标签的任务的值设置为 3 小时my_label。对于与选择器匹配的任务，这些值优先于全局值和工作流程定义中的值：


process {
    time = '30m'
    
    withLabel: 'my_label' {
        time = '3h'  
    }

    withName: 'myTask' {
        time = '2h'  
    }
}

导出任务内容

对于用 Nextflow 编写的工作流程，请定义 PublishDir 指令以将任务内容导出到输出 Amazon S3 存储桶。如以下示例所示，将 p ublishDir 值设置为。/mnt/workflow/pubdir要将文件导出到 Amazon S3，文件必须位于此目录中。


 nextflow.enable.dsl=2
              
  workflow {
    CramToBamTask(params.ref_fasta, params.ref_fasta_index, params.ref_dict, params.input_cram, params.sample_name)
    ValidateSamFile(CramToBamTask.out.outputBam)
  }
  
  process CramToBamTask {
    container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud"
  
    publishDir "/mnt/workflow/pubdir"
  
    input:
        path ref_fasta
        path ref_fasta_index
        path ref_dict
        path input_cram
        val sample_name
  
    output:
        path "${sample_name}.bam", emit: outputBam
        path "${sample_name}.bai", emit: outputBai
  
    script:
    """
        set -eo pipefail
  
        samtools view -h -T $ref_fasta $input_cram |
        samtools view -b -o ${sample_name}.bam -
        samtools index -b ${sample_name}.bam
        mv ${sample_name}.bam.bai ${sample_name}.bai
    """
  }
  
  process ValidateSamFile {
    container "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-cloud"
  
    publishDir "/mnt/workflow/pubdir"
  
    input:
        file input_bam
  
    output:
        path "validation_report"
  
    script:
    """
        java -Xmx3G -jar /usr/gitc/picard.jar \
        ValidateSamFile \
        INPUT=${input_bam} \
        OUTPUT=validation_report \
        MODE=SUMMARY \
        IS_BISULFITE_SEQUENCED=false
    """
  }

在 CWL 中编写工作流程

用通用工作流语言 (CWL) 编写的工作流程提供的功能与用 WDL 和 Nextflow 编写的工作流程类似。您可以使用 Amazon S3 或 HealthOmics 存储 URIs 作为输入参数。

如果您在子工作流程的 SecondaryFile 中定义输入，请在主工作流程中添加相同的定义。

HealthOmics 工作流程不支持操作流程。要了解有关 CWL 工作流中操作流程的更多信息，请参阅 CWL 文档。

要转换现有的 CWL 工作流定义文件以供使用 HealthOmics，请进行以下更改：

将所有 Docker 容器 URIs 替换为亚马逊 EC URIs R。
确保在主工作流程中将所有工作流文件声明为输入，并且所有变量都已明确定义。
确保所有 JavaScript 代码都是严格模式投诉。

应为使用的每个容器定义 CWL 工作流程。不建议使用固定的亚马逊 ECR URI 对 DockerPull 条目进行硬编码。

以下是用 CWL 编写的工作流程示例。



cwlVersion: v1.2
class: Workflow

inputs:
in_file:
  type: File
  secondaryFiles: [.fai]
 
out_filename: string
docker_image: string


outputs:
copied_file:
  type: File
  outputSource: copy_step/copied_file

steps:
copy_step:
  in:
    in_file: in_file
    out_filename: out_filename
    docker_image: docker_image
  out: [copied_file]
  run: copy.cwl

以下文件定义了copy.cwl任务。



cwlVersion: v1.2
class: CommandLineTool
baseCommand: cp

inputs:
in_file:
  type: File
  secondaryFiles: [.fai]
  inputBinding:
    position: 1

out_filename:
  type: string
  inputBinding:
    position: 2
docker_image:
  type: string

outputs:
copied_file:
  type: File
  outputBinding:
      glob: $(inputs.out_filename)

requirements:
InlineJavascriptRequirement: {}
DockerRequirement:
  dockerPull: "$(inputs.docker_image)"

以下是使用 CWL 编写的、具有 GPU 要求的工作流程示例。


cwlVersion: v1.2
class: CommandLineTool
baseCommand: ["/bin/bash", "docm_haplotypeCaller.sh"]
$namespaces:
cwltool: http://commonwl.org/cwltool#
requirements:
cwltool:CUDARequirement:
  cudaDeviceCountMin: 1
  cudaComputeCapability: "nvidia-tesla-t4" 
  cudaVersionMin: "1.0"
InlineJavascriptRequirement: {}
InitialWorkDirRequirement:
  listing:
  - entryname: 'docm_haplotypeCaller.sh'
    entry: |
            nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv   

inputs: []
outputs: []

工作流程定义示例

以下示例显示了 WDL、Nextflow 和 CWL 中相同的工作流程定义。

WDL


version 1.1

task my_task {
   runtime { ... }
   inputs {
       File input_file
       String name
       Int threshold
   }
   
   command <<<
   my_tool --name ~{name} --threshold ~{threshold} ~{input_file}
   >>>
   
   output {
       File results = "results.txt"
   }
}

workflow my_workflow {
   inputs {
       File input_file
       String name
       Int threshold = 50
   }
   
   call my_task {
       input:
          input_file = input_file,
          name = name,
          threshold = threshold
   }
   outputs {
       File results = my_task.results
   }
}

Nextflow


nextflow.enable.dsl = 2

params.input_file = null
params.name = null
params.threshold = 50

process my_task {
   // <directives>
   
   input:
     path input_file
     val name
     val threshold
   
   output:
     path 'results.txt', emit: results
   
   script:
     """
     my_tool --name ${name} --threshold ${threshold} ${input_file}
     """
     
   
}

workflow MY_WORKFLOW {
   my_task(
       params.input_file,
       params.name,
       params.threshold
   )
}

workflow {
   MY_WORKFLOW()
}

CWL


cwlVersion: v1.2
class: Workflow

requirements:
    InlineJavascriptRequirement: {}

inputs:
   input_file: File
   name: string
   threshold: int

outputs:
    result:
        type: ...
        outputSource: ...

steps:
    my_task:
        run:
            class: CommandLineTool
            baseCommand: my_tool
            requirements:
                ...
            inputs:
                name:
                    type: string
                    inputBinding:
                        prefix: "--name"
                threshold:
                    type: int
                    inputBinding:
                        prefix: "--threshold"
                input_file:
                    type: File
                    inputBinding: {}
            outputs:
                results:
                    type: File
                    outputBinding:
                        glob: results.txt

WDL 工作流程定义示例

以下示例显示了在 WDL BAM 中从CRAM转换为的私有工作流程定义。t CRAM o BAM 工作流定义了两个任务并使用genomes-in-the-cloud容器中的工具，该工具如示例所示，并且已公开发布。

以下示例说明如何将 Amazon ECR 容器作为参数包括在内。这 HealthOmics 允许在容器开始运行之前验证其访问权限。


{
     ...
     "gotc_docker":"<account_id>.dkr.ecr.<region>.amazonaws.com/genomes-in-the-cloud:2.4.7-1603303710"
  }

以下示例说明当文件位于 Amazon S3 存储桶中时，如何指定要在运行中使用哪些文件。


{
      "input_cram": "s3://amzn-s3-demo-bucket1/inputs/NA12878.cram",
      "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict",
      "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta",
      "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai",
      "sample_name": "NA12878"
  }

如果要指定序列存储中的文件，请使用序列存储的 URI 进行指示，如以下示例所示。


{
      "input_cram": "omics://429915189008.storage.us-west-2.amazonaws.com/111122223333/readSet/4500843795/source1",
      "ref_dict": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.dict",
      "ref_fasta": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta",
      "ref_fasta_index": "s3://amzn-s3-demo-bucket1/inputs/Homo_sapiens_assembly38.fasta.fai",
      "sample_name": "NA12878"
  }

然后，您可以在 WDL 中定义工作流程，如下所示。


 version 1.0
  workflow CramToBamFlow {
      input {
          File ref_fasta
          File ref_fasta_index
          File ref_dict
          File input_cram
          String sample_name
          String gotc_docker = "<account>.dkr.ecr.us-west-2.amazonaws.com/genomes-in-the-
  cloud:latest"
      }
      #Converts CRAM to SAM to BAM and makes BAI.
      call CramToBamTask{
           input:
              ref_fasta = ref_fasta,
              ref_fasta_index = ref_fasta_index,
              ref_dict = ref_dict,
              input_cram = input_cram,
              sample_name = sample_name,
              docker_image = gotc_docker,
       }
       #Validates Bam.
       call ValidateSamFile{
          input:
             input_bam = CramToBamTask.outputBam,
             docker_image = gotc_docker,
       }
       #Outputs Bam, Bai, and validation report to the FireCloud data model.
       output {
           File outputBam = CramToBamTask.outputBam
           File outputBai = CramToBamTask.outputBai
           File validation_report = ValidateSamFile.report
        }
  }
  #Task definitions.
  task CramToBamTask {
      input {
         # Command parameters
         File ref_fasta
         File ref_fasta_index
         File ref_dict
         File input_cram
         String sample_name
         # Runtime parameters
         String docker_image
      }
     #Calls samtools view to do the conversion.
     command {
         set -eo pipefail
  
         samtools view -h -T ~{ref_fasta} ~{input_cram} |
         samtools view -b -o ~{sample_name}.bam -
         samtools index -b ~{sample_name}.bam
         mv ~{sample_name}.bam.bai ~{sample_name}.bai
      }
      
      #Runtime attributes:
      runtime {
          docker: docker_image
      }
  
      #Outputs a BAM and BAI with the same sample name
       output {
           File outputBam = "~{sample_name}.bam"
           File outputBai = "~{sample_name}.bai"
      }
  }
  
  #Validates BAM output to ensure it wasn't corrupted during the file conversion.
  task ValidateSamFile {
     input {
        File input_bam
        Int machine_mem_size = 4
        String docker_image
     }
     String output_name = basename(input_bam, ".bam") + ".validation_report"
     Int command_mem_size = machine_mem_size - 1
     command {
         java -Xmx~{command_mem_size}G -jar /usr/gitc/picard.jar \
         ValidateSamFile \
         INPUT=~{input_bam} \
         OUTPUT=~{output_name} \
         MODE=SUMMARY \
         IS_BISULFITE_SEQUENCED=false
      }
      runtime {
      docker: docker_image
      }
     #A text file is generated that lists errors or warnings that apply.
      output {
          File report = "~{output_name}"
      }
  }

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

任务加速器

参数模板文件