本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
将读取集导入 HealthOmics 序列存储
创建序列存储后,创建导入任务以将读取集上传到数据存储中。您可以从 Amazon S3 存储桶上传文件,也可以使用同步 API 操作直接上传。您的 Amazon S3 存储桶必须与您的序列存储位于同一区域。
您可以将对齐和未对齐读取集的任意组合上传到序列存储中,但是,如果导入中的任何读取集是对齐的,则必须包括参考基因组。
您可以重复使用用于创建参考存储的 IAM 访问策略。
以下主题描述了将读取集导入序列存储然后获取有关导入数据信息的主要步骤。
将文件上传到亚马逊 S3
以下示例显示了如何将文件移动到您的 Amazon S3 存储桶中。
aws s3 cp s3://1000genomes/phase1/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/data/HG00096/alignment/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram s3://your-bucket aws s3 cp s3://gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam s3://your-bucket
本示例BAM
中CRAM
使用的样本需要不同的基因组参考文献,Hg19
以及Hg38
。要了解更多信息或访问这些参考文献,请参阅开放数据注册表中的广泛基因组参考文献
创建清单文件
您还必须以 JSON 格式创建清单文件来对导入任务进行建模import.json
(参见以下示例)。如果您在控制台中创建序列存储,则无需指定sequenceStoreId
或roleARN
,因此清单文件以sources
输入开头。
或者,您可以上传 YAML 格式的清单文件。
启动导入任务
要启动导入作业,请使用以下 AWS CLI 命令。
aws omics start-read-set-import-job --cli-input-json file://import.json
您会收到以下响应,表示成功创建了就业机会。
{ "id": "3660451514", "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "CREATED", "creationTime": "2022-07-13T22:14:59.309Z" }
监控导入作业
导入任务启动后,您可以使用以下命令监控其进度。在以下示例中,
替换为您的序列存储 ID,然后sequence store id
替换为导入 ID。job import ID
aws omics get-read-set-import-job --sequence-store-id
--id
sequence store id
job import ID
下图显示了与指定序列存储 ID 关联的所有导入任务的状态。
{ "id": "1234567890", "sequenceStoreId": "1234567890", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "RUNNING", "statusMessage": "The job is currently in progress.", "creationTime": "2022-07-13T22:14:59.309Z", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/8625408453", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/1234568870", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data", "readSetID": "1234567890" } ] }
查找导入的序列文件
任务完成后,您可以使用 list-read-setsAPI 操作来查找导入的序列文件。在以下示例中,
替换为您的序列存储 ID。sequence store
id
aws omics list-read-sets --sequence-store-id
sequence store id
您会收到以下回复。
{ "readSets": [ { "id": "0000000001", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/01234567890/readSet/0000000001", "sequenceStoreId": "1234567890", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00100", "description": "BAM for HG00100", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/01234567890/reference/0000000001", "fileType": "BAM", "sequenceInformation": { "totalReadCount": 9194, "totalBaseCount": 928594, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:25:20Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "d1d65429212d61d115bb19f510d4bd02" } }, { "id": "0000000002", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000002", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00146", "description": "FASTQ for HG00146", "fileType": "FASTQ", "sequenceInformation": { "totalReadCount": 8000000, "totalBaseCount": 1184000000, "generatedFrom": "1000 Genomes", "alignment": "UNALIGNED" }, "creationTime": "2022-07-13T23:26:43Z" "creationType": "IMPORT", "etag": { "algorithm": "FASTQ_MD5up", "source1": "ca78f685c26e7cc2bf3e28e3ec4d49cd" } }, { "id": "0000000003", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000003", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00096", "description": "CRAM for HG00096", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/0123456789/reference/0000000001", "fileType": "CRAM", "sequenceInformation": { "totalReadCount": 85466534, "totalBaseCount": 24000004881, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "CRAM_MD5up", "source1": "66817940f3025a760e6da4652f3e927e" } }, { "id": "0000000004", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000004", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "NA12878_A", "description": "uBAM for NA12878", "fileType": "UBAM", "sequenceInformation": { "totalReadCount": 20000, "totalBaseCount": 5000000, "generatedFrom": "GATK Test Data", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "640eb686263e9f63bcda12c35b84f5c7" } } ] }
获取有关阅读集的详细信息
要查看有关读取集的更多详细信息,请使用 GetReadSetMetadataAPI 操作。在以下示例中,
替换为您的序列存储 ID,然后sequence store id
替换为您的读取集 ID。read set id
aws omics get-read-set-metadata --sequence-store-id
--id
sequence store id
read set id
您会收到以下回复。
{ "arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/2015356892/readSet/9515444019", "creationTime": "2024-01-12T04:50:33.548Z", "creationType": "IMPORT", "creationJobId": "33222111", "description": null, "etag": { "algorithm": "FASTQ_MD5up", "source1": "00d0885ba3eeb211c8c84520d3fa26ec", "source2": "00d0885ba3eeb211c8c84520d3fa26ec" }, "fileType": "FASTQ", "files": { "index": null, "source1": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://
accountID
-sequence store ID
-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 }, "source2": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://accountID
-sequence store ID
-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 } }, "id": "9515444019", "name": "paired-fastq-import", "sampleId": "sampleId-paired-fastq-import", "sequenceInformation": { "alignment": "UNALIGNED", "generatedFrom": null, "totalBaseCount": 30000, "totalReadCount": 200 }, "sequenceStoreId": "2015356892", "status": "ACTIVE", "statusMessage": null, "subjectId": "subjectId-paired-fastq-import" }
下载读取集数据文件
您可以使用 Amazon S3 GetObject API 操作访问活动读取集的对象。该对象的 URI 将在 GetReadSetMetadataAPI 响应中返回。有关更多信息,请参阅 使用 Amazon S3 访问 HealthOmics 读取集 URIs。
或者,也可以使用 HealthOmics GetReadSet API 操作。您可以使用GetReadSet通过下载各个部分来并行下载。这些部分与 Amazon S3 部件类似。以下是如何从读取集下载第 1 部分的示例。在以下示例中,
替换为您的序列存储 ID,然后sequence store id
替换为您的读取集 ID。read set id
aws omics get-read-set --sequence-store-id
--id
sequence store id
--part-number 1 outfile.bam
read set id
您也可以使用 HealthOmics 传输管理器下载文件以 HealthOmics 供参考或读取。您可以在此