將讀取集匯入 HealthOmics 序列存放區 - AWS HealthOmics

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

將讀取集匯入 HealthOmics 序列存放區

建立序列存放區之後,請建立匯入任務,將讀取集上傳至資料存放區。您可以從 Amazon S3 儲存貯體上傳檔案,也可以使用同步 API 操作直接上傳。Amazon S3 儲存貯體必須與序列存放區位於相同的區域。

您可以將任何已對齊和未對齊的讀取集組合上傳至序列存放區,不過,如果匯入中的任何讀取集已對齊,您必須包含參考基因體。

您可以重複使用用來建立參考存放區的 IAM 存取政策。

下列主題說明將讀取集匯入序列存放區,然後取得匯入資料的相關資訊時所遵循的主要步驟。

將檔案上傳至 Amazon S3

下列範例示範如何將檔案移至 Amazon S3 儲存貯體。

aws s3 cp s3://1000genomes/phase1/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://your-bucket aws s3 cp s3://1000genomes/data/HG00096/alignment/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram s3://your-bucket aws s3 cp s3://gatk-test-data/wgs_ubam/NA12878_20k/NA12878_A.bam s3://your-bucket

此範例中CRAM使用的範例 BAM 和 需要不同的基因體參考 Hg19Hg38。若要進一步了解或存取這些參考,請參閱 上的開放資料登錄檔中的廣泛 Genome 參考 AWS。

建立清單檔案

您還必須在 JSON 中建立資訊清單檔案,以在 中建立匯入任務的模型 import.json(請參閱下列範例)。如果您在主控台中建立序列存放區,則不需要指定 sequenceStoreIdroleARN,因此資訊清單檔案會從 sources 輸入開始。

API manifest

下列範例使用 API 匯入三個讀取集:一個 BAMFASTQ一個 和一個 CRAM

{ "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::555555555555:role/OmicsImport", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:555555555555:referenceStore/0123456789/reference/0000000001", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "subjectId": "mySubject", "sampleId": "mySample", // NOTE: there is no reference arn required here "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data" } ] }
Console manifest

此範例程式碼用於使用 主控台匯入單一讀取集。

[ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://your-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "subjectId": "mySubject", "sampleId": "mySample", "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data" } ]

或者,您可以上傳 YAML 格式的資訊清單檔案。

啟動匯入任務

若要啟動匯入任務,請使用下列 AWS CLI 命令。

aws omics start-read-set-import-job --cli-input-json file://import.json

您會收到以下回應,表示任務建立成功。

{ "id": "3660451514", "sequenceStoreId": "3936421177", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "CREATED", "creationTime": "2022-07-13T22:14:59.309Z" }

監控匯入任務

匯入任務開始後,您可以使用下列命令來監控其進度。在下列範例中,將 取代sequence store id為您的序列存放區 ID,並將 取代job import ID為匯入 ID。

aws omics get-read-set-import-job --sequence-store-id sequence store id --id job import ID

以下顯示與指定序列存放區 ID 相關聯的所有匯入任務的狀態。

{ "id": "1234567890", "sequenceStoreId": "1234567890", "roleArn": "arn:aws:iam::111122223333:role/OmicsImport", "status": "RUNNING", "statusMessage": "The job is currently in progress.", "creationTime": "2022-07-13T22:14:59.309Z", "sources": [ { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam" }, "sourceFileType": "BAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/8625408453", "name": "HG00100", "description": "BAM for HG00100", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/SRR233106_1.filt.fastq.gz", "source2": "s3://amzn-s3-demo-bucket/SRR233106_2.filt.fastq.gz" }, "sourceFileType": "FASTQ", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "HG00146", "description": "FASTQ for HG00146", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/HG00096.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram" }, "sourceFileType": "CRAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/3242349265/reference/1234568870", "name": "HG00096", "description": "CRAM for HG00096", "generatedFrom": "1000 Genomes", "readSetID": "1234567890" }, { "sourceFiles": { "source1": "s3://amzn-s3-demo-bucket/NA12878_A.bam" }, "sourceFileType": "UBAM", "status": "IN_PROGRESS", "statusMessage": "The job is currently in progress." "subjectId": "mySubject", "sampleId": "mySample", "name": "NA12878_A", "description": "uBAM for NA12878", "generatedFrom": "GATK Test Data", "readSetID": "1234567890" } ] }

尋找匯入的序列檔案

任務完成後,您可以使用 list-read-sets API 操作來尋找匯入的序列檔案。在下列範例中,將 取代sequence store id為您的序列存放區 ID。

aws omics list-read-sets --sequence-store-id sequence store id

您會收到下列回應。

{ "readSets": [ { "id": "0000000001", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/01234567890/readSet/0000000001", "sequenceStoreId": "1234567890", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00100", "description": "BAM for HG00100", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/01234567890/reference/0000000001", "fileType": "BAM", "sequenceInformation": { "totalReadCount": 9194, "totalBaseCount": 928594, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:25:20Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "d1d65429212d61d115bb19f510d4bd02" } }, { "id": "0000000002", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000002", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00146", "description": "FASTQ for HG00146", "fileType": "FASTQ", "sequenceInformation": { "totalReadCount": 8000000, "totalBaseCount": 1184000000, "generatedFrom": "1000 Genomes", "alignment": "UNALIGNED" }, "creationTime": "2022-07-13T23:26:43Z" "creationType": "IMPORT", "etag": { "algorithm": "FASTQ_MD5up", "source1": "ca78f685c26e7cc2bf3e28e3ec4d49cd" } }, { "id": "0000000003", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000003", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "HG00096", "description": "CRAM for HG00096", "referenceArn": "arn:aws:omics:us-west-2:111122223333:referenceStore/0123456789/reference/0000000001", "fileType": "CRAM", "sequenceInformation": { "totalReadCount": 85466534, "totalBaseCount": 24000004881, "generatedFrom": "1000 Genomes", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "CRAM_MD5up", "source1": "66817940f3025a760e6da4652f3e927e" } }, { "id": "0000000004", "arn": "arn:aws:omics:us-west-2:111122223333:sequenceStore/0123456789/readSet/0000000004", "sequenceStoreId": "0123456789", "subjectId": "mySubject", "sampleId": "mySample", "status": "ACTIVE", "name": "NA12878_A", "description": "uBAM for NA12878", "fileType": "UBAM", "sequenceInformation": { "totalReadCount": 20000, "totalBaseCount": 5000000, "generatedFrom": "GATK Test Data", "alignment": "ALIGNED" }, "creationTime": "2022-07-13T23:30:41Z" "creationType": "IMPORT", "etag": { "algorithm": "BAM_MD5up", "source1": "640eb686263e9f63bcda12c35b84f5c7" } } ] }

取得讀取集的詳細資訊

若要檢視讀取集的詳細資訊,請使用 GetReadSetMetadata API 操作。在下列範例中,將 取代sequence store id為您的序列存放區 ID,並將 取代read set id為您的讀取集 ID。

aws omics get-read-set-metadata --sequence-store-id sequence store id --id read set id

您會收到下列回應。

{ "arn": "arn:aws:omics:us-west-2:123456789012:sequenceStore/2015356892/readSet/9515444019", "creationTime": "2024-01-12T04:50:33.548Z", "creationType": "IMPORT", "creationJobId": "33222111", "description": null, "etag": { "algorithm": "FASTQ_MD5up", "source1": "00d0885ba3eeb211c8c84520d3fa26ec", "source2": "00d0885ba3eeb211c8c84520d3fa26ec" }, "fileType": "FASTQ", "files": { "index": null, "source1": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 }, "source2": { "contentLength": 10818, "partSize": 104857600, "s3Access": { "s3Uri": "s3://accountID-sequence store ID-ajdpi90jdas90a79fh9a8ja98jdfa9jf98-s3alias/592761533288/sequenceStore/2015356892/readSet/9515444019/import_source1.fastq.gz" }, "totalParts": 1 } }, "id": "9515444019", "name": "paired-fastq-import", "sampleId": "sampleId-paired-fastq-import", "sequenceInformation": { "alignment": "UNALIGNED", "generatedFrom": null, "totalBaseCount": 30000, "totalReadCount": 200 }, "sequenceStoreId": "2015356892", "status": "ACTIVE", "statusMessage": null, "subjectId": "subjectId-paired-fastq-import" }

下載讀取集資料檔案

您可以使用 Amazon S3 GetObject API 操作存取作用中讀取集的物件。物件的 URI 會在 GetReadSetMetadata API 回應中傳回。如需詳細資訊,請參閱使用 Amazon S3 URIs 存取 HealthOmics 讀取集

或者,使用 HealthOmics GetReadSet API 操作。您可以使用 下載個別組件GetReadSet,以平行下載 。這些部分類似於 Amazon S3 部分。以下是如何從讀取集下載第 1 部分的範例。在下列範例中,將 取代sequence store id為您的序列存放區 ID,並將 取代read set id為您的讀取集 ID。

aws omics get-read-set --sequence-store-id sequence store id --id read set id --part-number 1 outfile.bam

您也可以使用 HealthOmics Transfer Manager 下載 HealthOmics 參考或讀取集的檔案。您可以在這裡下載 HealthOmics Transfer Manager。如需使用和設定 Transfer Manager 的詳細資訊,請參閱此 GitHub 儲存庫