

 **Halaman ini hanya untuk pelanggan lama layanan Amazon Glacier menggunakan Vaults dan REST API asli dari 2012.**

Jika Anda mencari solusi penyimpanan arsip, sebaiknya gunakan kelas penyimpanan Amazon Glacier di Amazon S3, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, dan S3 Glacier Deep Archive. Untuk mempelajari lebih lanjut tentang opsi penyimpanan ini, lihat kelas penyimpanan [Amazon Glacier](https://aws.amazon.com/s3/storage-classes/glacier/).

Amazon Glacier (layanan berbasis brankas mandiri asli) tidak lagi menerima pelanggan baru. Amazon Glacier adalah layanan mandiri dengan miliknya APIs sendiri yang menyimpan data di brankas dan berbeda dari Amazon S3 dan kelas penyimpanan Amazon S3 Glacier. Data Anda yang ada akan tetap aman dan dapat diakses di Amazon Glacier tanpa batas waktu. Tidak diperlukan migrasi. Untuk penyimpanan arsip jangka panjang berbiaya rendah, AWS rekomendasikan kelas [penyimpanan Amazon S3 Glacier](https://aws.amazon.com/s3/storage-classes/glacier/), yang memberikan pengalaman pelanggan yang unggul dengan APIs berbasis ember S3, ketersediaan penuh, biaya lebih rendah, Wilayah AWS dan integrasi layanan. AWS Jika Anda ingin meningkatkan kemampuan, pertimbangkan untuk bermigrasi ke kelas penyimpanan Amazon S3 Glacier dengan menggunakan [Panduan Solusi AWS kami untuk mentransfer data dari kubah Amazon Glacier ke kelas penyimpanan Amazon S3 Glacier](https://aws.amazon.com/solutions/guidance/data-transfer-from-amazon-s3-glacier-vaults-to-amazon-s3/).

Terjemahan disediakan oleh mesin penerjemah. Jika konten terjemahan yang diberikan bertentangan dengan versi bahasa Inggris aslinya, utamakan versi bahasa Inggris.

# Mengunduh Arsip Besar Menggunakan Pemrosesan Paralel dengan Python
<a name="downloading-large-archive-parallel-python"></a>

Topik ini menjelaskan cara mengunduh arsip besar dari Amazon S3 Glacier (S3 Glacier) menggunakan pemrosesan paralel dengan Python. Pendekatan ini memungkinkan Anda mengunduh arsip dengan ukuran berapa pun dengan memecahnya menjadi potongan-potongan kecil yang dapat diproses secara independen.

## Ikhtisar
<a name="downloading-large-archive-python-overview"></a>

Skrip Python yang disediakan dalam contoh ini melakukan tugas-tugas berikut:

1. Menetapkan AWS sumber daya yang diperlukan (topik Amazon SNS dan antrian Amazon SQS) untuk pemberitahuan

1. Memulai pekerjaan pengambilan arsip dengan Amazon Glacier

1. Memantau antrian Amazon SQS untuk pemberitahuan penyelesaian pekerjaan

1. Membagi arsip besar menjadi potongan-potongan yang dapat dikelola

1. Mengunduh potongan secara paralel menggunakan beberapa utas pekerja

1. Menyimpan setiap potongan ke disk untuk dipasang kembali nanti

## Prasyarat
<a name="downloading-large-archive-python-prerequisites"></a>

Sebelum memulai, pastikan Anda memiliki:
+ Python 3.6 atau yang lebih baru diinstal
+ AWS SDK untuk Python (Boto3) diinstal
+ AWS kredensional yang dikonfigurasi dengan izin yang sesuai untuk Amazon Glacier, Amazon SNS, dan Amazon SQS
+ Ruang disk yang cukup untuk menyimpan potongan arsip yang diunduh

## Contoh: Mengunduh Arsip Menggunakan Pemrosesan Paralel dengan Python
<a name="downloading-large-archive-python-code"></a>

Skrip Python berikut menunjukkan cara mengunduh arsip besar dari Amazon Glacier menggunakan pemrosesan paralel:

```
import boto3
import time
import json
import jmespath
import re
import concurrent.futures
import os

output_file_path = "{{output_directory_path}}"
vault_name = "{{vault_name}}"

chunk_size = 1000000000 #1gb - size of chunks for parallel download.
notify_queue_name = '{{GlacierJobCompleteNotifyQueue}}' # SQS queue for Glacier recall notification
chunk_download_queue_name='{{GlacierChunkReadyNotifyQueue}}' # SQS queue for chunks
sns_topic_name = '{{GlacierRecallJobCompleted}}' # the SNS topic to be notified when Glacier archive is restored.
chunk_queue_visibility_timeout = 7200 # 2 hours - this may need to be adjusted.
region = '{{us-east-1}}'
archive_id = "{{archive_id_to_restore}}"
retrieve_archive = True # set to false if you do not want to restore from Glacier - useful for restarting or parallel processing of the chunk queue.
workers = 12 # the number of parallel worker threads for downloading chunks. 

def setup_queues_and_topic():
    sqs = boto3.client('sqs')
    sns = boto3.client('sns')

    # Create the SNS topic
    topic_response = sns.create_topic(
        Name=sns_topic_name
    )
    topic_arn = topic_response['TopicArn']
    print("Creating the SNS topic " + topic_arn)

    # Create the notification queue
    notify_queue_response = sqs.create_queue(
        QueueName=notify_queue_name,
        Attributes={
            'VisibilityTimeout': '300',  # 5 minutes
            'ReceiveMessageWaitTimeSeconds': '20'  # Enable long polling
        }
    )
    notify_queue_url = notify_queue_response['QueueUrl']
    print("Creating the archive-retrieval notification queue " + notify_queue_url)

    # Create the chunk download queue
    chunk_queue_response = sqs.create_queue(
        QueueName=chunk_download_queue_name,
        Attributes={
            'VisibilityTimeout': str(chunk_queue_visibility_timeout),  # 5 minutes
            'ReceiveMessageWaitTimeSeconds': '0'
        }
    )
    chunk_queue_url = chunk_queue_response['QueueUrl']

    print("Creating the chunk ready notification queue " + chunk_queue_url)


   # Get the ARN for the notification queue
    notify_queue_attributes = sqs.get_queue_attributes(
        QueueUrl=notify_queue_url,
        AttributeNames=['QueueArn']
    )
    notify_queue_arn = notify_queue_attributes['Attributes']['QueueArn']

    # Set up the SNS topic policy on the notification queue
    queue_policy = {
        "Version": "2012-10-17",		 	 	 
        "Statement": [{
            "Sid": "allow-sns-messages",
            "Effect": "Allow",
            "Principal": {"AWS": "*"},
            "Action": "SQS:SendMessage",
            "Resource": notify_queue_arn,
            "Condition": {
                "ArnEquals": {
                    "aws:SourceArn": topic_arn
                }
            }
        }]
    }

    # Set the queue policy
    sqs.set_queue_attributes(
        QueueUrl=notify_queue_url,
        Attributes={
            'Policy': json.dumps(queue_policy)
        }
    )

    # Subscribe the notification queue to the SNS topic
    sns.subscribe(
        TopicArn=topic_arn,
        Protocol='sqs',
        Endpoint=notify_queue_arn
    )

    return {
        'topic_arn': topic_arn,
        'notify_queue_url': notify_queue_url,
        'chunk_queue_url': chunk_queue_url
    }


def split_and_send_chunks(archive_size, job_id,chunk_queue_url):
    ranges = []
    current = 0
    chunk_number = 0

    while current < archive_size:
        chunk_number += 1
        next_range = min(current + chunk_size - 1, archive_size - 1)
        ranges.append((current, next_range, chunk_number))
        current = next_range + 1

    # Send messages to SQS queue
    for start, end, chunk_number in ranges:
        body = {"start": start, "end": end, "job_id": job_id, "chunk_number": chunk_number}
        body = json.dumps(body)
        print("Sending SQS message for range:" + str(body))
        response = sqs.send_message(
            QueueUrl=chunk_queue_url,
            MessageBody=str(body)
        )

def GetJobOutputChunks(job_id, byterange, chunk_number):
    glacier = boto3.client('glacier')
    response = glacier.get_job_output(
        vaultName=vault_name,
        jobId=job_id,
        range=byterange,

    )

    with open(os.path.join(output_file_path,str(chunk_number)+".chunk"), 'wb') as output_file:
        output_file.write(response['body'].read())

    return response

def ReceiveArchiveReadyMessages(notify_queue_url,chunk_queue_url):

    response = sqs.receive_message(
        QueueUrl=notify_queue_url,
        AttributeNames=['All'],
        MaxNumberOfMessages=1,
        WaitTimeSeconds=20,
        MessageAttributeNames=['Message']
    )
    print("Polling archive retrieval job ready queue...")
    # Checking that there is a Messages key before proceeding. No 'Messages' key likely means the queue is empty

    if 'Messages' in response:
        print("Received a message from the archive retrieval job queue")
        jsonresponse = response
        # Loading the string into JSON and checking that ArchiveSizeInBytes key is present before continuing.
        jsonresponse=json.loads(jsonresponse['Messages'][0]['Body'])
        jsonresponse=json.loads(jsonresponse['Message'])
        if 'ArchiveSizeInBytes' in jsonresponse:
            receipt_handle = response['Messages'][0]['ReceiptHandle']    
            if jsonresponse['ArchiveSizeInBytes']:
                archive_size = jsonresponse['ArchiveSizeInBytes']

                print(f'Received message: {response}')      
                if archive_size > chunk_size:
                    split_and_send_chunks(archive_size, jsonresponse['JobId'],chunk_queue_url)

                    sqs.delete_message(
                    QueueUrl=notify_queue_url,
                    ReceiptHandle=receipt_handle)

            else:
                print("No ArchiveSizeInBytes value found in message")
                print(response)

    else:
        print('No messages available in the queue at this time.')

    time.sleep(1)

def ReceiveArchiveChunkMessages(chunk_queue_url):
    response = sqs.receive_message(
        QueueUrl=chunk_queue_url,
        AttributeNames=['All'],
        MaxNumberOfMessages=1,
        WaitTimeSeconds=0,
        MessageAttributeNames=['Message']
    )
    print("Polling archive chunk queue...")
    print(response)
    # Checking that there is a Messages key before proceeding. No 'Messages' key likely means the queue is empty
    if 'Messages' in response:
        jsonresponse = response
        # Loading the string into JSON and checking that ArchiveSizeInBytes key is present before continuing.
        jsonresponse=json.loads(jsonresponse['Messages'][0]['Body'])
        if 'job_id' in jsonresponse: #checking that there is a job id before continuing
            job_id = jsonresponse['job_id']
            byterange = "bytes="+str(jsonresponse['start']) + '-' + str(jsonresponse['end'])
            chunk_number = jsonresponse['chunk_number']
            receipt_handle = response['Messages'][0]['ReceiptHandle']
            if jsonresponse['job_id']:
                print(f'Received message: {response}')
                GetJobOutputChunks(job_id,byterange,chunk_number)
                sqs.delete_message(
                QueueUrl=chunk_queue_url,
                ReceiptHandle=receipt_handle)
    else:
        print('No messages available in the chunk queue at this time.')

def initiate_archive_retrieval(archive_id, topic_arn):
    glacier = boto3.client('glacier')

    job_parameters = {
        "Type": "archive-retrieval",
        "ArchiveId": archive_id,
        "Description": "Archive retrieval job",
        "SNSTopic": topic_arn,
        "Tier": "Bulk"  # You can change this to "Standard" or "Expedited" based on your needs
    }

    try:
        response = glacier.initiate_job(
            vaultName=vault_name,
            jobParameters=job_parameters
        )

        print("Archive retrieval job initiated:")
        print(f"Job ID: {response['jobId']}")
        print(f"Job parameters: {job_parameters}")
        print(f"Complete response: {json.dumps(response, indent=2)}")

        return response['jobId']

    except Exception as e:
        print(f"Error initiating archive retrieval job: {str(e)}")
        raise

def run_async_tasks(chunk_queue_url, workers):
    max_workers = workers  # Set the desired maximum number of concurrent tasks
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        for _ in range(max_workers):
            executor.submit(ReceiveArchiveChunkMessages, chunk_queue_url)

# One time setup of the necessary queues and topics. 
queue_and_topic_atts = setup_queues_and_topic()

topic_arn = queue_and_topic_atts['topic_arn']
notify_queue_url = queue_and_topic_atts['notify_queue_url']
chunk_queue_url = queue_and_topic_atts['chunk_queue_url']

if retrieve_archive:
    print("Retrieving the defined archive... The topic arn we will notify when recalling the archive is: "+topic_arn)
    job_id = initiate_archive_retrieval(archive_id, topic_arn)
else:
    print("Retrieve archive is false, polling queues and downloading only.")

while True:
   ReceiveArchiveReadyMessages(notify_queue_url,chunk_queue_url)
   run_async_tasks(chunk_queue_url,workers)
```

## Menggunakan Script
<a name="downloading-large-archive-python-usage"></a>

Untuk menggunakan skrip ini, ikuti langkah-langkah berikut:

1. Ganti nilai placeholder dalam skrip dengan informasi spesifik Anda:
   + {{output\_file\_path}}: Direktori tempat file chunk akan disimpan
   + {{vault\_name}}: Nama lemari besi S3 Glacier Anda
   + {{notify\_queue\_name}}: Nama untuk antrian pemberitahuan pekerjaan
   + {{chunk\_download\_queue\_name}}: Nama untuk antrian unduhan potongan
   + {{sns\_topic\_name}}: Nama untuk topik SNS
   + {{region}}: AWS wilayah tempat lemari besi Anda berada
   + {{archive\_id}}: ID arsip untuk mengambil

1. Jalankan skrip .

   ```
   python download_large_archive.py
   ```

1. Setelah semua potongan diunduh, Anda dapat menggabungkannya menjadi satu file menggunakan perintah seperti:

   ```
   cat /path/to/chunks/*.chunk > complete_archive.file
   ```

## Pertimbangan Penting
<a name="downloading-large-archive-python-considerations"></a>

Saat menggunakan skrip ini, ingatlah hal berikut:
+ Pengambilan arsip dari S3 Glacier dapat memakan waktu beberapa jam untuk diselesaikan, tergantung pada tingkat pengambilan yang dipilih.
+ Skrip berjalan tanpa batas waktu, terus melakukan polling antrian. Anda mungkin ingin menambahkan kondisi penghentian berdasarkan persyaratan spesifik Anda.
+ Pastikan Anda memiliki ruang disk yang cukup untuk menyimpan semua potongan arsip Anda.
+ Jika skrip terputus, Anda dapat memulai ulang dengan `retrieve_archive=False` untuk terus mengunduh potongan tanpa memulai pekerjaan pengambilan baru.
+ Sesuaikan {{chunk\_size}} dan {{workers}} parameter berdasarkan bandwidth jaringan dan sumber daya sistem Anda.
+  AWS Biaya standar berlaku untuk pengambilan Amazon S3, Amazon SNS, dan penggunaan Amazon SQS.