

# Tutorials
<a name="examples-blocks"></a>

[Block](API_Block.md) objects that are returned from Amazon Textract operations contain the results of text detection and text analysis operations, such as [AnalyzeDocument](API_AnalyzeDocument.md). The following Python tutorials show some of the different ways that you can use Block objects. For example, you can export table information to a comma-separated values (CSV) file.

The tutorials use synchronous Amazon Textract operations that return all results. If you want to use asynchronous operations such as [StartDocumentAnalysis](API_StartDocumentAnalysis.md), you need to change the example code to accommodate multiple batches of returned `Block` objects. To make use of the asynchronous operations example, ensure that you have followed the instructions given at [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md).

For examples that show you other ways to use Amazon Textract, see [Additional Code Samples](other-examples.md).

**Topics**
+ [Prerequisites](#examples-prerequisites)
+ [Extracting Key-Value Pairs from a Form Document](examples-extract-kvp.md)
+ [Exporting Tables into a CSV File](examples-export-table-csv.md)
+ [Detecting text with an AWS Lambda function](lambda.md)
+ [Extracting and Sending Text to AWS Comprehend for Analysis](textract-to-comprehend.md)
+ [Additional Code Samples](other-examples.md)

## Prerequisites
<a name="examples-prerequisites"></a>

Before you can run the examples in this section, you have to configure your environment. 

**To configure your environment**

1. Give a user the `AmazonTextractFullAccess` permissions. For more information, see [Step 1: Set Up an AWS Account and Create a User](setting-up.md).

1. Install and configure the AWS CLI and the AWS SDKs. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md).

# Extracting Key-Value Pairs from a Form Document
<a name="examples-extract-kvp"></a>

The following Python example shows how to extract key-value pairs in form documents from [Block](API_Block.md) objects that are stored in a map. Block objects are returned from a call to [AnalyzeDocument](API_AnalyzeDocument.md). For more information, see [Form Data (Key-Value Pairs)](how-it-works-kvp.md).

You use the following functions: 
+ `get_kv_map` – Calls [AnalyzeDocument](API_AnalyzeDocument.md), and stores the KEY and VALUE BLOCK objects in a map.
+ `get_kv_relationship` and `find_value_block` – Constructs the key-value relationships from the map.

**To extract key-value pairs from a form document**

1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites).

1. Save the following example code to a file named *textract\$1python\$1kv\$1parser.py*. In the function `get_kv_map`, replace `profile-name` with the name of a profile that can assume the role and `region` with the region in which you want to run the code.

   ```
   import boto3
   import sys
   import re
   import json
   from collections import defaultdict
   
   
   def get_kv_map(file_name):
       with open(file_name, 'rb') as file:
           img_test = file.read()
           bytes_test = bytearray(img_test)
           print('Image loaded', file_name)
   
       # process using image bytes
       session = boto3.Session(profile_name='profile-name')
       client = session.client('textract', region_name='region')
       response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])
   
       # Get the text blocks
       blocks = response['Blocks']
   
       # get key and value maps
       key_map = {}
       value_map = {}
       block_map = {}
       for block in blocks:
           block_id = block['Id']
           block_map[block_id] = block
           if block['BlockType'] == "KEY_VALUE_SET":
               if 'KEY' in block['EntityTypes']:
                   key_map[block_id] = block
               else:
                   value_map[block_id] = block
   
       return key_map, value_map, block_map
   
   
   def get_kv_relationship(key_map, value_map, block_map):
       kvs = defaultdict(list)
       for block_id, key_block in key_map.items():
           value_block = find_value_block(key_block, value_map)
           key = get_text(key_block, block_map)
           val = get_text(value_block, block_map)
           kvs[key].append(val)
       return kvs
   
   
   def find_value_block(key_block, value_map):
       for relationship in key_block['Relationships']:
           if relationship['Type'] == 'VALUE':
               for value_id in relationship['Ids']:
                   value_block = value_map[value_id]
       return value_block
   
   
   def get_text(result, blocks_map):
       text = ''
       if 'Relationships' in result:
           for relationship in result['Relationships']:
               if relationship['Type'] == 'CHILD':
                   for child_id in relationship['Ids']:
                       word = blocks_map[child_id]
                       if word['BlockType'] == 'WORD':
                           text += word['Text'] + ' '
                       if word['BlockType'] == 'SELECTION_ELEMENT':
                           if word['SelectionStatus'] == 'SELECTED':
                               text += 'X '
   
       return text
   
   
   def print_kvs(kvs):
       for key, value in kvs.items():
           print(key, ":", value)
   
   
   def search_value(kvs, search_key):
       for key, value in kvs.items():
           if re.search(search_key, key, re.IGNORECASE):
               return value
   
   
   def main(file_name):
       key_map, value_map, block_map = get_kv_map(file_name)
   
       # Get Key Value relationship
       kvs = get_kv_relationship(key_map, value_map, block_map)
       print("\n\n== FOUND KEY : VALUE pairs ===\n")
       print_kvs(kvs)
   
       # Start searching a key value
       while input('\n Do you want to search a value for a key? (enter "n" for exit) ') != 'n':
           search_key = input('\n Enter a search key:')
           print('The value is:', search_value(kvs, search_key))
   
   if __name__ == "__main__":
       file_name = sys.argv[1]
       main(file_name)
   ```

1. At the command prompt, enter the following command. Replace `file` with the document image file that you want to analyze.

   ```
   python textract_python_kv_parser.py file
   ```

1. When you're prompted, enter a key that's in the input document. If the code detects the key, it displays the key's value. 

# Exporting Tables into a CSV File
<a name="examples-export-table-csv"></a>

These Python examples show how to export tables from an image of a document into a comma-separated values (CSV) file.

The example for synchronous document analysis collects table information from a call to [AnalyzeDocument](API_AnalyzeDocument.md). The example for asynchronous document analysis makes a call to [StartDocumentAnalysis](API_StartDocumentAnalysis.md) and then retrives the results from [GetDocumentAnalysis](API_GetDocumentAnalysis.md) as `Block` objects.

Table information is returned as [Block](API_Block.md) objects from a call to [AnalyzeDocument](API_AnalyzeDocument.md). For more information, see [Tables](how-it-works-tables.md). The `Block` objects are stored in a map structure that's used to export the table data into a CSV file. 

------
#### [ Synchronous ]

In this example, you will use the functions: 
+ `get_table_csv_results` – Calls [AnalyzeDocument](API_AnalyzeDocument.md), and builds a map of tables that are detected in the document. Creates a CSV representation of all detected tables.
+ `generate_table_csv` – Generates the CSV file for an individual table.
+ `get_rows_columns_map` – Gets the rows and columns from the map.
+ `get_text` – Gets the text from a cell.

**To export tables into a CSV file**

1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites).

1. Save the following example code to a file named *textract\$1python\$1table\$1parser.py*. In the function `get_table_csv_results`, replace `profile-name` with the name of a profile that can assume the role and `region` with the region in which you want to run the code.

   ```
   import webbrowser, os
   import json
   import boto3
   import io
   from io import BytesIO
   import sys
   from pprint import pprint
   
   
   def get_rows_columns_map(table_result, blocks_map):
       rows = {}
       scores = []
       for relationship in table_result['Relationships']:
           if relationship['Type'] == 'CHILD':
               for child_id in relationship['Ids']:
                   cell = blocks_map[child_id]
                   if cell['BlockType'] == 'CELL':
                       row_index = cell['RowIndex']
                       col_index = cell['ColumnIndex']
                       if row_index not in rows:
                           # create new row
                           rows[row_index] = {}
                       
                       # get confidence score
                       scores.append(str(cell['Confidence']))
                           
                       # get the text value
                       rows[row_index][col_index] = get_text(cell, blocks_map)
       return rows, scores
   
   
   def get_text(result, blocks_map):
       text = ''
       if 'Relationships' in result:
           for relationship in result['Relationships']:
               if relationship['Type'] == 'CHILD':
                   for child_id in relationship['Ids']:
                       word = blocks_map[child_id]
                       if word['BlockType'] == 'WORD':
                           if "," in word['Text'] and word['Text'].replace(",", "").isnumeric():
                               text += '"' + word['Text'] + '"' + ' '
                           else:
                               text += word['Text'] + ' '
                       if word['BlockType'] == 'SELECTION_ELEMENT':
                           if word['SelectionStatus'] =='SELECTED':
                               text +=  'X '
       return text
   
   
   def get_table_csv_results(file_name):
   
       with open(file_name, 'rb') as file:
           img_test = file.read()
           bytes_test = bytearray(img_test)
           print('Image loaded', file_name)
   
       # process using image bytes
       # get the results
       session = boto3.Session(profile_name='profile-name')
       client = session.client('textract', region_name='region')
       response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
   
       # Get the text blocks
       blocks=response['Blocks']
       pprint(blocks)
   
       blocks_map = {}
       table_blocks = []
       for block in blocks:
           blocks_map[block['Id']] = block
           if block['BlockType'] == "TABLE":
               table_blocks.append(block)
   
       if len(table_blocks) <= 0:
           return "<b> NO Table FOUND </b>"
   
       csv = ''
       for index, table in enumerate(table_blocks):
           csv += generate_table_csv(table, blocks_map, index +1)
           csv += '\n\n'
   
       return csv
   
   def generate_table_csv(table_result, blocks_map, table_index):
       rows, scores = get_rows_columns_map(table_result, blocks_map)
   
       table_id = 'Table_' + str(table_index)
       
       # get cells.
       csv = 'Table: {0}\n\n'.format(table_id)
   
       for row_index, cols in rows.items():
           for col_index, text in cols.items():
               col_indices = len(cols.items())
               csv += '{}'.format(text) + ","
           csv += '\n'
           
       csv += '\n\n Confidence Scores % (Table Cell) \n'
       cols_count = 0
       for score in scores:
           cols_count += 1
           csv += score + ","
           if cols_count == col_indices:
               csv += '\n'
               cols_count = 0
   
       csv += '\n\n\n'
       return csv
   
   def main(file_name):
       table_csv = get_table_csv_results(file_name)
   
       output_file = 'output.csv'
   
       # replace content
       with open(output_file, "wt") as fout:
           fout.write(table_csv)
   
       # show the results
       print('CSV OUTPUT FILE: ', output_file)
   
   
   if __name__ == "__main__":
       file_name = sys.argv[1]
       main(file_name)
   ```

1. At the command prompt, enter the following command. Replace `file` with the name of the document image file that you want to analyze.

   ```
   python textract_python_table_parser.py file
   ```

When you run the example, the CSV output is saved in a file named `output.csv`.

------
#### [ Asynchronous ]

In this example, you will use make use of two different scripts. The first script starts the process of asynchronoulsy analyzing documents with `StartDocumentAnalysis` and gets the `Block` information returned by `GetDocumentAnalysis`. The second script takes the returned `Block` information for each page, formats the data as a table, and saves the tables to a CSV file.

**To export tables into a CSV file**

1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites).

1. Ensure that you have followed the instructions given at see [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md). The process documented on that page enables you to send and receive messages about the completion status of asynchronous jobs.

1. In the following code example, replace the value of `roleArn` with the Arn assigned to the role that you created in Step 2. Replace the value of `bucket` with the name of the S3 bucket containing your document. Replace the value of `document` with the name of the document in your S3 bucket. Replace the value of `region_name` with the name of your bucket's region.

   Save the following example code to a file named *start\$1doc\$1analysis\$1for\$1table\$1extraction.py.*.

   ```
   import boto3
   import time
   
   class DocumentProcessor:
   
       jobId = ''
       region_name = ''
   
       roleArn = ''
       bucket = ''
       document = ''
   
       sqsQueueUrl = ''
       snsTopicArn = ''
       processType = ''
   
       def __init__(self, role, bucket, document, region):
           self.roleArn = role
           self.bucket = bucket
           self.document = document
           self.region_name = region
   
           self.textract = boto3.client('textract', region_name=self.region_name)
           self.sqs = boto3.client('sqs')
           self.sns = boto3.client('sns')
   
       def ProcessDocument(self):
   
           jobFound = False
   
           response = self.textract.start_document_analysis(DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}},
                   FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
           print('Processing type: Analysis')
   
           print('Start Job Id: ' + response['JobId'])
   
           print('Done!')
   
       def CreateTopicandQueue(self):
   
           millis = str(int(round(time.time() * 1000)))
   
           # Create SNS topic
           snsTopicName = "AmazonTextractTopic" + millis
   
           topicResponse = self.sns.create_topic(Name=snsTopicName)
           self.snsTopicArn = topicResponse['TopicArn']
   
           # create SQS queue
           sqsQueueName = "AmazonTextractQueue" + millis
           self.sqs.create_queue(QueueName=sqsQueueName)
           self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl']
   
           attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl,
                                                   AttributeNames=['QueueArn'])['Attributes']
   
           sqsQueueArn = attribs['QueueArn']
   
           # Subscribe SQS queue to SNS topic
           self.sns.subscribe(TopicArn=self.snsTopicArn, Protocol='sqs', Endpoint=sqsQueueArn)
   
           # Authorize SNS to write SQS queue
           policy = """{{
         "Version":"2012-10-17",		 	 	 
         "Statement":[
           {{
             "Sid":"MyPolicy",
             "Effect":"Allow",
             "Principal" : {{"AWS" : "*"}},
             "Action":"SQS:SendMessage",
             "Resource": "{}",
             "Condition":{{
               "ArnEquals":{{
                 "aws:SourceArn": "{}"
               }}
             }}
           }}
         ]
       }}""".format(sqsQueueArn, self.snsTopicArn)
   
           response = self.sqs.set_queue_attributes(
               QueueUrl=self.sqsQueueUrl,
               Attributes={
                   'Policy': policy
               })
   
   def main():
       roleArn = 'role-arn'
       bucket = 'bucket-name'
       document = 'document-name'
       region_name = 'region-name'
   
       analyzer = DocumentProcessor(roleArn, bucket, document, region_name)
       analyzer.CreateTopicandQueue()
       analyzer.ProcessDocument()
   
   if __name__ == "__main__":
       main()
   ```

1. Run the code. The code will print a JobId. Copy this JobId down.

1.  Wait for your job to finish processing, and after it has finished, copy the following code to a file named *get\$1doc\$1analysis\$1for\$1table\$1extraction.py*. Replace the value of `jobId` with the Job ID you copied down earlier. Replace the value of `region_name` with the name of the region associated with your Textract role. Replace the value of `file_name` with the name you want to give the output CSV.

   ```
   import boto3
   from pprint import pprint
   
   jobId = ''
   region_name = ''
   file_name = ''
   
   textract = boto3.client('textract', region_name=region_name)
   
   # Display information about a block
   def DisplayBlockInfo(block):
       print("Block Id: " + block['Id'])
       print("Type: " + block['BlockType'])
       if 'EntityTypes' in block:
           print('EntityTypes: {}'.format(block['EntityTypes']))
   
       if 'Text' in block:
           print("Text: " + block['Text'])
   
       if block['BlockType'] != 'PAGE':
           print("Confidence: " + "{:.2f}".format(block['Confidence']) + "%")
   
   def GetResults(jobId, file_name):
       maxResults = 1000
       paginationToken = None
       finished = False
   
       while finished == False:
   
           response = None
   
           if paginationToken == None:
               response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults)
           else:
               response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults,
                                                              NextToken=paginationToken)
   
           blocks = response['Blocks']
           table_csv = get_table_csv_results(blocks)
           output_file = file_name + ".csv"
           # replace content
           with open(output_file, "at") as fout:
               fout.write(table_csv)
           # show the results
           print('Detected Document Text')
           print('Pages: {}'.format(response['DocumentMetadata']['Pages']))
           print('OUTPUT TO CSV FILE: ', output_file)
   
           # Display block information
           for block in blocks:
               DisplayBlockInfo(block)
               print()
               print()
   
           if 'NextToken' in response:
               paginationToken = response['NextToken']
           else:
               finished = True
   
   
   def get_rows_columns_map(table_result, blocks_map):
       rows = {}
       for relationship in table_result['Relationships']:
           if relationship['Type'] == 'CHILD':
               for child_id in relationship['Ids']:
                   try:
                       cell = blocks_map[child_id]
                       if cell['BlockType'] == 'CELL':
                           row_index = cell['RowIndex']
                           col_index = cell['ColumnIndex']
                           if row_index not in rows:
                               # create new row
                               rows[row_index] = {}
   
                           # get the text value
                           rows[row_index][col_index] = get_text(cell, blocks_map)
                   except KeyError:
                       print("Error extracting Table data - {}:".format(KeyError))
                       pass
       return rows
   
   
   def get_text(result, blocks_map):
       text = ''
       if 'Relationships' in result:
           for relationship in result['Relationships']:
               if relationship['Type'] == 'CHILD':
                   for child_id in relationship['Ids']:
                       try:
                           word = blocks_map[child_id]
                           if word['BlockType'] == 'WORD':
                               text += word['Text'] + ' '
                           if word['BlockType'] == 'SELECTION_ELEMENT':
                               if word['SelectionStatus'] == 'SELECTED':
                                   text += 'X '
                       except KeyError:
                           print("Error extracting Table data - {}:".format(KeyError))
   
       return text
   
   
   def get_table_csv_results(blocks):
   
       pprint(blocks)
   
       blocks_map = {}
       table_blocks = []
       for block in blocks:
           blocks_map[block['Id']] = block
           if block['BlockType'] == "TABLE":
               table_blocks.append(block)
   
       if len(table_blocks) <= 0:
           return "<b> NO Table FOUND </b>"
   
       csv = ''
       for index, table in enumerate(table_blocks):
           csv += generate_table_csv(table, blocks_map, index + 1)
           csv += '\n\n'
           # In order to generate separate CSV file for every table, uncomment code below
           #inner_csv = ''
           #inner_csv += generate_table_csv(table, blocks_map, index + 1)
           #inner_csv += '\n\n'
           #output_file = file_name + "___" + str(index) + ".csv"
           # replace content
           #with open(output_file, "at") as fout:
           #    fout.write(inner_csv)
   
       return csv
   
   
   def generate_table_csv(table_result, blocks_map, table_index):
       rows = get_rows_columns_map(table_result, blocks_map)
   
       table_id = 'Table_' + str(table_index)
   
       # get cells.
       csv = 'Table: {0}\n\n'.format(table_id)
   
       for row_index, cols in rows.items():
   
           for col_index, text in cols.items():
               csv += '{}'.format(text) + ","
           csv += '\n'
   
       csv += '\n\n\n'
       return csv
   
   response_blocks = GetResults(jobId, file_name)
   ```

1. Run the code.

   After you have obtained you results, be sure to delete the associated SNS and SQS resources, or else you may accrue charges for them.

------

# Detecting text with an AWS Lambda function
<a name="lambda"></a>

AWS Lambda is a compute service that you can use to run code without provisioning or managing servers. You can call Amazon Textract API operations from within an AWS Lambda function. The following instructions show how to create a Lambda function in Python that calls [DetectDocumentText](API_DetectDocumentText.md). 

The Lambda function returns a list of [Block](API_Block.md) objects with information about the detected words and lines of text. The instructions include example Python code that shows you how to call the Lambda function with a document supplied from an Amazon S3 bucket or your local computer. Images stored in Amazon S3 must be in single-page PDF or TIFF document format, or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format. The Python code returns part of the JSON response for each Block type detected in the document.

For an example that uses Lambda functions to process documents at a large scale, see [Amazon Textract IDP CDK Constructs](https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/) and [Use machine learning to automate and process documents at scale](https://s12d.com/aws-idp-scale-workshop).

**Topics**
+ [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function)
+ [Step 2: (Optional) Create a layer (console)](#example-lambda-create-layer)
+ [Step 3: Add Python code (console)](#example-lambda-add-code)
+ [Step 4: Try your Lambda function](#example-lambda-test)

## Step 1: Create an AWS Lambda function (console)
<a name="example-lambda-create-function"></a>

In this step, you create an empty AWS Lambda function and an IAM execution role that lets your function call the `DetectDocumentText` operation. If you are supplying documents from Amazon S3, this step also shows you how to grant access to the bucket that stores your documents.

Later you add the source code and optionally add a layer to the Lambda function.

**To create an AWS Lambda function (console)**

1. Sign in to the AWS Management Console and open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. Choose **Create function**. For more information, see [Create a Lambda Function with the Console](https://docs.aws.amazon.com/lambda/latest/dg/getting-started-create-function.html).

1. Choose the following options:
   + Choose **Author from scratch**. 
   + Enter a value for **Function name**.
   + For **Runtime**, choose **Python 3.9**.
   + For **Architecture**, choose **x86\$164**.

1. Choose **Create function** to create the AWS Lambda function.

1. On the function page, choose the **Configuration** tab.

1. On the **Permissions** pane, under **Execution role**, choose the role name to open the role in the IAM console.

1. In the **Permissions** tab, choose **Add permissions** and then **Create inline policy**.

1. Choose the **JSON** tab and replace the policy with the following policy:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Action": "textract:DetectDocumentText",
               "Resource": "*",
               "Effect": "Allow",
               "Sid": "DetectDocumentText"
           }
       ]
   }
   ```

------

1. Choose **Review policy**.

1. Enter a name for the policy, for example *DetectDocumentText-access*.

1. Choose **Create policy**.

1. If you are storing documents for analysis in an Amazon S3 bucket, you must add an Amazon S3 access policy. To do this, repeat steps 7 to 11 in the AWS Lambda console and make the following changes. 

   1. For step 8, use the following policy. Replace *bucket/folder path* with the Amazon S3 bucket and folder path to the documents that you want to analyze. 

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "S3Access",
                  "Effect": "Allow",
                  "Action": "s3:GetObject",
                  "Resource": "arn:aws:s3:::bucket/folder path/*"
              }
          ]
      }
      ```

------

   1. For step 10, choose a different policy name, such as *S3Bucket-access*.

## Step 2: (Optional) Create a layer (console)
<a name="example-lambda-create-layer"></a>

To run this example, you don't need to perform this step. The `DetectDocumentText` operation is included in the default Lambda Python environment as part of AWS SDK for Python (Boto3). If other parts of your Lambda function require recent AWS service updates that aren't in the default Lambda Python environment, then perform this step to add the most recent Boto3 SDK release as a layer to your function. 

First, you create a zip file archive that contains the Boto3 SDK. Then, you create a layer and add the zip file archive to the layer. For more information, see [Using layers with your Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/invocation-layers.html#invocation-layers-using).

**To create and add a layer (console)**

1. Open a command prompt and enter the following commands to create a deployment package with the most recent version of the AWS SDK.

   ```
   pip install boto3 --target python/.
   zip boto3-layer.zip -r python/
   ```

1. Note the name of the zip file (boto3-layer.zip), which you use in step 8 of this procedure.

1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. In the navigation pane, choose **Layers**. 

1. Choose **Create layer**.

1. Enter values for **Name** and **Description**.

1. For **Code entry type**, choose **Upload a .zip file** and select **Upload**.

1. In the dialog box, choose the zip file archive (boto3-layer.zip) that you created in step 1 of this procedure.

1. For **Compatible runtimes**, choose **Python 3.9**.

1. Choose **Create** to create the layer.

1. Choose the navigation pane menu icon.

1. In the navigation pane, choose **Functions**.

1. In the resources list, choose the function that you created previously in [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function). 

1. Choose the **Code** tab.

1. In the **Layers** section, choose **Add a layer**.

1. Choose **Custom layers**.

1. In **Custom layers**, choose the layer name that you entered in step 6. 

1. In **Version** choose the layer version, which should be 1.

1. Choose **Add**.

## Step 3: Add Python code (console)
<a name="example-lambda-add-code"></a>

In this step, you add Python code to your Lambda function by using the Lambda console code editor. The code detects text in a document with `DetectDocumentText` and returns a list of Block objects with information about the detected text. The document can be located in an Amazon S3 bucket or a local computer. Images stored in Amazon S3 must be single-page PDF or TIFF format documents or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format. 

**To add Python code (console)**

1. Navigate to the **Code** tab.

1. In the code editor, replace the code in **lambda\$1function.py** with the following code: 

   ```
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: Apache-2.0
   
   """
   Purpose
   An AWS lambda function that analyzes documents with Amazon Textract.
   """
   import json
   import base64
   import logging
   import boto3
   
   from botocore.exceptions import ClientError
   
   # Set up logging.
   logger = logging.getLogger(__name__)
   
   # Get the boto3 client.
   textract_client = boto3.client('textract')
   
   
   def lambda_handler(event, context):
       """
       Lambda handler function
       param: event: The event object for the Lambda function.
       param: context: The context object for the lambda function.
       return: The list of Block objects recognized in the document
       passed in the event object.
       """
   
       try:
   
           # Determine document source.
           if 'image' in event:
               # Decode the image
               image_bytes = event['image'].encode('utf-8')
               img_b64decoded = base64.b64decode(image_bytes)
               image = {'Bytes': img_b64decoded}
   
   
           elif 'S3Object' in event:
               image = {'S3Object':
                        {'Bucket':  event['S3Object']['Bucket'],
                         'Name': event['S3Object']['Name']}
                        }
   
           else:
               raise ValueError(
                   'Invalid source. Only image base 64 encoded image bytes or S3Object are supported.')
   
   
           # Analyze the document.
           response = textract_client.detect_document_text(Document=image)
   
           # Get the Blocks
           blocks = response['Blocks']
   
           lambda_response = {
               "statusCode": 200,
               "body": json.dumps(blocks)
           }
   
       except ClientError as err:
           error_message = "Couldn't analyze image. " + \
               err.response['Error']['Message']
   
           lambda_response = {
               'statusCode': 400,
               'body': {
                   "Error": err.response['Error']['Code'],
                   "ErrorMessage": error_message
               }
           }
           logger.error("Error function %s: %s",
               context.invoked_function_arn, error_message)
   
       except ValueError as val_error:
           lambda_response = {
               'statusCode': 400,
               'body': {
                   "Error": "ValueError",
                   "ErrorMessage": format(val_error)
               }
           }
           logger.error("Error function %s: %s",
               context.invoked_function_arn, format(val_error))
   
       return lambda_response
   ```

1. Choose **Deploy** to deploy your Lambda function.

## Step 4: Try your Lambda function
<a name="example-lambda-test"></a>

Now that you’ve created your Lambda function, you can invoke it to detect text in a document. In this step, you use Python code on your computer to pass a local document or a document in an Amazon S3 bucket to your Lambda function. Documents passed from a local computer must be smaller than 6291456 bytes. If your documents are larger, upload them to an Amazon S3 bucket and call the script with the Amazon S3 path to the image. For information about uploading image files to an Amazon S3 bucket, see [Uploading objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html).

Make sure you run the code in the same [AWS Region](https://docs.aws.amazon.com/general/latest/gr/rande.html) in which you created the Lambda function. You can view the AWS Region for your Lambda function in the navigation bar of the function details page in the [Lambda console](https://console.aws.amazon.com/lambda/).

If the AWS Lambda function returns a timeout error, extend the timeout period for the Lambda function. For more information, see [Configuring function timeout (console)](https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console).

For more information about invoking a Lambda function from your code, see [Invoking AWS Lambda Functions](https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-functions.html). 

**To try your Lambda function**

1. If you haven't already done so, do the following:

   1. Make sure that the user has `lambda:InvokeFunction` permission. You can use the following policy: 

      You can get the ARN for your Lambda function from the function overview in the [Lambda console](https://console.aws.amazon.com/lambda/).

      To provide access, add permissions to your users, groups, or roles:
      + Users and groups in AWS IAM Identity Center:

        Create a permission set. Follow the instructions in [Create a permission set](https://docs.aws.amazon.com//singlesignon/latest/userguide/howtocreatepermissionset.html) in the *AWS IAM Identity Center User Guide*.
      + Users managed in IAM through an identity provider:

        Create a role for identity federation. Follow the instructions in [Create a role for a third-party identity provider (federation)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-idp.html) in the *IAM User Guide*.
      + IAM users:
        + Create a role that your user can assume. Follow the instructions in [Create a role for an IAM user](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-user.html) in the *IAM User Guide*.
        + (Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in [Adding permissions to a user (console)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_users_change-permissions.html#users_change_permissions-add-console) in the *IAM User Guide*.

   1. Install and configure AWS SDK for Python. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md).

1. Save the following code to a file named `client.py`: 

   ```
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: Apache-2.0
   
   """
   Purpose
   Test code for running the Amazon Textract Lambda
   function example code.
   """
   
   import argparse
   import logging
   import base64
   import json
   import io
   import boto3
   
   from botocore.exceptions import ClientError
   from PIL import Image, ImageDraw
   
   
   logger = logging.getLogger(__name__)
   
   
   def analyze_image(function_name, image):
       """Analyzes a document with an AWS Lambda function.
       :param image: The document that you want to analyze.
       :return The list of Block objects in JSON format.
       """
   
       lambda_client = boto3.client('lambda')
   
       lambda_payload = {}
   
       if image.startswith('s3://'):
           logger.info("Analyzing document from S3 bucket: %s", image)
           bucket, key = image.replace("s3://", "").split("/", 1)
           s3_object = {
               'Bucket': bucket,
               'Name': key
           }
   
           lambda_payload = {"S3Object": s3_object}
   
       else:
           with open(image, 'rb') as image_file:
               logger.info("Analyzing local document: %s ", image)
               image_bytes = image_file.read()
               data = base64.b64encode(image_bytes).decode("utf8")
   
               lambda_payload = {"image": data}
   
       # Call the lambda function with the document.
   
       response = lambda_client.invoke(FunctionName=function_name,
                                       Payload=json.dumps(lambda_payload))
   
       return json.loads(response['Payload'].read().decode())
   
   
   def add_arguments(parser):
       """
       Adds command line arguments to the parser.
       :param parser: The command line parser.
       """
   
       parser.add_argument(
           "function", help="The name of the AWS Lambda function that you want " \
           "to use to analyze the document.")
       parser.add_argument(
           "image", help="The document that you want to analyze.") 
   
   
   def main():
       """
       Entrypoint for script.
       """
       try:
           logging.basicConfig(level=logging.INFO,
                               format="%(levelname)s: %(message)s")
   
           # Get command line arguments.
           parser = argparse.ArgumentParser(usage=argparse.SUPPRESS)
           add_arguments(parser)
           args = parser.parse_args()
   
           # Get analysis results.
           result = analyze_image(args.function, args.image)
           status = result['statusCode']
   
           blocks = result['body']
           blocks = json.loads(blocks)
   
           if status == 200:
   
               for block in blocks:
                   print('Type: ' + block['BlockType'])
                   if block['BlockType'] != 'PAGE':
                       print('Detected: ' + block['Text'])
                       print('Confidence: ' + "{:.2f}".format(block['Confidence']) + "%")
   
                   print('Id: {}'.format(block['Id']))
                   if 'Relationships' in block:
                       print('Relationships: {}'.format(block['Relationships']))
                   print('Bounding Box: {}'.format(block['Geometry']['BoundingBox']))
                   print('Polygon: {}'.format(block['Geometry']['Polygon']))
                   print()
               print("Blocks detected: " + str(len(blocks)))
           else:
               print(f"Error: {result['statusCode']}")
               print(f"Message: {result['body']}")
   
       except ClientError as error:
           logging.error(error)
           print(error)
   
   
   if __name__ == "__main__":
       main()
   ```

1. Run the code. For the command line argument, supply the Lambda function name and the document that you want to analyze. You can supply a path to a local document, or you can use the Amazon S3 path to an document stored in an Amazon S3 bucket. For example:

   ```
   python client.py function_name s3://bucket/path/document.jpg
   ```

   If the document is in an Amazon S3 bucket. make sure that it is the same bucket that you specified previously in step 12 of [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function).

   If successful, your code returns a partial JSON response for each Block type detected in the document.

# Extracting and Sending Text to AWS Comprehend for Analysis
<a name="textract-to-comprehend"></a>

Amazon Textract lets you include document text detection and analysis in your applications. With Amazon Textract you can extract text from a variety of different document types using both synchronous and asynchronous document processing. The extracted text can then be saved to a file or database, or sent to another AWS service for further processing. 

In this tutorial you carry out a common end-to-end workflow. This workflow involves:
+ Processing numerous input documents with Amazon Textract
+ Providing the extracted text to Amazon Comprehend for analysis
+ Saving both the analyzed text and the analysis data to an Amazon Simple Storage Service (S3) bucket

You use the [AWS SDK for Python](https://aws.amazon.com/sdk-for-python/) for this tutorial. You can also see the AWS Documentation SDK examples [GitHub repo ](https://github.com/awsdocs/aws-doc-sdk-examples)for more Python tutorials. 

## Prerequisites
<a name="tutorial-prerequisites"></a>

Before you begin this tutorial, you’ll need to install Python and complete the steps required to [set up the Python AWS SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html). Beyond this, ensure that you have:
+ [Created an AWS account and an IAM role](https://docs.aws.amazon.com/rekognition/latest/dg/setting-up.html)
+ [Properly configured your AWS access credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)
+ [Created an Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html)
+ [ Configured Amazon Textract for Asynchronous processing](https://docs.aws.amazon.com/en_us/textract/latest/dg/api-async-roles.html), copying down the Amazon Resource Number (ARN) of the IAM role you configured for use with Amazon Textract
+ [Granted your IAM role access to Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/security-iam.html#security_iam_access-manage) 
+ Selected a few documents for the purposes of text extraction/analysis and uploaded the document to Amazon S3. Ensure that the files you select for analysis are of the formats supported by Amazon Textract.

## Starting Asynchronous Document Text Detection
<a name="tutorial-step-1"></a>

You can extract the text from your documents and then analyze the extracted text with a service like Amazon Comprehend. Textract supports the extraction of text from multipage documents through asynchronous operations, which are for processing large, multipage documents. Processing a PDF file asynchronously allows your application to complete other tasks while it waits for the process to complete. This section will demonstrate how to import your documents from an Amazon S3 bucket and provide them to Textract’s asynchronous text detection operation. 

This tutorial assumes that you will be using Amazon S3 to store the files you want to extract text from. You’ll start by creating a class and functions that detect the text in your input documents. Your application will need to connect to the Textract client, as well as the Amazon SQS and Amazon SNS clients for the purposes of monitoring the completion status of the asynchronous job.

1. Start by writing the code to create an Amazon SNS topic and Amazon SQS queue.

   The following code sample creates a `DocumentProcessor` class that connects to the three required services and then creates both an Amazon SQS queue and Amazon SNS topic. The Amazon SNS topic is used to provide information about the job completion status to an Amazon SQS queue, which will be polled to obtain the completion status of a job. There are also methods to delete the Amazon SQS queue and Amazon SNS topic once the job has been completed and the resources are no longer needed.

   ```
   import boto3
   import json
   import sys
   import time
   
   class DocumentProcessor:
   
       jobId = ''
       region_name = ''
   
       roleArn = ''
       bucket = ''
       document = ''
   
       sqsQueueUrl = ''
       snsTopicArn = ''
       processType = ''
   
       def __init__(self, role, bucket, document, region):
           self.roleArn = role
           self.bucket = bucket
           self.document = document
           self.region_name = region
   
           # Instantiates necessary AWS clients
           session = boto3.Session(profile_name='profile-name',
           region_name='self.region_name')
           self.textract = session.client('textract', region_name=self.region_name)
           self.sqs = session.client('sqs', region_name=self.region_name)
           self.sns = session.client('sns', region_name=self.region_name)
   
       def CreateTopicandQueue(self):
   
           millis = str(int(round(time.time() * 1000)))
   
           # Create SNS topic
           snsTopicName = "AmazonTextractTopic" + millis
   
           topicResponse = self.sns.create_topic(Name=snsTopicName)
           self.snsTopicArn = topicResponse['TopicArn']
   
           # create SQS queue
           sqsQueueName = "AmazonTextractQueue" + millis
           self.sqs.create_queue(QueueName=sqsQueueName)
           self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl']
   
           attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl,
                                                   AttributeNames=['QueueArn'])['Attributes']
   
           sqsQueueArn = attribs['QueueArn']
   
           # Subscribe SQS queue to SNS topic
           self.sns.subscribe(
               TopicArn=self.snsTopicArn,
               Protocol='sqs',
               Endpoint=sqsQueueArn)
   
           # Authorize SNS to write SQS queue
           policy = """{{
     "Version": "2012-10-17",		 	 	 
     "Statement":[
       {{
         "Sid":"MyPolicy",
         "Effect":"Allow",
         "Principal" : {{"AWS" : "*"}},
         "Action":"SQS:SendMessage",
         "Resource": "{}",
         "Condition":{{
           "ArnEquals":{{
             "aws:SourceArn": "{}"
           }}
         }}
       }}
     ]
   }}""".format(sqsQueueArn, self.snsTopicArn)
   
           response = self.sqs.set_queue_attributes(
               QueueUrl=self.sqsQueueUrl,
               Attributes={
                   'Policy': policy
               })
   
       def DeleteTopicandQueue(self):
           self.sqs.delete_queue(QueueUrl=self.sqsQueueUrl)
           self.sns.delete_topic(TopicArn=self.snsTopicArn)
   ```

1. Write the code to call the `StartDocumentTextDetection` operation and get the results of the operation.

   The `DocumentProcessor` class will also need methods to: 
   + Call the `StartDocumentTextDetection` operation
   + Poll an Amazon SQS for the job completion status
   + Retrieve the results of the job once it is done processing

   The following code creates the `ProcessDocument` and `GetResults` methods that call `StartDocumentTextDetection` and gets the extracted text, respectively.

   ```
       def ProcessDocument(self):
       
               # Checks if job found
               jobFound = False
       
               # Starts the text detection operation on the documents in the provided bucket
               # Sends status to supplied SNS topic arn
               response = self.textract.start_document_text_detection(
                       DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}},
                       NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn})
               print('Processing type: Detection')
       
               print('Start Job Id: ' + response['JobId'])
               dotLine = 0
               while jobFound == False:
                   sqsResponse = self.sqs.receive_message(QueueUrl=self.sqsQueueUrl, MessageAttributeNames=['ALL'],
                                                       MaxNumberOfMessages=10)
       
                   # Waits until messages are found in the SQS queue
                   if sqsResponse:
                       if 'Messages' not in sqsResponse:
                           if dotLine < 40:
                               print('.', end='')
                               dotLine = dotLine + 1
                           else:
                               print()
                               dotLine = 0
                           sys.stdout.flush()
                           time.sleep(5)
                           continue
       
                       # Checks for a completed job that matches the jobID in the response from
                       # StartDocumentTextDetection
                       for message in sqsResponse['Messages']:
                           notification = json.loads(message['Body'])
                           textMessage = json.loads(notification['Message'])
                           if str(textMessage['JobId']) == response['JobId']:
                               print('Matching Job Found:' + textMessage['JobId'])
                               jobFound = True
                               text_data = self.GetResults(textMessage['JobId'])
                               self.sqs.delete_message(QueueUrl=self.sqsQueueUrl,
                                                       ReceiptHandle=message['ReceiptHandle'])
                               return text_data
                           else:
                               print("Job didn't match:" +
                                   str(textMessage['JobId']) + ' : ' + str(response['JobId']))
                           # Delete the unknown message. Consider sending to dead letter queue
                           self.sqs.delete_message(QueueUrl=self.sqsQueueUrl,
                                                   ReceiptHandle=message['ReceiptHandle'])
       
               print('Done!')
           
       # gets the results of the completed text detection job
       # checks for pagination tokens to determine if there are multiple pages in the input doc
       def GetResults(self, jobId):
           maxResults = 1000
           paginationToken = None
           finished = False
   
           while finished == False:
               response = None
               if paginationToken == None:
                   response = self.textract.get_document_text_detection(JobId=jobId,
                                                                            MaxResults=maxResults)
               else:
                   response = self.textract.get_document_text_detection(JobId=jobId,
                                                                            MaxResults=maxResults,
                                                                            NextToken=paginationToken)
   
               blocks = response['Blocks']
   
               # List to hold detected text
               detected_text = []
   
               # Display block information and add detected text to list
               for block in blocks:
                   if 'Text' in block and block['BlockType'] == "LINE":
                       detected_text.append(block['Text'])
   
               # If response contains a next token, update pagination token
               if 'NextToken' in response:
                   paginationToken = response['NextToken']
               else:
                   finished = True
   
               return detected_text
   ```

1. Save the above code in a file called `detectFileAsync.py`.

   You use this file in the next section to handle the detection of text in your input documents.

## Processing Your Documents and Sending the Text to Comprehend
<a name="tutorial-step-2"></a>

Your application will use the class you created in the proceeding section to:
+ read documents from your Amazon S3 bucket
+ extract the text in those documents
+ send the text to Amazon Comprehend for analysis

You start by creating some functions that utilize Amazon Comprehend to analyze the text detected in your input documents. A common type of text analysis is sentiment analysis, which aims to capture the affect of a statement (whether it is positive, negative, or neutral). You can also carry out entity detection and key phrase detection on the data.

The code below takes in the detected text and invokes the `BatchDetectSentiment` operation from Amazon Comprehend in order to carry out sentiment analysis.

1. Write the code to carry out sentiment analysis on your detected text.

   ```
   from detectFileAsync import DocumentProcessor
   import boto3
   import pandas as pd
   
   # Detect sentiment
   def sentiment_analysis(detected_text, lang):
   
       comprehend = boto3.client("comprehend")
   
       detect_sent_response = comprehend.batch_detect_sentiment(
               TextList=detected_text, LanguageCode=lang)
   
       # Lists to hold sentiment labels and sentiment scores
       sentiments = []
       pos_score = []
       neg_score = []
       neutral_score = []
       mixed_score = []
   
       # for all results add the Sentiment label and sentiment scores to lists
       for res in detect_sent_response['ResultList']:
           sentiments.append(res['Sentiment'])
           print(res['SentimentScore'])
           print(type(res['SentimentScore']))
           for key, val in res['SentimentScore'].items():
               if key == "Positive":
                   pos_score.append(val)
               if key == "Negative":
                   neg_score.append(val)
               if key == "Neutral":
                   neutral_score.append(val)
               if key == "Mixed":
                   mixed_score.append(val)
   
       return sentiments, pos_score, neg_score, neutral_score, mixed_score
   ```

   You may also want to perform other analysis operations, such as entity detection or key phrase detection, on your detected text. You can write the functions to carry out these analysis operations on your text, just like you did for the proceeding sentiment analysis operation.

1. Write the code to carry out entity detection on your detected text.

   ```
   # detect entities
   def entity_detection(detected_text, lang):
   
       comprehend = boto3.client("comprehend")
   
       # convert and handle string here
       # do string handling
       detect_ent_response = comprehend.batch_detect_entities(
           TextList=detected_text, LanguageCode=lang)
   
       # To fold detected entities and entity types
       ents = []
       types = []
   
       # Get detected entities and types from the response returned by Comprehend
       for i in detect_ent_response['ResultList']:
           if len(i['Entities']) == 0:
               ents.append("N/A")
               types.append("N/A")
           else:
               sentence_ents = []
               sentence_types = []
               for entities in i['Entities']:
                   sentence_ents.append(entities['Text'])
                   sentence_types.append(entities['Type'])
               ents.append(sentence_ents)
               types.append(sentence_types)
   
       return ents, types
   ```

1.  Write the code to carry out key phrase detection on your detected text.

   ```
   # Detect key phrases
   def key_phrases_detection(detected_text, lang):
   
       comprehend = boto3.client("comprehend")
   
       key_phrases = []
       detect_phrases_response = comprehend.batch_detect_key_phrases(
           TextList=detected_text, LanguageCode=lang)
       for i in detect_phrases_response['ResultList']:
           if len(i['KeyPhrases']) == 0:
               key_phrases.append("N/A")
           else:
               phrases = []
               for phrase in i['KeyPhrases']:
                   phrases.append(phrase['Text'])
               key_phrases.append(phrases)
   
       return key_phrases
   ```

   You need to create a function that invokes all of the code you’ve created so far. The function will use the `DocumentProcessor` class you created in your `DetectAnalyzeFileAsync.py` file, and then save the detected text to a variable for input into the three functions utilizing Amazon Comprehend that you previously wrote. The function will also need to construct a Pandas dataframe, into which the detected text and analysis data will be inserted. Finally, the Pandas dataframe will be saved as a CSV file.

1. Write the code to process your input documents with Textract and pass the detected text to Comprehend.

   ```
   def process_document(roleArn, bucket, document, region_name):
   
       # Create analyzer class from DocumentProcessor, create a topic and queue, use Textract to get text,
       # then delete topica and queue
       analyzer = DocumentProcessor(roleArn, bucket, document, region_name)
       analyzer.CreateTopicandQueue()
       extracted_text = analyzer.ProcessDocument()
       analyzer.DeleteTopicandQueue()
   
       # detect dominant language
       comprehend = boto3.client("comprehend")
       response = comprehend.detect_dominant_language(Text=str(extracted_text[:10]))
       print(response)
       print(type(response))
       lang = ""
       for i in response['Languages']:
           lang = i['LanguageCode']
       print(lang)
   
       # or you can enter language code below
       # lang = "en"
   
       print("Lines in detected text:" + str(len(extracted_text)))
       sliced_list = []
       start = 0
       end = 24
       while end < len(extracted_text):
           sliced_list.append(extracted_text[start:end])
           start += 25
           end += 25
       print(sliced_list)
   
       # Create lists to hold analytics data, these will be turned into columns
       all_sents = []
       all_scores = []
       all_ents = []
       all_types = []
       all_key_phrases = []
       all_pos_ratings = []
       all_neg_ratings = []
       all_neutral_ratings = []
       all_mixed_ratings = []
   
       # For every slice, get sentiment analysis, entity detection and key phrases, append results to lists
       for slice in sliced_list:
           slice_labels, pos_ratings, neg_ratings, neutral_ratings, mixed_ratings = sentiment_analysis(slice, lang)
           all_sents.append(slice_labels)
           all_pos_ratings.append(pos_ratings)
           all_neg_ratings.append(neg_ratings)
           all_neutral_ratings.append(neutral_ratings)
           all_mixed_ratings.append(mixed_ratings)
           slice_ents, slice_types = entity_detection(slice, lang)
           all_ents.append(slice_ents)
           all_types.append(slice_types)
           key_phrases = key_phrases_detection(slice, lang)
           all_key_phrases.append(key_phrases)
   
       # List comprehension to flatten multiple lists into a single list
       extracted_text = [line for sublist in sliced_list for line in sublist]
       all_sents = [sent for sublist in all_sents for sent in sublist]
       all_scores = [score for sublist in all_scores for score in sublist]
       all_ents = [ents for sublist in all_ents for ents in sublist]
       all_types = [types for sublist in all_types for types in sublist]
       all_key_phrases = [kp for sublist in all_key_phrases for kp in sublist]
       all_mixed_ratings = [kp for sublist in all_mixed_ratings for kp in sublist]
       all_pos_ratings = [kp for sublist in all_pos_ratings for kp in sublist]
       all_neg_ratings = [kp for sublist in all_neg_ratings for kp in sublist]
       all_neutral_ratings = [kp for sublist in all_neutral_ratings for kp in sublist]
   
       print(len(extracted_text))
       print(len(all_sents))
       print(len(all_ents))
       print(len(all_types))
       print(len(all_key_phrases))
   
       print("List of Recognized Entities:")
   
       # Create dataframe and save as CSV
       df = pd.DataFrame({'Sentences':extracted_text, 'Sentiment':all_sents, 'SentPosScore':all_pos_ratings,
                          'SentNegScore':all_neg_ratings, 'SentNeutralScore':all_neutral_ratings, 'SentMixedRatings':all_mixed_ratings,
                          'Entities':all_ents, 'EntityTypes':all_types,'KeyPhrases:':all_key_phrases})
       analysis_results = str(document.replace(".","_") + "_" + "analysis" + ".csv")
       df.to_csv(analysis_results, index=False)
   
       print(df)
       print("Data written to file!")
   
       return extracted_text, analysis_results
   ```

1. Write the code to process your documents and upload the resulting data to S3. In the code sample below, replace the value of `roleArn` with the ARN of the role you configured for use with Amazon Textract. Replace the value of `region_name` with the region your account is operating in. Finally, replace the value `bucket_name` with the name of the S3 bucket containing your documents.

   ```
   def main():
   
       # Initialize S3 client and set RoleArn, region name, and bucket name
       s3 = boto3.client("s3")
       roleArn = ''
       region_name = ''
       bucket_name = ''
   
       # initialize global corpus
       full_corpus = []
   
       # to hold all docs in bucket
       docs_list = []
   
       # loop through docs in bucket, get names of all docs
       s3_resource = boto3.resource("s3")
       bucket = s3_resource.Bucket(bucket_name)
       for bucket_object in bucket.objects.all():
           docs_list.append(bucket_object.key)
       print(docs_list)
   
       # For all the docs in the bucket, invoke document processing function,
       # add detected text to corpus of all text in batch docs,
       # and save CSV of comprehend analysis data and textract detected to S3
       for i in docs_list:
           detected_text, analysis_results = process_document(roleArn, bucket_name, i, region_name)
           full_corpus.append(detected_text)
           print("Uploading file: {}".format(str(analysis_results)))
           name_of_file = str(analysis_results)
           s3.upload_file(name_of_file, bucket_name, name_of_file)
   
       # print the global corpus
       print(full_corpus)
   
   if __name__ == "__main__":
       main()
   ```

1. Put the proceeding code in the section into a Python file and run it. 

You have successfully extracted text using Amazon Textract, sent the text to Amazon Comprehend for analysis, and then saved the results in a Amazon S3 bucket.

# Additional Code Samples
<a name="other-examples"></a>

The following table provides links to more Amazon Textract code examples.


| Example | Description | 
| --- | --- | 
|  [Amazon Textract Code Samples](https://github.com/aws-samples/amazon-textract-code-samples)  |  Show various ways in which you can use Amazon Textract.  | 
|  [Large scale document processing with Amazon Textract](https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing)  |  Shows a serverless reference architecture that processes documents at a large scale.  | 
|  [Amazon Textract Parser](https://github.com/aws-samples/amazon-textract-response-parser)  |  Shows how to parse the [Block](API_Block.md) objects returned by Amazon Textract operations.  | 
|  [Amazon Textract Documentation Code Examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/python/example_code/textract)  |  Code examples used in this guide.  | 
|  [Textractor](https://github.com/aws-samples/amazon-textract-textractor)  |  Shows how to convert Amazon Textract output into multiple formats.  | 
|  [Generate Searchable PDF documents with Amazon Textract](https://github.com/aws-samples/amazon-textract-searchable-pdf)  |  Shows how to create a searchable PDF document from different types of input documents such as JPG/PNG format images and scanned PDF documents.  | 