# Tutorials [Block](API_Block.md) objects that are returned from Amazon Textract operations contain the results of text detection and text analysis operations, such as [AnalyzeDocument](API_AnalyzeDocument.md). The following Python tutorials show some of the different ways that you can use Block objects. For example, you can export table information to a comma-separated values (CSV) file. The tutorials use synchronous Amazon Textract operations that return all results. If you want to use asynchronous operations such as [StartDocumentAnalysis](API_StartDocumentAnalysis.md), you need to change the example code to accommodate multiple batches of returned `Block` objects. To make use of the asynchronous operations example, ensure that you have followed the instructions given at [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md). For examples that show you other ways to use Amazon Textract, see [Additional Code Samples](other-examples.md). **Topics** + [Prerequisites](#examples-prerequisites) + [Extracting Key-Value Pairs from a Form Document](examples-extract-kvp.md) + [Exporting Tables into a CSV File](examples-export-table-csv.md) + [Detecting text with an AWS Lambda function](lambda.md) + [Extracting and Sending Text to AWS Comprehend for Analysis](textract-to-comprehend.md) + [Additional Code Samples](other-examples.md) ## Prerequisites Before you can run the examples in this section, you have to configure your environment. **To configure your environment** 1. Give a user the `AmazonTextractFullAccess` permissions. For more information, see [Step 1: Set Up an AWS Account and Create a User](setting-up.md). 1. Install and configure the AWS CLI and the AWS SDKs. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md). # Extracting Key-Value Pairs from a Form Document The following Python example shows how to extract key-value pairs in form documents from [Block](API_Block.md) objects that are stored in a map. Block objects are returned from a call to [AnalyzeDocument](API_AnalyzeDocument.md). For more information, see [Form Data (Key-Value Pairs)](how-it-works-kvp.md). You use the following functions: + `get_kv_map` – Calls [AnalyzeDocument](API_AnalyzeDocument.md), and stores the KEY and VALUE BLOCK objects in a map. + `get_kv_relationship` and `find_value_block` – Constructs the key-value relationships from the map. **To extract key-value pairs from a form document** 1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites). 1. Save the following example code to a file named *textract\$1python\$1kv\$1parser.py*. In the function `get_kv_map`, replace `profile-name` with the name of a profile that can assume the role and `region` with the region in which you want to run the code. ``` import boto3 import sys import re import json from collections import defaultdict def get_kv_map(file_name): with open(file_name, 'rb') as file: img_test = file.read() bytes_test = bytearray(img_test) print('Image loaded', file_name) # process using image bytes session = boto3.Session(profile_name='profile-name') client = session.client('textract', region_name='region') response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS']) # Get the text blocks blocks = response['Blocks'] # get key and value maps key_map = {} value_map = {} block_map = {} for block in blocks: block_id = block['Id'] block_map[block_id] = block if block['BlockType'] == "KEY_VALUE_SET": if 'KEY' in block['EntityTypes']: key_map[block_id] = block else: value_map[block_id] = block return key_map, value_map, block_map def get_kv_relationship(key_map, value_map, block_map): kvs = defaultdict(list) for block_id, key_block in key_map.items(): value_block = find_value_block(key_block, value_map) key = get_text(key_block, block_map) val = get_text(value_block, block_map) kvs[key].append(val) return kvs def find_value_block(key_block, value_map): for relationship in key_block['Relationships']: if relationship['Type'] == 'VALUE': for value_id in relationship['Ids']: value_block = value_map[value_id] return value_block def get_text(result, blocks_map): text = '' if 'Relationships' in result: for relationship in result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: word = blocks_map[child_id] if word['BlockType'] == 'WORD': text += word['Text'] + ' ' if word['BlockType'] == 'SELECTION_ELEMENT': if word['SelectionStatus'] == 'SELECTED': text += 'X ' return text def print_kvs(kvs): for key, value in kvs.items(): print(key, ":", value) def search_value(kvs, search_key): for key, value in kvs.items(): if re.search(search_key, key, re.IGNORECASE): return value def main(file_name): key_map, value_map, block_map = get_kv_map(file_name) # Get Key Value relationship kvs = get_kv_relationship(key_map, value_map, block_map) print("\n\n== FOUND KEY : VALUE pairs ===\n") print_kvs(kvs) # Start searching a key value while input('\n Do you want to search a value for a key? (enter "n" for exit) ') != 'n': search_key = input('\n Enter a search key:') print('The value is:', search_value(kvs, search_key)) if __name__ == "__main__": file_name = sys.argv[1] main(file_name) ``` 1. At the command prompt, enter the following command. Replace `file` with the document image file that you want to analyze. ``` python textract_python_kv_parser.py file ``` 1. When you're prompted, enter a key that's in the input document. If the code detects the key, it displays the key's value. # Exporting Tables into a CSV File These Python examples show how to export tables from an image of a document into a comma-separated values (CSV) file. The example for synchronous document analysis collects table information from a call to [AnalyzeDocument](API_AnalyzeDocument.md). The example for asynchronous document analysis makes a call to [StartDocumentAnalysis](API_StartDocumentAnalysis.md) and then retrives the results from [GetDocumentAnalysis](API_GetDocumentAnalysis.md) as `Block` objects. Table information is returned as [Block](API_Block.md) objects from a call to [AnalyzeDocument](API_AnalyzeDocument.md). For more information, see [Tables](how-it-works-tables.md). The `Block` objects are stored in a map structure that's used to export the table data into a CSV file. ------ #### [ Synchronous ] In this example, you will use the functions: + `get_table_csv_results` – Calls [AnalyzeDocument](API_AnalyzeDocument.md), and builds a map of tables that are detected in the document. Creates a CSV representation of all detected tables. + `generate_table_csv` – Generates the CSV file for an individual table. + `get_rows_columns_map` – Gets the rows and columns from the map. + `get_text` – Gets the text from a cell. **To export tables into a CSV file** 1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites). 1. Save the following example code to a file named *textract\$1python\$1table\$1parser.py*. In the function `get_table_csv_results`, replace `profile-name` with the name of a profile that can assume the role and `region` with the region in which you want to run the code. ``` import webbrowser, os import json import boto3 import io from io import BytesIO import sys from pprint import pprint def get_rows_columns_map(table_result, blocks_map): rows = {} scores = [] for relationship in table_result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: cell = blocks_map[child_id] if cell['BlockType'] == 'CELL': row_index = cell['RowIndex'] col_index = cell['ColumnIndex'] if row_index not in rows: # create new row rows[row_index] = {} # get confidence score scores.append(str(cell['Confidence'])) # get the text value rows[row_index][col_index] = get_text(cell, blocks_map) return rows, scores def get_text(result, blocks_map): text = '' if 'Relationships' in result: for relationship in result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: word = blocks_map[child_id] if word['BlockType'] == 'WORD': if "," in word['Text'] and word['Text'].replace(",", "").isnumeric(): text += '"' + word['Text'] + '"' + ' ' else: text += word['Text'] + ' ' if word['BlockType'] == 'SELECTION_ELEMENT': if word['SelectionStatus'] =='SELECTED': text += 'X ' return text def get_table_csv_results(file_name): with open(file_name, 'rb') as file: img_test = file.read() bytes_test = bytearray(img_test) print('Image loaded', file_name) # process using image bytes # get the results session = boto3.Session(profile_name='profile-name') client = session.client('textract', region_name='region') response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) # Get the text blocks blocks=response['Blocks'] pprint(blocks) blocks_map = {} table_blocks = [] for block in blocks: blocks_map[block['Id']] = block if block['BlockType'] == "TABLE": table_blocks.append(block) if len(table_blocks) <= 0: return " NO Table FOUND " csv = '' for index, table in enumerate(table_blocks): csv += generate_table_csv(table, blocks_map, index +1) csv += '\n\n' return csv def generate_table_csv(table_result, blocks_map, table_index): rows, scores = get_rows_columns_map(table_result, blocks_map) table_id = 'Table_' + str(table_index) # get cells. csv = 'Table: {0}\n\n'.format(table_id) for row_index, cols in rows.items(): for col_index, text in cols.items(): col_indices = len(cols.items()) csv += '{}'.format(text) + "," csv += '\n' csv += '\n\n Confidence Scores % (Table Cell) \n' cols_count = 0 for score in scores: cols_count += 1 csv += score + "," if cols_count == col_indices: csv += '\n' cols_count = 0 csv += '\n\n\n' return csv def main(file_name): table_csv = get_table_csv_results(file_name) output_file = 'output.csv' # replace content with open(output_file, "wt") as fout: fout.write(table_csv) # show the results print('CSV OUTPUT FILE: ', output_file) if __name__ == "__main__": file_name = sys.argv[1] main(file_name) ``` 1. At the command prompt, enter the following command. Replace `file` with the name of the document image file that you want to analyze. ``` python textract_python_table_parser.py file ``` When you run the example, the CSV output is saved in a file named `output.csv`. ------ #### [ Asynchronous ] In this example, you will use make use of two different scripts. The first script starts the process of asynchronoulsy analyzing documents with `StartDocumentAnalysis` and gets the `Block` information returned by `GetDocumentAnalysis`. The second script takes the returned `Block` information for each page, formats the data as a table, and saves the tables to a CSV file. **To export tables into a CSV file** 1. Configure your environment. For more information, see [Prerequisites](examples-blocks.md#examples-prerequisites). 1. Ensure that you have followed the instructions given at see [Configuring Amazon Textract for Asynchronous Operations](api-async-roles.md). The process documented on that page enables you to send and receive messages about the completion status of asynchronous jobs. 1. In the following code example, replace the value of `roleArn` with the Arn assigned to the role that you created in Step 2. Replace the value of `bucket` with the name of the S3 bucket containing your document. Replace the value of `document` with the name of the document in your S3 bucket. Replace the value of `region_name` with the name of your bucket's region. Save the following example code to a file named *start\$1doc\$1analysis\$1for\$1table\$1extraction.py.*. ``` import boto3 import time class DocumentProcessor: jobId = '' region_name = '' roleArn = '' bucket = '' document = '' sqsQueueUrl = '' snsTopicArn = '' processType = '' def __init__(self, role, bucket, document, region): self.roleArn = role self.bucket = bucket self.document = document self.region_name = region self.textract = boto3.client('textract', region_name=self.region_name) self.sqs = boto3.client('sqs') self.sns = boto3.client('sns') def ProcessDocument(self): jobFound = False response = self.textract.start_document_analysis(DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, FeatureTypes=["TABLES", "FORMS"], NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) print('Processing type: Analysis') print('Start Job Id: ' + response['JobId']) print('Done!') def CreateTopicandQueue(self): millis = str(int(round(time.time() * 1000))) # Create SNS topic snsTopicName = "AmazonTextractTopic" + millis topicResponse = self.sns.create_topic(Name=snsTopicName) self.snsTopicArn = topicResponse['TopicArn'] # create SQS queue sqsQueueName = "AmazonTextractQueue" + millis self.sqs.create_queue(QueueName=sqsQueueName) self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl'] attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl, AttributeNames=['QueueArn'])['Attributes'] sqsQueueArn = attribs['QueueArn'] # Subscribe SQS queue to SNS topic self.sns.subscribe(TopicArn=self.snsTopicArn, Protocol='sqs', Endpoint=sqsQueueArn) # Authorize SNS to write SQS queue policy = """{{ "Version":"2012-10-17", "Statement":[ {{ "Sid":"MyPolicy", "Effect":"Allow", "Principal" : {{"AWS" : "*"}}, "Action":"SQS:SendMessage", "Resource": "{}", "Condition":{{ "ArnEquals":{{ "aws:SourceArn": "{}" }} }} }} ] }}""".format(sqsQueueArn, self.snsTopicArn) response = self.sqs.set_queue_attributes( QueueUrl=self.sqsQueueUrl, Attributes={ 'Policy': policy }) def main(): roleArn = 'role-arn' bucket = 'bucket-name' document = 'document-name' region_name = 'region-name' analyzer = DocumentProcessor(roleArn, bucket, document, region_name) analyzer.CreateTopicandQueue() analyzer.ProcessDocument() if __name__ == "__main__": main() ``` 1. Run the code. The code will print a JobId. Copy this JobId down. 1. Wait for your job to finish processing, and after it has finished, copy the following code to a file named *get\$1doc\$1analysis\$1for\$1table\$1extraction.py*. Replace the value of `jobId` with the Job ID you copied down earlier. Replace the value of `region_name` with the name of the region associated with your Textract role. Replace the value of `file_name` with the name you want to give the output CSV. ``` import boto3 from pprint import pprint jobId = '' region_name = '' file_name = '' textract = boto3.client('textract', region_name=region_name) # Display information about a block def DisplayBlockInfo(block): print("Block Id: " + block['Id']) print("Type: " + block['BlockType']) if 'EntityTypes' in block: print('EntityTypes: {}'.format(block['EntityTypes'])) if 'Text' in block: print("Text: " + block['Text']) if block['BlockType'] != 'PAGE': print("Confidence: " + "{:.2f}".format(block['Confidence']) + "%") def GetResults(jobId, file_name): maxResults = 1000 paginationToken = None finished = False while finished == False: response = None if paginationToken == None: response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults) else: response = textract.get_document_analysis(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) blocks = response['Blocks'] table_csv = get_table_csv_results(blocks) output_file = file_name + ".csv" # replace content with open(output_file, "at") as fout: fout.write(table_csv) # show the results print('Detected Document Text') print('Pages: {}'.format(response['DocumentMetadata']['Pages'])) print('OUTPUT TO CSV FILE: ', output_file) # Display block information for block in blocks: DisplayBlockInfo(block) print() print() if 'NextToken' in response: paginationToken = response['NextToken'] else: finished = True def get_rows_columns_map(table_result, blocks_map): rows = {} for relationship in table_result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: try: cell = blocks_map[child_id] if cell['BlockType'] == 'CELL': row_index = cell['RowIndex'] col_index = cell['ColumnIndex'] if row_index not in rows: # create new row rows[row_index] = {} # get the text value rows[row_index][col_index] = get_text(cell, blocks_map) except KeyError: print("Error extracting Table data - {}:".format(KeyError)) pass return rows def get_text(result, blocks_map): text = '' if 'Relationships' in result: for relationship in result['Relationships']: if relationship['Type'] == 'CHILD': for child_id in relationship['Ids']: try: word = blocks_map[child_id] if word['BlockType'] == 'WORD': text += word['Text'] + ' ' if word['BlockType'] == 'SELECTION_ELEMENT': if word['SelectionStatus'] == 'SELECTED': text += 'X ' except KeyError: print("Error extracting Table data - {}:".format(KeyError)) return text def get_table_csv_results(blocks): pprint(blocks) blocks_map = {} table_blocks = [] for block in blocks: blocks_map[block['Id']] = block if block['BlockType'] == "TABLE": table_blocks.append(block) if len(table_blocks) <= 0: return " NO Table FOUND " csv = '' for index, table in enumerate(table_blocks): csv += generate_table_csv(table, blocks_map, index + 1) csv += '\n\n' # In order to generate separate CSV file for every table, uncomment code below #inner_csv = '' #inner_csv += generate_table_csv(table, blocks_map, index + 1) #inner_csv += '\n\n' #output_file = file_name + "___" + str(index) + ".csv" # replace content #with open(output_file, "at") as fout: # fout.write(inner_csv) return csv def generate_table_csv(table_result, blocks_map, table_index): rows = get_rows_columns_map(table_result, blocks_map) table_id = 'Table_' + str(table_index) # get cells. csv = 'Table: {0}\n\n'.format(table_id) for row_index, cols in rows.items(): for col_index, text in cols.items(): csv += '{}'.format(text) + "," csv += '\n' csv += '\n\n\n' return csv response_blocks = GetResults(jobId, file_name) ``` 1. Run the code. After you have obtained you results, be sure to delete the associated SNS and SQS resources, or else you may accrue charges for them. ------ # Detecting text with an AWS Lambda function AWS Lambda is a compute service that you can use to run code without provisioning or managing servers. You can call Amazon Textract API operations from within an AWS Lambda function. The following instructions show how to create a Lambda function in Python that calls [DetectDocumentText](API_DetectDocumentText.md). The Lambda function returns a list of [Block](API_Block.md) objects with information about the detected words and lines of text. The instructions include example Python code that shows you how to call the Lambda function with a document supplied from an Amazon S3 bucket or your local computer. Images stored in Amazon S3 must be in single-page PDF or TIFF document format, or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format. The Python code returns part of the JSON response for each Block type detected in the document. For an example that uses Lambda functions to process documents at a large scale, see [Amazon Textract IDP CDK Constructs](https://github.com/aws-samples/amazon-textract-idp-cdk-constructs/) and [Use machine learning to automate and process documents at scale](https://s12d.com/aws-idp-scale-workshop). **Topics** + [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function) + [Step 2: (Optional) Create a layer (console)](#example-lambda-create-layer) + [Step 3: Add Python code (console)](#example-lambda-add-code) + [Step 4: Try your Lambda function](#example-lambda-test) ## Step 1: Create an AWS Lambda function (console) In this step, you create an empty AWS Lambda function and an IAM execution role that lets your function call the `DetectDocumentText` operation. If you are supplying documents from Amazon S3, this step also shows you how to grant access to the bucket that stores your documents. Later you add the source code and optionally add a layer to the Lambda function. **To create an AWS Lambda function (console)** 1. Sign in to the AWS Management Console and open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/). 1. Choose **Create function**. For more information, see [Create a Lambda Function with the Console](https://docs.aws.amazon.com/lambda/latest/dg/getting-started-create-function.html). 1. Choose the following options: + Choose **Author from scratch**. + Enter a value for **Function name**. + For **Runtime**, choose **Python 3.9**. + For **Architecture**, choose **x86\$164**. 1. Choose **Create function** to create the AWS Lambda function. 1. On the function page, choose the **Configuration** tab. 1. On the **Permissions** pane, under **Execution role**, choose the role name to open the role in the IAM console. 1. In the **Permissions** tab, choose **Add permissions** and then **Create inline policy**. 1. Choose the **JSON** tab and replace the policy with the following policy: ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": [ { "Action": "textract:DetectDocumentText", "Resource": "*", "Effect": "Allow", "Sid": "DetectDocumentText" } ] } ``` ------ 1. Choose **Review policy**. 1. Enter a name for the policy, for example *DetectDocumentText-access*. 1. Choose **Create policy**. 1. If you are storing documents for analysis in an Amazon S3 bucket, you must add an Amazon S3 access policy. To do this, repeat steps 7 to 11 in the AWS Lambda console and make the following changes. 1. For step 8, use the following policy. Replace *bucket/folder path* with the Amazon S3 bucket and folder path to the documents that you want to analyze. ------ #### [ JSON ] **** ``` { "Version":"2012-10-17", "Statement": [ { "Sid": "S3Access", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::bucket/folder path/*" } ] } ``` ------ 1. For step 10, choose a different policy name, such as *S3Bucket-access*. ## Step 2: (Optional) Create a layer (console) To run this example, you don't need to perform this step. The `DetectDocumentText` operation is included in the default Lambda Python environment as part of AWS SDK for Python (Boto3). If other parts of your Lambda function require recent AWS service updates that aren't in the default Lambda Python environment, then perform this step to add the most recent Boto3 SDK release as a layer to your function. First, you create a zip file archive that contains the Boto3 SDK. Then, you create a layer and add the zip file archive to the layer. For more information, see [Using layers with your Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/invocation-layers.html#invocation-layers-using). **To create and add a layer (console)** 1. Open a command prompt and enter the following commands to create a deployment package with the most recent version of the AWS SDK. ``` pip install boto3 --target python/. zip boto3-layer.zip -r python/ ``` 1. Note the name of the zip file (boto3-layer.zip), which you use in step 8 of this procedure. 1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/). 1. In the navigation pane, choose **Layers**. 1. Choose **Create layer**. 1. Enter values for **Name** and **Description**. 1. For **Code entry type**, choose **Upload a .zip file** and select **Upload**. 1. In the dialog box, choose the zip file archive (boto3-layer.zip) that you created in step 1 of this procedure. 1. For **Compatible runtimes**, choose **Python 3.9**. 1. Choose **Create** to create the layer. 1. Choose the navigation pane menu icon. 1. In the navigation pane, choose **Functions**. 1. In the resources list, choose the function that you created previously in [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function). 1. Choose the **Code** tab. 1. In the **Layers** section, choose **Add a layer**. 1. Choose **Custom layers**. 1. In **Custom layers**, choose the layer name that you entered in step 6. 1. In **Version** choose the layer version, which should be 1. 1. Choose **Add**. ## Step 3: Add Python code (console) In this step, you add Python code to your Lambda function by using the Lambda console code editor. The code detects text in a document with `DetectDocumentText` and returns a list of Block objects with information about the detected text. The document can be located in an Amazon S3 bucket or a local computer. Images stored in Amazon S3 must be single-page PDF or TIFF format documents or in JPEG or PNG format. Local images must be in single-page PDF or TIFF format. **To add Python code (console)** 1. Navigate to the **Code** tab. 1. In the code editor, replace the code in **lambda\$1function.py** with the following code: ``` # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: Apache-2.0 """ Purpose An AWS lambda function that analyzes documents with Amazon Textract. """ import json import base64 import logging import boto3 from botocore.exceptions import ClientError # Set up logging. logger = logging.getLogger(__name__) # Get the boto3 client. textract_client = boto3.client('textract') def lambda_handler(event, context): """ Lambda handler function param: event: The event object for the Lambda function. param: context: The context object for the lambda function. return: The list of Block objects recognized in the document passed in the event object. """ try: # Determine document source. if 'image' in event: # Decode the image image_bytes = event['image'].encode('utf-8') img_b64decoded = base64.b64decode(image_bytes) image = {'Bytes': img_b64decoded} elif 'S3Object' in event: image = {'S3Object': {'Bucket': event['S3Object']['Bucket'], 'Name': event['S3Object']['Name']} } else: raise ValueError( 'Invalid source. Only image base 64 encoded image bytes or S3Object are supported.') # Analyze the document. response = textract_client.detect_document_text(Document=image) # Get the Blocks blocks = response['Blocks'] lambda_response = { "statusCode": 200, "body": json.dumps(blocks) } except ClientError as err: error_message = "Couldn't analyze image. " + \ err.response['Error']['Message'] lambda_response = { 'statusCode': 400, 'body': { "Error": err.response['Error']['Code'], "ErrorMessage": error_message } } logger.error("Error function %s: %s", context.invoked_function_arn, error_message) except ValueError as val_error: lambda_response = { 'statusCode': 400, 'body': { "Error": "ValueError", "ErrorMessage": format(val_error) } } logger.error("Error function %s: %s", context.invoked_function_arn, format(val_error)) return lambda_response ``` 1. Choose **Deploy** to deploy your Lambda function. ## Step 4: Try your Lambda function Now that you’ve created your Lambda function, you can invoke it to detect text in a document. In this step, you use Python code on your computer to pass a local document or a document in an Amazon S3 bucket to your Lambda function. Documents passed from a local computer must be smaller than 6291456 bytes. If your documents are larger, upload them to an Amazon S3 bucket and call the script with the Amazon S3 path to the image. For information about uploading image files to an Amazon S3 bucket, see [Uploading objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html). Make sure you run the code in the same [AWS Region](https://docs.aws.amazon.com/general/latest/gr/rande.html) in which you created the Lambda function. You can view the AWS Region for your Lambda function in the navigation bar of the function details page in the [Lambda console](https://console.aws.amazon.com/lambda/). If the AWS Lambda function returns a timeout error, extend the timeout period for the Lambda function. For more information, see [Configuring function timeout (console)](https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console). For more information about invoking a Lambda function from your code, see [Invoking AWS Lambda Functions](https://docs.aws.amazon.com/lambda/latest/dg/invoking-lambda-functions.html). **To try your Lambda function** 1. If you haven't already done so, do the following: 1. Make sure that the user has `lambda:InvokeFunction` permission. You can use the following policy: You can get the ARN for your Lambda function from the function overview in the [Lambda console](https://console.aws.amazon.com/lambda/). To provide access, add permissions to your users, groups, or roles: + Users and groups in AWS IAM Identity Center: Create a permission set. Follow the instructions in [Create a permission set](https://docs.aws.amazon.com//singlesignon/latest/userguide/howtocreatepermissionset.html) in the *AWS IAM Identity Center User Guide*. + Users managed in IAM through an identity provider: Create a role for identity federation. Follow the instructions in [Create a role for a third-party identity provider (federation)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-idp.html) in the *IAM User Guide*. + IAM users: + Create a role that your user can assume. Follow the instructions in [Create a role for an IAM user](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_roles_create_for-user.html) in the *IAM User Guide*. + (Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in [Adding permissions to a user (console)](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_users_change-permissions.html#users_change_permissions-add-console) in the *IAM User Guide*. 1. Install and configure AWS SDK for Python. For more information, see [Step 2: Set Up the AWS CLI and AWS SDKs](setup-awscli-sdk.md). 1. Save the following code to a file named `client.py`: ``` # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: Apache-2.0 """ Purpose Test code for running the Amazon Textract Lambda function example code. """ import argparse import logging import base64 import json import io import boto3 from botocore.exceptions import ClientError from PIL import Image, ImageDraw logger = logging.getLogger(__name__) def analyze_image(function_name, image): """Analyzes a document with an AWS Lambda function. :param image: The document that you want to analyze. :return The list of Block objects in JSON format. """ lambda_client = boto3.client('lambda') lambda_payload = {} if image.startswith('s3://'): logger.info("Analyzing document from S3 bucket: %s", image) bucket, key = image.replace("s3://", "").split("/", 1) s3_object = { 'Bucket': bucket, 'Name': key } lambda_payload = {"S3Object": s3_object} else: with open(image, 'rb') as image_file: logger.info("Analyzing local document: %s ", image) image_bytes = image_file.read() data = base64.b64encode(image_bytes).decode("utf8") lambda_payload = {"image": data} # Call the lambda function with the document. response = lambda_client.invoke(FunctionName=function_name, Payload=json.dumps(lambda_payload)) return json.loads(response['Payload'].read().decode()) def add_arguments(parser): """ Adds command line arguments to the parser. :param parser: The command line parser. """ parser.add_argument( "function", help="The name of the AWS Lambda function that you want " \ "to use to analyze the document.") parser.add_argument( "image", help="The document that you want to analyze.") def main(): """ Entrypoint for script. """ try: logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s") # Get command line arguments. parser = argparse.ArgumentParser(usage=argparse.SUPPRESS) add_arguments(parser) args = parser.parse_args() # Get analysis results. result = analyze_image(args.function, args.image) status = result['statusCode'] blocks = result['body'] blocks = json.loads(blocks) if status == 200: for block in blocks: print('Type: ' + block['BlockType']) if block['BlockType'] != 'PAGE': print('Detected: ' + block['Text']) print('Confidence: ' + "{:.2f}".format(block['Confidence']) + "%") print('Id: {}'.format(block['Id'])) if 'Relationships' in block: print('Relationships: {}'.format(block['Relationships'])) print('Bounding Box: {}'.format(block['Geometry']['BoundingBox'])) print('Polygon: {}'.format(block['Geometry']['Polygon'])) print() print("Blocks detected: " + str(len(blocks))) else: print(f"Error: {result['statusCode']}") print(f"Message: {result['body']}") except ClientError as error: logging.error(error) print(error) if __name__ == "__main__": main() ``` 1. Run the code. For the command line argument, supply the Lambda function name and the document that you want to analyze. You can supply a path to a local document, or you can use the Amazon S3 path to an document stored in an Amazon S3 bucket. For example: ``` python client.py function_name s3://bucket/path/document.jpg ``` If the document is in an Amazon S3 bucket. make sure that it is the same bucket that you specified previously in step 12 of [Step 1: Create an AWS Lambda function (console)](#example-lambda-create-function). If successful, your code returns a partial JSON response for each Block type detected in the document. # Extracting and Sending Text to AWS Comprehend for Analysis Amazon Textract lets you include document text detection and analysis in your applications. With Amazon Textract you can extract text from a variety of different document types using both synchronous and asynchronous document processing. The extracted text can then be saved to a file or database, or sent to another AWS service for further processing. In this tutorial you carry out a common end-to-end workflow. This workflow involves: + Processing numerous input documents with Amazon Textract + Providing the extracted text to Amazon Comprehend for analysis + Saving both the analyzed text and the analysis data to an Amazon Simple Storage Service (S3) bucket You use the [AWS SDK for Python](https://aws.amazon.com/sdk-for-python/) for this tutorial. You can also see the AWS Documentation SDK examples [GitHub repo ](https://github.com/awsdocs/aws-doc-sdk-examples)for more Python tutorials. ## Prerequisites Before you begin this tutorial, you’ll need to install Python and complete the steps required to [set up the Python AWS SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html). Beyond this, ensure that you have: + [Created an AWS account and an IAM role](https://docs.aws.amazon.com/rekognition/latest/dg/setting-up.html) + [Properly configured your AWS access credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html) + [Created an Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) + [ Configured Amazon Textract for Asynchronous processing](https://docs.aws.amazon.com/en_us/textract/latest/dg/api-async-roles.html), copying down the Amazon Resource Number (ARN) of the IAM role you configured for use with Amazon Textract + [Granted your IAM role access to Amazon Comprehend](https://docs.aws.amazon.com/comprehend/latest/dg/security-iam.html#security_iam_access-manage) + Selected a few documents for the purposes of text extraction/analysis and uploaded the document to Amazon S3. Ensure that the files you select for analysis are of the formats supported by Amazon Textract. ## Starting Asynchronous Document Text Detection You can extract the text from your documents and then analyze the extracted text with a service like Amazon Comprehend. Textract supports the extraction of text from multipage documents through asynchronous operations, which are for processing large, multipage documents. Processing a PDF file asynchronously allows your application to complete other tasks while it waits for the process to complete. This section will demonstrate how to import your documents from an Amazon S3 bucket and provide them to Textract’s asynchronous text detection operation. This tutorial assumes that you will be using Amazon S3 to store the files you want to extract text from. You’ll start by creating a class and functions that detect the text in your input documents. Your application will need to connect to the Textract client, as well as the Amazon SQS and Amazon SNS clients for the purposes of monitoring the completion status of the asynchronous job. 1. Start by writing the code to create an Amazon SNS topic and Amazon SQS queue. The following code sample creates a `DocumentProcessor` class that connects to the three required services and then creates both an Amazon SQS queue and Amazon SNS topic. The Amazon SNS topic is used to provide information about the job completion status to an Amazon SQS queue, which will be polled to obtain the completion status of a job. There are also methods to delete the Amazon SQS queue and Amazon SNS topic once the job has been completed and the resources are no longer needed. ``` import boto3 import json import sys import time class DocumentProcessor: jobId = '' region_name = '' roleArn = '' bucket = '' document = '' sqsQueueUrl = '' snsTopicArn = '' processType = '' def __init__(self, role, bucket, document, region): self.roleArn = role self.bucket = bucket self.document = document self.region_name = region # Instantiates necessary AWS clients session = boto3.Session(profile_name='profile-name', region_name='self.region_name') self.textract = session.client('textract', region_name=self.region_name) self.sqs = session.client('sqs', region_name=self.region_name) self.sns = session.client('sns', region_name=self.region_name) def CreateTopicandQueue(self): millis = str(int(round(time.time() * 1000))) # Create SNS topic snsTopicName = "AmazonTextractTopic" + millis topicResponse = self.sns.create_topic(Name=snsTopicName) self.snsTopicArn = topicResponse['TopicArn'] # create SQS queue sqsQueueName = "AmazonTextractQueue" + millis self.sqs.create_queue(QueueName=sqsQueueName) self.sqsQueueUrl = self.sqs.get_queue_url(QueueName=sqsQueueName)['QueueUrl'] attribs = self.sqs.get_queue_attributes(QueueUrl=self.sqsQueueUrl, AttributeNames=['QueueArn'])['Attributes'] sqsQueueArn = attribs['QueueArn'] # Subscribe SQS queue to SNS topic self.sns.subscribe( TopicArn=self.snsTopicArn, Protocol='sqs', Endpoint=sqsQueueArn) # Authorize SNS to write SQS queue policy = """{{ "Version": "2012-10-17", "Statement":[ {{ "Sid":"MyPolicy", "Effect":"Allow", "Principal" : {{"AWS" : "*"}}, "Action":"SQS:SendMessage", "Resource": "{}", "Condition":{{ "ArnEquals":{{ "aws:SourceArn": "{}" }} }} }} ] }}""".format(sqsQueueArn, self.snsTopicArn) response = self.sqs.set_queue_attributes( QueueUrl=self.sqsQueueUrl, Attributes={ 'Policy': policy }) def DeleteTopicandQueue(self): self.sqs.delete_queue(QueueUrl=self.sqsQueueUrl) self.sns.delete_topic(TopicArn=self.snsTopicArn) ``` 1. Write the code to call the `StartDocumentTextDetection` operation and get the results of the operation. The `DocumentProcessor` class will also need methods to: + Call the `StartDocumentTextDetection` operation + Poll an Amazon SQS for the job completion status + Retrieve the results of the job once it is done processing The following code creates the `ProcessDocument` and `GetResults` methods that call `StartDocumentTextDetection` and gets the extracted text, respectively. ``` def ProcessDocument(self): # Checks if job found jobFound = False # Starts the text detection operation on the documents in the provided bucket # Sends status to supplied SNS topic arn response = self.textract.start_document_text_detection( DocumentLocation={'S3Object': {'Bucket': self.bucket, 'Name': self.document}}, NotificationChannel={'RoleArn': self.roleArn, 'SNSTopicArn': self.snsTopicArn}) print('Processing type: Detection') print('Start Job Id: ' + response['JobId']) dotLine = 0 while jobFound == False: sqsResponse = self.sqs.receive_message(QueueUrl=self.sqsQueueUrl, MessageAttributeNames=['ALL'], MaxNumberOfMessages=10) # Waits until messages are found in the SQS queue if sqsResponse: if 'Messages' not in sqsResponse: if dotLine < 40: print('.', end='') dotLine = dotLine + 1 else: print() dotLine = 0 sys.stdout.flush() time.sleep(5) continue # Checks for a completed job that matches the jobID in the response from # StartDocumentTextDetection for message in sqsResponse['Messages']: notification = json.loads(message['Body']) textMessage = json.loads(notification['Message']) if str(textMessage['JobId']) == response['JobId']: print('Matching Job Found:' + textMessage['JobId']) jobFound = True text_data = self.GetResults(textMessage['JobId']) self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) return text_data else: print("Job didn't match:" + str(textMessage['JobId']) + ' : ' + str(response['JobId'])) # Delete the unknown message. Consider sending to dead letter queue self.sqs.delete_message(QueueUrl=self.sqsQueueUrl, ReceiptHandle=message['ReceiptHandle']) print('Done!') # gets the results of the completed text detection job # checks for pagination tokens to determine if there are multiple pages in the input doc def GetResults(self, jobId): maxResults = 1000 paginationToken = None finished = False while finished == False: response = None if paginationToken == None: response = self.textract.get_document_text_detection(JobId=jobId, MaxResults=maxResults) else: response = self.textract.get_document_text_detection(JobId=jobId, MaxResults=maxResults, NextToken=paginationToken) blocks = response['Blocks'] # List to hold detected text detected_text = [] # Display block information and add detected text to list for block in blocks: if 'Text' in block and block['BlockType'] == "LINE": detected_text.append(block['Text']) # If response contains a next token, update pagination token if 'NextToken' in response: paginationToken = response['NextToken'] else: finished = True return detected_text ``` 1. Save the above code in a file called `detectFileAsync.py`. You use this file in the next section to handle the detection of text in your input documents. ## Processing Your Documents and Sending the Text to Comprehend Your application will use the class you created in the proceeding section to: + read documents from your Amazon S3 bucket + extract the text in those documents + send the text to Amazon Comprehend for analysis You start by creating some functions that utilize Amazon Comprehend to analyze the text detected in your input documents. A common type of text analysis is sentiment analysis, which aims to capture the affect of a statement (whether it is positive, negative, or neutral). You can also carry out entity detection and key phrase detection on the data. The code below takes in the detected text and invokes the `BatchDetectSentiment` operation from Amazon Comprehend in order to carry out sentiment analysis. 1. Write the code to carry out sentiment analysis on your detected text. ``` from detectFileAsync import DocumentProcessor import boto3 import pandas as pd # Detect sentiment def sentiment_analysis(detected_text, lang): comprehend = boto3.client("comprehend") detect_sent_response = comprehend.batch_detect_sentiment( TextList=detected_text, LanguageCode=lang) # Lists to hold sentiment labels and sentiment scores sentiments = [] pos_score = [] neg_score = [] neutral_score = [] mixed_score = [] # for all results add the Sentiment label and sentiment scores to lists for res in detect_sent_response['ResultList']: sentiments.append(res['Sentiment']) print(res['SentimentScore']) print(type(res['SentimentScore'])) for key, val in res['SentimentScore'].items(): if key == "Positive": pos_score.append(val) if key == "Negative": neg_score.append(val) if key == "Neutral": neutral_score.append(val) if key == "Mixed": mixed_score.append(val) return sentiments, pos_score, neg_score, neutral_score, mixed_score ``` You may also want to perform other analysis operations, such as entity detection or key phrase detection, on your detected text. You can write the functions to carry out these analysis operations on your text, just like you did for the proceeding sentiment analysis operation. 1. Write the code to carry out entity detection on your detected text. ``` # detect entities def entity_detection(detected_text, lang): comprehend = boto3.client("comprehend") # convert and handle string here # do string handling detect_ent_response = comprehend.batch_detect_entities( TextList=detected_text, LanguageCode=lang) # To fold detected entities and entity types ents = [] types = [] # Get detected entities and types from the response returned by Comprehend for i in detect_ent_response['ResultList']: if len(i['Entities']) == 0: ents.append("N/A") types.append("N/A") else: sentence_ents = [] sentence_types = [] for entities in i['Entities']: sentence_ents.append(entities['Text']) sentence_types.append(entities['Type']) ents.append(sentence_ents) types.append(sentence_types) return ents, types ``` 1. Write the code to carry out key phrase detection on your detected text. ``` # Detect key phrases def key_phrases_detection(detected_text, lang): comprehend = boto3.client("comprehend") key_phrases = [] detect_phrases_response = comprehend.batch_detect_key_phrases( TextList=detected_text, LanguageCode=lang) for i in detect_phrases_response['ResultList']: if len(i['KeyPhrases']) == 0: key_phrases.append("N/A") else: phrases = [] for phrase in i['KeyPhrases']: phrases.append(phrase['Text']) key_phrases.append(phrases) return key_phrases ``` You need to create a function that invokes all of the code you’ve created so far. The function will use the `DocumentProcessor` class you created in your `DetectAnalyzeFileAsync.py` file, and then save the detected text to a variable for input into the three functions utilizing Amazon Comprehend that you previously wrote. The function will also need to construct a Pandas dataframe, into which the detected text and analysis data will be inserted. Finally, the Pandas dataframe will be saved as a CSV file. 1. Write the code to process your input documents with Textract and pass the detected text to Comprehend. ``` def process_document(roleArn, bucket, document, region_name): # Create analyzer class from DocumentProcessor, create a topic and queue, use Textract to get text, # then delete topica and queue analyzer = DocumentProcessor(roleArn, bucket, document, region_name) analyzer.CreateTopicandQueue() extracted_text = analyzer.ProcessDocument() analyzer.DeleteTopicandQueue() # detect dominant language comprehend = boto3.client("comprehend") response = comprehend.detect_dominant_language(Text=str(extracted_text[:10])) print(response) print(type(response)) lang = "" for i in response['Languages']: lang = i['LanguageCode'] print(lang) # or you can enter language code below # lang = "en" print("Lines in detected text:" + str(len(extracted_text))) sliced_list = [] start = 0 end = 24 while end < len(extracted_text): sliced_list.append(extracted_text[start:end]) start += 25 end += 25 print(sliced_list) # Create lists to hold analytics data, these will be turned into columns all_sents = [] all_scores = [] all_ents = [] all_types = [] all_key_phrases = [] all_pos_ratings = [] all_neg_ratings = [] all_neutral_ratings = [] all_mixed_ratings = [] # For every slice, get sentiment analysis, entity detection and key phrases, append results to lists for slice in sliced_list: slice_labels, pos_ratings, neg_ratings, neutral_ratings, mixed_ratings = sentiment_analysis(slice, lang) all_sents.append(slice_labels) all_pos_ratings.append(pos_ratings) all_neg_ratings.append(neg_ratings) all_neutral_ratings.append(neutral_ratings) all_mixed_ratings.append(mixed_ratings) slice_ents, slice_types = entity_detection(slice, lang) all_ents.append(slice_ents) all_types.append(slice_types) key_phrases = key_phrases_detection(slice, lang) all_key_phrases.append(key_phrases) # List comprehension to flatten multiple lists into a single list extracted_text = [line for sublist in sliced_list for line in sublist] all_sents = [sent for sublist in all_sents for sent in sublist] all_scores = [score for sublist in all_scores for score in sublist] all_ents = [ents for sublist in all_ents for ents in sublist] all_types = [types for sublist in all_types for types in sublist] all_key_phrases = [kp for sublist in all_key_phrases for kp in sublist] all_mixed_ratings = [kp for sublist in all_mixed_ratings for kp in sublist] all_pos_ratings = [kp for sublist in all_pos_ratings for kp in sublist] all_neg_ratings = [kp for sublist in all_neg_ratings for kp in sublist] all_neutral_ratings = [kp for sublist in all_neutral_ratings for kp in sublist] print(len(extracted_text)) print(len(all_sents)) print(len(all_ents)) print(len(all_types)) print(len(all_key_phrases)) print("List of Recognized Entities:") # Create dataframe and save as CSV df = pd.DataFrame({'Sentences':extracted_text, 'Sentiment':all_sents, 'SentPosScore':all_pos_ratings, 'SentNegScore':all_neg_ratings, 'SentNeutralScore':all_neutral_ratings, 'SentMixedRatings':all_mixed_ratings, 'Entities':all_ents, 'EntityTypes':all_types,'KeyPhrases:':all_key_phrases}) analysis_results = str(document.replace(".","_") + "_" + "analysis" + ".csv") df.to_csv(analysis_results, index=False) print(df) print("Data written to file!") return extracted_text, analysis_results ``` 1. Write the code to process your documents and upload the resulting data to S3. In the code sample below, replace the value of `roleArn` with the ARN of the role you configured for use with Amazon Textract. Replace the value of `region_name` with the region your account is operating in. Finally, replace the value `bucket_name` with the name of the S3 bucket containing your documents. ``` def main(): # Initialize S3 client and set RoleArn, region name, and bucket name s3 = boto3.client("s3") roleArn = '' region_name = '' bucket_name = '' # initialize global corpus full_corpus = [] # to hold all docs in bucket docs_list = [] # loop through docs in bucket, get names of all docs s3_resource = boto3.resource("s3") bucket = s3_resource.Bucket(bucket_name) for bucket_object in bucket.objects.all(): docs_list.append(bucket_object.key) print(docs_list) # For all the docs in the bucket, invoke document processing function, # add detected text to corpus of all text in batch docs, # and save CSV of comprehend analysis data and textract detected to S3 for i in docs_list: detected_text, analysis_results = process_document(roleArn, bucket_name, i, region_name) full_corpus.append(detected_text) print("Uploading file: {}".format(str(analysis_results))) name_of_file = str(analysis_results) s3.upload_file(name_of_file, bucket_name, name_of_file) # print the global corpus print(full_corpus) if __name__ == "__main__": main() ``` 1. Put the proceeding code in the section into a Python file and run it. You have successfully extracted text using Amazon Textract, sent the text to Amazon Comprehend for analysis, and then saved the results in a Amazon S3 bucket. # Additional Code Samples The following table provides links to more Amazon Textract code examples. | Example | Description | | --- | --- | | [Amazon Textract Code Samples](https://github.com/aws-samples/amazon-textract-code-samples) | Show various ways in which you can use Amazon Textract. | | [Large scale document processing with Amazon Textract](https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing) | Shows a serverless reference architecture that processes documents at a large scale. | | [Amazon Textract Parser](https://github.com/aws-samples/amazon-textract-response-parser) | Shows how to parse the [Block](API_Block.md) objects returned by Amazon Textract operations. | | [Amazon Textract Documentation Code Examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/python/example_code/textract) | Code examples used in this guide. | | [Textractor](https://github.com/aws-samples/amazon-textract-textractor) | Shows how to convert Amazon Textract output into multiple formats. | | [Generate Searchable PDF documents with Amazon Textract](https://github.com/aws-samples/amazon-textract-searchable-pdf) | Shows how to create a searchable PDF document from different types of input documents such as JPG/PNG format images and scanned PDF documents. |