

# Running analysis jobs for custom entity recognition
<a name="detecting-cer"></a>

You can run an asynchronous analysis job to detect custom entities in a set of one or more documents.

**Before you begin**  
You need a custom entity recognition model (also known as a recognizer) before you can detect custom entities. For more information about these models, see [Training custom entity recognizer models](training-recognizers.md). 

A recognizer that is trained with plain-text annotations supports entity detection for plain-text documents only. A recognizer that is trained with PDF document annotations supports entity detection for plain-text documents, images, PDF files, and Word documents. For files other than text files, Amazon Comprehend performs text extraction before running the analysis. For information about the input files, see [Inputs for asynchronous custom analysis](idp-inputs-async.md).

If you plan to analyze image files or scanned PDF documents, your IAM policy must grant permissions to use two Amazon Textract API methods (DetectDocumentText and AnalyzeDocument). Amazon Comprehend invokes these methods during text extraction. For an example policy, see [Permissions required to perform document analysis actions](security_iam_id-based-policy-examples.md#security-iam-based-policy-perform-cmp-actions).

To run an async analysis job, you perform the following overall steps:

1. Store the documents in an Amazon S3 bucket.

1. Use the API or console to start the analysis job.

1. Monitor the progress of the analysis job.

1. After the job runs to completion, retrieve the results of the analysis from the S3 bucket that you specified when you started the job.

**Topics**
+ [Starting a custom entity detection job (console)](detecting-cer-async-console.md)
+ [Starting a custom entity detection job (API)](detecting-cer-async-api.md)
+ [Outputs for asynchronous analysis jobs](outputs-cer-async.md)

# Starting a custom entity detection job (console)
<a name="detecting-cer-async-console"></a>

You can use the console to start and monitor an async analysis job for custom entity recognition.

**To start an async analysis job**

1. Sign in to the AWS Management Console and open the Amazon Comprehend console at [https://console.aws.amazon.com/comprehend/](https://console.aws.amazon.com/comprehend/)

1. From the left menu, choose **Analysis jobs** and then choose **Create job**.

1. Give the classification job a name. The name must be unique your account and current Region.

1. Under **Analysis type**, choose **Custom entity recognition**.

1. From **Recognizer model**, choose the custom entity recognizer to use.

1. From **Version**, choose the recognizer version to use.

1. (Optional) If you choose to encrypt the data that Amazon Comprehend uses while processing your job, choose **Job encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key ID under **KMS key ARN**.
**Note**  
For more information on creating and using KMS keys and the associated encryption, see [Key management service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Under **Input data**, enter the location of the Amazon S3 bucket that contains your input documents or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role that you're using for access permissions for the analysis job must have reading permissions for the S3 bucket.

1. (Optional) for **Input format**, you can choose the format of the input documents. The format can be one document per file, or one document per line in a single file. One document per line applies only to text documents. 

1. (Optional) For **Document read mode**, you can override the default text extraction actions. For more information, see [Setting text extraction options](idp-set-textract-options.md). 

1. Under **Output data**, enter the location of the Amazon S3 bucket where Amazon Comprehend should write the job's output data or navigate to it by choosing **Browse S3**. This bucket must be in the same Region as the API that you are calling. The IAM role you're using for access permissions for the classification job must have write permissions for the S3 bucket.

1. (Optional) if you choose to encrypt the output result from your job, choose **Encryption**. Then choose whether to use a KMS key associated with the current account, or one from another account.
   + If you are using a key associated with the current account, choose the key alias or ID for **KMS key ID**.
   + If you are using a key associated with a different account, enter the ARN for the key alias or ID under **KMS key ID**.

1. (Optional) To launch your resources into Amazon Comprehend from a VPC, enter the VPC ID under **VPC** or choose the ID from the drop-down list. 

   1. Choose the subnet under **Subnet(s)**. After you select the first subnet, you can choose additional ones.

   1. Under **Security Group(s)**, choose the security group to use if you specified one. After you select the first security group, you can choose additional ones.
**Note**  
When you use a VPC with your analysis job, the `DataAccessRole` used for the Create and Start operations must have permissions to the VPC that accesses the output bucket.

1. Choose **Create job** to create the entity recognition job.

# Starting a custom entity detection job (API)
<a name="detecting-cer-async-api"></a>

You can use the API to start and monitor an async analysis job for custom entity recognition.

To start a custom entity detection job with the [StartEntitiesDetectionJob](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartEntitiesDetectionJob.html) operation, you provide the EntityRecognizerArn, which is the Amazon Resource Name (ARN) of the trained model. You can find this ARN in the response to the [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) operation. 

**Topics**
+ [Detecting custom entities using the AWS Command Line Interface](#detecting-cer-async-api-cli)
+ [Detecting custom entities using the AWS SDK for Java](#detecting-cer-async-api-java)
+ [Detecting custom entities using the AWS SDK for Python (Boto3)](#detecting-cer-async-api-python)
+ [Overriding API actions for PDF files](#detecting-cer-api-pdf)

## Detecting custom entities using the AWS Command Line Interface
<a name="detecting-cer-async-api-cli"></a>

Use the following example for Unix, Linux, and macOS environments. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^). To detect custom entities in a document set, use the following request syntax:

```
aws comprehend start-entities-detection-job \
     --entity-recognizer-arn "arn:aws:comprehend:region:account number:entity-recognizer/test-6" \
     --job-name infer-1 \
     --data-access-role-arn "arn:aws:iam::account number:role/service-role/AmazonComprehendServiceRole-role" \
     --language-code en \
     --input-data-config "S3Uri=s3://Bucket Name/Bucket Path" \
     --output-data-config "S3Uri=s3://Bucket Name/Bucket Path/" \
     --region region
```

Amazon Comprehend responds with the `JobID` and `JobStatus` and will return the output from the job in the S3 bucket that you specified in the request.

## Detecting custom entities using the AWS SDK for Java
<a name="detecting-cer-async-api-java"></a>

For Amazon Comprehend examples that use Java, see [Amazon Comprehend Java examples](https://github.com/awsdocs/aws-doc-sdk-examples/tree/main/javav2/example_code/comprehend).

## Detecting custom entities using the AWS SDK for Python (Boto3)
<a name="detecting-cer-async-api-python"></a>

This example creates a custom entity recognizer, trains the model, and then runs it in an entity recognizer job using the AWS SDK for Python (Boto3).

Instantiate the SDK for Python. 

```
import boto3
import uuid
comprehend = boto3.client("comprehend", region_name="region")
```

Create an entity recognizer: 

```
response = comprehend.create_entity_recognizer(
    RecognizerName="Recognizer-Name-Goes-Here-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "EntityTypes": [
            {
                "Type": "ENTITY_TYPE"
            }
        ],
        "Documents": {
            "S3Uri": "s3://Bucket Name/Bucket Path/documents"
        },
        "Annotations": {
            "S3Uri": "s3://Bucket Name/Bucket Path/annotations"
        }
    }
)
recognizer_arn = response["EntityRecognizerArn"]
```

List all recognizers: 

```
response = comprehend.list_entity_recognizers()
```

Wait for the entity recognizer to reach TRAINED status: 

```
while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        sys.exit(1)
    if "TRAINED" == status:
        break

    time.sleep(10)
```

Start a custom entities detection job: 

```
response = comprehend.start_entities_detection_job(
    EntityRecognizerArn=recognizer_arn,
    JobName="Detection-Job-Name-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn="Role ARN",
    InputDataConfig={
        "InputFormat": "ONE_DOC_PER_LINE",
        "S3Uri": "s3://Bucket Name/Bucket Path/documents"
    },
    OutputDataConfig={
        "S3Uri": "s3://Bucket Name/Bucket Path/output"
    }
)
```

## Overriding API actions for PDF files
<a name="detecting-cer-api-pdf"></a>

For image files and PDF files, you can override the default extraction actions using the `DocumentReaderConfig` parameter in `InputDataConfig`.

The following example defines a JSON file named myInputDataConfig.json to set the `InputDataConfig` values. It sets `DocumentReadConfig` to use the Amazon Textract `DetectDocumentText` API for all PDF files.

**Example**  

```
"InputDataConfig": {
  "S3Uri": s3://Bucket Name/Bucket Path",
  "InputFormat": "ONE_DOC_PER_FILE",
  "DocumentReaderConfig": {
      "DocumentReadAction": "TEXTRACT_DETECT_DOCUMENT_TEXT",
      "DocumentReadMode": "FORCE_DOCUMENT_READ_ACTION"
  }
}
```

In the `StartEntitiesDetectionJob` operation, specify the myInputDataConfig.json file as the `InputDataConfig` parameter:

```
  --input-data-config file://myInputDataConfig.json  
```

For more information about the `DocumentReaderConfig` parameters, see [Setting text extraction options](idp-set-textract-options.md).

# Outputs for asynchronous analysis jobs
<a name="outputs-cer-async"></a>

After an analysis job completes, it stores the results in the S3 bucket that you specified in the request.

## Outputs for text inputs
<a name="outputs-cer-async-text"></a>

For text input files, the output consists of a list of entities for each input document.

The following example shows the output for two documents from an input file named **50\$1docs**, using one document per line format.

```
{
        "File": "50_docs",
        "Line": 0,
        "Entities":
        [
            {
                "BeginOffset": 0,
                "EndOffset": 22,
                "Score": 0.9763959646224976,
                "Text": "John Johnson",
                "Type": "JUDGE"
            }
        ]
    }
    {
        "File": "50_docs",
        "Line": 1,
        "Entities":
        [
            {
                "BeginOffset": 11,
                "EndOffset": 15,
                "Score": 0.9615424871444702,
                "Text": "Thomas Kincaid",
                "Type": "JUDGE"
            }
        ]
    }
```

## Outputs for semi-structured inputs
<a name="outputs-cer-async-other"></a>

For semi-structured input documents, the output can include the following additional fields:
+ DocumentMetadata – Extraction information about the document. The metadata includes a list of pages in the document, with the number of characters extracted from each page. This field is present in the response if the request included the `Byte` parameter.
+ DocumentType – The document type for each page in the input document. This field is present in the response for a request that included the `Byte` parameter.
+ Blocks – Information about each block of text in the input document. Blocks can nest within a block. A page block contains a block for each line of text, which contains a block for each word. This field is present in the response for a request that included the `Byte` parameter.
+ BlockReferences – A reference to each block for this entity. This field is present in the response for a request that included the `Byte` parameter. The field isn't present for text files.
+ Errors – Page-level errors that the system detected while processing the input document. The field is empty if the system encountered no errors.

For more details about these output fields, see [DetectEntities](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_DetectEntities.html) in the *Amazon Comprehend API Reference*

The following example shows the output for a one-page native PDF input document.

**Example output from a custom entity recognition analysis of a PDF document**  

```
{
        "Blocks":
        [
            {
                "BlockType": "LINE",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                "Page": 1,
                "Relationships":
                [
                    {
                        "ids":
                        [
                            "f343ce48-583d-4abe-b84b-a232e266450f"
                        ],
                        "type": "CHILD"
                    }
                ],
                "Text": "S-3"
            },
            {
                "BlockType": "WORD",
                "Geometry":
                {
                    "BoundingBox":
                    {
                        "Height": 0.012575757575757575,
                        "Left": 0.0,
                        "Top": 0.0015063131313131314,
                        "Width": 0.02262091503267974
                    },
                    "Polygon":
                    [
                        {
                            "X": 0.0,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.0015063131313131314
                        },
                        {
                            "X": 0.02262091503267974,
                            "Y": 0.014082070707070706
                        },
                        {
                            "X": 0.0,
                            "Y": 0.014082070707070706
                        }
                    ]
                },
                "Id": "f343ce48-583d-4abe-b84b-a232e266450f",
                "Page": 1,
                "Relationships":
                [],
                "Text": "S-3"
            }
        ],
        "DocumentMetadata":
        {
            "PageNumber": 1,
            "Pages": 1
        },
        "DocumentType": "NativePDF",
        "Entities":
        [
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 25,
                        "BlockId": "4330efed-6334-4fc4-ba48-e050afa95c8d",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 1,
                                "ChildBlockId": "cbba5534-ac69-4bc4-beef-306c659f70a6",
                                "EndOffset": 6
                            }
                        ],
                        "EndOffset": 30
                    }
                ],
                "Score": 0.9998825926329088,
                "Text": "0.001",
                "Type": "OFFERING_PRICE"
            },
            {
                "BlockReferences":
                [
                    {
                        "BeginOffset": 41,
                        "BlockId": "f343ce48-583d-4abe-b84b-a232e266450f",
                        "ChildBlocks":
                        [
                            {
                                "BeginOffset": 0,
                                "ChildBlockId": "292a2e26-21f0-401b-a2bf-03aa4c47f787",
                                "EndOffset": 9
                            }
                        ],
                        "EndOffset": 50
                    }
                ],
                "Score": 0.9809727537330395,
                "Text": "6,097,560",
                "Type": "OFFERED_SHARES"
            }
        ],
        "File": "example.pdf",
        "Version": "2021-04-30"
    }
```