

# Annotations
<a name="cer-annotation"></a>

Annotations label entities in context by associating your custom entity types with the locations where they occur in your training documents.

By submitting annotation along with your documents, you can increase the accuracy of the model. With Annotations, you're not simply providing the location of the entity you're looking for, but you're also providing more accurate context to the custom entity you're seeking.

For instance, if you're searching for the name John Johnson, with the entity type JUDGE, providing your annotation might help the model to learn that the person you want to find is a judge. If it is able to use the context, then Amazon Comprehend won't find people named John Johnson who are attorneys or witnesses. Without providing annotations, Amazon Comprehend will create its own version of an annotation, but won't be as effective at including only judges. Providing your own annotations might help to achieve better results and to generate models that are capable of better leverage context when extracting custom entities.

**Topics**
+ [

## Minimum number of annotations
](#prep-training-data-ann)
+ [

## Annotation best practices
](#cer-annotation-best-practices)
+ [

# Plain-text annotation files
](cer-annotation-csv.md)
+ [

# PDF annotation files
](cer-annotation-manifest.md)
+ [

# Annotating PDF files
](cer-annotation-pdf.md)

## Minimum number of annotations
<a name="prep-training-data-ann"></a>

The minimum number of input documents and annotations required to train a model depends on the type of annotations. 

**PDF annotations**  
To create a model for analyzing image files, PDFs, or Word documents, train your recognizer using PDF annotations. For PDF annotations, provide at least 250 input documents and at least 100 annotations per entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request. 

**Plain-text annotations**  
To create a model for analyzing text documents, you can train your recognizer using plain-text annotations.   
For plain-text annotations, provide at least three annotated input documents and at least 25 annotations per entity. If you provide less than 50 annotations total, Amazon Comprehend reserves more than 10% of the input documents to test the model (unless you provided a test dataset in the training request). Don't forget that the minimum document corpus size is 5 KB.  
If your input contains only a few training documents, you may encounter an error that the training input data contains too few documents that mention one of the entities. Submit the job again with additional documents that mention the entity.  
If you provide a test dataset, the test data must include at least one annotation for each of the entity types specified in the creation request.  
For an example of how to benchmark a model with a small dataset, see [Amazon Comprehend announces lower annotation limits for custom entity recognition](https://aws.amazon.com/blogs/machine-learning/amazon-comprehend-announces-lower-annotation-limits-for-custom-entity-recognition/) on the AWS blog site.

## Annotation best practices
<a name="cer-annotation-best-practices"></a>

There are a number of things to consider to get the best result when using annotations, including: 
+ Annotate your data with care and verify that you annotate every mention of the entity. Imprecise annotations can lead to poor results.
+ Input data should not contain duplicates, like a duplicate of a PDF you are going to annotate. Presence of a duplicate sample might result in test set contamination and could negatively affect the training process, model metrics, and model behavior.
+ Make sure that all of your documents are annotated, and that the documents without annotations are due to lack of legitimate entities, not due to negligence. For example, if you have a document that says "J Doe has been an engineer for 14 years", you should also provide an annotation for "J Doe" as well as "John Doe". Failing to do so confuses the model and can result in the model not recognizing "J Doe" as ENGINEER. This should be consistent within the same document and across documents.
+ In general, more annotations lead to better results.
+ You can train a model with the [minimum number](guidelines-and-limits.md#limits-custom-entity-recognition) of documents and annotations, but adding data usually improves the model. We recommend increasing the volume of annotated data by 10% to increase the accuracy of the model. You can run inference on a test dataset which remains unchanged and can be tested by different model versions. You can then compare the metrics for successive model versions.
+ Provide documents that resemble real use cases as closely as possible. Synthesized data with repetitive patterns should be avoided. The input data should be as diverse as possible to avoid overfitting and help the underlying model better generalize on real examples.
+ It is important that documents should be diverse in terms of word count. For example, if all documents in the training data are short, the resulting model may have difficulty predicting entities in longer documents.
+ Try and give the same data distribution for training as you expect to be using when you're actually detecting your custom entities (inference time). For example, at inference time, if you expect to be sending us documents that have no entities in them, this should also be part of your training document set.

For additional suggestions, see [Improving custom entity recognizer performance](https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html#cer-performance).

# Plain-text annotation files
<a name="cer-annotation-csv"></a>

For plain-text annotations, you create a comma-separated value (CSV) file that contains a list of annotations. The CSV file must contain the following columns if your training file input format is **one document per line**.


| File | Line | Begin offset | End offset | Type | 
| --- | --- | --- | --- | --- | 
|  The name of the file containing the document. For example, if one of the document files is located at `s3://my-S3-bucket/test-files/documents.txt`, the value in the `File` column will be `documents.txt`. You must include the file extension (in this case '`.txt`') as part of the file name.  |  The line number containing the entity. Omit this column if your input format is one document per file.  |  The character offset in the input text (relative to the beginning of the line) that shows where the entity begins. The first character is at position 0.  |  The character offset in the input text that shows where the entity ends.  |  The customer-defined entity type. Entity types must be an uppercase, underscore-separated string. We recommend using descriptive entity types such as `MANAGER`, `SENIOR_MANAGER`, or `PRODUCT_CODE`. Up to 25 entity types can be trained per model.  | 

If your training file input format is **one document per file**, you omit the line number column and the **Begin offset** and **End offset** values are the offsets of the entity from the start of the document.

The following example is for one document per line. The file `documents.txt` contains four lines (rows 0, 1, 2, and 3):

```
Diego Ramirez is an engineer in the high tech industry.
Emilio Johnson has been an engineer for 14 years.
J Doe is a judge on the Washington Supreme Court.
Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.
```

The CSV file with the list of annotations is as follows: 

```
File, Line, Begin Offset, End Offset, Type
documents.txt, 0, 0, 13, ENGINEER
documents.txt, 1, 0, 14, ENGINEER
documents.txt, 3, 25, 38, MANAGER
```

**Note**  
In the annotations file, the line number containing the entity starts with line 0. In this example, the CSV file contains no entry for line 2 because there is no entity in line 2 of `documents.txt`.

**Creating your data files**

It's important to put your annotations in a properly configured CSV file to reduce the risk of errors. To manually configure your CSV file, the following must be true:
+ UTF-8 encoding must be explicitly specified, even if its used as a default in most cases.
+ The first line contains the column headers: `File`, `Line` (optional), `Begin Offset`, `End Offset`, `Type`.

We highly recommended that you generate the CSV input files programmatically to avoid potential issues.

The following example uses Python to generate a CSV for the annotations shown earlier:

```
import csv 
with open("./annotations/annotations.csv", "w", encoding="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(["File", "Line", "Begin Offset", "End Offset", "Type"])
    csv_writer.writerow(["documents.txt", 0, 0, 11, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 1, 0, 5, "ENGINEER"])
    csv_writer.writerow(["documents.txt", 3, 25, 30, "MANAGER"])
```

# PDF annotation files
<a name="cer-annotation-manifest"></a>

For PDF annotations, you use SageMaker AI Ground Truth to create a labeled dataset in an augmented manifest file. Ground Truth is a data labeling service that helps you (or a workforce that you employ) to build training datasets for machine learning models. Amazon Comprehend accepts augmented manifest files as training data for custom models. You can provide these files when you create a custom entity recognizer by using the Amazon Comprehend console or the [CreateEntityRecognizer](https://docs.aws.amazon.com/comprehend/latest/APIReference/API_CreateEntityRecognizer.html) API action. 

You can use the Ground Truth built-in task type, Named Entity Recognition, to create a labeling job to have workers identify entities in text. To learn more, see [Named Entity Recognition](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html#sms-creating-ner-console) in the *Amazon SageMaker AI Developer Guide*. To learn more about Amazon SageMaker Ground Truth, see [Use Amazon SageMaker AI Ground Truth to Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html).

**Note**  
Using Ground Truth, you can define overlapping labels (text that you associate with more than one label). However, Amazon Comprehend entity recognition does not support overlapping labels.

Augmented manifest files are in JSON lines format. In these files, each line is a complete JSON object that contains a training document and its associated labels. The following example is an augmented manifest file that trains an entity recognizer to detect the professions of individuals who are mentioned in the text:

```
{"source":"Diego Ramirez is an engineer in the high tech industry.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":13,"startOffset":0,"label":"ENGINEER"}],"labels":[{"label":"ENGINEER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.92}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.175903","human-annotated":"yes"}}
{"source":"J Doe is a judge on the Washington Supreme Court.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":5,"startOffset":0,"label":"JUDGE"}],"labels":[{"label":"JUDGE"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.72}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174910","human-annotated":"yes"}}
{"source":"Our latest new employee, Mateo Jackson, has been a manager in the industry for 4 years.","NamedEntityRecognitionDemo":{"annotations":{"entities":[{"endOffset":38,"startOffset":26,"label":"MANAGER"}],"labels":[{"label":"MANAGER"}]}},"NamedEntityRecognitionDemo-metadata":{"entities":[{"confidence":0.91}],"job-name":"labeling-job/namedentityrecognitiondemo","type":"groundtruth/text-span","creation-date":"2020-05-14T21:45:27.174035","human-annotated":"yes"}}
```

Each line in this JSON lines file is a complete JSON object, where the attributes include the document text, the annotations, and other metadata from Ground Truth. The following example is a single JSON object in the augmented manifest file, but it's formatted for readability: 

```
{
  "source": "Diego Ramirez is an engineer in the high tech industry.",
  "NamedEntityRecognitionDemo": {
    "annotations": {
      "entities": [
        {
          "endOffset": 13,
          "startOffset": 0,
          "label": "ENGINEER"
        }
      ],
      "labels": [
        {
          "label": "ENGINEER"
        }
      ]
    }
  },
  "NamedEntityRecognitionDemo-metadata": {
    "entities": [
      {
        "confidence": 0.92
      }
    ],
    "job-name": "labeling-job/namedentityrecognitiondemo",
    "type": "groundtruth/text-span",
    "creation-date": "2020-05-14T21:45:27.175903",
    "human-annotated": "yes"
  }
}
```

In this example, the `source` attribute provides the text of the training document, and the `NamedEntityRecognitionDemo` attribute provides the annotations for the entities in the text. The name of the `NamedEntityRecognitionDemo` attribute is arbitrary, and you provide a name of your choice when you define the labeling job in Ground Truth.

In this example, the `NamedEntityRecognitionDemo` attribute is the *label attribute name*, which is the attribute that provides the labels that a Ground Truth worker assigns to the training data. When you provide your training data to Amazon Comprehend, you must specify one or more label attribute names. The number of attribute names that you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

If your file is the output of a single labeling job, specify the single label attribute name that was used when the job was created in Ground Truth. 

If your file is the output of a chained labeling job, specify the label attribute name for one or more jobs in the chain. Each label attribute name provides the annotations from an individual job. You can specify up to 5 of these attributes for augmented manifest files that are produced by chained labeling jobs. 

In an augmented manifest file, the label attribute name typically follows the `source` key. If the file is the output of a chained job, there will be multiple label attribute names. When you provide your training data to Amazon Comprehend, provide only those attributes that contain annotations that are relevant for your model. Do not specify the attributes that end with "-metadata".

For more information about chained labeling jobs, and for examples of the output that they produce, see [Chaining Labeling Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-reusing-data.html) in the Amazon SageMaker AI Developer Guide.

# Annotating PDF files
<a name="cer-annotation-pdf"></a>

Before you can annotate your training PDFs in SageMaker AI Ground Truth, complete the following prerequisites:
+ Install python3.8.x
+ Install [jq](https://stedolan.github.io/jq/download/)
+ Install the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html)

  If you're using the us-east-1 Region, you can skip installing the AWS CLI because it's already installed with your Python environment. In this case, you create a virtual environment to use Python 3.8 in AWS Cloud9.
+ Configure your [AWS credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
+ Create a private [SageMaker AI Ground Truth workforce](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private-use-cognito.html) to support annotation

  Make sure to record the workteam name you choose in your new private workforce, as you use it during installation.

**Topics**
+ [

## Setting up your environment
](#cer-annotation-pdf-set-up)
+ [

## Uploading a PDF to an S3 bucket
](#cer-annotation-pdf-upload)
+ [

## Creating an annotation job
](#cer-annotation-pdf-job)
+ [

## Annotating with SageMaker AI Ground Truth
](#w2aac35c23c21c19c15)

## Setting up your environment
<a name="cer-annotation-pdf-set-up"></a>

1. If using Windows, install [Cygwin](https://cygwin.com/install.html); if using Linux or Mac, skip this step.

1. Download the [annotation artifacts](http://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools) from GitHub. Unzip the file.

1. From your terminal window, navigate to the unzipped folder (**amazon-comprehend-semi-structured-documents-annotation-tools-main**). 

1. This folder includes a choice of `Makefiles` that you run to install dependencies, setup a Python virtualenv, and deploy the required resources. Review the **readme** file to make your choice.

1. The recommended option uses a single command to install all dependencies into a virtualenv, builds the CloudFormation stack from the template, and deploys the stack to your AWS account with interactive guidance. Run the following command:

   `make ready-and-deploy-guided`

   This command presents a set of configuration options. Be sure your AWS Region is correct. For all other fields, you can either accept the default values or fill in custom values. If you modify the CloudFormation stack name, write it down as you need it in the next steps.  
![\[Terminal session showing CloudFormation configuration options.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno.png)

   The CloudFormation stack creates and manage the [AWS lambdas](https://aws.amazon.com/lambda/), [AWS IAM](https://aws.amazon.com/iam/) roles, and [AWS S3](https://aws.amazon.com/s3/) buckets required for the annotation tool.

   You can review each of these resources in the stack details page in the CloudFormation console.

1. The command prompts you to start the deployment. CloudFormation creates all the resources in the specified Region.  
![\[Terminal session showing the deployed CloudFormation configuration.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/deploy_guided_anno_2.png)

   When the CloudFormation stack status transitions to create-complete, the resources are ready to use.

## Uploading a PDF to an S3 bucket
<a name="cer-annotation-pdf-upload"></a>

In the [Setting up](#cer-annotation-pdf-set-up) section, you deployed a CloudFormation stack that creates an S3 bucket named **comprehend-semi-structured-documents-\$1\$1AWS::Region\$1-\$1\$1AWS::AccountId\$1**. You now upload your source PDF documents into this bucket.

**Note**  
This bucket contains the data required for your labeling job. The Lambda Execution Role policy grants permission for the Lambda function to access this bucket.  
You can find the S3 bucket name in the **CloudFormation Stack details** using the '**SemiStructuredDocumentsS3Bucket**' key.

1. Create a new folder in the S3 bucket. Name this new folder '**src**'. 

1. Add your PDF source files to your '**src**' folder. In a later step, you annotate these files to train your recognizer.

1. (Optional) Here's an AWS CLI example you can use to upload your source documents from a local directory into an S3 bucket:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided/src/`

   Or, with your Region and Account ID:

   `aws s3 cp --recursive local-path-to-your-source-docs s3://deploy-guided-Region-AccountID/src/`

1. You now have a private SageMaker AI Ground Truth workforce and have uploaded your source files to the S3 bucket, **deploy-guided/src/**; you're ready to start annotating.

## Creating an annotation job
<a name="cer-annotation-pdf-job"></a>

The **comprehend-ssie-annotation-tool-cli.py** script in the `bin` directory is a simple wrapper command that streamlines the creation of a SageMaker AI Ground Truth labeling job. The python script reads the source documents from your S3 bucket and creates a corresponding single-page manifest file with one source document per line. The script then creates a labeling job, which requires the manifest file as an input. 

The python script uses the S3 bucket and CloudFormation stack that you configured in the [Setting up](#cer-annotation-pdf-set-up) section. Required input parameters for the script include:
+ **input-s3-path**: S3 Uri to the source documents you uploaded to your S3 bucket. For example: `s3://deploy-guided/src/`. You can also add your Region and Account ID to this path. For example: `s3://deploy-guided-Region-AccountID/src/`.
+ **cfn-name**: The CloudFormation stack name. If you used the default value for the stack name, your cfn-name is **sam-app**.
+ **work-team-name**: The workforce name you created when you built out the private workforce in SageMaker AI Ground Truth.
+ **job-name-prefix**: The prefix for the SageMaker AI Ground Truth labeling job. Note that there is a 29-character limit for this field. A timestamp is appended to this value. For example: `my-job-name-20210902T232116`.
+ **entity-types**: The entities you want to use during your labeling job, separated by commas. This list must include all entities that you want to annotate in your training dataset. The Ground Truth labeling job displays only these entities for annotators to label content in the PDF documents. 

To view additional arguments the script supports, use the `-h` option to display the help content.
+ Run the following script with the input parameters as described in the previous list.

  ```
  python bin/comprehend-ssie-annotation-tool-cli.py \
  --input-s3-path s3://deploy-guided-Region-AccountID/src/ \
  --cfn-name sam-app \
  --work-team-name my-work-team-name \
  --region us-east-1 \
  --job-name-prefix my-job-name-20210902T232116 \
  --entity-types "EntityA, EntityB, EntityC" \
  --annotator-metadata "key=info,value=sample,key=Due Date,value=12/12/2021"
  ```

  The script produces the following output:

  ```
  Downloaded files to temp local directory /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Deleted downloaded temp files from /tmp/a1dc0c47-0f8c-42eb-9033-74a988ccc5aa
  Uploaded input manifest file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/input-manifest/my-job-name-20220203-labeling-job-20220203T183118.manifest
  Uploaded schema file to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/schema.json
  Uploaded template UI to s3://comprehend-semi-structured-documents-us-west-2-123456789012/comprehend-semi-structured-docs-ui-template/my-job-name-20220203-labeling-job-20220203T183118/ui-template/template-2021-04-15.liquid
  Sagemaker GroundTruth Labeling Job submitted: arn:aws:sagemaker:us-west-2:123456789012:labeling-job/my-job-name-20220203-labeling-job-20220203t183118
  (amazon-comprehend-semi-structured-documents-annotation-tools-main) user@3c063014d632 amazon-comprehend-semi-structured-documents-annotation-tools-main %
  ```

## Annotating with SageMaker AI Ground Truth
<a name="w2aac35c23c21c19c15"></a>

Now that you have configured the required resources and created a labeling job, you can log in to the labeling portal and annotate your PDFs.

1. Log in to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker) using either Chrome or Firefox web browsers.

1. Select **Labeling workforces** and choose **Private**.

1. Under **Private workforce summary**, select the labeling portal sign-in URL that you created with your private workforce. Sign in with the appropriate credentials.

   If you don't see any jobs listed, don't worry—it can take a while to update, depending on the number of files you uploaded for annotation.

1. Select your task and, in the top right corner, choose **Start working** to open the annotation screen.

   You'll see one of your documents open in the annotation screen and, above it, the entity types you provided during set up. To the right of your entity types, there is an arrow you can use to navigate through your documents.  
![\[The Amazon Comprehend annotation screen.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/annotation_demo1.png)

   Annotate the open document. You can also remove, undo, or auto tag your annotations on each document; these options are available in the right panel of the annotation tool.  
![\[Available options in the Amazon Comprehend annotation right panel.\]](http://docs.aws.amazon.com/comprehend/latest/dg/images/data_annotation.png)

   To use auto tag, annotate an instance of one of your entities; all other instances of that specific word are then automatically annotated with that entity type.

   Once you've finished, select **Submit** on the bottom right, then use the navigation arrows to move to the next document. Repeat this until you've annotated all your PDFs.

After you annotate all the training documents, you can find the annotations in JSON format in the Amazon S3 bucket at this location:

```
/output/your labeling job name/annotations/
```

The output folder also contains an output manifest file, which lists all the annotations within your training documents. You can find your output manifest file at the following location.

```
/output/your labeling job name/manifests/
```