Documents
Standard output for documents lets you set the granularity of response you're interested in as well as establishing output format and text format in the output. Below are some of the outputs you can enable.
Note
BDA can process DOCX files. To process DOCX files, they are converted into PDFs. This means page number mapping will not work for DOCX files. Images of the converted PDFs will be uploaded to your output bucket if the if the JSON+ option and page granularity are selected.
Response Granularity
Response granularity determines what kind of response you want to receive from document text extraction. Each level of granularity gives you more and more separated responses, with page providing all of the text extracted together, and word providing each word as a separate response. The available granularity levels are:
-
Page level granularity – This is enabled by default. Page level granularity provides each page of the document in the text output format of your choice. If you're processing a PDF, enabling this level of granularity will detect and return embedded hyperlinks.
-
Element level granularity (Layout) – This is enabled by default. Provides the text of the document in the output format of your choice, separated into different elements. These elements, such as figures, tables, or paragraphs. These are returned in logical reading order based off the structure of the document. If you're processing a PDF, enabling this level of granularity will detect and return embedded hyperlinks.
-
Word level granularity – Provides information about individual words without using broader context analysis. Provides you with each word and its location on the page.
Output Settings
Output settings determine the way your downloaded results will be structured. This setting is exclusive to the console. The options for output settings are:
-
JSON – The default output structure for document analysis. Provides a JSON output file with the information from your configuration settings.
-
Async InvokeDataAutomationAsync API: JSON output for Async API is S3 only.
-
Sync InvokeDataAutomation API: JSON output can be set to S3 or inline by leveraging
outputconfiguration. If S3 is selected, then output JSON goes to S3 only (not inline). If S3 not provided, Sync API output supports JSON inline only.
-
-
JSON+files – Only available for Async InvokeDataAutomationAsync API. Using this setting generates both a JSON output and files that correspond with different outputs. For example, this setting gives you a text file for the overall text extraction, a markdown file for the text with structural markdown, and CSV files for each table that's found in the text. Figures located inside a document will be saved as well as figure crops and rectified images. Also, if you are processing a DOCX file and have this option selected the converted PDF of your DOCX file will be in the output folder. These outputs are located in
standard_output/in your output folder.logical_doc_id/assets/
Note
-
The sync API does not output any additional files beyond the JSON. The output JSON contains only the text format that was selected as part of the Standard Output Text format. Sync API will not output Figure crops or rectified images.
-
DocX not supported by Sync API.
Text Format
Text format determines the different kinds of texts that will be provided via various extraction operations. You can select any number of the following options for your text format.
-
Plaintext – This setting provides a text-only output with no formatting or other markdown elements noted.
-
Text with markdown – The default output setting for standard output. Provides text with markdown elements integrated.
-
Text with HTML – Provides text with HTML elements integrated in the response.
-
CSV – Provides a CSV structured output for tables within the document. This will only give a response for tables, and not other elements of the document.
Bounding Boxes and Generative Fields
For Documents, there are two response options that change their output based on the selected granularity. These are Bounding Boxes, and Generative Fields. Selecting Bounding Boxes will provide a visual outline of the element or word you click on in the console response dropdown. This lets you track down particular elements of your response more easily. Bounding Boxes are returned in your JSON as the coordinates of the four corners of the box.
When you select Generative Fields, you are generated a summary of the document, both a 10 word and 250 word version. Then, if you select elements as a response granularity, you generate a descriptive caption of each figure detected in the document. Figures include things like charts, graphs, and images.
Additional file format metadata JSON
When you receive your additional files from the additional file formats flag, you will get a JSON file for any rectified images that are extracted. BDA rectifies rotated images by using a homography to rotate the image to be at a 90 degree angle. An example of the JSON is below:
"asset_metadata": { "rectified_image": "s3://bucket/prefix.png", "rectified_image_width_pixels": 1700, "rectified_image_height_pixels": 2200, "corners": [ [ 0.006980135689736235, -0.061692718505859376 ], [ 1.10847711439684, 0.00673927116394043 ], [ 0.994479346419327, 1.050548828125 ], [ -0.11249661383904497, 0.9942819010416667 ] ] }
Corners represent the detected corners of an image, used to form a homography of the document. This homography is used to rotate the image while maintaining its other properties.