Setting text extraction options
By default, Amazon Comprehend performs the following actions to extract text from a file, based on the input file type:
Word files – Amazon Comprehend parser extracts the text.
Digital PDF files – Amazon Comprehend parser extracts the text.
Image files and scanned PDF files – Amazon Comprehend uses the Amazon Textract
DetectDocumentTextAPI to extract the text.
For image files and PDF files, you can use the DocumentReaderConfig parameter to override these
    default extraction actions. This parameter is available when you use the Amazon Comprehend console or API for real-time or 
    asynchronous custom analysis.
The DocumentReaderConfig parameter contains three fields:
- 
      
DocumentReadMode – Set to
SERVICE_DEFAULTfor Amazon Comprehend to perform the default actions.Set to
FORCE_DOCUMENT_READ_ACTIONto use Amazon Textract to parse digital PDF files. - 
      
DocumentReadAction – Sets the Amazon Textract API (DetectDocumentText or AnalyzeDocument) to use when Amazon Comprehend uses Amazon Textract for text extraction.
 FeatureTypes – If you set DocumentReadAction to use the AnalyzeDocument API operation, you can add one or both of the
FeatureTypes(TABLES, FORMS). These features provide additional information about the tables and forms in the document. For more information about these features, see Amazon Textract Document Analysis Response Objects.
The following examples show how to configure DocumentReaderConfig for specific use cases:
Use Amazon Textract for all PDF files.
- 
          
DocumentReadMode – Set to
FORCE_DOCUMENT_READ_ACTION. - 
          
DocumentReadAction – Set to
TEXTRACT_DETECT_DOCUMENT_TEXT. - 
          
FeatureTypes – Not required.
 
- 
          
 Use Amazon Textract
AnalyzeDocumentAPI for all PDF and image files.- 
          
DocumentReadMode – Set to
FORCE_DOCUMENT_READ_ACTION. - 
          
DocumentReadAction – Set to
TEXTRACT_ANALYZE_DOCUMENT. - 
          
FeatureTypes – Set to
TABLES,FORMSor both features. 
- 
          
 Use Amazon Textract
AnalyzeDocumentAPI for scanned PDF files and all image files.- 
          
DocumentReadMode – Set to
SERVICE_DEFAULT. - 
          
DocumentReadAction – Set to
TEXTRACT_ANALYZE_DOCUMENT. - 
          
FeatureTypes – Set to
TABLES,FORMSor both features. 
- 
          
 
For more information about the Amazon Textract options, see DocumentReaderConfig.