GetDocumentContent Output Schema - Amazon Q Business

GetDocumentContent Output Schema

When you use the GetDocumentContent API with outputFormat set to EXTRACTED, the response returns extracted text content in JSON format. The output schema is presented in JSON format:

{ // always V1 for now schemaVersionId: string; // always JSON for now outputFormat: string; // content for plain-text documents plainTextDocumentContent: string; // content for non-plaintext documents such as PDF, DOCX, PPTX, Audio, Video nonPlainTextDocumentContent: List<ExtractedDocumentBodyElement>; }

The schema for non-plaintext documents includes the ExtractedDocumentBodyElement which includes:

{ text: string; // Allowed values: TEXT, ARTICLE, SECTION, DIV, IMAGE_DESCRIPTION, CODE, // TABLE, LIST, URL, HEADER, FOOTER, FORM, MENU, AUDIO, VIDEO elementType: string; horizontalHeaderIndex: integer; verticalHeaderIndex: integer; htmlDocumentTitle: string; sectionTitle: string; sectionBody: string; tableCaption: string; tableFooter: string; tableRowHeaders: List<List<string>>; tableColumnHeaders: List<List<string>>; tableRows: List<List<string>>; tableRowsCount: integer; tableColumnsCount: integer; tableId: string; tokens: List<struct>; { value: string; startOffsets: integer; endOffsets: integer; } tableType: string; tableSummary: string; columnInfoList: List<struct>; { columnName: string; columnSummary: string; columnType: string; columnRepresentativeValues: List<string> } // Audio/Video specific fields below overallSummary: string; audioSummaryList: List<struct>; { summaryText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } videoSummaryList: List<struct>; { summaryText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } audioTranscriptList: List<struct>; { transcriptText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } videoTranscriptList: List<struct>; { transcriptText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } }