GetDocumentContent Output Schema
When you use the GetDocumentContent API with outputFormat
set to EXTRACTED
, the response returns extracted text content in JSON format. The output schema is presented in JSON format:
{ // always V1 for now schemaVersionId: string; // always JSON for now outputFormat: string; // content for plain-text documents plainTextDocumentContent: string; // content for non-plaintext documents such as PDF, DOCX, PPTX, Audio, Video nonPlainTextDocumentContent: List<ExtractedDocumentBodyElement>; }
The schema for non-plaintext documents includes the ExtractedDocumentBodyElement
which includes:
{ text: string; // Allowed values: TEXT, ARTICLE, SECTION, DIV, IMAGE_DESCRIPTION, CODE, // TABLE, LIST, URL, HEADER, FOOTER, FORM, MENU, AUDIO, VIDEO elementType: string; horizontalHeaderIndex: integer; verticalHeaderIndex: integer; htmlDocumentTitle: string; sectionTitle: string; sectionBody: string; tableCaption: string; tableFooter: string; tableRowHeaders: List<List<string>>; tableColumnHeaders: List<List<string>>; tableRows: List<List<string>>; tableRowsCount: integer; tableColumnsCount: integer; tableId: string; tokens: List<struct>; { value: string; startOffsets: integer; endOffsets: integer; } tableType: string; tableSummary: string; columnInfoList: List<struct>; { columnName: string; columnSummary: string; columnType: string; columnRepresentativeValues: List<string> } // Audio/Video specific fields below overallSummary: string; audioSummaryList: List<struct>; { summaryText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } videoSummaryList: List<struct>; { summaryText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } audioTranscriptList: List<struct>; { transcriptText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } videoTranscriptList: List<struct>; { transcriptText: string; startTimeMilliseconds: string; endTimeMilliseconds: string; } }