

Reading Order : The reading order of content within columns, across
#PDF EXTRACT TEXT FROM POSITION PDF#
Pages : A list of properties for each page of the PDF including page PNG in the tables folder with the filenameįilePaths : List of file paths to additional output files (images Tables : Identified as a Table in the Path attribute, saved as a PNG in the figures folder with the filename identified in the When inline elements are reported separately from parentīlock element, then this value has references to those inlineįigures : Identified as a Figure in the Path attribute, saved as a Text : Text for the element in UTF-8 format, only reported for textĮlements. Heading which can define the whole document. Going by this coordinate system, for all rects reported in Extract, bottom inside \ tags Again as per PDF spec, absolute values of bounds are in a coordinate system where origin is (0,0), up and right directions are positive. So, width of an A4 page is specified to be ~= 598 units (8.3 inches x 72) when creating the PDF.Īll values reported in Extract use this 72 dpi based coordinates. As per PDF specification, 72 DPI is used when creating a PDF. If values are required in coordinates, we need a DPI value i.e. PDF pages are generally specified in inches (like A4 page is 8.3 inches x 11.7 inches). The bounds are as per PDF specification coordinates. Not reported for elements which don't have any content Pages are reported for the first occurrence only.īounds : Bounding box enclosing the content items forming thisĮlement. Include headers or footers.In addition, headings that repeat across Position in the structure tree of the document.The output does not

Paragraphs, tables, figures) found in the document, on the basis of The following is a summary of key elements in the extracted JSON(SeeĮlements : Ordered list of semantic elements (like headings, Renditions with filenames that correspond to the element information The folder name is either "tables" or "figures"ĭepending on your specified element type.

#PDF EXTRACT TEXT FROM POSITION ZIP#
The output of an SDK extract operation is a zip package containing the
