Line Items
Understanding the line-items data format
What are line-items?
The lineitems
data type is used for fields that extract tabular information for a specific type of table and pre-defined columns. There are many different lineitems
fields tailored to different extraction use-cases.
For example, the invoice.lineitems
field captures tables containing invoice line items, while the statement.transactions
field returns credit and debit transaction rows from bank and credit card statements.
Specific AI products may contain additional functionality on top table basic extraction for a given use-case. For example,ndis.lineitems
includes inference of Support Item Reference Numbers from line level description text. Always prefer the best matching lineitems
field to your use-case over generic table extraction (i.e.generic.table)
when available.
Basic data structure
Fields with the lineitems
data type return a common data structure. Each prediction is a list of tables, one for each table found in the source document. Each of these tables is a JSON object with three keys:
types
aligns each extracted column to a specific column type.headers
identify the specific text and position of header cells within the source document.cells
contain the content of the table arranged as an array-of-arrays; organised rows by columns.
The following sections unpack each part of this data structure in detail.
Types
Each item in the types
array contains a type identifier and confidence score for the corresponding table column. These identifiers can be used to interpret the corresponding cell content for that column in the table. For example, a column labelled "Item Price" might be classified as a sypht.invoice.lineitems.unitPrice
column) and contain prices for each listed item.
A type of null
indicates the corresponding column does not match a pre-defined column type for the field. Header and cell content is still returned for these columns.
Headers
Headers encode information about the header cells detected on each table. In general headers
are not needed to interpret the content of the table for a lineitems
field, but may be useful to understand the content of non-aligned columns and how the data was originally presented in the source document.
Each object contains text
and bounds
information used to locate headers in the source.
Cells
Each element in the cells
array represents a row, and each row contains one item per column in the table. Row items may be null
indicating an empty cell for a given column. Rows with no extracted cells are omitted from the output.
When cells are present they contain a similar data structure to headers. This includes both text
and bounds
information.
Examples
Line-item fields are a densely packed source of structured information. While there is a lot of information available, it's usually quite simple in practice to pull out the specific information you need.
Here we provide an end-to-end example uploading a document using the Sypht API and interpreting the results of a lineitems
field in Python. We utilise the pandas
library to format tabular results.
Depending on the input file content, this sample produces a DataFrame with original document headers for columns and cell content in each row, e.g.:
Date | Product Description | Misc. | Total ($/AUD) |
1/1/2020 | Foo | Hello | $50.00 |
1/1/2021 | Bar | World | $100.00 |
Alternately we can use the aligned column types rather than raw text to construct a DataFrame like so:
This produces an equivalent table with columns aligned to specific invoice.lineitems
types:
|
|
|
|
1/1/2020 | Foo | Hello | $50.00 |
1/1/2021 | Bar | World | $100.00 |
Last updated