Smart document split

Add automatic document splitting to workflows.

How it works

In cases where a single PDF file contains multiple underlying documents, smart-split allows for the automatic detection and segmentation of sub-documents. Even in cases where sub-documents have variable lengths and format.

When a source file is uploaded, it is processed and a corresponding fileId is assigned. The /results/ for the original file will then contain one or more child document fileIds which can then be queried to obtain the corresponding sub-document results.

Document splitting can be used in conjunction with other standard workflows like prediction or validation.

Getting started

To automatically split files on upload, a few changes to the standard fileupload form-data parameters are required:

  • Specify the document-splitting workflow type by setting: workflowId=split

  • Specify a childWorkflowId to define what workflow to run on each generated sub-document

  • Optionally specify childWorkflowOptions to parameterise the workflow run on each generated sub-document

Split workflows are a BETA feature and subject to change. This guide is under-construction.

1. Upload

workflowId = split

workflowOptions

Sample
{
    "prediction": {
        "childWorkflow": "prediction",
        "childWorkflowOptions": {
            "prediction": {
                "fieldSets": ["sypht.invoice"]
             }
        }
    }
}

Response:

{
    "fileId": "00000000-0000-0000-0000-000000000000",
    "uploadedAt": "2020-08-20T03:19:07.319Z",
    "status": "RECEIVED"
}

2. Collect the split results

GET https://api.sypht.com/result/final/00000000-0000-0000-0000-000000000000

{
    "fileId": "815c63f6-...-f07223d057cb",
    "status": "FINALISED",
    "results": {
        "fields": [
            {
                "name": "components.children",
                "value": [
                    {"file_id": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"},
                    {"file_id": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb"}
                ]
            }
        ]
    }
}

3. Collect results for each sub-document

GET https://api.sypht.com/result/final/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa

Response
{
    "fileId": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
    "status": "FINALISED",
    "results": {
        "timestamp": "2020-08-20T03:30:09.703Z",
        "fields": [
            {
                "name": "invoice.total",
                "value": "1485.00",
                "confidence": 0.9958282699555642,
                ...
            },
            ...
        ]
    }
}

GET https://api.sypht.com/result/final/bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb

Response
{
    "fileId": "bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb",
    "status": "FINALISED",
    "results": {
        "timestamp": "2020-08-20T03:30:09.703Z",
        "fields": [
            {
                "name": "invoice.total",
                "value": "2485.00",
                "confidence": 0.99582,
                ...
            },
            ...
        ]
    }
}

Limitations, Errors and Recommendations

  • Uploading a document for the split worflow does not enforce any page limit checks. You may upload a document of any size but recent tests have shown we cannot process more than 50 pages at this time.

  • Any split documents will be checked for page limits. To avoid this scenario please ask to have your page limit increased to your expected maximum.

  • If a split document is rejected due to page or file size limits, the split workflow will eventually be marked as failure. Some split documents my successfully upload however - this is not ideal and can be avoided by increasing your page limit as above.

Last updated