Processing a document

Before performing extraction on a document, you need to define a processing pipeline. Once the pipeline is created and the pipeline id is known, you can process the document either directly or by creating an inbox associated with the pipeline.

The approach involving inboxes persistently saves the document and its extraction in our platform. By contrast, if you send a document directly to a pipeline, we do not store the document or its extraction (except for asynchronous requests, in which case extraction results are retained for a certain period of time).

Synchronous and asynchronous processing

Synchronous processing simply means that the extraction is returned directly after your request. Asynchronous processing request on the other hand returns a job confirmation and not the extraction result. In the latter case a result is available once the processing is finished in the background. At that point you can download the extraction result.

One advantage of asynchronous processing is that you don't have to wait for every document to be processed individually. You can simply submit a large batch of documents and retrieve the results at a later stage. An advantage of synchronous processing is that it only requires a single HTTP request; asynchronous processing requires polling for the results.

Running document processing pipelines synchronously

Running a pipeline in synchronous mode requires a POST request on /pipelines/PIPELINE_ID/run with your document attached:

curlPython

curl -X 'POST' 'https://api.smartextract.ai/pipelines/PIPELINE_ID/run' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer API_TOKEN' \
  -F 'document=@DOCUMENT_PATH;type=application/pdf'

import httpx
with open('DOCUMENT_PATH', 'rb') as file:
    response = httpx.post(
        url='https://api.smartextract.ai/pipelines/PIPELINE_ID/run',
        headers={
            'Accept': 'application/json',
            'Authorization': 'Bearer API_TOKEN'
        },
        files={
            'document': (
                'DOCUMENT_PATH', 
                file, 
                'application/pdf'
            )
        }
    )
    print(response.json())

The example assumes that your document is a PDF. If you want to submit an image, adjust the media type accordingly.

The response is a JSON object containing the extracted information. The extraction schema is documented here.

Asynchronous document processing

When processing documents asynchronously, you first submit the document to a queue and receive a job id. You are then able to query the job status, and when the job is finished you can download the extraction result.

Submitting a document

To submit a document for asynchronous processing send a following POST request to /pipelines/PIPELINE_ID/submit:

curlPython

curl -X 'POST' 'https://api.smartextract.ai/pipelines/PIPELINE_ID/submit' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer API_TOKEN' \
  -F 'document=@DOCUMENT_PATH;type=application/pdf'

import httpx
with open('DOCUMENT_PATH', 'rb') as file:
    response = httpx.post(
        url='https://api.smartextract.ai/pipelines/PIPELINE_ID/submit',
        headers={
            'Accept': 'application/json',
            'Authorization': 'Bearer API_TOKEN'
        },
        files={
            'document': (
                'DOCUMENT_PATH', 
                file, 
                'application/pdf'
            )
        }
    )
    print(response.json())

The example assumes that your document is a PDF. If you want to submit an image, adjust the media type accordingly.

The request immediately returns job id and initiates document processing in the background. Make sure to record job id in order to fetch the results later.

Querying job status

Assuming the job id is available you can retrieve job status with the following GET request on /jobs/JOB_ID/status:

curlPython

curl -X 'GET' 'https://api.smartextract.ai/jobs/JOB_ID/status' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer API_TOKEN'

import httpx
response = httpx.get(
    url='https://api.smartextract.ai/jobs/JOB_ID/status',
    headers={
        'Accept': 'application/json',
        'Authorization': 'Bearer API_TOKEN'
    }
)
status = response.text
print(status)

The response contains a string with job status. The job status is either running if the processing has not yet terminated, failed if the job did not finish successfully or finished if the extraction is ready for downloading.

Downloading extraction result

To download the extraction result of a finished job send GET request on /jobs/JOB_ID/result:

curlPython

curl -X 'GET' 'https://api.smartextract.ai/jobs/JOB_ID/result' \
  -H 'Accept: application/json' \
  -H 'Authorization: Bearer API_TOKEN'

import httpx
response = httpx.get(
    url='https://api.smartextract.ai/jobs/JOB_ID/result',
    headers={
        'Accept': 'application/json',
        'Authorization': 'Bearer API_TOKEN'
    }
)
print(response.json())

The response is a JSON object containing the extracted information. The extraction schema is documented here.