Processing a document
Before performing extraction on a document, you need to define a processing pipeline. Once the pipeline is created and the pipeline id is known, you can process the document either directly or by creating an inbox associated with the pipeline.
The approach involving inboxes persistently saves the document and its extraction in our platform. By contrast, if you send a document directly to a pipeline, we do not store the document or its extraction (except for asynchronous requests, in which case extraction results are retained for a certain period of time).
Synchronous and asynchronous processing
Synchronous processing simply means that the extraction is returned directly after your request. Asynchronous processing request on the other hand returns a job confirmation and not the extraction result. In the latter case a result is available once the processing is finished in the background. At that point you can download the extraction result.
One advantage of asynchronous processing is that you don't have to wait for every document to be processed individually. You can simply submit a large batch of documents and retrieve the results at a later stage. An advantage of synchronous processing is that it only requires a single HTTP request; asynchronous processing requires polling for the results.
Running document processing pipelines synchronously
Running a pipeline in synchronous mode requires a POST request on
/pipelines/PIPELINE_ID/run
with your document attached:
curl -X 'POST' 'https://api.smartextract.ai/pipelines/PIPELINE_ID/run' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN' \
-F 'document=@DOCUMENT_PATH;type=application/pdf'
import httpx
with open('DOCUMENT_PATH', 'rb') as file:
response = httpx.post(
url='https://api.smartextract.ai/pipelines/PIPELINE_ID/run',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
},
files={
'document': (
'DOCUMENT_PATH',
file,
'application/pdf'
)
}
)
print(response.json())
The example assumes that your document is a PDF. If you want to submit an image, adjust the media type accordingly.
The response is a JSON object containing the extracted information. The extraction schema is documented here.
Asynchronous document processing
When processing documents asynchronously, you first submit the document to a queue and receive a job id. You are then able to query the job status, and when the job is finished you can download the extraction result.
Submitting a document
To submit a document for asynchronous processing send a following POST request
to /pipelines/PIPELINE_ID/submit
:
curl -X 'POST' 'https://api.smartextract.ai/pipelines/PIPELINE_ID/submit' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN' \
-F 'document=@DOCUMENT_PATH;type=application/pdf'
import httpx
with open('DOCUMENT_PATH', 'rb') as file:
response = httpx.post(
url='https://api.smartextract.ai/pipelines/PIPELINE_ID/submit',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
},
files={
'document': (
'DOCUMENT_PATH',
file,
'application/pdf'
)
}
)
print(response.json())
The example assumes that your document is a PDF. If you want to submit an image, adjust the media type accordingly.
The request immediately returns job id and initiates document processing in the background. Make sure to record job id in order to fetch the results later.
Querying job status
Assuming the job id is available you can retrieve job status with the following
GET request on /jobs/JOB_ID/status
:
curl -X 'GET' 'https://api.smartextract.ai/jobs/JOB_ID/status' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
import httpx
response = httpx.get(
url='https://api.smartextract.ai/jobs/JOB_ID/status',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
}
)
status = response.text
print(status)
The response contains a string with job status. The job status is either
running
if the processing has not yet terminated, failed
if the job did not
finish successfully or finished
if the extraction is ready for downloading.
Downloading extraction result
To download the extraction result of a finished job send GET request on
/jobs/JOB_ID/result
:
curl -X 'GET' 'https://api.smartextract.ai/jobs/JOB_ID/result' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
import httpx
response = httpx.get(
url='https://api.smartextract.ai/jobs/JOB_ID/result',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
}
)
print(response.json())
The response is a JSON object containing the extracted information. The extraction schema is documented here.