Pipelines
A pipeline defines a document processing procedure. It combines various processing components including OCR, AI models and data validation procedures to compute a structured data extraction from a document. Pipelines are highly customizable and enable flexible data extraction.
Typically you would create a pipeline by specifying the data structure you want to extract from documents. This data structure is also known as a template. After that, you can run the pipeline on individual documents or create an inbox based on a pipeline and upload documents to it for automatic processing. It is recommended to organize document extraction around inboxes.
Templates
An extraction template is a JSON object that describes all fields to be extracted, including their name, type, textual description and whether multiple extractions of a field are possible. The template schema is documented here.
For your convenience, smartextract has many predefined extraction templates
designed for specific use cases. The includes templates for invoices, receipts,
bank statements, delivery notes and more. Where the API expects a TEMPLATE
,
you may either provide an actual template (as a JSON object) or the name of a
predefined template (as a string).
To retrieve a full list of predefined templates, send a GET request to
pipelines/templates
:
curl -X 'GET' 'https://api.smartextract.ai/templates?lang=en' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
import httpx
response = httpx.get(
url='https://api.smartextract.ai/templates?lang=en',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
}
)
print(response.json())
You may specify a different template language via the lang
query parameter. At
the moment, en
for English and de
for German are supported. To refer to a
predefined template, use the id.lang
notation, for example invoice.en
.
Creating a pipeline
To create a template based pipeline send a POST request on /pipelines
including the template and a clear descriptive pipeline name:
curl -X 'POST' 'https://api.smartextract.ai/pipelines' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN' \
-d '{
"name": "PIPELINE_NAME",
"template": "TEMPLATE",
}'
import httpx
response = httpx.post(
url='https://api.smartextract.ai/pipelines',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
},
json={
'name': 'PIPELINE_NAME',
'template': 'TEMPLATE'
}
)
print(response.json())
The response contains the id of a created pipeline.
Managing pipelines
Listing pipelines
When you create a pipeline, the pipeline id is returned. You can also retrieve a
listing of all pipelines you have access to with a GET request to /resources
:
curl -X 'GET' 'https://api.smartextract.ai/resources?type=template_pipeline' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
import httpx
response = httpx.get(
url='https://api.smartextract.ai/resources?type=template_pipeline',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
}
)
print(response.json())
Note, that the template_pipeline
resource type is specified as a query
parameter.
Viewing details about a pipeline
To view the details about a given pipeline, including its template, send a GET
request to /pipelines/PIPELINE_ID
:
curl -X 'GET' 'https://api.smartextract.ai/pipelines/PIPELINE_ID' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
import httpx
response = httpx.get(
url='https://api.smartextract.ai/pipelines/PIPELINE_ID',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
}
)
print(response.json())
Modifying pipelines
To modify an existing pipeline, send a PATCH request to
/pipelines/PIPELINE_ID
:
curl -X 'PATCH' 'https://api.smartextract.ai/pipelines/PIPELINE_ID' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer API_TOKEN'
-d '{
"name": "New name",
}'
import httpx
httpx.patch(
url='https://api.smartextract.ai/pipelines/PIPELINE_ID',
headers={
'Accept': 'application/json',
'Authorization': 'Bearer API_TOKEN'
},
json={
'name': 'New name'
}
)
The request payload may include any of the following entries:
- name: The display name of the pipeline.
- template: A JSON object describing the desired extraction template.
Sharing pipelines
A pipeline you own can be shared with another smartextract user using the
following POST request to /resources/PIPELINE_ID/permissions
:
curl -X 'POST' 'https://api.smartextract.ai/resources/PIPELINE_ID/permissions' \
-H 'Authorization: Bearer API_TOKEN' \
-H 'Content-Type: application/json' \
-d '{
"user": "USER_EMAIL",
"level": "view"
}'
import httpx
httpx.post(
url='https://api.smartextract.ai/resources/PIPELINE_ID/permissions',
headers={
'Authorization': 'Bearer API_TOKEN',
'Content-Type': 'application/json'
},
json={
'user': 'USER_EMAIL',
'level': 'view'
}
)