Skip to content

Essential concepts

In order to use the smartextract API, it is important to understand the 3 main concepts involved.

  • A document is the input to the information extraction process. Currently, it can be a PDF file or an image (JPEG or PNG format). Documents may or may not be stored in our system. When a document is stored, it belongs to an inbox and is identified by a UUID.

  • A pipeline is a document processing procedure. It returns an extraction, which is just some data computed from the document. The content and schema of the extraction depend on the pipeline. Under the hood, a pipeline combines various processing components including OCR, AI models, and data validation procedures. Every pipeline is identified by a UUID.

  • An inbox is a repository where documents can be stored long-term. Every inbox has an associated pipeline (but a single pipeline may be associated to multiple inboxes). When a document is added to an inbox, its extraction by the associated pipeline is computed. After that, it's possible to correct and validate the extraction and download the extractions or send them to other services. Every inbox is identified by a UUID.