Zum Inhalt

Extraction schema

An extraction is based on typed label-value pairs that we call entities and organize in a two-level hierarchy. In this example we show a hypothetical receipt extraction containing the following structure:

  • general information
    • receipt number: 110103
    • date of issue: 2023-08-28
  • payment details
    • net amount: 23.95 EUR
{
  "schema": "https://smartextract.ai/schemas/extraction/v0",
  "entities": [
    {
      "label": "general information",
      "type": "record",
      "value": [
        {
          "label": "receipt number",
          "value": "110103",
          "type": "text",
        },
        {
          "label": "date of issue",
          "type": "date",
          "value": "2023-08-28",
          "raw": "28.08.2023"
        }
      ],
    },
    {
      "label": "payment details",
      "type": "record"
      "value": [
        {
          "label": "net amount",
          "type": "quantity",
          "value": "23.95 EUR",
          "raw": "23,95 €"
        },
      ],
    }
  ],
}

As you can see, the extraction consists of a list of entities and each entity has a label key and a type key. Further details depend on the entity type:

  • record: These entities can appear only in the first hierarchical level. The value field is a list of subentities.
  • text: These are simple entities consisting of text extracted directly from the document under the value key.
  • date: These entities have a value key with a normalized date in YYYY-MM-DD format and a raw key containing the literal information as found in the document.
  • quantity: This entity type is used for numerical quantities with a unit of measurement. The value key contains the normalized information as a floating point number followed by a space followed by the unit of measurement. The raw property is the literal information as found in the document. Note in the example that the decimal comma in the raw property is converted to a decimal dot in the value field.

Now let's suppose our receipt includes additionally a couple of line items:

  • line item 1
    • description: Item A
    • price: 4.00 EUR
  • line item 2
    • description: Item B
    • price: 12.95 EUR

Then our extraction object would be extended in the following way:

{
  "schema": "https://smartextract.ai/schemas/extraction/v0",
  "entities": [
    {
      "label": "general information",
      ...
    },
    {
      "label": "payment details",
      ...
    },
    {
      "label": "line item",
      "type": "record",
      "index": 0,
      "value": [
        {
          "label": "description",
          "type": "text",
          "value": "Item A"
        },
        {
          "label": "price",
          "type": "quantity",
          "value": "4.00 EUR",
          "raw": "€ 4,-"
        }
      ]
    },
    {
      "label": "line item",
      "type": "record",
      "index": 1,
      "value": [
        {
          "label": "description",
          "type": "text",
          "value": "Item B"
        },
        {
            "label": "price",
            "type": "quantity",
            "value": "19.95 EUR",
            "raw": "€19,95"
        }
      ]
    }
  ]
}

What should be noted here is:

  • Each line item is presented as an entity with "label": "line item" of record type.
  • Each line item record has a fixed structured, with a description text field and a price quantity field.
  • Zero or more line item records may appear and each of them has an index key indicating its order. Contrast this with the "general information" and "payment details" fields, which appear exactly once and don't have an index key.

Finally, we note that entity objects may contain additional fields, including:

  • confidence: An indication of how sure the model is about the accuracy of the extracted information.
  • page and box: The location in the document where the information was found.

Those fields and more may be null or omitted. This depends on your specific inbox and pipeline settings.

Complete extraction schema in JSON schema format
{
  "$defs": {
    "Box": {
      "maxItems": 4,
      "minItems": 4,
      "prefixItems": [
        {
          "title": "L",
          "type": "number"
        },
        {
          "title": "T",
          "type": "number"
        },
        {
          "title": "R",
          "type": "number"
        },
        {
          "title": "B",
          "type": "number"
        }
      ],
      "type": "array"
    },
    "CompositeEntity": {
      "description": "An extracted entity corresponding to a record.",
      "properties": {
        "label": {
          "description": "The entity label, must be the `name` of a template field.",
          "title": "Label",
          "type": "string"
        },
        "value": {
          "description": "For each field in the template, the corresponding entity extraction.",
          "items": {
            "anyOf": [
              {
                "$ref": "#/$defs/TextEntity"
              },
              {
                "$ref": "#/$defs/QuantityEntity"
              },
              {
                "$ref": "#/$defs/DateEntity"
              }
            ]
          },
          "title": "Value",
          "type": "array"
        },
        "type": {
          "$ref": "#/$defs/FieldType",
          "description": "The type of the entity, matching the `type` of the corresponding template field."
        },
        "index": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If this entity corresponds to a field type that can appear multiple times, the index of this instance; otherwise null.",
          "title": "Index"
        },
        "page": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Document page where the information is found.",
          "title": "Page"
        },
        "box": {
          "anyOf": [
            {
              "$ref": "#/$defs/Box"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Normalized bounding box where the information is found."
        },
        "confidence": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Confidence score of the extraction. Please interpret this information with care, as it does not imply a guarantee of correctness.",
          "title": "Confidence"
        }
      },
      "required": [
        "label",
        "value",
        "type"
      ],
      "title": "CompositeEntity",
      "type": "object"
    },
    "DateEntity": {
      "description": "An extracted entity corresponding to a date field.",
      "properties": {
        "label": {
          "description": "The entity label, must be the `name` of a template field.",
          "title": "Label",
          "type": "string"
        },
        "value": {
          "description": "The normalized value of the entity.",
          "title": "Value",
          "type": "string"
        },
        "type": {
          "$ref": "#/$defs/FieldType",
          "description": "The type of the entity, matching the `type` of the corresponding template field."
        },
        "index": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If this entity corresponds to a field type that can appear multiple times, the index of this instance; otherwise null.",
          "title": "Index"
        },
        "page": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Document page where the information is found.",
          "title": "Page"
        },
        "box": {
          "anyOf": [
            {
              "$ref": "#/$defs/Box"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Normalized bounding box where the information is found."
        },
        "confidence": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Confidence score of the extraction. Please interpret this information with care, as it does not imply a guarantee of correctness.",
          "title": "Confidence"
        },
        "raw": {
          "description": "The information as found in the document.",
          "title": "Raw",
          "type": "string"
        }
      },
      "required": [
        "label",
        "value",
        "type",
        "raw"
      ],
      "title": "DateEntity",
      "type": "object"
    },
    "FieldType": {
      "description": "Enumeration of possible field types in a template.",
      "enum": [
        "text",
        "choice",
        "date",
        "quantity",
        "record"
      ],
      "title": "FieldType",
      "type": "string"
    },
    "QuantityEntity": {
      "description": "An extracted entity corresponding to a quantity field.",
      "properties": {
        "label": {
          "description": "The entity label, must be the `name` of a template field.",
          "title": "Label",
          "type": "string"
        },
        "value": {
          "description": "The extracted value of the entity.",
          "title": "Value",
          "type": "string"
        },
        "type": {
          "$ref": "#/$defs/FieldType",
          "description": "The type of the entity, matching the `type` of the corresponding template field."
        },
        "index": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If this entity corresponds to a field type that can appear multiple times, the index of this instance; otherwise null.",
          "title": "Index"
        },
        "page": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Document page where the information is found.",
          "title": "Page"
        },
        "box": {
          "anyOf": [
            {
              "$ref": "#/$defs/Box"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Normalized bounding box where the information is found."
        },
        "confidence": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Confidence score of the extraction. Please interpret this information with care, as it does not imply a guarantee of correctness.",
          "title": "Confidence"
        },
        "raw": {
          "description": "The information as found in the document.",
          "title": "Raw",
          "type": "string"
        },
        "unit": {
          "description": "The measurement unit of the quantity",
          "title": "Unit",
          "type": "string"
        },
        "amount": {
          "description": "The pure numeric part of the quantity",
          "title": "Amount",
          "type": "number"
        }
      },
      "required": [
        "label",
        "value",
        "type",
        "raw",
        "unit",
        "amount"
      ],
      "title": "QuantityEntity",
      "type": "object"
    },
    "TextEntity": {
      "description": "An extracted entity corresponding to a simple field.",
      "properties": {
        "label": {
          "description": "The entity label, must be the `name` of a template field.",
          "title": "Label",
          "type": "string"
        },
        "value": {
          "description": "The extracted value of the entity.",
          "title": "Value",
          "type": "string"
        },
        "type": {
          "$ref": "#/$defs/FieldType",
          "description": "The type of the entity, matching the `type` of the corresponding template field."
        },
        "index": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If this entity corresponds to a field type that can appear multiple times, the index of this instance; otherwise null.",
          "title": "Index"
        },
        "page": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Document page where the information is found.",
          "title": "Page"
        },
        "box": {
          "anyOf": [
            {
              "$ref": "#/$defs/Box"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Normalized bounding box where the information is found."
        },
        "confidence": {
          "anyOf": [
            {
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Confidence score of the extraction. Please interpret this information with care, as it does not imply a guarantee of correctness.",
          "title": "Confidence"
        }
      },
      "required": [
        "label",
        "value",
        "type"
      ],
      "title": "TextEntity",
      "type": "object"
    }
  },
  "description": "Extraction information returned to the user.",
  "properties": {
    "schema": {
      "const": "https://smartextract.ai/schemas/extraction/v0",
      "description": "Reference to the schema describing this data structure.",
      "title": "Schema",
      "type": "string"
    },
    "confidence": {
      "anyOf": [
        {
          "type": "number"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Overall confidence score of the extraction. Please interpret this information with care, as it does not imply a guarantee of correctness.",
      "title": "Confidence"
    },
    "entities": {
      "items": {
        "anyOf": [
          {
            "$ref": "#/$defs/TextEntity"
          },
          {
            "$ref": "#/$defs/QuantityEntity"
          },
          {
            "$ref": "#/$defs/DateEntity"
          },
          {
            "$ref": "#/$defs/CompositeEntity"
          }
        ]
      },
      "title": "Entities",
      "type": "array"
    }
  },
  "required": [
    "schema",
    "entities"
  ],
  "title": "Extraction",
  "type": "object"
}