Document QA

Use the DocVision API to input a document and ask a question. DocVision API is excellent at answering open-ended question and extracting information from documents. It's suitable for processing simple documents easily and at low cost.

Available models

Model	Release date	Context Length	Description
solar-docvision `preview`	2024-09-10	8192	A model specialized for Document Visual Question Answering (opens in a new tab). `solar-docvision` supports English only at this time. `solar-docvision` is an alias for our latest Solar DocVision model. (Currently `solar-docvision-preview-240910`)

Capabilities

Solar DocVision is trained to perform question-answering tasks on documents by extracting relevant information. Our model supports two main functionalities:

Extractive Question Answering (Extractive QA)
Key Information Extraction (KIE)

Extractive Question Answering (Extractive QA)

Extractive QA is a task that involves extracting appropriate answers from documents based on given questions.

For example, when presented with a business card and asked "What is the phone number?", Solar DocVision can extract the correct phone number from the document.

Key Information Extraction (KIE)

In addition to Extractive QA, Solar DocVision can perform basic Key Information Extraction tasks.

Using specific prompts, Solar DocVision can extract structured data from documents and present it in JSON format.

For instance, given a business card image, you could use the following prompt:

Extract the information from the business card. Format the output as JSON.

Solar DocVision would produce a response like this:

{
  "name": "John Smith",
  "phone": "123-456-7890",
  "email": "john.smith@example.com",
  "company": "Tech Innovations Inc."
}

This capability allows for efficient extraction of multiple pieces of information from a single document.

Limitations

As of now, Solar DocVision doesn't support some functionalities, such as summarization, reasoning, or chat-based interactions.

Request

POST https://api.upstage.ai/v1/solar/chat/completions

Parameters

The messages parameter is a list of message objects. Each message object has a role (must be "user" for DocVision model) and content. Currently, the model accept only one message with the "user" role.

A "user" message is where you place your question and document image. The message.content object will contain both question and image. For detail, see parameters and example sections below.

Request headers

Authorization string Required
Authentication token, format: Bearer API_KEY

Request body

messages list Required
A list of messages comprising the conversation so far. Currently model accept only one message with "user" role.

messages[].content list Optional
The contents of the user message which is list of content parts.

messages[].content[].type list Required
The type of content part. Either "text" or "image_url".
messages[].content[].text field is required if value of type is "text" or messages[].content[].image_url field if type is "image_url"

messages[].content[].text string Optional
The question to the document.

messages[].content[].image_url object Optional
The image_url object that contains url.

messages[].content[].image_url.url string Optional
The url contains either url of the image or base64 encoded image.

messages[].role string Required
The role of the messages author. Must be "user"

model string Required
The model name to generate the completion.

max_tokens integer Optional
An optional parameter that limits the maximum number of tokens to generate. If max_tokens is set, sum of input tokens and max_tokens should be lower than or equal to context length of model. Default value is inf.

stream boolean Optional
An optional parameter that specifies whether a response should be sent as a stream. If set true, partial message deltas will be sent. Tokens will be sent as data-only server-sent events. Default value is false.

temperature float Optional
An optional parameter to set the sampling temperature. The value should lie between 0 and 2. Higher values like 0.8 result in a more random output, whereas lower values such as 0.2 enhance focus and determinism in the output. Default value is 0.7.

top_p float Optional
An optional parameter to trigger nucleus sampling. The tokens with top_p probability mass will be considered, which means, setting this value to 0.1 will consider tokens comprising the top 10% probability.

Requirements

Supported image formats: JPEG, PNG
Maximum image size: 16MB
Maximum image dimensions: 4096 pixels for both width and height

Response

Return values

Returns a chat.completion object, or a streamed sequence of chat.completion.chunk objects if the request is streamed.

The chat completion object

id string
A unique identifier for the chat completion. Each chunk has the same ID.

object string
The obejct type, which is always chat.completion

created integer
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.

model string
A string representing the version of the model being used.

system_fingerprint null
This field is not yet available.

choices list
A list of chat completion choices.

choices[].finish_reason string
The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence, length if the maximum number of tokens specified in the request was reached.

choices[].index integer
The index of the choice in the list of choices.

choices[].message object
A chat completion message generated by the model.

choices[].message.content string
The contents of the message.

choices[].message.role string
The role of the author of this message.

choices[].logprobs null
This field is not yet available.

usage object
Usage statistics for the completion request.

usage.completion_tokens integer
Number of tokens in the generated completion.

usage.prompt_tokens integer
Number of tokens in the prompt.

usage.total_tokens integer
Total number of tokens used in the request (prompt + completion).

The chat completion chunk object

id string
A unique identifier for the chat completion. Each chunk has the same ID.

object string
The obejct type, which is always chat.completion.chunk

created integer
The Unix timestamp (in seconds) of when the chat completion was created. Each chunk has the same timestamp.

model string
A string representing the version of the model being used.

system_fingerprint null
This field is not yet available.

choices list
A list of chat completion choices.

choices[].index integer
The index of the choice in the list of choices.

choices[].delta object
A chat completion message generated by the model.

choices[].delta.content string
The contents of the message.

choices[].delta.role string or null
The role of the author of this message.

choices[].logprobs null
This field is not yet available.

Example

Request

curl --location 'https://api.upstage.ai/v1/solar/chat/completions' \
  --header 'Authorization: Bearer UPSTAGE_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "solar-docvision",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/ReceiptSwiss.jpg/340px-ReceiptSwiss.jpg"
            }
          },
          {
            "type": "text",
            "text": "How much is Latte Macchiato?"
          }
        ]
      }
    ]
}'

Response

{
    "id": "b3773198-1280-4bc4-ba8c-ea5d907fdff9",
    "object": "chat.completion",
    "created": 1725432431,
    "model": "solar-docvision-preview-240910",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " 4.50\n\n"
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1907,
        "completion_tokens": 8,
        "total_tokens": 1915
    },
    "system_fingerprint": null
}

Migrating from Layout Analysis Document OCR