Using AI to Automate Invoice Processing at Enterprise Scale

Invoice parsing has been the most challenging task among the Enterprise AI community. Even with tools like Tesseract OCR, the job of parsing an invoice has been super complicated considering the different types of format from different vendors. But what if we can club the power of a smart AI system like Google’s Gemini Vision model for processing invoices? Well, that’s what we are going to do in this article.

This prototype was a part of Skcript's S1 AI Suite, where you can get pre-built AI solutions for your enterprise and run it at scale in days.

Before you jump into this article, remember that what we are showing here is just the beginning. We are implementing complex invoice processing in near-real-time speeds for enterprise AI use-cases, and if you are looking for a pre-built AI model that is specific for invoice processing at scale, reach out to us.

What is Google’s Gemini Vision?

Gemini Vision is a multimodal model that can take in both text and image as input and give out the output. It is a pre-trained model that can be used for various use-cases like invoice processing, document processing, etc. It is a part of Google’s Makers Suite and can be used for free.

What are prompts?

These are custom instructions that guide Gemini Vision towards specific data points within an invoice. Just like pointing on a map, prompts direct the AI to find the vendor name, due date, or any other desired information. Think of them as personalized cheat codes for unlocking precise data extraction.

Contents of an Invoice

Even though invoices are of different formats and designs, they all have common parameters in them which we might be interested in extracting from them and posting on different systems or recording in databases. The common ones are as follows:

Invoice Number
Invoice Date
Due Date
Vendor Name
Vendor Address
Vendor Phone Number
Vendor Email

For this article I will use the following invoice as an example:

Using GenAI to automate invoice processing

So our goal is to extract these parameters from the invoice and store them in a database or post them on different systems. But before that we need to understand how to use the Google’s Gemini Vision model to extract these parameters from the invoice.

Talking to the model

Google’s Gemini Vision model is a multimodal model which means it can take in both text and image as input and give out the output. So we need to feed in both the text and image of the invoice to the model to get the output. We can simply use their playgroud to test the model and see how it works.

Navigate to Google's AI Studio
Create an account and navigate to https://makersuite.google.com/app/prompts/new_freeform
Then choose the model as Gemini Pro Vision

Model

Insert your invoice image in the playground and add a prompt, in this case we'll use the below prompts

Model

1
Parse this invoice into the following JSON structure and dont create new keys and leave the keys empty when the value is not present
2
```json
3
{ "invoiceId": "", "invoiceDate": "", "merchantName": "", "address": "", "lineItems": [{ "description": "", "quantity": "(How many number of items in this type)", "perUnit": "", "total":"" }] }

If you see, Im telling the model about what exactly I want from the invoice and how I want it to be structured. This is the power of prompts.

Now when you run the model, you'll get the output as follows:

1
{
2
  "invoiceId": "US-001",
3
  "invoiceDate": "11/02/2019",
4
  "merchantName": "East Repair Inc.",
5
  "address": "1912 Harvest Lane\nNew York, NY 12210",
6
  "lineItems": [
7
    {
8
      "description": "Front and rear brake cables",
9
      "quantity": 1,
10
      "perUnit": 100.0,
11
      "total": 100.0
12
    },
13
    {
14
      "description": "New set of pedal arms",
15
      "quantity": 2,
16
      "perUnit": 15.0,
17
      "total": 30.0
18
    },
19
    {
20
      "description": "Labor 3hrs",
21
      "quantity": 3,
22
      "perUnit": 15.0,
23
      "total": 45.0
24
    }
25
  ]
26
}

And thats how simple it is to use the model. You dont need to train the model or anything, just feed in the prompts and the model will do the rest.

How do we do it in our systems instead of the playground? Well, Google has provided a simple API to do the same.

Building an API out of it

Developing an API around Gemini Vision unlocks its full potential. Developers can leverage Google's pre-built functionalities or tailor the API to their specific needs. This empowers integration with diverse systems, streamlining invoice processing and unlocking valuable data insights.

In Google's studio, click on the "Get API Key" button

Model

Copy the API key and use it in the below code
Create a simple python project and install the below packages

1
pip install google-generativeai

Make sure you place your invoice image in the same folder as the python file under the name invoice.png
Create a new index.py file and paste the below code

1
from pathlib import Path
2
import google.generativeai as genai
3

4
genai.configure(api_key="YOUR_API_KEY")
5

6
# Set up the model
7
generation_config = {
8
  "temperature": 0.4,
9
  "top_p": 1,
10
  "top_k": 32,
11
  "max_output_tokens": 4096,
12
}
13

14
model = genai.GenerativeModel(model_name="gemini-pro-vision",
15
                              generation_config=generation_config)
16

17
# Validate that an image is present
18
if not (img := Path("invoice.png")).exists():
19
  raise FileNotFoundError(f"Could not find image: {img}")
20

21
image_parts = [
22
  {
23
    "mime_type": "image/jpeg",
24
    "data": Path("invoice.png").read_bytes()
25
  },
26
]
27

28
prompt_parts = [
29
  image_parts[0],
30
  "\nParse this invoice into the following JSON structure and dont create new keys and leave the keys empty when the value is not present\n```json \n{ \"invoiceId\": \"\", \"invoiceDate\": \"\", \"merchantName\": \"\", \"address\": \"\", \"lineItems\": [{ \"description\": \"\", \"quantity\": \"(How many number of items in this type)\", \"perUnit\": \"\", \"total\":\"\" }] }\n```",
31
]
32

33
response = model.generate_content(prompt_parts)
34
print(response.text)

Just replace the API Key and the image path and run the code. You'll get the output as follows:

Model

Now you can simple use this data to feed your invoice processing pipeline and do the rest of the processing.

A New Era of Invoice Processing

Google Gemini Vision marks a paradigm shift in invoice parsing. Its multimodal approach promises to:

Reduce manual processing: Automation significantly minimizes tedious manual data entry, saving time and resources.
Boost accuracy: Multimodal processing minimizes errors and inconsistencies, leading to more reliable financial and inventory data.
Unlock deeper insights: Extracted data readily integrates with existing systems, enabling insightful analysis and reporting.
Streamline workflows: Automated invoice processing fosters faster approvals, payments, and other workflow steps.

Need to implement this invoice processing using AI at enterprise scale? We're here to help. Reach out to us and we'll get back to you in 24 hours.

Using AI to Automate Invoice Processing at Enterprise Scale

What is Google’s Gemini Vision?

What are prompts?

Contents of an Invoice

Talking to the model

Building an API out of it

A New Era of Invoice Processing

Table of Contents

More Articles

10 Business Cases of RPA you can adopt today in your business

10 Gems in Ruby that you will love

AI's ROI grows with time. You just don't realize that yet. Let's call it ROA.