Invoice Information extraction using OCR and Deep Learning

Sourav Ghosh
Analytics Vidhya
Published in
15 min readJan 14, 2021

--

Document information extraction is considered as a major challenge in computer vision and involves a combination of object classification and object localization within a scene. The advent of modern advances in deep learning, has led to significant advances in object detection, with the majority of research focuses on designing increasingly more complex object detection networks for improved accuracy such as SSD, R-CNN, Mask R-CNN, and other extended variants of these networks. This project is mainly aimed to extract information from invoice using a latest deep learning techniques available for object detection. This deep convolutional neural network model will be introduced for embedded object detection.

Key Benefits

Reducing Cost: This helps the organization to cut down the cost of hiring man power for manual data extraction. Employees can be utilized for focusing on other productive job.

Reducing Error: Extracting information from invoices is difficult because of different formats. Also, human errors are another big problem, which leads to data loss and inaccuracy. OCR helps to reduce human error and make the extraction accurate.

Ready Availability: OCR does not required human intervention for the extraction and validation process, once the invoices is fed to the system it will extract the text out of it and push it to the inventory in the same flow.

Security: A complete automated extraction process provides data level security to the organization and the data is not easily visible to outside world.

Study Current Literature

The advent of modern advances in deep learning, has led to significant advances in object detection, with the majority of research focuses on designing increasingly more complex object detection networks for improved accuracy such as SSD, R-CNN, Mask R-CNN, and other extended variants of these networks. This paper is focused on Object Detection in images using Convolutional Neural Networks (CNN). An image detection problem predicts the label of an image based on the predefined labels. Based on assumption it selects single object of interest in the image and try to cover significant potion on that image. The detection job is not only about defining the class of that object but localize the extent of object in that image.

The block-wise orientation histogram (SIFT or HOG) features previously used as object detection method, which encodes very low level characteristic of an object and hence this technique is suitable to distinguish properly among different labels. Deep Convolutional Neural Network became the state of art in object detection in an image.

Convolutional Neural Network Architecture

The CNN architecture is similar to the connectivity pattern of Neurons in the Human Brain. A CNN adept to capture spatial and temporal dependencies in an image using different filters. Hence, to understand the sophistication of the image, the network can be trained using CNN. Convolutional network is consist of two main features: feature learning (also known as hidden layer) including Convolution, ReLU and Pooling, and classification layer including FC and Softmax.

Technically, in ConvNet each image passes through a series of convolution layers with multiple kernels or filters, Pooling, Fully Connected layer and at the end of the network it implements Softmax function uses the probabilistic value [0,1] to classify an object in the image. Figure 1 describes the CNN pipeline in order to process an input image and classifies the objects based on values.

Architecture of a CNN (https://www.datascience.com/blog/convolutional-neural-network)

Region Based Convolutional Network (RCNN):

The RCNN network creates bunch of bounding boxes in the image and finds for existence of any object contained by those boxes, instead of dealing with huge number of regions. RCNN employed Selective Search to create bounding boxes or region proposals. Selective search grasp windows of different size over the image and group together adjacent pixels by varying scales, colors, textures, and enclosure.

R-CNN architecture (https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e)
Creation of region proposals.(https://arxiv.org/abs/1311.2524.)

Steps involved in RCNN to detect object in images as follows:

  • Transfer learning is a key concept in deep learning paradigm. So we will consider a pre-trained convolutional neural network and re-train the end layer of the model based on the classes that need to be detected.
  • In the next step we will calculate the Region of Interest (ROI) of each image and reshape all those regions to match the CNN input.
  • Once all the regions are calculated, we need to train Support Vector Machine (SVM) in order to classify objects and background in form of binary classifier.
  • Afterwards, we will employ Liner Regression model in order to output tighter coordinates for the boxes.

Faster-R-CNN (FRCN)

State-of-the-art object detection framework described on Faster R-CNN, which is based on deep convolutional networks and includes a Region Proposal Network (RPN) and an object detection network. Both RPN and R-FCN networks are trained for sharing convolutional layers for fast computing and testing. RPN produce full-image convolutional features with a detection network that enables nearly cost-free region proposals, where each object proposal with an objectness score as output.

FRCNN architecture — (https://towardsdatascience.com/deep-learning-for-object-detection-a-comprehensive-review-73930816d8d9)

Region Proposal Network (RPN)

  • RNP takes a 3x3 sliding window flows across the feature map and maps into a lower dimension.
  • Then it generates k fixed anchor boxes of different shape sizes for each sliding-window location.
  • Once the anchor boxes are generated, RPN calculate the softmax probability that an anchor box is an object.
  • Then for better fit the object, bounding box regression is being performed for adjusting the anchors.
RPN architecture — https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-object-detection-algorithms-part-1/

As we have all the region of proposals, next step is to feed then directly to Fast R-CNN network, which consists of a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor.

Single Shot MultiBox Detector (SSD)

End of 2016 Christian Szegedy came up with Single Shot Multibox Detector in the field of object detection, with mean average precision of 74% on standard dataset as COCO and PascalVOC. SSD only needs an input image and ground truth boxes for each object during training. It is based on a feed-forward convolutional network. It generates a defined-size collection of bounding boxes and corresponding scores for the existence of object class instances in those boxes, and also a non-maximum suppression step to produce the final detection results.

Single Shot MultiBox detector architecture (https://arxiv.org/pdf/1512.02325.pdf)
  • Architecture: VGG-16 is one of the strongest network in the field of image classification with high performance and high quality. Therefore, based on VGG-16 architecture Christian Szegedy designed SSD architecture but discarded the fully connected layers from the architecture. In order to progressively decrease the size of the input of each subsequent layer, a set of auxiliary convolutional layers were implemented. VGG-16 architecture mentioned below.
VGG architecture with input 224x224x3
  • MultiBox Detector: Based on Christian Szegedy MultiBox method, the bounding box regression of SSD is designed. It obtains a feature layer of size m X n with p channels (m x n x p). For each location, we have got k bounding boxes having different size and aspect ratio. For example, vertical bounding boxes for human and horizontal bounding boxes for cars. Then it calculates the class score and 4 coordinates offset respect to the original ground truth bounding box shapes.
  • There are two critical components of MultiBox’s loss functions that made their way into SSD, Confidence Loss: Cross-entropy is being used to measure softmax loss over multiple classes confidences and Localization Loss: Smooth L1 is used compute loss between predicted box and ground truth box, which is includes the offsets for the centre point (cx, cy), width (w) and height (h) of the bounding box

multibox_loss = confidence_loss + alpha * location_loss

Multiple Bounding Boxes for Localization and Confidence (https://arxiv.org/pdf/1512.02325.pdf)
  • Intersection Over Union (IOU): IOU is an evaluation technique to measure the accuracy of an object detection model. We need to component to find IOU are the ground-truth bounding boxes and the predicted bounding boxes as an output of the object detection model. It’s not possible in reality, which the predicted bounding box coordinates are going to fit exactly to the ground-truth bounding boxes coordinates. Hence, in order to find the heavily overlapped bounding boxes with the ground-truth boxes, we set an IOU threshold. That ensure, out predicted bounding boxes match the ground-truth boxes as closely as possible.
Intersection Over Union (https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/)
  • Non-Maximum Suppression: NMS is being employed prune most of the bounding box generated by the model during inference stage by setting a confidence threshold of 0.01 and IOU threshold of 0.5, we can filter out most of the boxes and only N predictions are kept. It helps optimize the model output during the inference and remove noisier predictions.
Non-maximum suppression example (https://medium.com/@yusuken/object-detction-1-nms-ed00d16fdcf9)

Pipeline Design

High Level Architecture

At the beginning of this use case, we have employed few computer vision techniques to identify the table structure in a PDF or image file, where we have tried morphological operations for edge detection by applying two kernels 1. Kernel to detect horizontal line 2. Kernel to detect vertical lines. This approach failed where the table doesn’t contain any horizontal or vertical lines. Nowadays, structure extraction is one of the key research area in the field of deep learning, hence we have decided to proceed further with deep neural network for our use case. We have structured this use case in three different sections 1. Region detection using deep learning technique, 2. Text extraction using OCR tool from the detected region. 3. Implement text analytic to identify the relation between the extracted texts and dump it to a repository.

Pipeline Architecture

Image Pre-processing: Here the images are been prepared for training and testing process. First we convert PDF invoices to JPG with (600x600x3) and 300 DPI followed by different pre-processing technique mentioned in the section [6]. Once all the images are collected and processed, it will be parsed for training to the deep learning models (FRCN and SSD are being used).

Structure Extraction Using Deep Learning: We have divided the extraction part into two categories described below.

  • Detection Modelling: Once the dataset is ready, we will pass the same to the detection model in order to identify the tables, paragraph and forms in the input images. We are currently working with both FRCN and SSD models and based on the accuracy we will be selecting the model for final use case. All the measurement details and comparisons will be demonstrated on the final document.
  • Texts Extraction: The next component in the pipeline is text extraction. We are currently working with an open source OCR tool named Tesseract-OCR for text extraction from the detected regions. This is to mentioned, that extraction module is in progress, we might shift to other methodology for better performance and accuracy.

Text Categorization: The text categorization consists of two section. 1. Using Natural Language Processing technique will be used to identify the relationship between the extracted text as key-value pair and 2. Classification algorithm to be used in order to calculate the accuracy of the final output.

Development Environment: Anaconda3, Google Cloud Platform, protobuf-compiler, python-pil, python-lxml, python-tk

Python Packages: · Cython, matplotlib, PIL (pillow), tensorflow-gpu, keras, LabelImg, Imgaug, spaCy

Preparation of Training Dataset

Image Collection and Preparation

Almost 1K the images have been collected from different sources, like, Google, Being and few vendor invoices, 30% of images kept for testing and validation of the models. In order to strengthen the model accuracy image augmentation techniques have been adopted.

  • Image Normalization: Image normalization technique is used to change the range of pixel intensity values in order to improve contrast of the image. Histogram Equalized or Contrast Stretched mechanisms are widely used for image normalization. In this use-case we have employed Contrast Stretched technique.
Image normalization. Original vs. Contrast stretching
  • Image Resize: We are training the deep neural network with size 600x600 pixel of images. Resize function of PIL python module have been used for resizing the images.
  • Image Conversion: All the images converted to RGB (channel 3) format and encoded to JEPG. OpenCV package is used for the conversion.
  • Dots per Inch (DPI) Conversion: Throughout this project we are working on images with high intensity, to accrue the same all the images converted to 300 DPI.

Image Labelling

Labelling of images is one of the most important task in object detection field. We have manually annotated 1000 images in three categories: paragraph, table, and forms using a GUI based labelling tool labelImg. All the annotations are saved as XML file in a different directory and it uses PascalVOC format. A custom python script is created to prepare CSV file out of all XML files, contains filename, image size, bounding box coordinates and class, which will be used during the TFRecord creation.

The XML file contains all the details about the image and annotations.

  • Image location.
  • Size and channel (RGB)
  • Classes (paragraph, table, form)
  • Bounding box coordinates as xmin, ymin, xmax, ymax.
Image annotation using labelImg
XML file with bounding box coordinates

Image Augmentation

In order to achieve a good performance in a deep network, we need a large amount of data for training. As we have collected only 1000 images, hence image augmentation technique has been employed to boost the performance of our network. The python library imgaug helped us to generate augmented images along with bounding box shifting. The following augmenters used in out use-case:

  • Affine Translation: Using affine translation technique we are translating images by 40pixels on x-axis and 60pixels on y-axis, and scaling to 50–70% of their original size. This affect the bounding box location, hence we are shifting the bounding box coordinates for the augmented image.
  • Brightness: To make the image brighter we used multiplied all the pixel with (1.2, 1.5).
  • Gaussian Blur: Gaussian kernel used to augment image with a sigma of 1.5.
  • Horizontal Flip: Augmenter Flipper used with scale 1.0 to flip image horizontally.
Image augmentation

Model Building

As stated earlier Faster-RCNN is a combination of RPN instead of Selective Search and Fast-RCNN framework. Selective search employs SIFT and HOG descriptors to generate object proposals and it takes 2 sec per image on CPU, where, using VGGnet or ResNext as backend network Faster RCNN works at 5fps (frames per second).

Keras offers different state-of-art models with pre-training models to produce custom model by loading existing weights. The pre-trained models are trained on ImageNet data set on 1000 classes and 25,636,712 parameters. We have implemented transfer learning as standalone feature extractor and weight initializer for our custom model with both VGG16 and ResNet50 network and pre-trained models have been downloaded from internet. This model can be trained on another network including inception_v3.

Optimization

Loss functions can be optimized for all the anchors but it will be biased against negative samples as maximum part of the image does not contain any object. As mentioned in the FRCN paper, we are randomly sampling 256 anchors in order to compute the loss function of a mini batch with positive and negative anchors have a ratio of up to 1:1. We are using zero-mean Gaussian distribution with standard deviation 0.01 to initialize further layers. We have used initial learning rate of 0.001 till 23k mini batch and 0.0001 for next 20k mini batches on PASCAL dataset.

Hyper-parameters

  • Initial learning rate : 0.001 first 23k steps
  • Scheduled learning rate: 0.0001 for next 20k steps.
  • Momentum optimizer value : 0.9
  • Use moving average : False
  • Gradient clipping by norm : 10.0
  • Standard deviation : 0.01
  • Epoch : 25
  • Steps/epoch : 1000
  • nms_iou_threshold : 0.8

Final Loss Calculation

As mentioned earlier we have trained this model with 25 epoch of 1000 steps each and the final loss calculated as RPN loss and detection loss. We achieved 88.9% accuracy from RPN for bounding box classifier.

Visualization of Activation Layer Outputs

Visualization of Activation layers

Final Segmented Image

Text Extraction using Tesseract:

Often abbreviated OCR, Optical Character Recognition is one of hot topic in computer science filed over two decades. OCR detects text contents on images and translate images to machine-encoded text for computer to manipulate. Steps involved in OCR described below.

  • Images are scanned and converted into Bitmap, which is a matrix of black and white dots.
  • To enhance the accuracy, the images has to be pre-processed with brightness and contrast adjustment.
  • Then find the region of interest, where the texts are in the images using segmentation algorithm.
  • Now the area of interest can be cut down further into lines, words and characters. Afterwards, comparison and detection algorithm is used to match the characters.

We selected Google’s Tesseract-OCR Engine, was originally developed by Hewlett Packard in the 1980s and Google took it over on 2006. Tesseract-OCR is deep learning based open source software and it supports 130 languages and over 35 scripts. We are using PyTesseract is a python wrapper for Tesseract-OCR Engine for text extraction.

Image Pre-processing

In order to increase accuracy of Tesseract-OCR, the input image needs to be processed. There are several techniques available for pre-processing of image, where we have employed some basic techniques i.e. grey scale conversion, resizing and brightness adjustment. We have used python package called Pillow (is the Python Imaging Library (PIL)) for image pre-processing

Content Extraction

The FRCN model inference graph accepts an image and generates detection boxes, detection scores and detection classes as python dictionary, where detection boxes hold the bounding box coordinates. We are converting the coordinates and calculating xmin, ymin, xmax, ymax and python OpenCV for cropping the bounding boxes and passing it to pytesseract for text extraction.

Python Packages used: Pytesseract, Numpy, Matplotlib, OpenCV.

Text Classification using Named Entity Recognition

Named Entity Recognition (NER) is one of the text classification techniques used in Natural Language Processing, and is also known as entity identification, entity chunking and entity extraction. It is designed to locate and classify named entity described in unstructured data into pre-defined categories such as name, country, organization, currency and so on. NLTK and SpaCy are the most used python packages for NER. We have built the NER model using SpaCy, which provides support for English, Spanish, French, Italian, Dutch and multi-language NER. We have taken the output of the OCR and trained a custom NER model in order to classify the following entities.

  1. Invoice Number
  2. Invoice Date
  3. Currency
  4. Total Amount
  5. Purchase Order

Dataset Preparation

We employed spaCy NER annotator tool to label all the extracted text from the training images, it generates a consolidate .json file with all the annotation. SpaCy accepts list of tuple for training process, so we have created python code to convert json data to spaCy format.

NER Annotation tool used for data pre-processing

Training NER model

The NER training begins with creating a blank model and add the same to the built-in pipeline. SpaCy mini-batch component is used for training batches. As we are training a new model, hence resetting and initializing the weights randomly. The model has been trained with 200 iteration, drop 0.5 and mini-batch compounding learning rate as (4.0, 32.0, 1.001).

NER Model Evaluation

We have used spaCy built in class gold and scorer API to evaluate out NER model. Precision score 93.33, recall is 94.27 and the F1 score 94.12.

Testing and Validation

After training it stores the model into a predefined directory. The trained model classify the input text extracted from FRCN inference graph and classify afore mentioned entities along with tokenization. Displacy component has been used to visualize the classification output. The extracted text get stored into a dictionary and the same can be used to store into a CSV file or directly push to SAP system.

Visualizing final output

Conclusion and Future Work

This paper covers the work done for 16 week duration, where we have studied about different state of art models of deep neural network and selected few of them for our use case and we have done all the data pre-processing required for the training activity. We have built a real life dataset to train the models. We have trained all the three models for region detection and decided to proceed with FRCN based on its accuracy and time complexity. As a future scope, we are exploring Graph Convolutional Neural Network to get more grip on the relationship between tables headers, row and cells and build a graphical model for better accuracy.

References

--

--

Sourav Ghosh
Analytics Vidhya

Data Scientist with Python, R and Big Data Analytics. A coding enthusiast and passionate about Deep Learning and Computer Vision.