Paperless offices have been talked about for nearly 50 years, but offices didn't budge until recently. In recent years, it has become mainstream, urged on by the emergence of smart phones, tablet computers, various types of cloud computing, and collaboration software. Documents that contain a large amount of important information are stored in an electronic form. Integrating this information into the digital workflow is the essence of paperless office. It greatly simplifies the office processes, reduces operation costs, and improves work efficiency. Quickly and accurately accessing this information is key. Therefore, Optical Character Recognition (ORC) becomes crucial.
OCR can identify text in pictures and scanned copies, and extract it to editable text. There is no need for manual conversion. This "eyesight" and its reading ability is similar to human eyesight. In fact, OCR is not a new concept. Research for OCR started in the 1960s and 1970s. However, due to the high cost of hardware, slow turnaround, and high requirements on input quality, OCR was not widely used. In recent years, the efficiency and accuracy of text recognition has been greatly improved for commercial use, due to the development of artificial intelligence and image recognition technologies.
As a company with millions of sales orders around the world, quickly and accurately dealing with large amounts of documents has become difficult. Huawei aggregates basic capabilities such as AI, IoT, computing, and storage to resolve such problems.
Challenges
Considering the diversity of customers and application scenarios, OCR faces several challenges.
- The scanned documents usually have line interference, missing formatting, tilt, low light, distortion, and noise, which makes character location difficult.
- Font, font size, color, and stroke width are not fixed, and the stroke direction is arbitrary. Decimal points, approximate English numbers, special symbols, conjunctions, and creative words are easily missed or misidentified.
- There are a wide variety of languages, and mixes of languages such as Chinese and English. This makes it difficult to identify the language.
- Forms often have seals (covering words), undefined lines (text overflow, and intersection of tables). This interferes with text recognition and greatly affects accuracy.
- Photographing text causes problems such as noise, blurring, light changes, deformation, and complex background interference. This is a challenge for accurate text location and recognition.
One by One Break Down
HUAWEI CLOUD OCR breaks through the preceding challenges with image preprocessing, form text positioning, certificate text positioning and extraction, and ensemble learning technologies. These have created a measured improvement.
First, the latest deep learning model is used for preprocessing to separate table lines and seals from the text. This eliminates their interference and eliminates noise, greatly simplifying subsequent text identification and layout analysis, and improving accuracy. The recognition rate and recall rate of HUAWEI CLOUD OCR are now at the top of the industry because of its advanced deep learning model, transferred learning model optimization, and hundreds of millions of training samples.
Second, multiple techniques, such as a tilt correction, maximum contour extraction, table line interference removal, and text box positioning, are used for processing form text. The integrated detection (ITE) algorithm is proposed for identification and information extraction. A major portion of information extraction is implemented in the text location phase. This greatly improves efficiency and accuracy when extracting structured data. In order to better adapt to rotation, distortion, complex backgrounds, light, and noise during text detection and recognition, black edge processing, automatic correction, noise removal, automatic image rotation, and multiple binarization methods are used to deal with images. In this way, documents such as forms and invoices can be quickly identified and output in a structured manner, helping you quickly and conveniently complete digitalization of paper documents. In addition, OCR services can be customized for customers to meet individual requirements.
Third, a word library and ensemble learning are used to collect data for common words and the Levenshtein distance algorithm is used to correct them. For key numeric sections, multiple image post-processing methods are used for ensemble learning to produce a final "confidence" rating and report possible errors.
Fourth, the latest big data cluster technology is adopted to ensure stability and reliability of the back-end and fast response of the system (within milliseconds).
The HUAWEI CLOUD OCR technology is used to automatically collect key data, establish data assets, and analyze big data, effectively reducing the operating expense (OPEX) and improving efficiency of services. This OCR technology helps Huawei save 200 employees worth of manpower in more than 170 subsidiaries worldwide. By analyzing key information such as customs valuation, we controlled the financial exposure of tens of millions of US dollars each year and increased the rate of business process automation by 50%.
Unified APIs
HUAWEI CLOUD OCR effectively leverages the advantages of Huawei cloud computing, such as its loose coupling, high reusability, and easy maintenance. OCR provides unified API interfaces for external application systems to provide convenient and compatible OCR services.
HUAWEI CLOUD OCR has been successfully applied in global healthcare, customs, logistics, finance, insurance, government, transportation, automobile, and traditional manufacturing industries. For example, identification of insurance policies and medical documents is provided to help insurance companies improve work efficiency and speed up claims. In the medical field, companies can identify and quickly build a database of pharmaceutical instructions. In the logistics field, Huawei helped improve the efficiency of multiple top courier enterprises by completing automatic identification of various documents.