PDF Extraction using Microservices Architecture

One of the key hindrances in supply chain digitization is dealing with information stored in PDF. The information in PDF is mostly read-only, making info processing difficult and inefficient. We have partnered with Fung Academy to crack the case – extracting sketches and information from the tech packs in PDF format using an architecture that is made of a series of microservices in parallel.

As shown in the diagram above, the tech packs in PDF format are first fed into the architecture and checked if the they have been loaded before by the controller. If they have not, the system will perform raw extraction on these files using Tabula and PyPDF2 that are available in Python library. These raw extracts will then pass through a series of microservices / algorithm to reconstruct the information and make them comprehensible and useful.

As there are diverse possibilities on how the raw PDF extracts can be constructed, the output of these microservices are evaluated and rated with confidence level. This process will ensure that only sensible information is passed on to the next process namely evaluation which combines and evaluates the outputs from various microservices and make the extracted information useful. The extracted information can be stored in the designated database in the form of serialized data or be exported to other systems for subsequent processing.

This architecture is flexible in that it can be expanded to incorporate more microservices/algorithms that specialize in extracting certain types of information, for instance, product classes and sketches. This information may in turn be utilized to optimize vendor management activities such as vendor allocation, capability development and allocation.

PDF Extraction using Microservices Architecture

Share This Story

Related Posts

Automail v12 Release

ChatGPT Integration into Autoform

COVID-19 Immunity through Automation