This is a proof of concept system for the automatic extraction of bioinformatics focused tabular data from spreadsheets and PDF documents is designed and implemented. Image analysis and heuristic techniques are used to determine the table dimensions of tables found in PDFs.
Optical character recognition (OCR) is considered as a novel approach as it sidesteps the inner workings of the PDF.