De-obfuscation of published scientific data
Information extraction using optical character recognition and heuristic techniques

This is a proof of concept system for the automatic extraction of bioinformatics focused tabular data from spreadsheets and PDF documents is designed and implemented. Image analysis and heuristic techniques are used to determine the table dimensions of tables found in PDFs.

Optical character recognition (OCR) is considered as a novel approach as it sidesteps the inner workings of the PDF.

Luke Darlow, BSc
Contact Details

Luke Darlow

Department of Computer Science

Rhodes University