De-obfuscation of published scientific data
Information extraction using optical character recognition and heuristic techniques

This is a proof of concept system for the automatic extraction of bioinformatics focused tabular data from spreadsheets and PDF documents is designed and implemented. Image analysis and heuristic techniques are used to determine the table dimensions of tables found in PDFs.

Optical character recognition (OCR) is considered as a novel approach as it sidesteps the inner workings of the PDF.


Luke Darlow, BSc
me
Contact Details


Luke Darlow

g10d0410@campus.ru.ac.za

Department of Computer Science

Rhodes University

Grahamstown