Gareth Dwyer

The project of building a web corpus of South African English is being undertaken as part of the requirements for my Bachelor of Arts Honours Degree in Computer Science at Rhodes University, Grahamstown, South Africa.

I am working with the Rhodes University Linguistics Department and the Dictionary Unit of South African English with the aim of building a South African English web corpus. The corpus will monitor dozens of identified sources of South African Enlish, such as online news sites and magazines, and retrieve all new content. It will include functionality to automatically remove irrelevant text ('boilerplate'), and detect and remove near and exact duplicate content. It will also provide a front-end with functionality for various linguisitic analysis tools, such as a concordancer and collocation tool.

The project is built largely using Python and MongoDB, using the Flask web framework for the UI. It incorporates many pre-built tools, such as the NLTK (Natural Language Toolkit), as well as some that were created during the devlopment of the project.

The Abstract for my Honours thesis Towards Automated Creation and Management of a South African English Web Corpus can be found below, with a link to the full document here.

A corpus is a large collection of classified text, from which knowledge about how natural language is used can be extracted. Corpora are used by linguists and lexicographers to analyse language and to compile dictionaries. Although corpora can be built from a variety of data sources, the World Wide Web is particularly suitable as data exist in large quantities and are already in digital form. For this research, a system was implemented which gathers and stores data from South African Web sites, and furthermore monitors a set of online feeds for new data, and provides a variety of analysis functionality through a web interface. Using the system, an evolving web corpus of South African English was created. This research details a breakdown of steps and components which were required to build and maintain an evolving language-specific web corpus, including the gathering and storing of data, the cleaning or `boilerplate removal' of Web pages, near and exact deduplication of content, and the various analysis tools implemented which are commonly used by language researchers.