Social scientists increasingly base their research on data sets that contain natural language text as they become available from a variety of initiatives to digitise historical population records. Analysis of such records, for example historical occupations or causes of death, is facilitated by classification into standard coding systems such as ICD-10 or HISCO.
The availability of high-quality codings for occupations and causes of death will allow historians and demographers to study historical populations in more detail and shed new lights into changes in past and modern societies, and for example discover new genetic factors of certain diseases, or correlations between occupations and death causes.
Using real historical death and occupation data sets, the objectives of this student project are to develop novel advanced text classification techniques, and evaluate these techniques on large real-world data collections.
Specifically, due to low data quality (typographical errors and variations in the descriptions of death causes and occupations, as well as non-standardised descriptions), a variety of data pre-processing and feature generation schemes will need to be explored for their suitability to classifying such data.
Due to possible multiple causes in a single death or occupation description, multi-class classification techniques might have to be employed.
This project is available as a one-semester Computer Science project for both undergraduate or MComp students, or as a one-year CS or MComp honours project (with an extended scope).
Students interested in undertaking this project should have good programming skills (ideally in Python), and knowledge in areas such as algorithms and data structures, machine learning and data mining, string processing, etc.
It is of advantage if a student has successfully attended courses on databases, data mining, machine learning, or document computing, and has received high marks in these courses.
This is an exciting and challenging project that will involve the processing and analysis of real-world data. A successful outcomes of this project can have the potential to make a significant impact in the way social scientists can analyse historical death and occupation data collections.
Historical death data, occupation data, text classification, machine learning, data mining, Python