Entity resolution of authors and institutions in bibliographic databases

People

Research areas

Description

The aim: the sociological theory of "Social Capital" (by Ronald S Burt) hypothesizes that competitive advantage of individuals / organisations can be derived from social ties with other individuals or organisations (and relative positions) within the social network. It is not just how much talent (human or knowledge capital) one has, or how much financial capital one has that will make or break in every case. In some cases, those who can access key individuals within the social network can produce valuable outcomes. In order to apply the theory to understand and evaluate the state of Australian research in the global arena, it would be prudent to know our (individual researchers and research institutions) relative positions and social ties to the rest of the world (other individual researchers and research institutions). Having a reliable mean to identify the individuals and institutions within the network is an important step to the aim.

Goals

This one-semester (or Honours) project will help to achieve this overall aim. The ANU has a large database containing the details of millions of scientific publications. The way authors and institutions are recorded in this database is not standardised, and as a result the same name can represent different authors, different name variations can represent the same author, and several name variations can occur for the same institution. The objectives of this project are: (1) to investigate the use of a variety of approximate string matching and entity resolution techniques in order to identify which name variations correspond to the same real-world authors and institutions; and (2) to systematically improve the accuracy of matching results by capturing domain-specific knowledge into a formal framework that enables the detection of hidden mistakes and the sharing of knowledge. Depending upon student progress, further work into the areas of network analysis and knowledge reasoning is possible.

Requirements

This project is available both as one-semester Computer Science project (undergraduate or MComp) or a one-year CS or MComp honours project. Students interested in undertaking this project should have good programming skills (ideally be familiar with Python), and knowledge in areas such as algorithms and data structures, string handling, databases, etc. It is of advantage if a student has successfully attended courses on databases, data mining, or document computing.

Background Literature

The recently published book 'Data Matching" (see URL below) by Peter Christen provides an ideal broad introduction to most topics related to this project.

Gain

This is an exciting and challenging project that will involve the analysis of real world data, cutting edge technologies, advanced scientific techniques, and cross sectorial collaboration between academia (Research School of Computer Science) and university administration (Central Research Office).

Updated:  8 September 2015/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing