In many data intensive projects, information from multiple sources needs to be integrated, combined, or linked in order to allow more detailed analysis. One aspect of data integration is the identification of records that belong to the same real world entity. The aim of this data linkage process (also called data matching or entity resolution) is to identify all records that relate to the same entity, such as a customer, student, tax payer, or patient. Many different techniques have been developed in the past decade to solve this problem. Some only consider pairs of records that are compared, while others generate graphs (with records as nodes and similarities between records as edges) and view the problem as a clustering of the records such that each cluster corresponds to an entity. A different approach is to combine rule-based logic with probabilistic learning approaches. Called Markov logic networks, these techniques have shown to be declarative, and also achieve high linkage quality, however their computational complexities are high, making these techniques not scalable to large databases.
The aim of this project is to investigate entity resolution techniques based on Markov logic networks, and to: - analyze some data set using the techniques described in the paper "Entity Resolution with Markov Logic" (see below); - explore if the weights and rules learned with a MLN approach can be used as the basis of a more scalable faster matching approach; - investigate how to combine MLN with scalable collective approaches, such as the one described in the paper "Large-scale collective entity matching" (see below).
This project is available as a one-semester Computer Science project for both undergraduate or Mcomp students, or as a one-year CS or MComp honours project (with an extended scope). Students interested in undertaking this project should have good programming skills, and knowledge in areas such as algorithms and data structures, first order logic, string processing, etc. It is of advantage if a student has successfully attended some courses on databases, data mining, logic, machine learning, or document computing.
The recently published book 'Data Matching" (see URL below) by Peter Christen provides an ideal broad introduction to most topics related to this project, while the two listed papers provide specific background about the project.
This is an exciting and challenging project that will involve the analysis of real world data, cutting edge technologies, and advanced scientific techniques.