Thesis topics for Students

The following list contains possible topics for BSc and MSc students enrolled at IT University of Copenhagen. Each topic usually contains a theoretical and a practical component. The balance can be adjusted to the student’s needs.

If you wish to write a thesis in algorithms: 1) Find someone to work with (we prefer thesis work as a group), 2) Write a short description of the kind of thesis you imagine writing, 3) Contact me to set up a meeting.

Proposed Topics

The list below contains thesis topics that I gave some thought. If you find none of them interesting, but have another algorithmic question in mind, feel free to contact me with your proposal. I am interested in supervising topics that survey existing algorithmic work, conduct a reproducibility study of algorithm engineering approaches, work on explaining results of algorithms (e.g., for a recommender system) or fairness issues in machine learning and algorithms (e.g., see this book project and our paper for a concrete example).

Last Update: July 01, 2020

List Of Suggested Topics

Meta-Study on Reproducibility of Database Research

Many conferences are adopting a reproducibility track to increase and popularize reproducibility of research, see for example this initiative. This project aims to analyze accepted papers at SIGMOD and VLDB under the focus of evaluation methodology and reproducibility. While being mainly a literature study, it will nevertheless require automated mining of papers, some form of automated analysis, and reading a lot of source code. The result should be a meta-study on reproducibility in database research. It can also cover re-implementation of some selected algorithms that were published at these venues and which do not provide source code.

PUFFINN is an LSH-based nearest neighbor search algorithm with provable guarantees on the correctness of the result. Unfortunately, it only supports inner product and set similarity distance functions. The goal of this project is to add other LSH variants to PUFFINN, making it more generally useful. This project can be carried out on all different levels, but requires good knowledge of C++. Another project would be to improve parallel index building, which requires additional knowledge about parallel programming.

Literature:

Distributed Similarity Join

Our research group is currently working on carrying out large-scale similarity joins as for example discussed in (Fier et al., 2018). The source is available here. If you are a group willing to learning Rust and distributed systems, many implementation tasks can be carried out within this project. It can be carried out on all levels, but this project probably best fits a combination of a research project (getting known to the setting) and a subsequent Master thesis.

Literature:

Engineering Low-Dimensional Nearest Neighbor Search Algorithms

Nearest neighbor search in 2 dimensios is a well-studied problem with many efficient algorithms, such as kd-trees. This project will take a look at current state-of-the-art algorithms described by (Chan, Tsakalidis, 2016). The goal is to create an efficient implementation of their ideas and compare them to empirical fast algorithms, such as the ones described by (Birn et al., 2010).

A recent google project https://github.com/google-research/google-research/tree/master/scann seems to provide the state-of-the-art nearest neighbor search algorithm. The goal of this project is to explain the algorithm and provide a prototypical implementation to reproduce their results and enable further testing of ideas.

  • Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar: Accelerating Large-Scale Inference with Anisotropic Vector Quantization, PDF

Avoiding leaks of user information is crucial for the reputation of social networks and other large-scale systems. Hence, all aspects of data processing and managing should be carefully analyzed with regard to possible leaks. The thesis aims at studying privacy issues and privacy-preserving techniques for similarity search systems, a key primitive in several applications (e.g., recommender systems, machine learning, information retrieval) that often handle sensitive data like user preferences or medical records.

Work can be carried out in different directions:

  1. Attack scenarios on similarity search tools. Describe attack scenarios on similarity search systems and evaluate their impact. This would entail studying different paradigms for similarity search such as locality-sensitive hashing, tree-based techniques, and similarity graphs within their application areas (e.g., recommender systems or deep learning).
  2. Privacy-preserving similarity search. Investigate techniques for privacy-preserving similarity search and study their security guarantees. The goal is to build a similarity-search library that offers provable security guarantees.
  3. Empirical comparison of privacy-preserving similarity search. Compare two existing methods in terms of their privacy guarantees and running time. For example, (Pagh, Stausholm, 2020) can be used to solve the problem discussed in (Aumüller et al., 2020).

There were already two Master theses (see (Aumüller et al., 2020) for an idea) that worked towards building a privacy-preserving similarity search tool. Extending such a tool could be an option.

A Bachelor thesis/Research project could be a survey of existing technologies.

See the referenced literature for more information.

Literature:

Analyzing and Engineering Graph-based Similarity Search Algorithms

Algorithms based on nearest neighbor graphs are currently the fastest algorithm to solve the nearest neighbor problem, see ann-benchmarks.com or the paper (Aumüller et al., 2019). However, the usually lack a solid theoretical foundation and/or easy-accessible implementations to investigate the effects of different optimizations.

A research project or bachelor thesis would survey existing technologies and provide a prototypical implementation that can be used to analyze the behavior of nearest neighbor search algorithms based on nearest neighbor graphs. A Master thesis would investigate new directions of building, analyzing and querying the nearest neighbor graph efficiently, which can both be of theoretical or empirical nature.

Literature:

Outlier-detection using LSH Variants

Outlier detection is a classical problem, in which we want to identify points that deviate so much from other data points in an input set that we suspect them to be generated through a different mechanism.

This topic is concerned with applying Locality-Sensitive Hashing to solve (versions of) the outlier detection problem. It was shown in (Kirner et al.) that for some applications, a standard LSH approach does not perform for the outlier detection problem. On starting point of the thesis is to check, how methods described in (Aumüller et al.) (“anti-LSH” and “annulus LSH”) can be applied in the context of outlier detection, and compare these approaches with existing methods.

Literature:

Influence of Local Intrinsic Dimensionality in Machine Learning and Indexing Problems

In the last year, the notion of local intrinsic dimensionality has found use in solving many different conceptual problems. For example, it has been used in outlier classification (Houle et al., 2018), characterizing adversarial subspaces in machine learning (Ma et al., 2018a/b), and for dataset diversification for machine learning tasks (Aumüller, Ceccarello, 2019).

Possible directions of a thesis can be applications of the LID measure to other areas, for example in work such as (Li et al., 2020) or a reproducibility study of the effects of local intrinsic dimensionality.

Literature:

Learned index data structures

In very recent work, (Kraska et al., 2018) discuss replacing classical data structures, such as B-trees and hashing algorithms, with indexes that learn a model of the data from the data. The mentioned paper shows that this can, sometimes, greatly reduce the index size and/or running times.

What the paper lacks is a theoretical foundation of the possibilities and limitations of such learned indexes. Here, (Mitzenmacher, 2018) made a proposal for the case of Bloom filters and related data structures.

Thesis work could go into two directions: 1) Implement and compare learned indexes to other indexing methods, such as Cuckoo hashing or locality-sensitive hash functions. 2) Further develop the theoretical foundations provided in (Mitzenmacher, 2018) to a different context.

Literature: