Thesis topics for Students
The following list contains possible topics for BSc and MSc students enrolled at IT University of Copenhagen. Each topic usually contains a theoretical and a practical component. The balance can be adjusted to the student’s needs.
If you wish to write a thesis in algorithms: 1) Find someone to work with (we prefer thesis work as a group), 2) Write a short description of the kind of thesis you imagine writing, 3) Contact me to set up a meeting.
Proposed Topics
The list below contains thesis topics that I gave some thought. If you find none of them interesting, but have another algorithmic question in mind, feel free to contact me with your proposal. I am interested in supervising topics that survey existing algorithmic work, conduct a reproducibility study of algorithm engineering approaches, work on explaining results of algorithms (e.g., for a recommender system) or fairness issues in machine learning and algorithms (e.g., see this book project and our paper for a concrete example).
Last Update: August 23, 2019.
Analyzing and Engineering Graphbased Similarity Search Algorithms
Algorithms based on nearest neighbor graphs are currently the fastest algorithm to solve the nearest neighbor problem, see annbenchmarks.com or the paper (Aumüller et al., 2019). However, the usually lack a solid theoretical foundation and/or easyaccessible implementations to investigate the effects of different optimizations.
A research project or bachelor thesis would survey existing technologies and providing a prototypical implementation that can be used to analyze the behavior of nearest neighbor search algorithms based on nearest neighbor graphs. A Master thesis would investigate new directions of building and querying the nearest neighbor graph efficiently, which can both be of theoretical or empirical nature.
Literature:
 (Aumüller et al., 2019) Martin Aumüller, Erik Bernhardsson, Alexander Faithful, ANNBenchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms
 (Malkov, Yashunin, 2018) Yu. A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
 (Iwasaki, Miyazaki, 2018) Masajiro Iwasaki, Daisuke Miyazaki, Optimization of Indexing Based on kNearest Neighbor Graph for Proximity Search in Highdimensional Data
Privacypreserving Similarity Search
Avoiding leaks of user information is crucial for the reputation of social networks and other largescale systems. Hence, all aspects of data processing and managing should be carefully analyzed with regard to possible leaks. The thesis aims at studying privacy issues and privacypreserving techniques for similarity search systems, a key primitive in several applications (e.g., recommender systems, machine learning, information retrieval) that often handle sensitive data like user preferences or medical records.
Work can be carried out in different directions:
 Attack scenarios on similarity search tools. Describe attack scenarios on similarity search systems and evaluate their impact. This would entail studying different paradigms for similarity search such as localitysensitive hashing, treebased techniques, and similarity graphs within their application areas (e.g., recommender systems or deep learning).
 Privacypreserving similarity search. Investigate techniques for privacypreserving similarity search and study their security guarantees. The goal is to build a similaritysearch library that offers provable security guarantees.
There was already a Master thesis (see (Bourgeat et al., 2019)) that worked towards building a privacypreserving similarity search tool. Extending such a tool could be an option.
A Bachelor thesis/Research project could be a survey of existing technologies.
See the referenced literature for more information.
Literature:

(Aumüller et al., 2018) M. Aumüller, T. Christiani, R. Pagh, F. Silvestri. Distancesensitive Hashing. Proc. 37th Symposium on Principles of Database Systems, 2018.

(Riazi et al., 2017) M. S. Riazi, B. Chen, A. Shrivastava, D. Wallach, F. Koushanfar. Sublinear Privacypreserving Search with Untrusted Server and Semihonest Parties. Under submission (Arxiv 1612.01835).

(Bourgeat et al., 2019) Anders Bourgeat, Jana Schmurr, Martin Aumüller. Local Differential Private Estimation of Jaccard Similarity

Dhaliwal, J., So, G., ParkerWood, A., Beck, M.: Utility preserving secure private data release
Outlierdetection using LSH Variants
Outlier detection is a classical problem, in which we want to identify points that deviate so much from other data points in an input set that we suspect them to be generated through a different mechanism.
This topic is concerned with applying LocalitySensitive Hashing to solve (versions of) the outlier detection problem. It was shown in (Kirner et al.) that for some applications, a standard LSH approach does not perform for the outlier detection problem. On starting point of the thesis is to check, how methods described in (Aumüller et al.) (“antiLSH” and “annulus LSH”) can be applied in the context of outlier detection, and compare these approaches with existing methods.
Literature:

(Aumüller et al., 2018) M. Aumüller, T. Christiani, R. Pagh, F. Silvestri. Distancesensitive Hashing. Proc. 37th Symposium on Principles of Database Systems, 2018.

(Kirner et al., 2017) E. Kirner, E. Schubert, A. Zimek: Good and Bad Neighborhood Approximations for Outlier Detection Ensembles. Similarity Search and Applications  10th International Conference.
Influence of Local Intrinsic Dimensionality in Machine Learning and Indexing Problems
In the last year, the notion of local intrinsic dimensionality has found use in solving many different conceptual problems. For example, it has been used in outlier classification (Houle et al., 2018), characterizing adversarial subspaces in machine learning (Ma et al., 2018a/b), and for dataset diversification for machine learning tasks (Aumüller, Ceccarello, 2019).
Possible directions of a thesis can be applications of the LID measure to other areas or a reproducibility study of the effects of local intrinsic dimensionality.
Literature:
 (Houle et al., 2018) Michael E. Houle, Erich Schubert, Arthur Zimek, On the Correlation Between Local Intrinsic Dimensionality and Outlierness
 (Ma et al., 2018a) Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi N. R. Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, James Bailey, Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
 (Ma et al., 2018b) Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, ShuTao Xia, Sudanthi N. R. Wijewickrema, James Bailey, DimensionalityDriven Learning with Noisy Labels
 (Aumüller, Ceccarello, 2019) Martin Aumüller, Matteo Ceccarello, The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
Learned index data structures
In very recent work, (Kraska et al., 2018) discuss replacing classical data structures, such as Btrees and hashing algorithms, with indexes that learn a model of the data from the data. The mentioned paper shows that this can, sometimes, greatly reduce the index size and/or running times.
What the paper lacks is a theoretical foundation of the possibilities and limitations of such learned indexes. Here, (Mitzenmacher, 2018) made a proposal for the case of Bloom filters and related data structures.
Thesis work could go into two directions: 1) Implement and compare learned indexes to other indexing methods, such as Cuckoo hashing or localitysensitive hash functions. 2) Further develop the theoretical foundations provided in (Mitzenmacher, 2018) to a different context.
Literature:

(Kraska et al., 2018) Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. (2018, May). The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (pp. 489504). ACM.

(Mitzenmacher, 2018) M. Mitzenmacher, A Model for Learned Bloom Filters and Related Structures, eprint arXiv:1802.00884.