Hi, I'm Thomas. I am a graduate student in Theoretical Computer Science with Rasmus Pagh, since 2015 at the IT University of Copenhagen.
My research has primarily involved the theoretical foundations of massive data, similarity search (which is also my primary funding through Scalable Similarity Search), high dimensional geometry, communication and conditional lower bounds, sublinear algorithms and intrinsic dimensionality.
Previously I have been at University of Copenhagen and University of Oxford.
TA, M Aumüller, R Pagh — 2017
Parameter-free Locality Sensitive Hashing for Spherical Range Reporting. Proceedings of Symposium on Discrete Algorithms, (pdf, arxiv, slides)We present a data structure for spherical range reporting on a point set S, i.e., reporting all points in S that lie within radius r of a given query point q. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from q to the points of S, our data structure is parameter-free, except for the space usage, which is configurable by the user. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been optimally chosen for the data and query in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH) and easy queries such as those where the number of points to report is a constant fraction of S, or where almost all points in S are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone.
The algorithm has expected query time bounded by O(t(n/t)^ρ), where t is the number of points to report and ρ∈(0,1) depends on the data distribution and the strength of the LSH family used. We further present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to O(n^ρ+t), which is the best we can hope to achieve using LSH. The previously best running time in high dimensions was Ω(tn^ρ). For many data distributions where the intrinsic dimensionality of the point set close to q is low, we can give improved upper bounds on the expected query time.
TA, R Pagh, I Razenshteyn, F Silvestri — 2016
On the Complexity of Inner Product Similarity Join. Symposium on Principles of Database Systems, (pdf, arxiv, slides)A number of tasks in classification, information retrieval, recommendation systems, and record linkage reduce to the core problem of inner product similarity join (IPS join): identifying pairs of vectors in a collection that have a sufficiently large inner product. IPS join is well understood when vectors are normalized and some approximation of inner products is allowed. However, the general case where vectors may have any length appears much more challenging. Recently, new upper bounds based on asymmetric locality-sensitive hashing (ALSH) and asymmetric embeddings have emerged, but little has been known on the lower bound side. In this paper we initiate a systematic study of inner product similarity join, showing new lower and upper bounds. Our main results are:
* Approximation hardness of IPS join in subquadratic time, assuming the strong exponential time hypothesis.
* New upper and lower bounds for (A)LSH-based algorithms. In particular, we show that asymmetry can be avoided by relaxing the LSH definition to only consider the collision probability of distinct elements.
* A new indexing method for IPS based on linear sketches, implying that our hardness results are not far from being tight.
Our technical contributions include new asymmetric embeddings that may be of independent interest. At the conceptual level we strive to provide greater clarity, for example by distinguishing among signed and unsigned variants of IPS join and shedding new light on the effect of asymmetry.