The following projects are proposed for Graduate (M.Sc, Ph.D) and Undergraduate students in the faculty. For more details or for registration, please contact the listed advisor by mail.
Standard machine-learning algorithms assume representations of their input data as numerical vectors. Applying these algorithms to the analysis of non-numerical data requires embeddings of data units into a (typically) finite-dimensional Euclidean vector space. The embedding needs to be “faithful” to the semantics. In particular, similar entities should be mapped to vectors that are close geometrically, and vice versa. In some modalities, the input comes with a useful embedding to begin with; for example, an image can be represented by the RGB intensities of its pixels. In others, semantic aware embeddings have to be devised, and indeed have been devised, such as Word2Vec and GLoVe for words of natural language, Node2Vec for the nodes of a graph, TransE for the entities of a knowledge graph, and Mol2Vec for molecule structures. Naturally, generic embeddings have also been devised for relational databases. Such embeddings have enabled the deployment of machine-learning architectures to traditional database tasks that have been the pain point of data management for decades, including quality enhancement (e.g., data imputation, data cleaning, column prediction) and data integration (e.g., entity resolution, record linking, schema matching).
The goal of this project is to develop embedding techniques for databases, some based on novel ideas invented by the laboratory and its partners, and apply them to database tasks, targeting considerable outperformance of state-of-the-art solutions. Additional applications of embedding, to be potentially included in this project, involve aspects of fairness and general data ethics.
This project is done in collaboration with the group of Prof. Martin Grohe from RWTH Aachen University and is funded by the German-Israeli Project Cooperation (DIP) program of DFG.