The following projects are proposed for Graduate (M.Sc, Ph.D) and Undergraduate students in the faculty. For more details or for registration, please contact the listed advisor by mail.
The objective of this research is to apply a recent database approach to text analysis, namely the framework of “document spanners,” to optimize core processes in bioinformatics and medical sciences. Specifically, we target the important operation of oligo design. Oligos (Oligonucleotides) are short DNA or RNA molecules that are synthetically generated for a wide range of applications, from basic medical research to forensic uses like COVID-19 testing. Oligo design takes place before the oligos are created in the lab, when oligos are planned on the computer by algorithms that analyze and simulate potential interactions between the oligo and different DNA molecules in different cells.
Yet, computational limitations in oligo design are holding back the field of medicine. These limitations fall into three categories:
- Current software libraries fall short of enabling domain experts to translate their insights on oligo design into efficient algorithms.
- Some novel tasks in oligo design require solving computational problems where efficient algorithms are still unknown.
- It is highly challenging to incorporate in biological software some basic computational tasks such as runtime optimization and code modularization.
We propose to address these challenges by adopting the framework of document spanners that views text analysis, typically around Natural Language Processing, as centered around two core subtasks: (a) the extraction of basic relations from text, and (b) the relational manipulation of the extracted relations. This approach enables us to adopt and adapt decades of advances on database optimization to the task at hand. Within this framework we plan to develop novel algorithms that will enable researchers to design oligos more quickly and accurately. With these algorithms, we aim to improve the performance of core tasks in oligo design, and to devise efficient algorithms for new (and unsolved) ones. Moreover, we aim not only to facilitate existing practice of oligo design, but also to enable new medical research and clinical tools that require even more sophisticated oligo design.
This project is done in collaboration with the group of Prof. Naama Geva Zatorski in the department of Medicine, Technion.
Standard machine-learning algorithms assume representations of their input data as numerical vectors. Applying these algorithms to the analysis of non-numerical data requires embeddings of data units into a (typically) finite-dimensional Euclidean vector space. The embedding needs to be “faithful” to the semantics. In particular, similar entities should be mapped to vectors that are close geometrically, and vice versa. In some modalities, the input comes with a useful embedding to begin with; for example, an image can be represented by the RGB intensities of its pixels. In others, semantic aware embeddings have to be devised, and indeed have been devised, such as Word2Vec and GLoVe for words of natural language, Node2Vec for the nodes of a graph, TransE for the entities of a knowledge graph, and Mol2Vec for molecule structures. Naturally, generic embeddings have also been devised for relational databases. Such embeddings have enabled the deployment of machine-learning architectures to traditional database tasks that have been the pain point of data management for decades, including quality enhancement (e.g., data imputation, data cleaning, column prediction) and data integration (e.g., entity resolution, record linking, schema matching).
The goal of this project is to develop embedding techniques for databases, some based on novel ideas invented by the laboratory and its partners, and apply them to database tasks, targeting considerable outperformance of state-of-the-art solutions. Additional applications of embedding, to be potentially included in this project, involve aspects of fairness and general data ethics.
This project is done in collaboration with the group of Prof. Martin Grohe from RWTH Aachen University and is funded by the German-Israeli Project Cooperation (DIP) program of DFG.