EventHalil İbrahim Kuru

Graph Embeddings on Protein Interaction Networks

Protein-protein interaction (PPI) networks represent the possible set of interactions among proteins and thereby the genes that code for them. By integrating isolated signals on single genes such as mutations or differential expression patterns, PPI networks have enabled various biological discoveries so far. Furthermore, even the connectivity patterns of proteins in such networks have been proven to be highly informative for various prediction tasks involving proteins or genes. These tasks; however, require task specific feature engineering. Graph embedding techniques that learn a deep representation of the nodes on the network, provides a powerful alternative and obviate the need for this extensive feature engineering on the network. In this study we use \textit{graph embedding techniques} on PPI networks in two independent machine learning tasks. The first part of the present work focuses on predicting gene essentiality. Using two different node embedding techniques, node2vec and DeepWalk, we present a classifier which only uses node embeddings as input and show that it can achieve up to 88 \% AUC score in predicting human gene essentiality.

The second part of the thesis proposes a novel representation of patients based on pairwise rank order of patient protein expression values and protein interactions, which we abbreviate as PRER. Specifically, we use the protein expression values of proteins, and generate a patient specific gene embedding to represent relative expression of a protein with other proteins in the neighborhood of that protein. The neighborhood is derived using a biased random-walk strategy. We first check whether a given protein is less or more expressed compared to the other proteins in their neighborhood for a specific tumor. Based on this we generate a representation that not only captures the dysregulation patterns among the proteins but also accounts for the molecular interactions. To test the effectiveness of this representation, we use PRER for the problem of patient survival prediction. When compared against the representation of patients with their individual protein expression features, PRER representation demonstrates significantly superior predictive performance in 8 out of 10 cancer types. Proteins that emerge as important in the PRER as opposed to individual expression values provide a valuable set of biomarkers with high prognostic value. Additionally, they highlight other proteins that should be further investigated for the dysregulation patterns.