Research Projects

Here are the research projects I worked on during my PhD:

Cell-signaling and metabolic networks modeled as hypergraphs

Cell-signaling and metabolic pathways are cornerstones in cellular and molecular biology. They are often implicated as the cause of many diseases, including cancer.  Many forms of cancer arise from under-expression or over-expression of proteins in a certain pathway.  Usually the disrupted pathway is identified by the detection of downstream effects (something many steps away from the underlying problem). Finding the cause of the pathway disruption is difficult, because annotation of pathways is time consuming for biologists, so many biologically relevant pathways remain unknown.

Cellular reaction networks are traditionally modeled using graphs, complex graphs, and boolean networks.  These models are inadequate at representing complex multiway reactions (like most cellular reactions). Hypergraphs are a generalization of graphs, where hyperedges now connect two sets of vertices, instead of connecting single vertices.  Directed hypergraphs model multiway reactions correctly as a hyperedge from the set of one or more reagents to the set of one or more products.

Many algorithms for ordinary graphs have not yet been generalized to hypergraphs, and are often computationally hard.  Even finding the shortest hyperpath on a hypergraph is NP-Complete.

We have developed a heuristic for the general shortest s,t-hyperpaths problem, which allows cycles (where previous approaches only find acyclic hyperpaths). Our heuristic has a guaranteed to be efficient, and we have shown through comprehensive experimentation that it is optimal on over 99% of problem instances from two standard reaction network databases. The paper was published in the proceedings of the 21st International Workshop on Algorithms in Bioinformatics (WABI 2021), and is available here.
A more complete journal version of this paper appears in Algorithms for Molecular Biology (AMB) and is available here.

We have also developed an exact algorithm for the general shortest s,t-hyperpaths problem that uses a novel characterization of s,t-hyperpaths in an integer linear programming (ILP) formulation of the problem. We can solve this ILP fast in practice using a new cutting plane algorithm. We show on these same network databases that it is now possible to compute optimal shortest hyperpaths fast: with a median run time under 15 seconds and a maximum run time under 30 minutes. The paper was published in the proceedings of the 27th International Conference on Research in Computational Molecular Biology (RECOMB 2023) and is available here.

Factories are similar to hyperpaths, but require stoichiometry constraints to be met for each of the reactions represented by hyperedges in the factory. We were the first to present the minimum-hyperedge factory problem, which we tackled by a mixed-integer linear programming approach. We also incorporated negative regulation into hypergraph models of cellular reaction networks for the first time. This paper appears in the proceedings of Intelligent Systems for Molecular Biology (ISMB 2022), and it is available here.

Source code for the heuristic, a hyperpath enumeration algorithm that allows tractable enumeration of all possible s,t-hyperpaths, and an implementation of our minimum-hyperedge factory approach is available at http://hyperpaths.cs.arizona.edu.

Protein secondary structure prediction

A protein’s primary structure is its linear sequence of amino acids (which is usually represented computationally as a string over the ~20 letter amino acid alphabet). Proteins fold into large three dimensional structures, based on hydrogen bonding between the amino acids in their primary structure. Proteins also have secondary structure, where each amino acid in their primary sequence has a corresponding secondary structure state, taken from three or eight basic types. (For 3-state prediction, these are alpha helix, beta strand, and random coil.) Determining the secondary structure of a protein is difficult and time-consuming, as it requires crystalizing the protein and then determining its three-dimensional structure. This three-dimensional structure is then converted to secondary structure using the DSSP algorithm.

We predict the secondary structure of a protein avoiding traditional database homology search (which is used by all state-of-the-art methods) and instead using nearest neighbor search. Our method Nnessy determines state probabilities for each type of secondary structure for each amino acid in the protein via an overlapping words procedure. These state probabilities are then used as an input to a maximum-likelihood dynamic programming algorithm that outputs physically-valid secondary structure predictions. We also use aspects of the prediction procedure to obtain an estimated accuracy of our secondary structure prediction, which is output to the user.

We combine Nnessy (which is a template-based method, since the secondary structure states of the words in our nearest neighbor database are known) with a template-free method to obtain a high-accuracy hybrid method. We compare the estimated accuracy of Nnessy’s prediction to a threshold, and if it exceeds the threshold, we return Nnessy’s prediction, otherwise the template-free tool’s prediction is returned. This hybrid method exceeds the accuracy of state-of-the-art tools by 3% accuracy on standard benchmark datasets for the 3-state model and by more than 8% accuracy for the 8-state model.

This work is published in the proceedings of the 24th Conference on Intelligent Systems for Molecular Biology (ISMB 2020) and available here.

We also have created accuracy estimators for other commonly-used tools: JPred, PSIPRED, Porter, MUFOLD, DeepCNF, and SSpro. We explored many different ensemble methods to combine heterogeneous predictions from these tools into a final ensemble prediction. We found that the hybrid method between Nnessy and Porter outperformed all ensemble methods. However, with more accurate accuracy estimators, an ensemble that estimates the accuracy of each prediction and takes the prediction with highest estimated accuracy would surpass the accuracy of the two-tool hybrid.

This ensemble work is published in the proceedings of the 11th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2020) and available here.

Multi-level vertex-weighted steiner tree

I contributed to a paper with other students from Stephen Kobourov’s 600-level course in Spring 2018.  I helped write much of the paper with an emphasis on the ILP formulation of the problem and how to solve the ILP quickly using a cutting-plane approach.

A link to the arxiv version: https://arxiv.org/abs/1811.11700
css.php