My research focuses on algorithms and data structures for “big data,” with applications to computational biology and more recently astronomy, among other areas. I am particularly interested in the “manifold hypothesis” and how interesting geometric and topological properties of data can enable more efficient algorithms for search and analysis. I am also interested in computational topology, and approaches for making topological data analysis tractable on large data sets.
I run the URI Algorithms for Big Data group (URI-ABD). Our most recent research has involved the development of CLAM (Clustered Learning of Approximate Manifolds). CLAM is a clustering and graph-induction approach, which enables fast search (ρ-nearest neighbors and k-nearest neighbors) and data compression (these combined tools are called CLAM-CAKES, for CLAM-Accelerated K-nearest-neighbor Entropy-scaling Search). CLAM’s graph induction also enables anomaly detection through an ensemble of simple algorithms, called CHAODA (and yes, pronounced “chowda”, for Clustered Hierarchical Anomaly and Outlier Detection Algorithms). We are currently working on a graph visualization tool to better understand graph embeddings of manifolds in high-dimensional spaces. CLAM also enables object recognition in three dimensions, an ongoing project currently funded by the Office of Naval Research.
With collaborators at Tufts, URI, and Drexel, I have been developing the MEDFORD metadata language. The medford parser is available. Our paper in OUP’s Database journal is also available.
Cover of Cell Systems, Issue 2
Here are some selected publications
- Transfer of Knowledge from Model Organisms to Evolutionarily Distant Non-Model Organisms: The Coral Pocillopora damicornis Membrane Signaling Receptome (with Lokender Kumar, Nathanael Brenner, Samuel Sledzieski, Monsurat Olaosebikan, Matthew Lynn-Goin, Bonnie Berger, Hollie Putnam, Jinkyu Yang, Nastassja Lewinski, Rohit Singh, Lenore Cowen, and Judith Klein-Seetharaman (2022) PLoS ONE
- MEDFORD: A human- and machine-readable metadata markup language (with Polina Shpilker, John Freeman, Hailey McKelvie, Jill Ashey, Jay-Miguel Fonticella, Hollie Putnam, Jane Greenberg, Lenore Cowen, and Alva Couch) (2022) Database (The Journal of Biological Databases and Curation)
- Clustered Hierarchical Anomaly and Outlier Detection Algorithms (with Najib Ishaq and Thomas Howard, III) IEEE Big Data 2021 and its supplement
- MEtaData Format for Open Reef Data (MEDFORD) (with Polina Shpilker, Jack Freeman, Hailey McKelvie, Jill Ashey, Jay-Miguel Fonticella, Hollie Putnam, Jane Greenberg, Lenore J. Cowen, and Alva Couch), International Conference on Metadata and Semantics Research (MTSR) 2021
- Clustered Hierarchical Entropy-Scaling Search (with Najib Ishaq and George Student) IEEE Big Data 2019
- Assortative Mixture of English Parts of Speech (with Timothy Leonard, Lutz Hamel, and Natallia Katenka) International Conference on Complex Networks and their Applications 2017
- Computational Biology in the 21st Century: Scaling with Compressive Algorithms (with Bonnie Berger and Y. William Yu) Communications of the ACM
- Entropy-scaling search of massive biological data (with Y. William Yu, David Christian Danko, and Bonnie Berger) Cell Systems
- MRFy: Approximate Markov random fields for remote homology detection (with Andrew Gallant, Norman Ramsey, and Lenore Cowen) IEE/ACM TCBB 2015 preprint
- Compressive genomics for protein databases (with Andrew Gallant, Jian Peng, Lenore Cowen, Michael Baym, Bonnie Berger) ISMB 2013 Research project page
- Remote homology detection in proteins using graphical models (my dissertation) ArXiv 2013 / preprint
- Experience Report: Haskell in Computational Biology (with Andrew Gallant and Norman Ramsey) ICFP 2012 / preprint
- Formatt: Correcting Protein Multiple Structural Alignments by Sequence Peeking (with Shilpa Nadimpalli and Lenore Cowen) BMC Bioinformatics 2012 / preprint Software
- SMURFLite: combining simplified Markov random fields with simulated evolution improves remote homology detection for beta-structural proteins into the twilight zone (with Raghu Hosur, Bonnie Berger, Lenore Cowen) Bioinformatics 2012 / preprint Webserver
- Touring Protein Space with Matt (with Anoop Kumar, Matt Menke, Lenore Cowen) ACM/TCBB 2012 / preprint Mattbench website
Recent advances in technology have led to exponential growth in data, sometimes outpacing available computing power. This explosion in data promises new discoveries, if only we can mine it. I am interested in developing algorithms to help data scientists from fields as diverse as molecular biology, astronomy, chemistry, the social sciences, global trade, and finance make discoveries based upon data. In the past, my research has focused on protein structure prediction, remote homology detection in proteins, and function prediction in protein-protein interaction networks. More recently, I have developed compressive algorithms for speeding up approximate search in biological systems, and I am interested in extending these ideas to domains outside of biology.
I also have a strong interest in functional programming, and I have enjoyed implementing some of my research software in Haskell. I am interested in how more powerful programming language features, such as algebraic type systems, higher-order functions, and ease of parallelization, can make the implementation of data-science algorithms faster, more correct, and more productive for the programmer.