2021 BRITE REU Faculty Projects

Genetic variation and linkage to phenotype
Gary Benson
Departments of Biology and Computer Science

The Benson lab develops algorithms and software for biological sequence comparison and repeat detection in genomic sequences. The focus is understanding the occurrence and functional effects of tandem repeats (TRs), and especially, those with variable copy number, also known as variable number of tandem repeats (VNTRs). The lab has developed an analysis tool, VNTRseek, to identify VNTRs, using high-throughput sequencing data, but it is limited to those TRs that fit within a sequencing read. This project will develop new algorithmic and statistical methods to permit detection of longer VNTR repeats and the use of longer read sequencing technologies. Additionally, an online database will be created to store and analyze the variant data. Students will gain knowledge in human genetic variability and DNA repeats, and skills in analyzing high-throughput sequencing data, algorithm design and testing, and database development.

Quantifying cross-scale interaction in complex natural systems
Ethan Deyle
Department of Biology

Many of the tools scientists use to quantitatively study the world were developed for engineered systems and laboratory experiments, where a single cause produces a single effect independent of other variables (“linear separability”). Natural systems, whether individual cells or entire ocean ecosystems, do not always follow these expectations. Instead, interactions are often state-dependent, where the action of a cause depends on the context around it (i.e. it depends on the state of other variables). This nonlinear state-dependence can interfere with the comfortable, correlative approaches to studying systems, but also presents rich opportunities. This project will center on applying nonlinear causal inference to identify interaction between scales of complexity in natural systems (e.g. single fish populations and ecosystem functioning or single cell expression and organism physiology). Options are available to focus on applied data study of neuronal gene expression, aquatic food-webs, or marine fishery management. It is also possible to focus entirely on numeric simulation data. Students will gain hands-on experience in data processing, non-parametric statistics, and time-series analysis using R or Python (based on preference). Previous coding experience is not a strict requirement but will affect the scope of the project.

Fine-mapping of genetic loci for quantitative traits
Josee Dupuis
Department of Biostatistics

The Dupuis lab develops statistical approaches to identify specific genes or genetic variants that influence complex phenotypes through their associated quantitative traits, which are traits that can be measured numerically, such as height or blood pressure in humans, and seed size or oil content in plants. This project involves developing statistical analyses which combine genome wide association results with prior information from “omics” studies (gene variant functionality, gene expression, methylation, metabolomic data, and proteomic data) to determine regions with common or rare genetic variants that are potentially causally associated with traits of interest. Students will become familiar with genetic studies and software for genetic analysis, and will explore publicly available databases to assign putative function to sets of variants.

Profiling human microbial communities
W. Evan Johnson
Departments of Biostatistics and Medicine

The Johnson lab studies the human microbiome, i.e., microbial communities which live in and on the human body and play a vital role in health and disease. This project involves the development of statistical tools and software for jointly analyzing microbial and host data from sequencing experiments, in order to determine community content, microbe-microbe interactions, and host-microbe relationships. Students will help develop tools and workflows to compile annotated libraries of genes and genomes, curate functional associations between genes and microbes in metabolism, and link microbial abundance to host gene/pathway expression and other outcomes.

Ecological forecasting: Predicting changes in soil microorganisms
Jennifer Bhatnagar
Department of Biology

The Bhatnagar lab studies soil microbiome variation in the context of changing environmental conditions. Soil microorganisms perform a variety of essential roles, including acting as plant symbionts, animal pathogens, and free-living decomposers that recycle nutrients and carbon through the biosphere. Yet, it is unclear that soil microorgamisms will persist in a changing world. This project will develop new bioinformatics tools to predict which soil microorganisms will endure and remain active over space and time. Microbial DNA sequence data will be collected from soils obtained through a national sampling initiative – the National Ecological Observatory Network (NEON). The data will be analyzed for gene clusters involving biochemical pathways affecting microbial ecology (e.g., the ability to serve as pathogens or symbionts) and used to develop statistical models that predict variance in microbial function based on location, time, temperature, precipitation, soil nutrient content, and plant biomass. Student will learn key steps in metagenome analysis and methods for data visualization.

Near-term ecological forecasting
Michael Dietze
Department of Earth and Environment

The Dietze lab uses a combination of ecological theory, informatics, statistics, and cyberin-frastructure development to advance the field of predictive ecology, and iterative forecasting, in which new data are used to refine predictive models. Current application areas include: soil microbes, vegetation phenology, land carbon and water fluxes, aquatic productivity, and algal blooms. This project involves the development of computational forecasting workflows, including modules for expansion to new data repositories and forecasting data types, statistical model calibration and validation, and forecast visualization. Students will learn about ecological forecasting, high-performance and cloud computing, software containerization, real-time workflow automation, databases, and the statistics of iterative model-data assimilation.

Single cell transcriptomics
Joshua Campbell
Department of Computational Biomedicine

The Campbell lab focuses on developing computational methods for characterizing cellular heterogeneity in gene expression using single cell RNA sequencing. Tools include CELDA (CEllular Latent Dirichlet Allocation), which identifies hidden transcriptional states and cellular subpopulations in count-based, single-cell RNA-seq data, and DecontX, which estimates contamination by ambient RNA in single cell data. This project involves analyzing publicly available single cell datasets to test and develop new methods of single cell analysis. Students will learn about RNA sequencing for bulk tissue and single cell samples, and will help develop analysis pipelines and data visualizations in the R programming language with the R/shiny graphical user interface.

Genes and pathways regulating symbiosis in corals
Sarah W. Davies
Department of Biology

The Davies lab studies how corals and their symbiotic algae maintain and lose symbiosis under varying environmental conditions. Corals meet the majority of their energy needs through life-long symbiotic relationships with single-celled algae. Loss of this relationship leads to coral bleaching and, eventually, colony death. Some corals have a facultative relationship with their algal symbionts, wherein both the host and symbiont can be cultured independently and manipulated in and out of symbiosis. This project involves analyzing gene expression, and in particular, orthologous gene covariance, in such facultative systems under baseline and stress conditions, to help elucidate maintenance and loss of symbiosis. It will use a holobiont (coral + algae) transcriptome developed in the Davies lab. Students will develop knowledge and skills related to ecology, evolution, and RNA-seq analysis. Since stress experiments are ongoing, the project may include an experimental component.

Determining modes of protein-membrane interaction
Karen Allen
Department of Chemistry

The Allen lab explores the relationship between protein structure and function using X-ray diffraction and enzyme kinetic studies. In bacteria, a principal mechanism for glycan (complex sugar molecule) assembly on the cytoplasmic face of cell membranes involves polyprenol phosphate (PrenP) phosphoglycosyl transferases (PGTs). PGTs catalyze transfer of a phosphosugar to a membrane-bound PrenP acceptor. Recently, the lab solved the X-ray crystallographic structure of the PGT PglC from Campylobacter concisus, showing that it contains a re-entrant membrane helix (RMH) that penetrates only one leaflet of the bilayer then re-emerges on the cytoplasmic face. This contradicts computational prediction that the RMH is a transmembrane helix. This project involves developing hidden Markov models (HMM) to predict these “misannotated” helices in other protein families using data from an in vivo cysteine labeling method to assess whether the N-terminus lies on the cytoplasm or periplasm side of the membrane. Students will gain exposure to protein chemistry, enzyme functional studies, chemoinformatic library analysis, sequence and structural alignment methods and HMM modeling techniques. Since the cysteine labeling studies are ongoing, this project may include an experimental component.

Multi-Study Feature Selection
Prasad Patil
Department of Biostatistics

I work with sets of datasets that measure the same outcome and overlapping sets of features in multiple patient cohorts. Generally, these are datasets within which patient survival (outcome) and gene expression measurements of 20,000+ genes (features) are recorded for ovarian or breast cancer patients. The goal is to train a prediction rule for risk of cancer progression or recurrence that performs well across datasets and generalizes well for new patients. Oftentimes, there are far more predictors than patients in each dataset, so feature filtration and selection prior to training a prediction rule is a necessary step. A prevailing question is how to ensure that we select features which predict well across studies and avoid features that only perform exceedingly well within a single study. Students will gain experience working with high-dimensional genomic datasets and feature selection and machine learning approaches implemented in the R programming language.

Genetic and Life Style Factors for Complex Phenotypes
Chunyu Liu
Department of Biostatistics

The Liu lab develops statistical approaches and applies those methodologies to identify genetic and life style factors that influence complex phenotypes. Two projects are available:

1) Mitochondrial DNA (mtDNA) sequencing project: Mitochondria are power house in human cells. mtDNA is involves in the major pathway for power production. Students will have the opportunity to use publicly available software to identify mutations in the mitochondrial genome (mtDNA) from whole genome sequencing data in human. In addition, they will also have the opportunities to perform association analysis of the mtDNA mutations with cardiovascular disease.

2) Gene expression and alcohol consumption project: Gene expression is the process by which information in a gene is used to generate messenger RNA (mRNA) for protein production. Students will have the opportunity to identify genes that are related to alcohol consumption. In addition, students will explore gene pathway analyses to identify gene networks that are related to alcohol consumption and cardiovascular disease.

Identifying Cell-Types Across Treatments in Single-cell RNA Sequencing Data
Cynthia Bradham
Department of Biology

Single-cell RNA sequencing technologies are extraordinarily powerful for dissecting cell compositions in heterogeneous cell mixtures in single biological conditions. However, complications arise when multiple biological conditions are introduced — such as disease status or drug treatment. We have developed a novel algorithm ICAT to more accurately identify shared, as well as distinct, cell-types between treatments in scRNAseq data. Potential BRITE students could look forward to contributing to new features in ICAT including identifying stably expressed genes between treatments, testing performance across different implementations, and creating a Seurat wrapper. Students would get first-hand experience using machine learning to parse large datasets, implementing high performance Python code, and exposing Python packages to R using the reticulate package. Students will also make heavy use of the linux command line and git. This project will be perfect for students interested in algorithm development, Python and R programming, and machine learning, while working at the very intersection of mathematics, computer science, and developmental biology.