REU Faculty Past Projects
2021
Gary Benson
Departments of Biology and Computer Science
The Benson lab develops algorithms and software for biological sequence comparison and repeat detection in genomic sequences. The focus is understanding the occurrence and functional effects of tandem repeats (TRs), and especially, those with variable copy number, also known as variable number of tandem repeats (VNTRs). The lab has developed an analysis tool, VNTRseek, to identify VNTRs, using high-throughput sequencing data, but it is limited to those TRs that fit within a sequencing read. This project will develop new algorithmic and statistical methods to permit detection of longer VNTR repeats and the use of longer read sequencing technologies. Additionally, an online database will be created to store and analyze the variant data. Students will gain knowledge in human genetic variability and DNA repeats, and skills in analyzing high-throughput sequencing data, algorithm design and testing, and database development.Genetic variation and linkage to phenotype
Ethan Deyle
Department of Biology
Many of the tools scientists use to quantitatively study the world were developed for engineered systems and laboratory experiments, where a single cause produces a single effect independent of other variables (“linear separability”). Natural systems, whether individual cells or entire ocean ecosystems, do not always follow these expectations. Instead, interactions are often state-dependent, where the action of a cause depends on the context around it (i.e. it depends on the state of other variables). This nonlinear state-dependence can interfere with the comfortable, correlative approaches to studying systems, but also presents rich opportunities. This project will center on applying nonlinear causal inference to identify interaction between scales of complexity in natural systems (e.g. single fish populations and ecosystem functioning or single cell expression and organism physiology). Options are available to focus on applied data study of neuronal gene expression, aquatic food-webs, or marine fishery management. It is also possible to focus entirely on numeric simulation data. Students will gain hands-on experience in data processing, non-parametric statistics, and time-series analysis using R or Python (based on preference). Previous coding experience is not a strict requirement but will affect the scope of the project.Quantifying cross-scale interaction in complex natural systems
Josee Dupuis
Department of Biostatistics
The Dupuis lab develops statistical approaches to identify specific genes or genetic variants that influence complex phenotypes through their associated quantitative traits, which are traits that can be measured numerically, such as height or blood pressure in humans, and seed size or oil content in plants. This project involves developing statistical analyses which combine genome wide association results with prior information from “omics” studies (gene variant functionality, gene expression, methylation, metabolomic data, and proteomic data) to determine regions with common or rare genetic variants that are potentially causally associated with traits of interest. Students will become familiar with genetic studies and software for genetic analysis, and will explore publicly available databases to assign putative function to sets of variants.Fine-mapping of genetic loci for quantitative traits
W. Evan Johnson
Departments of Biostatistics and Medicine
The Johnson lab studies the human microbiome, i.e., microbial communities which live in and on the human body and play a vital role in health and disease. This project involves the development of statistical tools and software for jointly analyzing microbial and host data from sequencing experiments, in order to determine community content, microbe-microbe interactions, and host-microbe relationships. Students will help develop tools and workflows to compile annotated libraries of genes and genomes, curate functional associations between genes and microbes in metabolism, and link microbial abundance to host gene/pathway expression and other outcomes.Profiling human microbial communities
Jennifer Bhatnagar
Department of Biology
The Bhatnagar lab studies soil microbiome variation in the context of changing environmental conditions. Soil microorganisms perform a variety of essential roles, including acting as plant symbionts, animal pathogens, and free-living decomposers that recycle nutrients and carbon through the biosphere. Yet, it is unclear that soil microorgamisms will persist in a changing world. This project will develop new bioinformatics tools to predict which soil microorganisms will endure and remain active over space and time. Microbial DNA sequence data will be collected from soils obtained through a national sampling initiative – the National Ecological Observatory Network (NEON). The data will be analyzed for gene clusters involving biochemical pathways affecting microbial ecology (e.g., the ability to serve as pathogens or symbionts) and used to develop statistical models that predict variance in microbial function based on location, time, temperature, precipitation, soil nutrient content, and plant biomass. Student will learn key steps in metagenome analysis and methods for data visualization.Ecological forecasting: Predicting changes in soil microorganisms
Michael Dietze
Department of Earth and Environment
The Dietze lab uses a combination of ecological theory, informatics, statistics, and cyberin-frastructure development to advance the field of predictive ecology, and iterative forecasting, in which new data are used to refine predictive models. Current application areas include: soil microbes, vegetation phenology, land carbon and water fluxes, aquatic productivity, and algal blooms. This project involves the development of computational forecasting workflows, including modules for expansion to new data repositories and forecasting data types, statistical model calibration and validation, and forecast visualization. Students will learn about ecological forecasting, high-performance and cloud computing, software containerization, real-time workflow automation, databases, and the statistics of iterative model-data assimilation.Near-term ecological forecasting
Joshua Campbell
Department of Computational Biomedicine
The Campbell lab focuses on developing computational methods for characterizing cellular heterogeneity in gene expression using single cell RNA sequencing. Tools include CELDA (CEllular Latent Dirichlet Allocation), which identifies hidden transcriptional states and cellular subpopulations in count-based, single-cell RNA-seq data, and DecontX, which estimates contamination by ambient RNA in single cell data. This project involves analyzing publicly available single cell datasets to test and develop new methods of single cell analysis. Students will learn about RNA sequencing for bulk tissue and single cell samples, and will help develop analysis pipelines and data visualizations in the R programming language with the R/shiny graphical user interface.Single cell transcriptomics
Sarah W. Davies
Department of Biology
The Davies lab studies how corals and their symbiotic algae maintain and lose symbiosis under varying environmental conditions. Corals meet the majority of their energy needs through life-long symbiotic relationships with single-celled algae. Loss of this relationship leads to coral bleaching and, eventually, colony death. Some corals have a facultative relationship with their algal symbionts, wherein both the host and symbiont can be cultured independently and manipulated in and out of symbiosis. This project involves analyzing gene expression, and in particular, orthologous gene covariance, in such facultative systems under baseline and stress conditions, to help elucidate maintenance and loss of symbiosis. It will use a holobiont (coral + algae) transcriptome developed in the Davies lab. Students will develop knowledge and skills related to ecology, evolution, and RNA-seq analysis. Since stress experiments are ongoing, the project may include an experimental component.Genes and pathways regulating symbiosis in corals
Karen Allen
Department of Chemistry
The Allen lab explores the relationship between protein structure and function using X-ray diffraction and enzyme kinetic studies. In bacteria, a principal mechanism for glycan (complex sugar molecule) assembly on the cytoplasmic face of cell membranes involves polyprenol phosphate (PrenP) phosphoglycosyl transferases (PGTs). PGTs catalyze transfer of a phosphosugar to a membrane-bound PrenP acceptor. Recently, the lab solved the X-ray crystallographic structure of the PGT PglC from Campylobacter concisus, showing that it contains a re-entrant membrane helix (RMH) that penetrates only one leaflet of the bilayer then re-emerges on the cytoplasmic face. This contradicts computational prediction that the RMH is a transmembrane helix. This project involves developing hidden Markov models (HMM) to predict these “misannotated” helices in other protein families using data from an in vivo cysteine labeling method to assess whether the N-terminus lies on the cytoplasm or periplasm side of the membrane. Students will gain exposure to protein chemistry, enzyme functional studies, chemoinformatic library analysis, sequence and structural alignment methods and HMM modeling techniques. Since the cysteine labeling studies are ongoing, this project may include an experimental component.Determining modes of protein-membrane interaction
Prasad Patil
Department of Biostatistics
I work with sets of datasets that measure the same outcome and overlapping sets of features in multiple patient cohorts. Generally, these are datasets within which patient survival (outcome) and gene expression measurements of 20,000+ genes (features) are recorded for ovarian or breast cancer patients. The goal is to train a prediction rule for risk of cancer progression or recurrence that performs well across datasets and generalizes well for new patients. Oftentimes, there are far more predictors than patients in each dataset, so feature filtration and selection prior to training a prediction rule is a necessary step. A prevailing question is how to ensure that we select features which predict well across studies and avoid features that only perform exceedingly well within a single study. Students will gain experience working with high-dimensional genomic datasets and feature selection and machine learning approaches implemented in the R programming language.Multi-Study Feature Selection
Chunyu Liu
Department of Biostatistics
The Liu lab develops statistical approaches and applies those methodologies to identify genetic and life style factors that influence complex phenotypes. Two projects are available: 1) Mitochondrial DNA (mtDNA) sequencing project: Mitochondria are power house in human cells. mtDNA is involves in the major pathway for power production. Students will have the opportunity to use publicly available software to identify mutations in the mitochondrial genome (mtDNA) from whole genome sequencing data in human. In addition, they will also have the opportunities to perform association analysis of the mtDNA mutations with cardiovascular disease. 2) Gene expression and alcohol consumption project: Gene expression is the process by which information in a gene is used to generate messenger RNA (mRNA) for protein production. Students will have the opportunity to identify genes that are related to alcohol consumption. In addition, students will explore gene pathway analyses to identify gene networks that are related to alcohol consumption and cardiovascular disease.Genetic and Life Style Factors for Complex Phenotypes
Cynthia Bradham
Department of Biology
Single-cell RNA sequencing technologies are extraordinarily powerful for dissecting cell compositions in heterogeneous cell mixtures in single biological conditions. However, complications arise when multiple biological conditions are introduced — such as disease status or drug treatment. We have developed a novel algorithm ICAT to more accurately identify shared, as well as distinct, cell-types between treatments in scRNAseq data. Potential BRITE students could look forward to contributing to new features in ICAT including identifying stably expressed genes between treatments, testing performance across different implementations, and creating a Seurat wrapper. Students would get first-hand experience using machine learning to parse large datasets, implementing high performance Python code, and exposing Python packages to R using the reticulate package. Students will also make heavy use of the linux command line and git. This project will be perfect for students interested in algorithm development, Python and R programming, and machine learning, while working at the very intersection of mathematics, computer science, and developmental biology.Identifying Cell-Types Across Treatments in Single-cell RNA Sequencing Data
2019
Joshua Campbell
Department of Medicine
Division of Computational Biomedicine
High-throughput genomic technologies are rapidly evolving including the areas of DNA and RNA sequencing. Novel types of complex data are being quickly generated and require novel methods for quality control and analysis. We are currently focused on developing and/or applying methods for identifying genomic alterations in cancer, quantifying the mutagenic effect of carcinogens, and characterizing cellular heterogeneity using single cell RNA sequencing. For example, we have developed the Celda framework (CEllular Latent Dirichlet Allocation), which can be used to identify hidden transcriptional states and cellular populations in count-based single-cell RNA-seq data. This project will include application of Celda to new and/or publicly available single cell datasets using the Single Cel ToolKit – an interactive R/shiny app.Single Cell Sequencing Analysis
Sarah Davies
Department of Biology
Our lab is interested in understanding the diversity and dynamics of coral-algal symbiosis. Unfortunately, this symbiotic relationship breaks down in response to thermal stressors associated with climate change in a phenomenon known as coral bleaching. The current project will utilize RNA-sequencing data to determine how gene expression responds to thermal stress in coral and their algal symbionts. This project will utilize various bioinformatic tools and statistical approaches to disentangle which genes and gene networks are impacted due to thermal stress. The student will ultimately help address questions as to how algal-coral symbiosis operates, and why this relationship breaks down under thermal stress, which is relevant now more than ever under a changing climate.Coral symbiosis and Climate Change
W. Evan Johnson
Departments of Medicine and Biostatistics
Microbial communities which live in and on the human body play a vital role in health and disease. The simultaneous study of microbial function and associated host response is key for understanding metabolism in healthy people and pathogenesis in diseased individuals. Recent innovations in sequencing technologies have enabled the profiling of microbial communities, their function, and the interrogation of host-pathogen interactions at a deeper resolution than ever before. However, due to the heterogeneity of the microbiome across individuals and the complexity of microbes’ interactions with each other and their hosts, these analyses require advanced computational and statistical techniques and current tools for this purpose are not sufficient. We are developing flexible and powerful statistical tools and software for jointly analyzing microbial and host data from microbiome experiments. These tools include workflows to compile annotated libraries of genes and genomes, the curation of functional associations between genes and microbes in metabolism and immune response, and linkage of microbial abundance to host gene/pathway expression and other outcomes. Our software package will also provide a user-friendly R/Shiny interface with interactive graphics and analysis tools, making the pipeline accessible to users without strong computational backgrounds. We are applying our methods in the context of multiple existing and prospective studies in obesity, kidney disease, lung cancer, and infectious diseases such as HIV and tuberculosis. These studies will result in a deliberate and focused effort to develop a greater understanding of the microbial communities the cohabitate human systems, including the community content, microbe-microbe interactions, and host-microbe relationships. Researchers on this project will aid in the development of statistical tools and software, and support the analysis of data from microbiome studies from multiple diseases.Tools and software for profiling microbial communities in multiple human diseases
Paola Sebastiani
Department of Biostatistics
Dr Sebastiani is the statistician of two studies of human longevity: the New England Centenarian Study 1 and the Long Life Family Study2. Both studies aim to discover the genetic and environmental factors that promote long and healthy lives. The two studies are complementary in terms of data structure (one is a population based study with more than 2,000 centenarians and one is a family based study with approximately 550 families demonstrating clustering for exceptional longevity). Both studies, funded by the National Institute on Aging, investigate genetic and environmental factors that affect aging and individual susceptibility to age-related diseases and disability. Key long term objectives of the studies are (1) Identify genetic risk factor as well as life styles that can effect aging and use this information to design interventions that reduce the burden of morbidity and mortality in older people.Biomarkers of Healthy Aging
(2) Translate the discoveries from genetic studies into risk prediction models that can help identify genetic signatures of healthy aging and their interaction with the environment.
(3) Identify biomarkers of aging that can be used as prognostic and diagnostic tools.
Jennifer Bhatnagar
Department of Biology
Soil microorganisms have a variety of functions in our natural ecosystems – from acting as plant symbionts to animal pathogens to free-living decomposers that break down dead material (like plant litter) and recycle nutrients and carbon through the biosphere. This project will develop new bioinformatics tools to answer the questions, “how well will our natural ecosystems function in the coming century and do we know enough about microorganisms in the environment to predict which ones will persist – and how active they will be – over space and time?” We will obtain microbial DNA sequence data collected from soils across a new national sampling initiative – the National Ecological Observatory Network (NEON). This data will be analyzed for biosynthetic clusters that code for specific biochemical pathways in microbial genomes that affect their ecology (e.g. ability to serve as pathogens or symbionts). The data will be used to develop a statistical model that partitions the variance in microbial function on orthogonal axes of space and time, as well as across environmental variables like climate (temperature, precipitation), soil nutrient content, and plant biomass.Predicting changes in the Earth microbiome; near-term ecological forecasting of critical soil microorganisms
Andrew Emili
Department of Biochemistry
Dr. Emili develops and applies advanced proteomic, molecular genetic, genomic and bioinformatic technologies to investigate the molecular associations and biological roles of the many varied macromolecules present in different cells, tissues and organisms. His group aims to generate comprehensive maps of the physical and functional “interactomes” of informative model organisms. These maps are expected to lead to breakthrough mechanistic understanding of how cells and tissues function at a fundamental molecular level, and serve as valuable resources for the broader research community. Ultimately, the aim is to translate this basic knowledge into novel diagnostic and therapeutic tools, with an emphasis on cancer, cardiovascular disease and neurodegenerative disorders. Students on this project will aid in the development of software and tools that support the analysis of interactome data.Biomolecular Interactomes
2018
Gary Benson
Departments of Biology and Computer Science
The Benson lab develops algorithms to detect genetic variation using next-generation sequencing data. Our main focus is human genetic variation in tandem repeat sequences, called variable number of tandem repeats (VNTRs) and large structural variations (SVs), including inversions, translocations, duplications, insertions and deletions. We use publicly available human sequencing data, especially high coverage, long read data, to detect and characterize variants. We are interested in where the variants occur, their frequency in populations, and how they may affect gene function or chromosome function. Besides detection tools, we also develop online databases to store and share information about the variants we detect. A variety of student projects related to this work are available. Algorithms for detecting genetic variation
Michael Dietze
Department Earth & Environment
The world is changing in many ways that impact ecosystems and the essential services they provide to human health and well-being. In the face of such change, it is imperative that scientists provide the best available information about likely impacts, and responses to decision alternatives, to help society both anticipate and adapt to change. Ecological forecasting is a transformative, interdisciplinary research area concerned with predicting the future states and distributions of ecosystems and their services to humans. This project involves an automated workflow for integrating multiple data streams into predictive models. Students will have to opportunity to contribute to the overall cyberinfrastructure (R packages, Docker+RabbitMQ stack, database) and to active forecast projects in: tick-borne disease, ticks, and small mammal hosts; harmful algal blooms; land surface CO2 and water fluxes; leaf phenology. Near-term forecasting of ecological processes: integrating multiple data streams and models
Kirill Korolev
Department of Physics
Humans like all other animals and plants are colonized by thousands of microbial species. This microbiome helps us digest food, trains our immune system, protects us from pathogens, but also plays an important role in disease. Detecting species responsible for diseases is a major goal in microbiome research. This project will study how the microbiome changes along the GI tract in controls and patients with different inflammatory diseases of the digestive system. One of the goals will be to develop prognostic and diagnostic biomarkers that can distinguish disease subtypes or guide the choice of treatment. Temporal and spatial variation of microbiota in inflammatory bowel disease
Paola Sebastiani
Department of Biostatistics
Dr Sebastiani is the statistician of two studies of human longevity: the New England Centenarian Study 1 and the Long Life Family Study2. Both studies aim to discover the genetic and environmental factors that promote long and healthy lives. The two studies are complementary in terms of data structure (one is a population based study with more than 2,000 centenarians and one is a family based study with approximately 550 families demonstrating clustering for exceptional longevity). Both studies, funded by the National Institute on Aging, investigate genetic and environmental factors that affect aging and individual susceptibility to age-related diseases and disability. Key long term objectives of the studies are Biomarkers of Healthy Aging
(1) Identify genetic risk factor as well as life styles that can effect aging and use this information to design interventions that reduce the burden of morbidity and mortality in older people.
(2) Translate the discoveries from genetic studies into risk prediction models that can help identify genetic signatures of healthy aging and their interaction with the environment.
(3) Identify biomarkers of aging that can be used as prognostic and diagnostic tools.
Jennifer Bhatnagar
Department of Biology
Soil microorganisms have a variety of functions in our natural ecosystems – from acting as plant symbionts to animal pathogens to free-living decomposers that break down dead material (like plant litter) and recycle nutrients and carbon through the biosphere. This project will develop new bioinformatics tools to answer the questions, “how well will our natural ecosystems function in the coming century and do we know enough about microorganisms in the environment to predict which ones will persist – and how active they will be – over space and time?” We will obtain microbial DNA sequence data collected from soils across a new national sampling initiative – the National Ecological Observatory Network (NEON). This data will be analyzed for biosynthetic clusters that code for specific biochemical pathways in microbial genomes that affect their ecology (e.g. ability to serve as pathogens or symbionts). The data will be used to develop a statistical model that partitions the variance in microbial function on orthogonal axes of space and time, as well as across environmental variables like climate (temperature, precipitation), soil nutrient content, and plant biomass. Predicting changes in the Earth microbiome; near-term ecological forecasting of critical soil microorganisms
Joe Zaia
Department of Biochemistry
The Zaia lab studies protein glycosylation, a mechanism whereby glycans (complex sugar molecules) are attached to proteins through a series of biosynthetic enzyme reactions. Because the reactions do not go to completion, the result is populations of mature protein molecules heterogeneous with respect to glycosylation and diverse with respect to biological activities. Many families of proteins contain domains (known as lectin domains) that bind to glycan substructures (known as epitopes, containing 1-5 monosaccharide units). Glycosylation of a protein strongly influences its binding to other proteins via lectin domains. The lab is studying the mechanisms whereby viruses alter glycosylation of surface glycoproteins in order to escape host immune response. We have generated large datasets measuring the expression of glycans and glycoproteins from virus evolution experiments, using liquid chromatography-mass spectrometry and tandem mass spectrometry. Glycan Abundance Imputation through Clustering and Machine Learning
REU students will design and implement computer programs to cluster glycan compositions, which they will then use as composite baselines for imputing missing glycopeptide abundances. They will perform and evaluate the imputation through machine learning algorithms; this will build off of published data and currently unpublished work within the lab. Students will learn how glycans modulate interactions with glycan binding proteins, and will gain experience with data structures, statistical evaluation, and scientific programming in the context of clustering algorithms and machine learning methods.
2017
The project involves the engineering of biological systems in a field called synthetic biology. A key challenge in synthetic biology is to control numerous environmental conditions and genetic network interactions. Microfluidics has the potential to address this challenge when used as the platform on which synthetic biological systems are explicitly specified, designed, built, and tested. Undergraduate researchers will work on algorithmic methods for designing microfluidic primitives by extracting analytical models from the datasets of mass-prototyped microfluidic chips developed in the CIDAR Lab. Data-Driven Methods for Automated Design of Lab on a Chip Devices
An integrated genomics information resources platform
Genome-wide association analyses have identified a large number of genetic variants associated with complex traits. However, most susceptibility variants discovered to date fall outside of gene regions, and identifying the true causal genetic variants remains a challenge. The goal of this project is to build an annotation pipeline that will integrate information from various public genomics databases to help prioritize functional studies on genetic variants found to be associated with one or several traits of interest.
Temporal and spatial variation of microbiota in inflammatory bowel disease
Humans like all other animals and plants are colonized by thousands of microbial species. This microbiome helps us digest food, trains our immune system, protects us from pathogens, but also plays an important role in disease. Detecting species responsible for diseases is a major goal in microbiome research. This project will study how the microbiome contributes to inflammatory bowel disease (IBD) and develop prognostic and diagnostic biomarkers for this disease. Unlike most previous studies, we will analyze not only bacterial, but also fungal and archaeal communities because we previously found strong microbial imbalance in the fungal as well as bacterial communities.
A structural map of the human genome
Humans like all other animals and plants are colonized by thousands of microbial species. This microbiome helps us digest food, trains our immune system, protects us from pathogens, but also plays an important role in disease. Detecting species responsible for diseases is a major goal in microbiome research. This project will study how the microbiome contributes to inflammatory bowel disease (IBD) and develop prognostic and diagnostic biomarkers for this disease. Unlike most previous studies, we will analyze not only bacterial, but also fungal and archaeal communities because we previously found strong microbial imbalance in the fungal as well as bacterial communities.
High-throughput differential gene expression database
Humans like all other animals and plants are colonized by thousands of microbial species. This microbiome helps us digest food, trains our immune system, protects us from pathogens, but also plays an important role in disease. Detecting species responsible for diseases is a major goal in microbiome research. This project will study how the microbiome contributes to inflammatory bowel disease (IBD) and develop prognostic and diagnostic biomarkers for this disease. Unlike most previous studies, we will analyze not only bacterial, but also fungal and archaeal communities because we previously found strong microbial imbalance in the fungal as well as bacterial communities.
2016
Synthetic ecology of microbes is a young, fast-developing research area concerned with the design, construction and understanding of engineered microbial consortia. The idea of designing microbial consortia is inspired by the ubiquitous presence of microbial communities on our planet, and the key role that these communities play in many aspects of human life, including biogeochemical cycles, animal and plant physiology, and metabolic engineering. Synthetic ecology may allow us to perform specific tasks by understanding and embracing – rather than avoiding – properties that seem often inherent in the natural microbial world, such as diversity, resilience, competition for resources, division of labor, and obligate interdependence. Moreover, an engineered community of organisms may perform tasks that no individual species could possibly perform on its own. In recent years, the Segrè lab has pioneered new computational methods for studying metabolic dynamics in natural and engineered microbial ecosystems, based on the knowledge of all metabolic reactions encoded in an organism’s genome. One of these approaches, an open source platform for the Computation of Microbial Ecosystems in Time and Space (COMETS), has been successfully tested on small synthetic communities. The specific project suggested for this summer program will involve simulating computationally and testing experimentally the interaction between different bacterial species on a Petri dish. In particular, based on available literature data on co-occurring species in natural microbial communities, and on predictions previously generated in the Segrè lab, the student will select two or more species to use in the experiments. COMETS simulations will predict the growth of the species on their own and in co-culture, including possible changes in colony morphology. The predictions will be compared with experimental measurements of colony growth on a Petri dish, using an established protocol for automated acquisition and analysis of images taken with a regular slide scanner connected to a computer and an Arduino microcontroller. Thus the student will have a chance to learn about genome-scale models of artificial microbial communities, and to test predictions using a simple but quantitative and instructive experimental setup. The project will have the potential to probe mechanistically putative inter-species microbial interactions, and can be easily extended to multiple organisms and conditions. Synthetic ecology of microbes
Bit-Parallel Sequence Alignment Algorithms for Tandem Repeat Detection
The REU project involves the development of new bit-parallel alignment algorithms and their application to characterize VNTR occurrence in whole human genome sequencing data. The student will help develop and implement a bit-parallel algorithm for tandem global alignment (used when detecting tandem repeats of unknown copy number) using SIMD instructions (single instruction, multiple data) and possibly parallel instructions designed for Graphical Processing Units (GPUs). The student will also help analyze VNTRseek results from long sequencing reads (~1000 bp per read) generated with Pacific Biosciences sequencing technology. These tasks will ultimately help in the development of an internet accessible public database for VNTRs. The student will gain knowledge in human genetic variability and DNA repeats, and skills in analyzing high-throughput sequencing data sets, algorithm design and testing, and parallel programming.
Work on copy number versus gene expression biomarkers in cancer
The use of copy number information in the analysis, separation, and prognostication for cancer subtypes is becoming a common approach in cancer biomarker analysis. However, the comparison of predictions arising from copy number biomarkers in tissue samples to those arising from gene expression information has had some difficulties. Among these is the fact that these two information subtypes are quite incompatible in their formats. Our group has developed a formatting procedure for copy number information that makes it similar in format to gene expression information. This will allow the importation of a large number of gene expression analysis tools to the study of copy number information, now in a parallel fashion. One of the goals of this project will be to analyze the application of both toolkits (gene expression and copy number) to subtyping and outcome prediction in cancer, in order to compare their effectiveness as well as find whether they synergize as predictive tools. A second goal will be to determine the methods’ usefulness in unsupervised learning. This involves the discovery of cancer subtypes from larger groups of cancer samples using clustering and related methods. The use of parallel data formats for both copy number and gene expression data may have some interesting implications for such subtyping.
This project will initially involve the student’s development of skills in implementing machine learning programs for prediction of subtypes and outcomes based on feature vectors involving genomic and or copy number information. Once the skills involving tools such as support vector machine and random forest have been developed, the student will be expected to apply these tools two predict outcomes of ovarian and other cancer classes based on the two information types. There will be a need for computational skills and for mathematical understanding of basic concepts. In addition to the application of machine learning to the prediction of cancer outcomes and identification of subtypes, the participant will be expected to attend the laboratory meetings of students/postdocs affiliated with the DeLisi group at BU, which is involved largely with uses of machine learning in computational biology.
HALOALKANOIC ACID DEHALOGENASE SUPERFAMILY (HADSF)
The HAD superfamily is a large enzyme family (~120,000 non-redundant sequences) of phosphotransferases (phosphomutases, ATPases and phosphatases) represented in all three kingdoms of life, and, within each cell, by numerous homologues (28 in E. coli; 35 in Salmonella typhimurium; 31 in Pseudomonas aeruginosa; 30 in Mycobacterium tuberculosis; 84 in Caenorhabditis elegans; 169 in Arabidopsis thaliana; and 183 in human). Approximately 80-90% of the HAD superfamily members are phosphatases and it is estimated that 40% of the bacterial metabolome is comprised of phosphorylated compounds. Although the HADSF fold is dominant among eukaryotic and prokaryotic phosphatases, it has yet to be truly exploited for inhibitor discovery. Such inhibitors would be invaluable to discovery and study of metabolic pathways involving phosphorylated metabolites. This is in contrast to the phosphotyrosine phosphatase family of phosphatases, for which great progress has been made in inhibitor design and focused library-screening-based discovery. To date there are just two reports of HADSF phosphatase inhibitor discovery. We aim to focus on targeting the region of the HAD proteins responsible for phosphoryl-group binding (contributed by the catalytic domain). This may have the added benefit of producing a moregeneralizable phospho-mimetic, one of the “holy grails” of phosphoryl transfer.
In order to identify such a mimetic, we will leverage a number of atomic resolution (~1 Å) structures of HAD members liganded to transition-state analogues. These enzymes invariably form a trigonal bipyramid with the phosphoryl group together with an apical ligand from the nucleophilic aspartate in the phosphatase. The REU student will utilize this data to make a template molecular model of an inhibitor scaffold defined by hydrogen bond donors and acceptors which ignore the phosphorus atom itself. This model scaffold will then be utilized to mine databases of known binding fragments and inhibitors. The student will also utilize chemi-informatics and protein mapping algorithms in order to analyze the chemical diversity of “hits” and the match to the biophysical properties of the corresponding binding sites. Ultimately, the compounds will be tested experimentally on a set of HAD phosphotranferases for inhibitory activity and successful compounds will be studied for binding mode by obtaining X-ray crystal structures of the complexes with the HAD enzymes. Through these studies, students will gain exposure to chemi-informatic library searching and analysis, in silico docking and structural analysis (with the possibility of experimental kinetics and structure analysis).
Living Computing Project (LCP)
The Living Computing Project (www.programmingbiology.org) investigates computing paradigms in living organisms. Specifically, it explores if digital, analog, memory, and communication concepts can be implemented in cellular environments. Understanding if quantitative approaches and standardized metrics can broadly be applied to these systems will help us develop the formalized mechanisms we can use to specify, design, and verify these systems. Solutions in medicine, materials, sensing, and manufacturing will be able to be more easily created, efficiently implemented, and broadly distributed if computing paradigms are found to be applicable.
The collection of 10 UROP students for the summer of 2016 will be aiding in this research. Specifically they will be involved in one of four efforts:
2016 Boston University Wet Lab iGEM Team – Four Students – These students will take basic DNA building blocks and assemble them into genetic circuits. These circuits will act as either digital or memory based computing elements. The students will then characterize these circuits to extract quantitative metrics related to their performance. This data will be archived physically and electronically along with the biological DNA information to begin to curate a library of computational components for the LCP. These components will be housed in the LCP Inventory of Composable Elements (ICE), the Synthetic Biology Open Language (SBOL) Stack, and BTSync (for flow cytometer data). This collection of information will be used to augment existing design software to predict the performance of future circuits and search for optimized designs. This project will require molecular biology skills and bioinformatics analysis. These students will be supervised by BME graduate student Divya Israni.
2016 Boston University Hardware iGEM Team – Three Students – These students will be creating a microfluidic design environment to automate the testing of genetic logic circuits. The fabrication, control, and data extraction for this platform will be automated with the use of software tools. The team will be creating a genetic system that interfaces with off the shelf sensors and hardware so that it can be quickly reconfigured for numerous environments and designs. It will consist of a set of input locations, intermediate locations, switch fabric, and output locations. This will allow for a generic device that is differentiated experiment by experiment. This project will involve embedded systems design, CNC milling, 3D printing, and software programming. These students will be supervised by ECE graduate student Ryan Silva.
Phagebook and CIDAR Software – Two Students – Synthetic biology software includes tools for specification, design, assembly, verification, and data management activities. CIDAR lab (www.cidarlab.org) has numerous software packages that need to be made either more robust, user friendly, or more widely tested. These include a design environment for functional specification and assembly of genetic circuits (Phoenix) as well as a social media platform for synthetic biology (Phagebook). This project will require web design, Java/Javascript, cloud computing, and database skills. These students will be supervised by ECE graduate student Prashant Vaidyanathan.
Living Computing Project Research Intern – One Student – This student will work on fundamental research questions related to models of computation in synthetic biology (e.g. state machines, data flow networks) and how they can be formalized and assigned to biological elements. This project will require computational interests, programming skills, and some computer science exposure. This student will be supervised by ECE Research Assistant Professor Dr. Swapnil Bhatia.
Understanding variation in microbial community composition in both space and time
New DNA sequencing technology has fundamentally transformed our understanding of microbial communities. We can now rapidly census the species composition of microbial communities in complex systems like soil, and relate them to the spatial and environmental factors that structure these communities. This technology has enabled us to test classic ecological theories about how microbial communities change across space (Fierer & Jackson, 2006). However, little work has characterized how microbial communities change over time (Shade et al., 2013). Filling this knowledge gap will increase our chances of accurately forecasting how microbial systems will respond to disturbance in the future. This is a critical need in Earth system science, because we are beginning to find that specific microbial species have unique activities in the cycling of elements and energy within the biosphere. Nevertheless, to date there is no work testing the relative importance of space vs. time in shaping microbial community composition and activity.
Approach and learning outcomes
We propose a meta-analysis approach to determining the time vs. space variation in microbial community composition. To do this, we will collect DNA sequence data from already identified publications that have resolution in both space and time. This sequence data comes from high-throughput sequencing platforms (e.g. 454-pyrosequencing, Illumina MiSeq runs) that generate Gb of sequence data for each sample set for a publication.
Once collected, this data will be analyzed for community composition using the QIIME bioinformatic pipeline (Caporaso et al., 2010) through the BU SCC. The community data will then be used to develop a statistical model that partitions the variance in community composition data on orthogonal axes of space and time.
We seek an REU student to collect DNA sequence data published online, work with the data through bioinformatics pipelines, and analyze the data using statistical software (R). The project will involve development of coding skills using Jupyter Notebooks and statistical training in analysis and visualization of microbial community composition data.
A successful project will result in training of:
1) Front-to-back analysis of large DNA sequence-based microbial community datasets;
2) Analysis and visualization of data using R software packages (e.g. ggplot2);
3) Communicating results in a presentation and draft of a manuscript by the end of the internship