CCGB: Abstracts for WWWGLS

Abstracts for Weekly Wednesday Wartik Genomics Lecture Series (WWWGLS)

Spring 2012

April 25: Brett McKinney (Univ. of Tulsa)

Epistasis network prioritization of genes for pathway enrichment in GWAS

Most pathway and gene set enrichment methods prioritize genes by their main effect; however, this prioritization does not account for variation due to interactions in the pathway. There is evidence that apparent missing heritability in GWAS may be discoverable by accounting for additive genetic variability and gene-gene interactions. In this study we aggregate gene-gene interaction information with main effect associations through an Evaporative Cooling machine learning filter and regression-based epistasis network analysis. We prioritize the importance of the genes in the epistasis network with the centrality algorithm SNPrank and apply pathway enrichment analysis to the rank list. We validate this approach in a two-stage (discovery/replication) analysis of GWAS of bipolar disorder. In the discovery stage, we apply the analysis strategy to the Wellcome Trust Case Control Consortium (WTCCC) GWAS of Bipolar Disorder (BD), and then we repeat the steps in the NIMH GWAS of BD. We replicate plausible pathways for BD that show consistent enrichment significance when genes are prioritized by their epistasis network centrality. These results provide evidence that numerous small interactions among common alleles may contribute to the diathesis for BD and demonstrate the importance of including information from the network of gene-gene interactions when prioritizing genes for pathway analysis for complex diseases.

April 18: Mark Shriver

Modeling 3D facial appearance in relation to sex, genetic ancestry, and individual genes

The genes determining normal-range variation in human faces are arguably some of the most intrinsically interesting and fastest evolving. However, so far, little work has been focused on discovering these genes. Working under the hypothesis that genes causing Mendelian craniofacial dysmorphologies also may be important in determining normal-range facial-feature variation, and that those genes associated with population differences in facial features should have experienced greater levels of evolution (change in allele frequency), we have taken an admixture mapping/selection scan approach to identifying and studying the genes directly affecting facial features. We have applied the methods of automated quasi-landmark analyses, partial least squares regression, and individual genomic ancestry estimates to explore the distribution of facial features across two groups of human populations -- West Africans and Europeans. Using three samples of admixed subjects (American; N=159, Brazilian; N=197, and Cape Verdean; N=248) we have modeled facial variation in the parental populations and compared the extent to which estimates of ancestry from the face compare to genomic-ancestry estimates. We also have tested six selection-nominated craniofacial candidate genes for functional effects on facial features using admixture mapping. In objective tests, two of these six genes (FGFR1 and TRPS1) show significant effects on facial features. In addition, human-observer ratings of the similarity between subjects and allele-specific facial morphs show the same effects for these two genes. Additionally, exaggerated allele-specific morphs based on normal-range variation in these genes recapitulates the syndromic facies of the craniofacial dysmorphologies with which they are associated.

April 11: Mary Poss (CIDD)

Endogenous retroviruses and genomic plasticity

Retroviruses integrate into the genome of the infected cell as a normal part of the virus life cycle. Although there are no specific target sites for integration, regions close to actively transcribed genes are most accessible. The position of virus integration is a 'mark' carried by all subsequent progeny of the cell. If a retrovirus infects a germ cell -- not a typical target of virus infection -- the virus essentially becomes a new host gene (an endogenous retrovirus -- ERV) and is represented in all cells in the organism. The additional genomic material can have profound consequences on the host organism and has been attributed to a wide array of phenotypes from driving speciation events to cancer. There is no polymorphism among ERV integration sites in humans because colonizations are ancient events in primates. Thus it is difficult to determine how ERV affect evolutionary and functional genomics. We have recently discovered a new ERV in mule deer that is insertionally polymorphic and transcriptionally active in its host. This provides an exceptional opportunity -- and significant challenges -- to determine if and how ERV affect the evolution of local genomic regions.

April 4: Kamesh Madduri

De novo assembly of short-read metagenomic data

I will present a new parallel method for the de novo assembly of large-scale metagenomic data sets on clusters of multicore systems. Our approach belongs to the family of Eulerian path-based methods to de novo genome assembly, and involves construction, traversal, and simplification of a large de Bruijn graph. I will discuss parallelization strategies and optimizations for various steps of the assembly process, focusing on techniques for scalable de Bruijn graph construction and simplification. We demonstrate the parallel efficiency and scalability of our assembler with the analysis of a 192 Gbp metagenome DNA from microbes adherent to plant fiber, incubated in cow rumen. Comparisons to existing assemblers such as Velvet and AbySS indicate that our approach is orders of magnitude faster, and can generate contigs with similar accuracy.

March 28: PJ (George) Perry

Evolutionary and conservation genomic analyses of the aye-aye, a nocturnal lemur from Madagascar

The only surviving representative of the primate family Daubentoniidae is the aye-aye (Daubentonia madagascariensis), a nocturnal lemur with unusual, derived traits including an elongated, thin, highly flexible middle finger, a pair of relatively huge, continuously-growing incisors, and the largest relative brain size of any strepsirrhine (lemurs and lorises) primate. These features are likely adaptations that facilitate complex, extractive foraging strategies to obtain grubs from cavities gnawed in tree bark (there are no woodpeckers on Madagascar) and seeds of hard-shelled ramy nuts. Aye-ayes have the most extensive geographical distribution of any lemur on Madagascar, but they may be among the most susceptible to regional extinction as they have very large individual home range size requirements, low population densities, and the lowest nuclear genetic diversity of any primate yet studied. I will present results from a short series of aye-aye comparative and population genetic studies to illustrate how analyses of genomic-scale data can benefit conservation planning efforts and our understanding of the evolutionary biology and ecology of an endangered species.

March 21: Jason Moore (Dartmouth)

Systems genetics approaches to human disease susceptibility

The sequencing of the human genome has made it possible to identify an informative set of more than one million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWAS). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation, and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving healthcare through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is, by design, agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic machine learning approach that recognizes the complexity of the genotype-phenotype relationship that is characterized by significant heterogeneity and gene-gene and gene-environment interaction. We argue here that machine learning has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. We present here an approach that extends machine learning results to large networks of interacting SNPs thus revealing additional complexity in the genotype-phenotype map. These results suggest that a systems approach may be more fruitful for understanding the genetic basis of common human diseases.

March 14: Marilyn Roossinck (CIDD)

Understanding plant virus ecology through biodiversity studies

We have been doing "ecogenomics" studies of plant viruses in the US and Costa Rica, and have collected about 12,000 plant samples to date. Of these over 8,000 have been enriched for viruses and sequenced by 454, but we are still looking for better tools to analyze this vast amount of data. I will discuss the studies, study sites, what we know so far, and the bioinformatics challenges that are still to be worked out.

February 29: Will Bush (Vanderbilt)

Linking genetic variants to the universe of biological data

After many years of effort, the field of human genetics has established a near-complete catalog commonly occurring single nucleotide polymorphisms (SNPs), and has begun a major initiative to identify more rarely occurring variation as well. Furthermore, genome-wide association studies (GWAS) have discovered thousands of new relationships between SNPs and human phenotypes. The vast majority of these findings however have small effect sizes and do not point to an obvious biological mechanism. The problem of this missing biology dramatically reduces the ability to translate GWAS discoveries into clinical therapies. Hundreds of biological databases have been created over the last ten years documenting various attributes of proteins, genes, or genetic variants, yet very few of these resources relate these attributes to genetic variation. Also, because resources are fractured across multiple databases, flat files, and supplemental publication data, it is difficult to examine relationships among genomic attributes without considerable effort. In this presentation, I will provide an overview of techniques for exploring the biological function of SNPs, and discuss new ways to incorporate SNP level data into the analysis and annotation of GWAS data.

February 22: Ross Hardison

Genomics of gene regulation: case studies from mammalian hematopoiesis

The folks in my laboratory and those of our many collaborators conduct experiments on several aspects of the genomics of gene regulation, almost always in the context of hematopoiesis (most frequently erythropoiesis). We search for cis-regulatory modules using a variety of techniques, ranging from genome-wide determination of epigenetic marks (histone modifications, transcription factor occupancy, DNase sensitivity) to statistical modeling of patterns in the evolution of regulatory modules. The biochemical and computational approaches lead to predictions about regulatory regions that we test in the laboratory. I plan to give a survey of the major topics under investigation, and leave plenty of material for the folks in the lab to cover in their own talks. Here are two vignettes that are examples of what we are doing.

(1) Dynamics of epigenetic landscapes during hematopoietic commitment and differentiation

Interplays among lineage-specific nuclear proteins, chromatin modifying enzymes, and the basal transcription machinery establish the epigenetic landscape and govern cellular differentiation. However, the dynamics of changes in the epigenetic landscape and how those dynamics affect transcriptional control are not fully understood. To determine the predominant roles of chromatin states and factor occupancy in directing gene regulation during differentiation, we mapped chromatin accessibility, histone modifications, and nuclear factor occupancy genome-wide during mouse erythroid differentiation dependent on the master regulatory transcription factor GATA1. Notably, despite extensive changes in gene expression, the chromatin state profiles and accessibility remain largely unchanged during GATA1-induced erythroid differentiation. In contrast, gene induction and repression are strongly associated with changes in patterns of transcription factor occupancy. Our results indicate that during erythroid differentiation, the broad features of chromatin states are established at the stage of lineage commitment, largely independently of GATA1. These determine permissiveness for expression, with subsequent induction or repression mediated by distinctive combinations of transcription factors. Current studies extend these analyses to the sister cell lineage, megakaryocytes, earlier bipotential progenitors, and prepro-B-lymphocytes.

Wu W, et al.(2011) Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. Genome Res. 21:1659-1671.

(2) Epigenetic landscapes and the prediction of functional genetic variants

Key aspects of the epigenetic landscape can be used to generate productive hypotheses about which phenotype-associated SNPs (e.g. from GWAS) are likely to be functional. Variation in regulation of gene expression is likely to be a major contributor to the genetic component of complex traits, such as disease susceptibility. We show that a substantial fraction of the lead SNPs in the GWAS catalog overlap with likely regulatory regions (using ENCODE data). When the epigenetic marks around all SNPs in linkage disequilibrium with the lead SNPs are interrogated, a signal consistent with variation in gene regulation is found for a majority of the complex phenotypes in the GWAS catalog.

ENCODE Project Consortium (2011) A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9:e1001046.

February 15: Francesca Chiaromonte (joint work from the Chiaromonte and Makova labs)

Statistical characterizations of genome dynamics

Many studies of sequenced genomes have shown that the rates of different types of mutations tend to vary and co-vary along the nuclear DNA. We investigate statistically this variation structure and its association with features of the local genomic landscape. To do so, we use a number of multivariate techniques on data from primate alignments. In the past, these included regression, linear and non-linear Principal Components, and linear and non-linear Canonical Correlations. More recently (work in progress) we have been using multivariate Hidden Markov Models to segment the genome based on mutation behavior -- with promising results.

February 8: Webb Miller, Belinda Giardine, Cathy Riemer

SNP tools on Galaxy

Several labs in the CCGB have begun a project to produce computational tools to support genomics investigations at the College of Medicine (Hershey). We anticipate that efforts will be organized around various kinds of data, e.g., genome sequences, exome sequences, protein-binding data, GWAS data, SNP array data, etc. In each case, the goals are to (1) write and/or borrow tools that meet Hershey's data-analysis needs and put them on Galaxy, and (2) document those tools. The first part of the presentation will describe a new suite of tools for analyzing SNPs (single-nucleotide polymorphisms) determined by low-coverage sequencing of multiple individuals. These tools, including principal-components analysis and searches for positive selection, will be applied to sequences with roughly 7-fold average coverage of 13 human genomes. The second part will focus on some of the existing tools available on Galaxy that are useful for analyzing human variation. We trace step-by-step through an example illustrating several methods for examining a single full-coverage genome to look for SNPs that are known to be associated with disease.

February 1: Marylyn Ritchie

Meta-dimensional analysis of phenotypes to dissect the architecture of complex traits

The efforts of the human genome project are beginning to provide important findings for human health. Technological advances in the laboratory, particularly in characterizing human genomic variation, have created new approaches for studying the human genome. However, current statistical and computational strategies are taking only partial advantage of this wealth of information. In the quest for disease susceptibility genes for common, complex disease, we are faced with many challenges. Selecting genetic, clinical, and environmental factors important for the trait of interest is increasingly more difficult as high throughput data generation technologies are developed. We know that genes do not act in isolation, thus numerous other factors are likely important in complex disease phenotypes. Ultimately, we want to know what factors are important to provide superior prevention, diagnosis, and treatment of human disease. Unfortunately, interpretation of statistical models in a meaningful way for biomedical research has been lacking due to the inherent difficulty in making such connections. Thus, a technology that embraces the complexity of human disease and integrates multiple data sources including biological knowledge from the public domain, through a powerful analytical framework is essential for dissecting the architecture of common diseases. ATHENA: the Analysis Tool for Heritable and Environmental Network Associations is a novel framework that incorporates variable selection, modeling, and interpretation to learn more about diseases of public health interest. As the field gains experience in analyzing large scale genomic data, it is crucial that we learn from each other and develop and utilize the best strategies.