The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
BACKGROUND: The binding of transcription factors to specific locations in the genome is integral to the orchestration of transcriptional regulation in cells. To characterize transcription factor binding site function on a large scale, we predicted and mutagenized 455 binding sites in human promoters. We carried out functional tests on these sites in four different immortalized human cell lines using transient transfections with a luciferase reporter assay, primarily for the transcription factors CTCF, GABP, GATA2, E2F, STAT, and YY1.
RESULTS: In each cell line, between 36% and 49% of binding sites made a functional contribution to the promoter activity; the overall rate for observing function in any of the cell lines was 70%. Transcription factor binding resulted in transcriptional repression in more than a third of functional sites. When compared with predicted binding sites whose function was not experimentally verified, the functional binding sites had higher conservation and were located closer to transcriptional start sites (TSSs). Among functional sites, repressive sites tended to be located further from TSSs than were activating sites. Our data provide significant insight into the functional characteristics of YY1 binding sites, most notably the detection of distinct activating and repressing classes of YY1 binding sites. Repressing sites were located closer to, and often overlapped with, translational start sites and presented a distinctive variation on the canonical YY1 binding motif.
CONCLUSIONS: The genomic properties that we found to associate with functional TF binding sites on promoters -- conservation, TSS proximity, motifs and their variations -- point the way to improved accuracy in future TFBS predictions.
Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors
Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) has become the dominant technique for mapping transcription factor (TF) binding regions genome-wide. We performed an integrative analysis centered around 457 ChIP-seq data sets on 119 human TFs generated by the ENCODE Consortium. We identified highly enriched sequence motifs in most data sets, revealing new motifs and validating known ones. The motif sites (TF binding sites) are highly conserved evolutionarily and show distinct footprints upon DNase I digestion. We frequently detected secondary motifs in addition to the canonical motifs of the TFs, indicating tethered binding and cobinding between multiple TFs. We observed significant position and orientation preferences between many cobinding TFs. Genes specifically expressed in a cell line are often associated with a greater occurrence of nearby TF binding in that cell line. We observed cell-line-specific secondary motifs that mediate the binding of the histone deacetylase HDAC2 and the enhancer-binding protein EP300. TF binding sites are located in GC-rich, nucleosome-depleted, and DNase I sensitive regions, flanked by well-positioned nucleosomes, and many of these features show cell type specificity. The GC-richness may be beneficial for regulating TF binding because, when unoccupied by a TF, these regions are occupied by nucleosomes in vivo. We present the results of our analysis in a TF-centric web repository Factorbook (http://factorbook.org) and will continually update this repository as more ENCODE data are generated.
Community-wide evaluation of methods for predicting the effect of mutations on protein-protein interactions
Community-wide blind prediction experiments such as CAPRI and CASP provide an objective measure of the current state of predictive methodology. Here we describe a community-wide assessment of methods to predict the effects of mutations on protein-protein interactions. Twenty-two groups predicted the effects of comprehensive saturation mutagenesis for two designed influenza hemagglutinin binders and the results were compared with experimental yeast display enrichment data obtained using deep sequencing. The most successful methods explicitly considered the effects of mutation on monomer stability in addition to binding affinity, carried out explicit side-chain sampling and backbone relaxation, evaluated packing, electrostatic, and solvation effects, and correctly identified around a third of the beneficial mutations. Much room for improvement remains for even the best techniques, and large-scale fitness landscapes should continue to provide an excellent test bed for continued evaluation of both existing and new prediction methodologies.
We compared the performance of template-free (docking) and template-based methods for the prediction of protein-protein complex structures. We found similar performance for a template-based method based on threading (COTH) and another template-based method based on structural alignment (PRISM). The template-based methods showed similar performance to a docking method (ZDOCK) when the latter was allowed one prediction for each complex, but when the same number of predictions was allowed for each method, the docking approach outperformed template-based approaches. We identified strengths and weaknesses in each method. Template-based approaches were better able to handle complexes that involved conformational changes upon binding. Furthermore, the threading-based and docking methods were better than the structural-alignment-based method for enzyme-inhibitor complex prediction. Finally, we show that the near-native (correct) predictions were generally not shared by the various approaches, suggesting that integrating their results could be the superior strategy.
We report the first assessment of blind predictions of water positions at protein-protein interfaces, performed as part of the critical assessment of predicted interactions (CAPRI) community-wide experiment. Groups submitting docking predictions for the complex of the DNase domain of colicin E2 and Im2 immunity protein (CAPRI Target 47), were invited to predict the positions of interfacial water molecules using the method of their choice. The predictions-20 groups submitted a total of 195 models-were assessed by measuring the recall fraction of water-mediated protein contacts. Of the 176 high- or medium-quality docking models-a very good docking performance per se-only 44% had a recall fraction above 0.3, and a mere 6% above 0.5. The actual water positions were in general predicted to an accuracy level no better than 1.5 A, and even in good models about half of the contacts represented false positives. This notwithstanding, three hotspot interface water positions were quite well predicted, and so was one of the water positions that is believed to stabilize the loop that confers specificity in these complexes. Overall the best interface water predictions was achieved by groups that also produced high-quality docking models, indicating that accurate modelling of the protein portion is a determinant factor. The use of established molecular mechanics force fields, coupled to sampling and optimization procedures also seemed to confer an advantage. Insights gained from this analysis should help improve the prediction of protein-water interactions and their role in stabilizing protein complexes.
Conformational entropy is an important component of protein-protein interactions; however, there is no reliable method for computing this parameter. We have developed a statistical measure of residual backbone entropy in folded proteins by using the varphi-psi distributions of the 20 amino acids in common secondary structures. The backbone entropy patterns of amino acids within helix, sheet or coil form clusters that recapitulate the branching and hydrogen bonding properties of the side chains in the secondary structure type. The same types of residues in coil and sheet have identical backbone entropies, while helix residues have much smaller conformational entropies. We estimated the backbone entropy change for immunoglobulin complementarity-determining regions (CDRs) from the crystal structures of 34 low-affinity T-cell receptors and 40 high-affinity Fabs as a result of the formation of protein complexes. Surprisingly, we discovered that the computed backbone entropy loss of only the CDR3, but not all CDRs, correlated significantly with the kinetic and affinity constants of the 74 selected complexes. Consequently, we propose a simple algorithm to introduce proline mutations that restrict the conformational flexibility of CDRs and enhance the kinetics and affinity of immunoglobulin interactions. Combining the proline mutations with rationally designed mutants from a previous study led to 2400-fold increase in the affinity of the A6 T-cell receptor for Tax-HLAA2. However, this mutational scheme failed to induce significant binding changes in the already-high-affinity C225-Fab/huEGFR interface. Our results will serve as a roadmap to formulate more effective target functions to design immune complexes with improved biological functions.
Excision of introns from pre-mRNAs is mediated by the spliceosome, a multi-megadalton complex consisting of U1, U2, U4/U6, and U5 snRNPs plus scores of associated proteins. Spliceosome assembly and disassembly are highly dynamic processes involving multiple stable intermediates. In this study, we utilized a split TAP-tag approach for large-scale purification of an abundant endogenous U2.U5.U6 complex from Schizosaccharomyces pombe. RNAseq revealed this complex to largely contain excised introns, indicating that it is primarily ILS (intron lariat spliceosome) complexes. These endogenous ILS complexes are remarkably resistant to both high-salt and nuclease digestion. Mass spectrometry analysis identified 68, 45, and 43 proteins in low-salt-, high-salt-, and micrococcal nuclease-treated preps, respectively. The protein content of a S. pombe ILS complex strongly resembles that previously reported for human spliced product (P) and Saccharomyces cerevisiae ILS complexes assembled on single pre-mRNAs in vitro. However, the ATP-dependent RNA helicase Brr2 was either substoichiometric in low-salt preps or completely absent from high-salt and MNase preps. Because Brr2 facilitates spliceosome disassembly, its relative absence may explain why the ILS complex accumulates logarithmically growing cultures and the inability of S. pombe extracts to support in vitro splicing.
SUMMARY: Protein-protein interactions are essential to cellular and immune function, and in many cases, because of the absence of an experimentally determined structure of the complex, these interactions must be modeled to obtain an understanding of their molecular basis. We present a user-friendly protein docking server, based on the rigid-body docking programs ZDOCK and M-ZDOCK, to predict structures of protein-protein complexes and symmetric multimers. With a goal of providing an accessible and intuitive interface, we provide options for users to guide the scoring and the selection of output models, in addition to dynamic visualization of input structures and output docking models. This server enables the research community to easily and quickly produce structural models of protein-protein complexes and symmetric multimers for their own analysis.
AVAILABILITY: The ZDOCK server is freely available to all academic and non-profit users at: http://zdock.umassmed.edu. No registration is required.
Insertions and excisions of transposable elements (TEs) affect both the stability and variability of the genome. Studying the dynamics of transposition at the population level can provide crucial insights into the processes and mechanisms of genome evolution. Pooling genomic materials from multiple individuals followed by high-throughput sequencing is an efficient way of characterizing genomic polymorphisms in a population. Here we describe a novel method named TEMP, specifically designed to detect TE movements present with a wide range of frequencies in a population. By combining the information provided by pair-end reads and split reads, TEMP is able to identify both the presence and absence of TE insertions in genomic DNA sequences derived from heterogeneous samples; accurately estimate the frequencies of transposition events in the population and pinpoint junctions of high frequency transposition events at nucleotide resolution. Simulation data indicate that TEMP outperforms other algorithms such as PoPoolationTE, RetroSeq, VariationHunter and GASVPro. TEMP also performs well on whole-genome human data derived from the 1000 Genomes Project. We applied TEMP to characterize the TE frequencies in a wild Drosophila melanogaster population and study the inheritance patterns of TEs during hybrid dysgenesis. We also identified sequence signatures of TE insertion and possible molecular effects of TE movements, such as altered gene expression and piRNA production.
TEMP is freely available at github: https://github.com/JialiUMassWengLab/TEMP.git. Acids Research.
Crystal structure of Streptococcus pyogenes EndoS, an immunomodulatory endoglycosidase specific for human IgG antibodies
To evade host immune mechanisms, many bacteria secrete immunomodulatory enzymes. Streptococcus pyogenes, one of the most common human pathogens, secretes a large endoglycosidase, EndoS, which removes carbohydrates in a highly specific manner from IgG antibodies. This modification renders antibodies incapable of eliciting host effector functions through either complement or Fc gamma receptors, providing the bacteria with a survival advantage. On account of this antibody-specific modifying activity, EndoS is being developed as a promising injectable therapeutic for autoimmune diseases that rely on autoantibodies. Additionally, EndoS is a key enzyme used in the chemoenzymatic synthesis of homogenously glycosylated antibodies with tailored Fc gamma receptor-mediated effector functions. Despite the tremendous utility of this enzyme, the molecular basis of EndoS specificity for, and processing of, IgG antibodies has remained poorly understood. Here, we report the X-ray crystal structure of EndoS and provide a model of its encounter complex with its substrate, the IgG1 Fc domain. We show that EndoS is composed of five distinct protein domains, including glycosidase, leucine-rich repeat, hybrid Ig, carbohydrate binding module, and three-helix bundle domains, arranged in a distinctive V-shaped conformation. Our data suggest that the substrate enters the concave interior of the enzyme structure, is held in place by the carbohydrate binding module, and that concerted conformational changes in both enzyme and substrate are required for subsequent antibody deglycosylation. The EndoS structure presented here provides a framework from which novel endoglycosidases could be engineered for additional clinical and biotechnological applications.
piRNAs guide an adaptive genome defense system that silences transposons during germline development. The Drosophila HP1 homolog Rhino is required for germline piRNA production. We show that Rhino binds specifically to the heterochromatic clusters that produce piRNA precursors, and that binding directly correlates with piRNA production. Rhino colocalizes to germline nuclear foci with Rai1/DXO-related protein Cuff and the DEAD box protein UAP56, which are also required for germline piRNA production. RNA sequencing indicates that most cluster transcripts are not spliced and that rhino, cuff, and uap56 mutations increase expression of spliced cluster transcripts over 100-fold. LacI::Rhino fusion protein binding suppresses splicing of a reporter transgene and is sufficient to trigger piRNA production from a trans combination of sense and antisense reporters. We therefore propose that Rhino anchors a nuclear complex that suppresses cluster transcript splicing and speculate that stalled splicing differentiates piRNA precursors from mRNAs.
The growing list of mutations implicated in monogenic disorders of the developing brain includes at least seven genes (ARX, CUL4B, KDM5A, KDM5C, KMT2A, KMT2C, KMT2D) with loss-of-function mutations affecting proper regulation of histone H3 lysine 4 methylation, a chromatin mark which on a genome-wide scale is broadly associated with active gene expression, with its mono-, di- and trimethylated forms differentially enriched at promoter and enhancer and other regulatory sequences. In addition to these rare genetic syndromes, dysregulated H3K4 methylation could also play a role in the pathophysiology of some cases diagnosed with autism or schizophrenia, two conditions which on a genome-wide scale are associated with H3K4 methylation changes at hundreds of loci in a subject-specific manner. Importantly, the reported alterations for some of the diseased brain specimens included a widespread broadening of H3K4 methylation profiles at gene promoters, a process that could be regulated by the UpSET(KMT2E/MLL5)-histone deacetylase complex. Furthermore, preclinical studies identified maternal immune activation, parental care and monoaminergic drugs as environmental determinants for brain-specific H3K4 methylation. These novel insights into the epigenetic risk architectures of neurodevelopmental disease will be highly relevant for efforts aimed at improved prevention and treatment of autism and psychosis spectrum disorders.
To broaden our understanding of the evolution of gene regulation mechanisms, we generated occupancy profiles for 34 orthologous transcription factors (TFs) in human-mouse erythroid progenitor, lymphoblast and embryonic stem-cell lines. By combining the genome-wide transcription factor occupancy repertoires, associated epigenetic signals, and co-association patterns, here we deduce several evolutionary principles of gene regulatory features operating since the mouse and human lineages diverged. The genomic distribution profiles, primary binding motifs, chromatin states, and DNA methylation preferences are well conserved for TF-occupied sequences. However, the extent to which orthologous DNA segments are bound by orthologous TFs varies both among TFs and with genomic location: binding at promoters is more highly conserved than binding at distal elements. Notably, occupancy-conserved TF-occupied sequences tend to be pleiotropic; they function in several tissues and also co-associate with many TFs. Single nucleotide variants at sites with potential regulatory functions are enriched in occupancy-conserved TF-occupied sequences.
piPipes: a set of pipelines for piRNA and transposon analysis via small RNA-seq, RNA-seq, degradome- and CAGE-seq, ChIP-seq and genomic DNA sequencing
MOTIVATION: PIWI-interacting RNAs (piRNAs), 23-36 nt small silencing RNAs, repress transposon expression in the metazoan germ line, thereby protecting the genome. Although high-throughput sequencing has made it possible to examine the genome and transcriptome at unprecedented resolution, extracting useful information from gigabytes of sequencing data still requires substantial computational skills. Additionally, researchers may analyze and interpret the same data differently, generating results that are difficult to reconcile. To address these issues, we developed a coordinated set of pipelines, 'piPipes', to analyze piRNA and transposon-derived RNAs from a variety of high-throughput sequencing libraries, including small RNA, RNA, degradome or 7-methyl guanosine cap analysis of gene expression (CAGE), chromatin immunoprecipitation (ChIP) and genomic DNA-seq. piPipes can also produce figures and tables suitable for publication. By facilitating data analysis, piPipes provides an opportunity to standardize computational methods in the piRNA field.
SUPPLEMENTARY INFORMATION: Supplementary information, including flowcharts and example figures for each pipeline, are available at Bioinformatics online.
AVAILABILITY AND IMPLEMENTATION: piPipes is implemented in Bash, C++, Python, Perl and R. piPipes is free, open-source software distributed under the GPLv3 license and is available at http://bowhan.github.io/piPipes/.
CONTACT: Phillip.Zamore@umassmed.edu or Zhiping.Weng@umassmed.edu
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Epigenetic dysregulation of hairy and enhancer of split 4 (HES4) is associated with striatal degeneration in postmortem Huntington brains
To investigate epigenetic contributions to Huntington's disease (HD) pathogenesis, we carried out genome-wide mapping of the transcriptional mark, trimethyl-histone H3-lysine 4 (H3K4me3) in neuronal nuclei extracted from prefrontal cortex of HD cases and controls using chromatin immunoprecipitation followed by deep-sequencing. Neuron-specific mapping of the genome-wide distribution of H3K4me3 revealed 136 differentially enriched loci associated with genes implicated in neuronal development and neurodegeneration, including GPR3, TMEM106B, PDIA6 and the Notch signaling genes hairy and enhancer of split 4 (HES4) and JAGGED2, supporting the view that the neuronal epigenome is affected in HD. Importantly, loss of H3K4me3 at CpG-rich sequences on the HES4 promoter was associated with excessive DNA methylation, reduced binding of nuclear proteins to the methylated region and altered expression of HES4 and HES4 targeted genes MASH1 and P21 involved in striatal development. Moreover, hypermethylation of HES4 promoter sequences was strikingly correlated with measures of striatal degeneration and age-of-onset in a cohort of 25 HD brains (r = 0.56, P = 0.006). Lastly, shRNA knockdown of HES4 in human neuroblastoma cells altered MASH1 and P21 mRNA expression and markedly increased mutated HTT-induced aggregates and cell death. These findings, taken together, suggest that epigenetic dysregulation of HES4 could play a critical role in modifying HD disease pathogenesis and severity.