Generation of DNase I hypersensitivity maps
DNase I assays were generally performed according to a protocol detailed previously42. This protocol involves treatment of intact nuclei with the small enzyme DNase I which is able to penetrate the nuclear pore and cleave exposed DNA. Small (<1 kb) fragments are isolated from lysed nuclei following DNase I treatment, linkers are added, and the resulting library is sequenced. Because tissue and cell culture, isolation, and handling protocols differ for different biosamples, these are indexed in Supplementary Table 1. Additional information on the procurement of biosample material and DNase-seq biosample selection and data processing is available in the Supplementary Methods.
Index of consensus human DHSs
DHSs were detected in individual biosample datasets and integrated across all 733 datasets to yield a set of 3.59 million consensus DHS delineations. These elements were subsequently annotated with estimates of their centre-of-mass, positional stability across datasets and confidence scores. A detailed explanation of this procedure is provided in the Supplementary Methods.
Overlap of the DHS index with genomic annotations
To assess the overlap of our DHS consensus elements with repetitive elements (Extended Data Fig. 2b), we obtained RepeatMasker43 annotations downloaded from the University of California Santa Cruz (UCSC) Table Browser44, and considered the various repeat classes and (sub)families as provided. To perform analogous analyses for human gene annotations (Extended Data Fig. 2c), we obtained GENCODE45 v.28 Basic annotations. We defined exons as specified in the GENCODE annotation, promoters as the TSS of genes ±1 kb, and introns as the rest of the gene body. Intergenic regions were defined as those not covered by gene bodies or defined promoters. We assigned index DHSs to these annotations requiring at least a 1 bp overlap, choosing the annotation with the largest overlap in case of multiple overlapping annotations.
TOPMed within-human sequence variation data were obtained from the Bravo website (https://bravo.sph.umich.edu/freeze5/hg38/download, Freeze 5, hg38, VCF format). We converted 495.6 million single-base substitutions to nucleotide diversity scores (π), with a score of zero implied for every genomic base position with no variants. Per base, phyloP46 sequence conservation scores were downloaded as-is (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/). Within-human sequence variation data (π × 104) and phyloP conservation scores were aligned relative to DHS centroids using 20-bp non-overlapping windows tiled across a 1-kb region centred on each centroid (Extended Data Fig. 2g). For each window offset relative to the DHS centroid, genome-wide per-base scores were subsetted using bedops47 and averaged with GNU datamash.
Saturation and extendability of DHS index
For random subsamples of sizes ranging from 1 to 733 biosamples, we estimated the mean number of novel DHSs added by a new dataset as a function of total number of datasets sampled (Extended Data Fig. 2h). To extrapolate these estimates to future biosample sets, we fitted a log-log model to the data. From the saturation analysis, we expect the overwhelming majority of DHSs identified in any new dataset to be represented already in the index, to which they will contribute additional confidence and precision. Incremental datasets can be added to the index by re-delineating DHSs using the original per-dataset DHS calls permanently recorded at the ENCODE DCC (Supplementary Table 1).
Construction of a DHS vocabulary
We used NMF28,29 for the decomposition of a binary matrix consisting of the presence or absence calls of m DHSs across n DNase-seq datasets into a smaller set of k components. As with other dimensionality reduction methods, NMF does not guarantee a total recapitulation of the original data; instead we chose to allow information loss in exchange for a more interpretable result. Therefore, we considered using a much smaller number of k components than the lower of the two dimensions of our input matrix (733 DNase-seq datasets). To keep the reconstruction error in check, we used an objective function that is minimized subject to the Frobenius norm (Extended Data Fig. 3a). NMF typically uses a random initialization step, leading to unstable results. To alleviate this, we performed the initialization step using singular value decomposition (SVD)48,49, leading to consistent results while maintaining a performance that is on par with randomly initialized instances. A more detailed rationale for the component-wise description of DHSs, as well as details on the implementation and execution of the decomposition, is provided in the Supplementary Methods.
Labelling of NMF components and DHSs
To aid interpretation of the 16 NMF-derived components, we used two orthogonal approaches to assign labels to components, based on (i) biosample properties and (ii) DHS sequence features.
First, for each component we selected the top biosamples based on component-specific NMF loadings present in their datasets (Extended Data Fig. 4a). These maximal NMF loadings across datasets were generally strong across components (Extended Data Fig. 4b). In general, a clear pattern emerged of shared properties of biosamples most strongly associated with specific components. To formalize this, we performed one-sided Mann–Whitney U tests to assess whether NMF loadings for biosamples sharing certain metadata categories (Supplementary Table 1) are greater than those for biosamples not in the given metadata category (Extended Data Fig. 4c). In particular, we assessed metadata categories corresponding to human organ systems and the cancer status of biosamples. P values were corrected for multiple hypothesis testing using the Bonferroni correction method. A post hoc analysis of biosample-to-component assignment for values of k < 16 provided insight into the genesis of our k = 16 component model, showing junctures after which separate cell type lineages are captured by distinct components (Extended Data Fig. 4d).
Second, for each component we obtained DHSs with maximal NMF loadings for that component, and subsequently performed enrichment analyses for TF binding site motifs (Extended Data Fig. 4e). We used a wide array of TF motifs and used FIMO50 (match threshold P < 10−5), to search for motif instances in the human genome. We tested the association of motif occurrences with specific NMF components using Fisher’s exact test. We used clusters of similar motifs (http://www.mauranolab.org/CATO/weblogos/main.html) for the purpose of summarization and visualization. The results show strong enrichments for component-specific motifs, suggesting preferential binding of component-relevant transcription factors (Extended Data Fig. 4f).
The strong associations of 1) biosample properties and 2) TF binding site occurrences with specific components enabled us to label each NMF component, resulting in a DHS vocabulary (Fig. 2d), further detailed in the Supplementary Note. For downstream analyses, we labelled each DHS with its strongest NMF component (Fig. 2e, bottom).
Robustness of component interpretation
To test the effect of inducing additional sparsity in the NMF model, we systematically increased the L1 penalization setting while tracking F1 scores and the fraction of non-zero parameters used in the model (Extended Data Fig. 5a–c). The top 15 component-contributing biosamples per component remain mostly consistent with Fig. 2e and Extended Data Fig. 4a without L1 penalization, indicating that enforcing additional sparsity does not impact the interpretation of model components.
To test the effect of possible over/under-representation of certain cell types, we removed 44 (40%) haematopoietic biosamples, consisting of the highest quality datasets representative of unique cellular conditions (Supplementary Table 1). After building a new NMF model, we observe that although the remaining (lower quality) haematopoietic biosamples are now captured by a single component instead of two, the interpretation of the remaining non-haematopoietic components does not change (Extended Data Fig. 5d).
Regulatory annotation of human genes
Per-component genomic distribution of DHSs
We compared the average distance between same-component DHSs against empirical distributions based on random assignment of component labels to DHSs and sampling the same number of DHSs 1,000 times (Extended Data Fig. 6a).
Per-component meta-DNase tracks
To illustrate the regional diversity of DHS component data, we generated meta-DNase tracks representing each of the 16 DHS components (Extended Data Fig. 6b) by averaging genome-wide DNase-seq signal profiles of the top 15 biosamples most strongly associated with each component (Extended Data Fig. 4a). For visual conciseness, we provide aggregate tracks that overlay the meta-DNase tracks of all DHS components (for example, Fig. 3c, Extended Data Figs. 6b, c, 7a–h, 8a–c).
Definition of regulatory landscapes
We defined the regulatory landscape of a gene as the set of DHSs within the gene body, plus DHSs in flanking regions of maximally 5 kb upstream and maximally 1 kb downstream of the gene body, or up until halfway through to the gene upstream, whichever value is smaller (Fig. 3a, Extended Data Fig. 6e–g). This captures approximately 65% of all DHSs (Extended Data Fig. 6d) and prevents flanking region DHSs from being routinely assigned to the regulatory landscapes of multiple genes, alleviating mixing of regulatory signals.
Association of genes with DHS components
We tested the association of all 56,832 annotated GENCODE genes (Fig. 3b) with each DHS component separately. Under the null hypothesis that DHS components are randomly distributed across gene regulatory landscapes, we used the binomial distribution to test whether the proportion of DHSs annotated with a given component is higher among DHSs within a particular gene regulatory landscape than outside. We controlled the FDR at 5% by calculating q values51 across the total of all genes and components. Further details are provided in the Supplementary Methods. To study the differences between a gene-centric and TSS-centric approach, we calculated component associations for 10-kb regions centred around the TSS (that is, TSS ± 5 kb) and assessed the number and type of genes annotated (Extended Data Fig. 6h, i).
Annotations for GENCODE genes and pseudo-gene types
GENCODE v.28 (Basic) annotations were used for all analyses. For the purpose of labelling and visualizing genes, for each gene we used its longest transcript as its representative region. Pseudo-gene annotations were obtained from psiCube52, http://pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt.
Visualization of gene regulatory annotations
We used t-SNE to visualize the enrichment ratios of gene regulatory landscapes for DHS components (Fig. 3d, Extended Data Fig. 8a–c). Each dot shown represents a gene found to be significantly associated with one or more DHS components, and the union of these are the genes used to calculate the 2D embedding. The R (http://www.r-project.org) implementation as provided in the Rtsne package was used, with default parameters. Genes are coloured according to their (most strongly enriched) significant DHS component.
Construction and use of gene expression compendium dataset
We used the full human ARCHS4 dataset (downloaded 26 June 2018)31 and selected relevant tissue and cell types for each DHS vocabulary component (Supplementary Methods). This resulted in a total of 33,733 unique gene expression datasets, with expression information for 35,238 genes. For each gene, we obtained the 95th percentile value across datasets selected for each DHS component as the representative value in that component, to not be led by outliers in the data, while still being sensitive for cell type selective expression levels. For each DHS component, we calculated average expression levels across genes labelled with that component (observed), as well as across all component-labelled genes (expected). Resulting values are reported as log2-transformed enrichment ratios (Fig. 3g).
Annotation and visualization of pathway labellings
A curated set of canonical pathways was obtained from the MSigDB Collections (http://software.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP). Pathway enrichment analyses (Extended Data Fig. 8d, e) were performed analogously to gene enrichment analyses, by pooling DHSs in neighbourhoods of all pathway-associated genes. We used the KEGG35 REST API (https://www.kegg.jp/kegg/rest/keggapi.html) to download and graphically annotate KEGG pathway representations.
Prioritization of TF-associated DHSs
We obtained DHSs with loadings for a single component only. For each component-labelled TF gene with a known sequence binding motif, we obtained the subset of DHSs that (i) are annotated with the same component as the TF, (ii) contain a TF-matching motif, and (iii) are footprinted in a biosample associated with the same component25 (Fig. 3h). Although the above analysis identified a small minority of DHSs owing to stringent filtering, motifs with variable information content, and the smaller range of biosamples for which footprinting data are available, this approach could be recapitulated with less extreme parameters to identify larger sets of DHSs at reasonable confidence.
Genetic variation analyses
GWAS traits and summary statistics
We obtained GWAS summary statistics data from the UK Biobank project as processed by the Neale lab (http://www.nealelab.is/uk-biobank/). In addition, we obtained GWAS summary statistics calculated using BOLT-LMM v2.353, as used in recent work38.
Estimates of SNP-based heritability
GWAS traits were curated by removing those with a narrow-sense SNP-based heritability54 (hg2) of less than 1%. Although ideally we would quantify heritability by considering the true causal effects of variants, in reality we do not observe these. Instead, we are limited to GWAS summary statistics, which essentially describe the marginal trait-correlation for each variant, consisting of both causal effects and effects due to LD, plus statistical noise. Recently proposed methods such as LD score regression (LDSC)55 are able to estimate heritability while explicitly considering the underlying LD structure. For continuous traits, in case both raw and inverse-ranked normalized (irnt) versions were available, we retained the latter only. This yielded a total of 1,316 traits for subsequent analyses with an hg2 of at least 1%.
Quantitative trait associations
For quantitative trait-versus-component analyses (Fig. 4a, Extended Data Fig. 9a–c), we assessed the correspondence between trait association strength (GWAS variant association P value) and the component annotations of variant-containing DHSs, for increasingly stringent subsets of GWAS variants. Enrichment P values were calculated using a binomial distribution, as done previously6. We explicitly control for large scale LD structure, using a form of LD clumping56, by selecting a single variant-containing DHS for each of 1,708 approximately independent LD blocks57. Namely, for each LD block, the variant with the lowest GWAS association P value that overlaps a DHS was selected for subsequent analysis. In case multiple such variant-containing DHSs existed, we gave preference to the DHS with the highest confidence score (mean signal) in our DHS index.
Stratified LD-score regression
To estimate hg2 with maximal statistical power, we used LD score regression (LDSC)55 to explicitly take into account LD structure. In particular, we used a stratified version of LDSC (S-LDSC)36 to partition heritability estimates according to pre-defined sets of genome-wide annotations (Fig. 4b, c, Extended Data Fig. 9d, e), consisting of our annotated DHSs in addition to a wide range of 85 genome-wide functional ‘baseline’ annotations (baseline-LD model v.2.1). The v.2.1 baseline set consists of a total of 86 genome-wide annotations, building upon the 76 annotations used in the v.2.0 set and several additional annotations58. These ‘baseline’ annotations encode whether SNPs fall inside protein-coding or non-coding regions, regions with increased levels of evolutionary conservation, regions predicted or confirmed to have enhancer activity, and so on. Their breadth provides a robust36 baseline model along which to test trait heritability contributions of our DHS components. We express the heritability enrichment of an annotation as the ratio of its proportion of per-trait hg2 and the proportion of SNPs covered by the annotation (Fig. 4b).
Variants included in the analysis are those registered in HapMap3, with a minimal minor allele frequency (MAF) of 5%, and excluding the human major histocompatibility complex (MHC) locus. Baseline LD scores were computed from 1000 Genomes Phase 3 data from European ancestry populations and corresponding allele frequencies (as used previously58 and available from the LDSC reference downloads page, along with the baselineLD annotation set: https://data.broadinstitute.org/alkesgroup/LDSCORE/).
Heritability enrichments for DHS vocabulary components
We applied S-LDSC to our DHS vocabulary components as follows. In brief, each DHS was assigned to its majority DHS component and (when possible) assigned to overlapping variants. For the resulting vocabulary-based annotations, LD scores were calculated. We then performed S-LDSC separately for each of the selected 1,316 traits, relative to these vocabulary-based annotations and the baselineLD model described above. For each trait versus annotation combination, we obtained estimates of its heritability enrichment36, expressed as the ratio of its proportion of hg2 and the proportion of SNPs covered by the annotation (Fig. 4b, c). We considered heritability enrichments statistically significant at an estimated FDR of less than 5% calculated across all considered traits and DHS components. This is more stringent than the commonly used per-trait correction for multiple hypothesis testing.
Unique per-annotation contributions to SNP-based heritability
Estimates of heritability enrichment can be confounded by contributions of multiple (overlapping) genomic annotations included in S-LDSC models. To quantify unique per-annotation contributions to heritability, we obtained the average per-SNP increase in heritability ascribed to that component, after controlling for all other annotations in the model (baseline annotations and DHS components)36. From the reported coefficients and their standard errors, we derived z-scores, one-sided P values and FDR-corrected q values for each trait-versus-component combination (Fig. 4d, e). For the heritability analysis in component concordant genic DHSs (Fig. 4f), we further stratified DHSs based on whether they are component concordant, component discordant, inside non-annotated genes (genic controls), or inter-genic. Figure 4f shows z-scores for the maximally enriched components identified in Fig. 4c.
To quantify the heritability contribution of per-dataset DHSs, we performed a variation on the standard S-LDSC procedure, as described previously36. Specifically, we built upon the baselineLD model by iteratively considering annotations derived from individual datasets only. These individual datasets were collected by selecting for each trait the 15 datasets most informative to each DHS component (Extended Data Fig. 4a). Annotations consist of DHSs observed in those datasets, as well as their complement, that is, the remainder of index DHSs. We report the contribution to heritability based on the former, expressed as z-scores (Fig. 4d, e).
Extendability of the DHS vocabulary
Addition of novel unseen datasets
New datasets may be added to the current NMF model while retaining the same interpretation of components (Extended Data Fig. 10a). In brief, 0.1% FDR variable-width peak calls are obtained from new datasets of interest, mapped to DHS index elements using bedops47 and projected into the existing component space using standard NMF routines (see code for more details).
DHS index element identification without de novo peak identification
We used bedops47 to look up DNase-seq signal levels of a dataset of interest over index elements, to determine whether a given element is actuated in the dataset. Expressed as a classification problem, using the existing 0.1% FDR variable-width peak calls as the groundtruth set, we assess precision and recall of peak recovery. For all 733 biosamples we find area under precision recall curve (AUPRC) values ranging from 0.33 to 0.83 (median, 0.71; IQR, 0.64–0.75), with a trophoblast biosample (ENCODE DCC identifier ENCBS576QRR) shown as an example (Extended Data Fig. 10a). The large difference between AUPRC values of matched versus non-matched biosamples allows the identification of the original biosample (Extended Data Fig. 10b), while showing that biosamples with similar AUPRC ranks share the same biological characteristics (Extended Data Fig. 10c). This procedure can also be followed for unseen datasets (Extended Data Fig. 10d), in particular datasets that are less deeply profiled or would otherwise be too sparse to call peaks on de novo—such as single cell chromatin profiling data.
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.