2 ATAC-Seq Data Analysis

2.1 Methods

In this tutorial we will explore methods for measuring the central value:

Mean
Median

Methods for measuring the variance or dispersion of the data:

Methods for data exploration and visualization:

Methods for genomic count data normalization and differential peak calling:

Methods for gene set and pathway enrichment and visualizing differential peak calling results:

2.2 R libraries

We need a couple of libraries for the exercises. Let’s load them all upfront:

library(ggplot2)
library(reshape2)
library(pander)
library(Hmisc)
library(pastecs)
library(DESeq2)
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(clusterProfiler)
library(org.Hs.eg.db)
library(ReactomePA)
library(pheatmap)

2.3 Load the ATAC-Seq count matrix

Let’s load the count matrix and the sample information with the blood type and donor.

df = read.table("atac.data.gz", header=T)
si = read.table("blood.samples", header=F)
colnames(si) = c("sample", "celltype", "donor")
rownames(si) = si$sample
si$donor = factor(si$donor)
pander(dim(df), "Data dimensions")

590650 and 14

pander(head(df))

Table continues below
Chr	Start	End	Bcell_13A	CD4_9A	CD4_9B	CD8_10B
chr1	10025	10525	0	0	2	0
chr1	13252	13752	0	4	0	2
chr1	16019	16519	1	0	2	0
chr1	96376	96876	0	0	0	0
chr1	115440	115940	0	2	0	0
chr1	235393	235893	2	4	2	0

Table continues below
Erythroblast_15A	Erythroblast_15B	NK_11A	Nkcell_11B
0	0	3	2
0	0	2	0
2	0	2	2
0	0	0	0
4	2	0	0
0	0	2	4

Bcell_13B	CD8_10A
0	0
0	0
0	2
0	0
0	0
0	0

pander(head(si))

	sample	celltype	donor
Bcell_13A	Bcell_13A	Bcell	5483
Bcell_13B	Bcell_13B	Bcell	5483
CD4_9A	CD4_9A	CD4Tcell	5483
CD4_9B	CD4_9B	CD4Tcell	5483
CD8_10A	CD8_10A	CD8Tcell	5483
CD8_10B	CD8_10B	CD8Tcell	5483

print(summary(si))

##        sample      celltype  donor   
##  Bcell_13A:1   Bcell   :2   5483:11  
##  Bcell_13B:1   CD4Tcell:2            
##  CD4_9A   :1   CD8Tcell:2            
##  CD4_9B   :1   Ery     :3            
##  CD8_10A  :1   NKcell  :2            
##  CD8_10B  :1                         
##  (Other)  :5

2.4 Removing missing peaks

The original count matrix had many more samples and thus, we have now a number of peaks in our matrix that are actually not present anymore. Let’s remove all the rows where not a single sample has more than 50 reads.

df = df[apply(df[,4:ncol(df)], 1, max) > 50,]
pander(dim(df), "Data dimensions")

57891 and 14

cm = df[,4:ncol(df)]
pander(quantile(rowSums(cm)))

0%	25%	50%	75%	100%
71	260	444	944	16021

pander(quantile(rowMeans(cm)))

0%	25%	50%	75%	100%
6.455	23.64	40.36	85.82	1456

pander(quantile(apply(cm, 1, max)))

0%	25%	50%	75%	100%
51	71	111	209	2471

2.5 Data Exploration

With the cleaned data, we can of course now compute simple statistics for each sample like the mean read count and standard deviation of counts across peaks. Box plots are useful for summarizing the distribution of read counts.

print(mean(df$NK_11A))

## [1] 89.05225

print(sd(df$NK_11A))

## [1] 138.2255

sim = rnorm(1000, mean=100, sd=10)
boxplot(sim)

hist(sim)

pander(quantile(sim))

0%	25%	50%	75%	100%
68.79	92.95	99.63	106.3	134.2

boxplot(df$NK_11A)

hist(df$NK_11A)

pander(quantile(df$NK_11A))

0%	25%	50%	75%	100%
0	12	43	102	2184

A common technique for labelling outliers uses the 25% and 75% quantiles.

pw = df$NK_11A
uq = quantile(pw, 0.75)
print(mean(pw > 1.5 * uq))

## [1] 0.1635487

iqr = IQR(pw)
print(mean(pw > 3 * iqr))

## [1] 0.07711043

We obviously have a very skewed distribution with a long tail. This is very common in Genomics and the standard approach to account for such a skewed distribution is some kind of log-transformation of the count data.

pw = log(df$NK_11A + 1)
boxplot(pw)

hist(pw)

The other major problem in genomic count data sets is that they often show heteroscedasticity which means in our case that different peaks show different levels of variabilities in the number of reads. This is a major problem for differential peak calling.

rowsummary = data.frame(rowmeans = apply(df[, 4:ncol(df)], 1, mean), rowsds = apply(df[, 4:ncol(df)], 1, sd))
ggplot(data=rowsummary, aes(x=rowmeans, y=rowsds)) + geom_point() + xlab("Peak means") + ylab("Peak SDs")

2.6 Data normalization

Extensions of the simple log-transformation such as rlog or the variance stabilizing transformation have been developed and are often applied to count data sets. DESeq2 also provides a method to compute normalized counts that account for library size and variance-mean dependencies.

counts = df[,4:ncol(df)]
dds = DESeqDataSetFromMatrix(countData = counts[,order(colnames(counts))], colData = si, design = ~ celltype)
dds = DESeq(dds)
cm = data.frame(counts(dds, normalized=TRUE))
rownames(cm) = paste0(df$Chr, '_', df$Start, '_', df$End)

2.7 Data visualization

Let’s explore the normalized counts.

lf = melt(cm, id.vars=c())
pander(head(lf))

variable	value
Bcell_13A	120.3
Bcell_13A	28.59
Bcell_13A	36.76
Bcell_13A	36.31
Bcell_13A	24.51
Bcell_13A	41.3

ggplot(data=lf, aes(x=variable, y=value)) + geom_boxplot(aes(group=variable)) + xlab("Sample") + ylab("Normalized Count") + coord_flip()

ggplot(data=lf, aes(x=value)) + geom_freqpoly(aes(group=variable, color=variable), bins=30) + xlab("Sample") + ylab("Normalized Count")

libsize = data.frame(x=sizeFactors(dds), y=colSums(assay(dds)))
ggplot(data=libsize, aes(x=x, y=y)) + geom_point() + geom_smooth(method="lm") + xlab("Estimated size factor") + ylab("Library size")

2.8 Principal component analysis (PCA)

Let’s do a PCA on the normalized counts and project the cell-type information onto the PCA plot.

pca = prcomp(t(cm))
print(summary(pca))

## Importance of components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     7401.6259 4460.2806 3302.6565 2.011e+03 1.698e+03
## Proportion of Variance    0.5654    0.2053    0.1126 4.175e-02 2.977e-02
## Cumulative Proportion     0.5654    0.7708    0.8833 9.251e-01 9.549e-01
##                              PC6       PC7       PC8       PC9      PC10
## Standard deviation     1.602e+03 837.51387 638.98215 610.83234 567.93309
## Proportion of Variance 2.649e-02   0.00724   0.00421   0.00385   0.00333
## Cumulative Proportion  9.814e-01   0.98861   0.99282   0.99667   1.00000
##                             PC11
## Standard deviation     2.079e-11
## Proportion of Variance 0.000e+00
## Cumulative Proportion  1.000e+00

pcaData = as.data.frame(pca$x)
pcaData$sample=rownames(pcaData)
pcaData=merge(pcaData, si)
percentVar = round(100 * (pca$sdev^2 / sum( pca$sdev^2 ) ))
p=ggplot(data=pcaData, aes(x = PC1, y = PC2, color=celltype)) + geom_point(size=3)
p=p+xlab(paste0("PC1: ", percentVar[1], "% variance"))
p=p+ylab(paste0("PC2: ", percentVar[2], "% variance"))
print(p)

q=ggplot(data=pcaData, aes(x = PC3, y = PC4, color=celltype)) + geom_point(size=3)
q=q+xlab(paste0("PC3: ", percentVar[3], "% variance"))
q=q+ylab(paste0("PC4: ", percentVar[4], "% variance"))
print(q)

We can also check the proportion of variance explained by each PC.

varexp = data.frame(x=1:length(percentVar), y=percentVar)
varexp$x = factor(varexp$x)
ggplot(data=varexp, aes(x=x, y=y)) + geom_bar(stat="identity") + xlab("Principal Component") + ylab("Proportion of variation (%)")

Lastly we can inspect the loadings for each PC. That is we can investigate which peaks contribute most to the separation of the individual cell types.

loadings = abs(pca$rotation)
contribution = as.data.frame(sweep(loadings, 2, colSums(loadings), "/"))
contribution = contribution[with(contribution, order(-PC1)),]
pander(head(contribution))

Table continues below
	PC1	PC2	PC3	PC4
chr12_92987578_92988078	0.0004424	0.0001124	1.766e-05	7.004e-05
chr1_224691015_224691515	0.0003718	0.0001519	3.07e-05	0.0002056
chr8_38795902_38796402	0.0003584	0.0003164	0.0001248	5.385e-05
chr12_92270808_92271308	0.0003584	0.0002103	5.986e-05	9.303e-06
chr12_51295467_51295967	0.0003429	0.0001555	8.909e-06	9.567e-06
chr15_74902825_74903325	0.0003399	0.0001579	9.641e-06	3.776e-06

Table continues below
	PC5	PC6	PC7	PC8
chr12_92987578_92988078	1.955e-05	5.95e-05	4.679e-05	0.0001663
chr1_224691015_224691515	6.718e-05	0.0001275	1.757e-05	2.673e-05
chr8_38795902_38796402	0.0001721	6.097e-05	6.327e-05	0.000128
chr12_92270808_92271308	3.164e-05	4.929e-05	0.0001448	0.0001997
chr12_51295467_51295967	6.043e-06	0.0001456	8.594e-07	1.615e-05
chr15_74902825_74903325	2.996e-05	0.0003589	3.861e-05	7.074e-06

	PC9	PC10	PC11
chr12_92987578_92988078	0.0002022	5.283e-06	0.0001728
chr1_224691015_224691515	6.669e-06	3.329e-06	0.0001102
chr8_38795902_38796402	1.797e-05	0.0001382	4.786e-05
chr12_92270808_92271308	7.393e-05	6.688e-05	0.0001987
chr12_51295467_51295967	1.024e-05	4.807e-07	3.969e-05
chr15_74902825_74903325	4.397e-06	4.686e-06	0.0001824

2.9 Annotate genomic context

Let’s annotate the genomic context of each peak such as nearby genes a given peak may regulate.

gr = makeGRangesFromDataFrame(df, keep.extra.columns=T)
peakAnno = annotatePeak(gr, tssRegion=c(-1000, 1000), TxDb=TxDb.Hsapiens.UCSC.hg19.knownGene, annoDb="org.Hs.eg.db")

## >> preparing features information...      2019-05-05 02:13:04 PM 
## >> identifying nearest features...        2019-05-05 02:13:04 PM 
## >> calculating distance from peak to TSS...   2019-05-05 02:13:05 PM 
## >> assigning genomic annotation...        2019-05-05 02:13:05 PM 
## >> adding gene annotation...          2019-05-05 02:13:16 PM 
## >> assigning chromosome lengths           2019-05-05 02:13:17 PM 
## >> done...                    2019-05-05 02:13:17 PM

The genomic context of all peaks can be ploted using:

plotAnnoPie(peakAnno)

Let’s have a look at the 500 peaks with the highest loading for PC1. Looking at the PCA plot PC1 seems to separate Erythroid cells from the others.

dfPA = as.data.frame(peakAnno)
rownames(dfPA) = paste0(dfPA$seqnames, '_', dfPA$start, '_', dfPA$end)
selpeaks = dfPA[rownames(head(contribution, 500)),]
pathway1 = enrichPathway(selpeaks[abs(selpeaks$distance) < 5000,]$geneId)
pander(head(pathway1))

Table continues below
	ID	Description	GeneRatio
R-HSA-6798695	R-HSA-6798695	Neutrophil degranulation	16/117
R-HSA-189451	R-HSA-189451	Heme biosynthesis	3/117

Table continues below
	BgRatio	pvalue	p.adjust	qvalue
R-HSA-6798695	480/10619	6.996e-05	0.03267	0.03267
R-HSA-189451	11/10619	0.0002017	0.04709	0.04709

Table continues below
	geneID
R-HSA-6798695	210/8694/10383/847/5553/54472/23197/3043/3071/5870/272/353189/8621/2992/4033/51646
R-HSA-189451	210/3145/2235

	Count
R-HSA-6798695	16
R-HSA-189451	3

dotplot(pathway1)

## wrong orderBy parameter; set to default `orderBy = "x"`

Heme metabolism is important during erythropoiesis and the neutrophil degranulation pathway is important for cells of the immune system. Hence, it makes sense that these peaks separate the Erythroid cells from the white blood cells.

2.10 Differential peak calling

We can also identify differential peaks between two cell types using DESeq2.

res = results(dds, lfcThreshold=1, contrast=c("celltype", "Bcell", "Ery"))
print(mcols(res, use.names=T))

## DataFrame with 6 rows and 2 columns
##                        type                                   description
##                 <character>                                   <character>
## baseMean       intermediate     mean of normalized counts for all samples
## log2FoldChange      results log2 fold change (MLE): celltype Bcell vs Ery
## lfcSE               results         standard error: celltype Bcell vs Ery
## stat                results         Wald statistic: celltype Bcell vs Ery
## pvalue              results      Wald test p-value: celltype Bcell vs Ery
## padj                results                          BH adjusted p-values

print(summary(res))

## 
## out of 57891 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 1.00 (up)    : 14782, 26%
## LFC < -1.00 (down) : 2132, 3.7%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## NULL

The histogram of p-values is

hist(res$pvalue, breaks=0:20/20, col="grey50", border="white", xlim=c(0,1), main="Histogram of p-values", xlab="p-value")

The log-fold changes can be visualized using

plotMA(res, ylim = c(-5, 5))

Let’s plot the significant results in a heatmap:

print(sum(res$padj < 0.01 & abs(res$log2FoldChange) > 1))

## [1] 9105

mat = cm[which(res$padj < 0.01 & abs(res$log2FoldChange) > 1),]
mat = mat - rowMeans(mat)
anno = as.data.frame(colData(dds)[, c("sample", "celltype")])
rownames(mat) = NULL
pheatmap(mat, annotation_col = anno, scale="row")

For the up- and down-regulated peaks we can again perform pathway enrichment. Please note that up-regulated peaks are higher in B cells whereas down-regulated peaks are higher in Erythroid cells given our contrast of B-cells vs. Erythroid cells.

selpeaks = dfPA[rownames(cm[which(res$padj < 0.1 & res$log2FoldChange>0),]),]
pathwayUp = enrichPathway(selpeaks[abs(selpeaks$distance) < 5000,]$geneId)
pander(head(pathwayUp))

Table continues below
	ID	Description	GeneRatio
R-HSA-194840	R-HSA-194840	Rho GTPase cycle	59/1664
R-HSA-983695	R-HSA-983695	Antigen activates B Cell Receptor (BCR) leading to generation of second messengers	22/1664
R-HSA-2029480	R-HSA-2029480	Fcgamma receptor (FCGR) dependent phagocytosis	39/1664
R-HSA-4420097	R-HSA-4420097	VEGFA-VEGFR2 Pathway	41/1664
R-HSA-449147	R-HSA-449147	Signaling by Interleukins	122/1664
R-HSA-9006934	R-HSA-9006934	Signaling by Receptor Tyrosine Kinases	120/1664

Table continues below
	BgRatio	pvalue	p.adjust	qvalue
R-HSA-194840	138/10619	2.048e-14	2.737e-11	2.367e-11
R-HSA-983695	32/10619	2.261e-11	1.51e-08	1.306e-08
R-HSA-2029480	86/10619	6.262e-11	2.789e-08	2.413e-08
R-HSA-4420097	99/10619	6.401e-10	2.138e-07	1.85e-07
R-HSA-449147	463/10619	1.123e-09	3.001e-07	2.596e-07
R-HSA-9006934	458/10619	2.24e-09	4.698e-07	4.065e-07

Table continues below
	geneID
R-HSA-194840	55160/998/10451/389/9181/257106/2665/94134/9886/391/143872/57569/397/121512/54509/90627/8874/23263/394/11214/29/201176/396/23526/7409/23370/4650/51291/64857/9138/388/6654/23433/9938/55843/57580/613/5880/9901/50650/57514/8997/26084/116984/399/83478/10144/79658/7204/23092/221472/117289/5879/23221/84904/89846/7410/395/393
R-HSA-983695	5293/29760/118788/6786/5777/3709/801/5336/974/7409/933/973/6654/3708/4690/27071/5295/2534/80228/640/4067/6850
R-HSA-2029480	998/10163/2268/10451/3709/3071/122618/4644/5336/4641/71/7525/7409/4650/5581/10097/10096/7456/10109/3055/63916/5594/3708/5580/4690/5290/5295/26999/2534/60/5879/9844/3984/10095/4067/5747/6850/7410/7454
R-HSA-4420097	5590/998/10163/5567/10451/117145/9047/4688/9261/3845/3709/3071/5829/801/5579/1535/71/6093/7409/208/8440/3685/63916/4689/3708/5580/4690/5290/253260/5295/26999/2534/60/5879/9844/857/2185/5747/79109/7410/1536
R-HSA-449147	5293/998/163702/6195/5724/3932/3725/1901/1435/3570/5451/356/9261/3586/29949/7431/6935/3688/3185/240/29760/9049/9846/3587/7132/5777/51561/11213/1848/2322/3146/2308/3936/4792/57161/2353/302/4088/3603/4205/3566/50615/246778/3684/3687/3965/6352/6348/6351/5608/9021/7525/5771/596/7409/2208/3383/9466/3718/3594/2323/5443/6654/57162/3557/26469/6775/2335/3667/3055/351/6647/3588/3689/6285/23765/5594/4282/5008/3162/3568/729230/1234/941/942/3592/5290/604/5602/4790/3558/2247/3600/3575/5295/4208/84868/23291/8878/3662/1026/5292/6885/2309/2534/53832/6196/7291/3569/5898/5478/2185/1846/4067/3574/4609/4507/6850/2889/6197/4478/3561
R-HSA-9006934	5590/998/6195/10163/3932/5567/10451/117145/9047/26052/3915/4688/127124/9261/3688/3185/143282/10312/9846/80310/5777/5800/3845/3709/3071/1848/3479/10019/5829/2254/2322/8874/801/161742/6692/5607/1445/4205/5579/57610/1535/84951/3675/71/7525/5771/6093/3909/7409/208/2323/29924/6654/5581/200734/8440/26469/10254/2591/3685/8828/2335/3667/1286/1285/23767/83737/63916/1299/5753/6285/5594/135/4689/9402/3708/5580/26018/4690/5290/2247/2549/253260/55914/5295/4208/2241/9542/1729/5159/26999/2534/5796/2099/6196/60/5879/9844/5898/64759/10603/857/673/26281/2185/1846/2260/4067/55824/54845/5440/5747/5774/9550/79109/2889/5900/7410/6197/1536

	Count
R-HSA-194840	59
R-HSA-983695	22
R-HSA-2029480	39
R-HSA-4420097	41
R-HSA-449147	122
R-HSA-9006934	120

selpeaks = dfPA[rownames(cm[which(res$padj < 0.1 & res$log2FoldChange<0),]),]
pathwayDown = enrichPathway(selpeaks[abs(selpeaks$distance) < 5000,]$geneId)
pander(head(pathwayDown))

Table continues below
	ID	Description	GeneRatio	BgRatio
R-HSA-189451	R-HSA-189451	Heme biosynthesis	5/437	11/10619

Table continues below
	pvalue	p.adjust	qvalue	geneID
R-HSA-189451	4.339e-05	0.04538	0.04476	7389/7390/3145/2235/210

	Count
R-HSA-189451	5

There are many other methods for functional enrichments such as GREAT or DAVID. This concludes the ATAC-Seq tutorial. I hope you enjoyed it.