3. CpG_anno_probe.py

3.1. Description

This program adds comprehensive annotation information to each 450K/850K array probe ID. It will add 17 columns to the original input data file. These 17 columns include (from left to right):

Header Name Description
hg19_pos The genomic position of the CpG on human genome assembly hg19 (or GRCh37)
hg38_pos The genomic position of the CpG on human genome assembly hg38 (or GRCh38).
strand Strand of the CpG. Value - “R” (reverse strand) or “F” (forward strand).
geneSymbol Genes the CpG has been assigned to. “N/A” indicates no genes were found. This is retrieved from the Illumina MethylationEpic v1.0 B4 manifest file.
CpGisland The CpG island (CGI) that overlaps with this CpG. “N/A” indicates no CGIs were found.
with_450K Boolean indicating whether this CpG probe is also included in 450K. “0” - No, “1”- Yes.
SNP_ID SNPs (rsID) that are close to this CpG. Multiple SNPs are separated by “;”. “N/A” indicates no SNPs were found.
SNP_distance The nucleotide distances between SNPs and the CpG.
SNP_MAF The minor allele frequencies (MAF) of SNPs.
Cross_Reactive Boolean (“0” - No, “1”- Yes) indicating whether this CpG could be affected by cross-hybridization or underlying genetic variation as reported by this paper.
ENCODE_TF_ChIP Transcription factor (TF) binding sites identified from ChIP-seq experiments performed by the ENCODE project. Peaks from 1264 experiments representing 338 transcription factors in 130 cell types are combined (N = 10,560,472). BED format file was downloaded from the UCSC Tabel Browser, and a detailed description is provided here.
ENCODE_DNaseI DNase I hypersensitivity sites identified from ENCODE DNase-seq experiments. Peaks from 125 cell types are combined (N - 1,867,665). BED format file was downloaded from the UCSC Table Browser, and a detailed description is provided here.
ENCODE_H3K27ac_ChIP H3K27ac peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 665,650)
ENCODE_H3K4me1_ChIP H3K4me1 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 1,435,550)
ENCODE_H3K4me3_ChIP H3K4me3 peaks identified from ENCODE histone ChIP-seq experiments. Peaks from 11 cell types (GM12878, H1-hESC, HMEC, HSMM, HUVEC, HeLaS3, HepG2, K562, Monocytes-CD14+_RO01746, NHEK, NHLF) are combined (N = 525,824)
ENCODE_chromHMM Chromatin State Segmentation by chromHMM from ENCODE. Chromatin states across 9 cell types (GM12878, H1-hESC, K562, HepG2, HUVEC, HMEC, HSMM, NHEK, NHLF) were learned by computationally by integrating 9 factors (CTCF, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H4K20me1 ) plus input. A total of 15 states were identified, include: State-1 (Active Promoter), state-2 (Weak Promoter), state-3 (Inactive/poised Promoter), state-4 and 5 (Strong enhancer), state-6 and 7 (Weak/poised enhancer), state-8 (insulator), state-9 (Transcriptional transition), state-10 (Transcriptional elongation), state-11 (Weak transcribed), state-12 (Polycomb-repressed), state-13 (Heterochromatin or low signal), state-14 and 15 (Repetitive/Copy Number Variation). Orignal chromatin state BED file was downloaded from UCSC Table Browser, and detailed description is provided here.
FANTOM_enhancer PHANTOM5 human enhancers downloaded from here.

3.2. Notes

  • For peaks identified from ENCODE ChIP-seq and DNase-seq (ENCODE_TF_ChIP, ENCODE_H3K27ac_ChIP, ENCODE_H3K4me1_ChIP, ENCODE_H3K4me3_ChIP, and ENCODE_DNaseI), we require the probe must be located in the 100 bp window centered on the middle of the peak.

3.3. Options

--version show program’s version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
 Input data file (Tab-separated) with a certain column containing 450K/850K array CpG IDs. This file can be a regular text file or compressed file (.gz, .bz2).
-a ANNO_FILE, --annotation=ANNO_FILE
 Annotation file. This file can be a regular text file or compressed file (.gz, .bz2).
-o OUT_FILE, --output=OUT_FILE
 Prefix of the output file.
-p PROBE_COL, --probe_column=PROBE_COL
 The number specifying which column contains probe IDs. Note: the column index starts with 0. default-0.
-l, --header Input data file has a header row.

3.5. Command

# probe IDs are located in the 4th column (-p 3)

$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv -i test_01.hg19.bed6 -o output

or (take gzipped files as input)

$CpG_anno_probe.py -p 3 -l -a MethylationEPIC_CpGtools.tsv.gz -i test_01.hg19.bed6.gz -o output

@ 2019-06-28 09:12:41: Read annotation file "../epic/MethylationEPIC_CpGtools.tsv" ...
@ 2019-06-28 09:12:52: Add annotation information to "test_01.hg19.bed6" ...

3.6. Output files

  • output.anno.txt