1. Overview

CpGtools package provides a number of Python programs to annotate, QC, visualize, and analyze DNA methylation data generated from Illumina HumanMethylation450 BeadChip (450K) / MethylationEPIC BeadChip (850K) array or RRBS / WGBS.

These programs can be divided into four classes:

  • CpG position analysis modules

  • CpG signal analysis modules

  • Differential CpG analysis modules

  • Predictive modules (under development)

1.1. CpG position analysis modules

These modules are primarily used to analyze CpG’s genomic locations.

Name

Description

CpG_aggregation.py

Aggregate proportion values of CpGs that located in give genomic regions (eg. CpG islands, promoters, exons, etc.).

CpG_anno_position.py

Add annotation information CpGs according to their genomic coordinates.

CpG_anno_probe.py

Add annotation information to 450K/850K probes.

CpG_density_gene_centered.py

Generate the CpG density (count) profile over gene body and the up/down-stream intergenic regions.

CpG_distrb_chrom.py

Calculate the distribution of CpG over chromosomes.

CpG_distrb_gene_centered.py

Calculate the distribution of CpG over gene-centered genomic regions.

CpG_distrb_region.py

Calculate the distribution of CpG over user-specified genomic regions.

CpG_logo.py

Generate a DNA motif logo and matrices for a given set of CpGs.

CpG_to_gene.py

Assign CpGs to their putative target genes. It uses the algorithm similar to GREAT.

1.2. CpG signal analysis modules

These modules are primarily used to analyze CpG’s DNA methylation beta values

Name

Description

beta_PCA.py

Perform PCA (principal component analysis) for samples.

beta_jitter_plot.py

Generate jitter plot (a.k.a. strip chart) and bean plot for each sample.”

beta_m_conversion.py

Convert Beta-value into M-value or vice versa.

beta_profile_gene_centered.py

Calculate the methylation profile (i.e., average beta value) for genomic regions around genes.

beta_profile_region.py

Calculate methylation profile (i.e. average beta value) around the user-specified genomic regions.

beta_stacked_barplot.py

Create stacked barplot for each sample. The stacked barplot showing the proportions of CpGs whose beta values are falling into [0,0.25], [0.25,0.5], [0.5,0.75],[0.75,1]

beta_stats.py

Summarize basic information on CpGs located in each genomic region.

beta_tSNE.py

Perform t-SNE (t-Distributed Stochastic Neighbor Embedding) analysis for samples.

beta_topN.py

Select the top N most variable CpGs (according to standard deviation) from the input file.

beta_trichotmize.py

Use Bayesian Gaussian Mixture model to trichotmize beta values into three status: ‘Un-methylated’,’Semi-methylated’, ‘Full-methylated’, and ‘unassigned’.

beta_UMAP.py

Perform UMAP (Uniform Manifold Approximation and Projection) for samples.

beta_selectNbest.py

Select the K best features using ANOVA, Mutual information or Chi-squared stat.

1.3. Differential CpG analysis modules

These modules are primarily used to identify CpGs that are differentially methylated between groups

Name

Description

dmc_Bayes.py

Differential CpG analysis using the Bayesian approach. (for 450K/850K data)

dmc_bb.py

Differential CpG analysis using the beta-binomial model. (for RRBS/WGBS count data)

dmc_fisher.py

Differential CpG analysis using Fisher’s Exact Test. (for RRBS/WGBS count data)

dmc_glm.py

Differential CpG analysis using the GLM generalized liner model. (for 450K/850K data)

dmc_logit.py

Differential CpG analysis using logistic regression model. (for RRBS/WGBS count data)

dmc_nonparametric.py

Differential CpG analysis using Mann-Whitney U test for two group comparison, and the Kruskal-Wallis H-test for multiple groups comparison.

dmc_ttest.py

Differential CpG analysis using T test. (for 450K/850K data)

1.4. Predictive modules

These modules are primarily used to predict phenotypes from DNA methylation data

Name

Description

predict_sex.py

Predict sex based on the semi-methylation (also known as genomic imprinting) ratio.