1. Overview

The CpGtools package provides a collection of Python programs designed to annotate, perform quality control (QC), visualize, and analyze DNA methylation data generated from the following Illumina platforms:

The CpGtools modules are organized into four main categories:

  • CpG position analysis modules

  • CpG signal analysis modules

  • Differential CpG analysis modules

  • Predictive modules (under development)

1.1. CpG Position Analysis Modules

These modules focus on analyzing CpG genomic locations and their annotations.

Name

Description

CpG_aggregation.py

Aggregate proportion values of CpGs that located in give genomic regions (eg. CpG islands, promoters, exons, etc.).

CpG_anno_position.py

Add annotation information CpGs according to their genomic coordinates.

CpG_anno_probe.py

Add annotation information to 450K/850K probes.

CpG_density_gene_centered.py

Generate the CpG density (count) profile over gene body and the up/down-stream intergenic regions.

CpG_distrb_chrom.py

Calculate the distribution of CpG over chromosomes.

CpG_distrb_gene_centered.py

Calculate the distribution of CpG over gene-centered genomic regions.

CpG_distrb_region.py

Calculate the distribution of CpG over user-specified genomic regions.

CpG_logo.py

Generate a DNA motif logo and matrices for a given set of CpGs.

CpG_to_gene.py

Assign CpGs to their putative target genes. It uses the algorithm similar to GREAT.

1.2. CpG Signal Analysis Modules

These modules analyze CpG methylation beta values across samples and genomic regions.

Name

Description

beta_PCA.py

Perform PCA (principal component analysis) for samples.

beta_jitter_plot.py

Generate jitter plot (a.k.a. strip chart) and bean plot for each sample.”

beta_m_conversion.py

Convert Beta-value into M-value or vice versa.

beta_profile_gene_centered.py

Calculate the methylation profile (i.e., average beta value) for genomic regions around genes.

beta_profile_region.py

Calculate methylation profile (i.e. average beta value) around the user-specified genomic regions.

beta_stacked_barplot.py

Create stacked barplot for each sample. The stacked barplot showing the proportions of CpGs whose beta values are falling into [0,0.25], [0.25,0.5], [0.5,0.75],[0.75,1]

beta_stats.py

Summarize basic information on CpGs located in each genomic region.

beta_tSNE.py

Perform t-SNE (t-Distributed Stochastic Neighbor Embedding) analysis for samples.

beta_topN.py

Select the top N most variable CpGs (according to standard deviation) from the input file.

beta_trichotmize.py

Use Bayesian Gaussian Mixture model to trichotmize beta values into three status: ‘Un-methylated’,’Semi-methylated’, ‘Full-methylated’, and ‘unassigned’.

beta_UMAP.py

Perform UMAP (Uniform Manifold Approximation and Projection) for samples.

beta_selectNbest.py

Select the K best features using ANOVA, Mutual information or Chi-squared stat.

beta_combat.py

Corrects batch effect using the combat algorithm.

1.3. Differential CpG Analysis Modules

These modules identify CpGs that are differentially methylated between experimental or biological groups.

Name

Description

dmc_Bayes.py

Differential CpG analysis using the Bayesian approach. (for 450K/850K data)

dmc_bb.py

Differential CpG analysis using the beta-binomial model. (for RRBS/WGBS count data)

dmc_fisher.py

Differential CpG analysis using Fisher’s Exact Test. (for RRBS/WGBS count data)

dmc_glm.py

Differential CpG analysis using the GLM generalized liner model. (for 450K/850K data)

dmc_logit.py

Differential CpG analysis using logistic regression model. (for RRBS/WGBS count data)

dmc_nonparametric.py

Differential CpG analysis using Mann-Whitney U test for two group comparison, and the Kruskal-Wallis H-test for multiple groups comparison.

dmc_ttest.py

Differential CpG analysis using T test. (for 450K/850K data)

1.4. Predictive Modules

These modules aim to predict phenotypes or biological attributes from DNA methylation profiles.

Name

Description

predict_sex.py

Predict sex based on the semi-methylation (also known as genomic imprinting) ratio.