8. CpG_logo.py

8.1. Description

This program generates a DNA motif logo for a given set of CpGs. To answer the question of “what is the genomic context for a given list of CpGs ?”. This program first extracts genomic sequences around C position, and then generate motif matrices include:

  • position frequency matrix (PFM)

  • position probability matrix (PPM)

  • position weight matrix (PWM)

  • MEME format matrix

  • Jaspar format matrix

It also generates motif logo using weblogo

Notes

  • input BED file must have strand information.

8.2. Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input_file=INPUT_FILE

BED file specifying the C position. This BED file should have at least six columns (Chrom, ChromStart, ChromeEnd, name, score, strand). Note: Must provide correct strand information. This file can be a regular text file or compressed file (.gz, .bz2).

-r GENOME_FILE, --refgenome=GENOME_FILE

Reference genome seqeunces in FASTA format. Must be indexed using the samtools “faidx” command.

-e EXTEND_SIZE, --extend=EXTEND_SIZE

Number of bases extended to up- and down-stream. default=5 (bp)

-n MOTIF_NAME, --name=MOTIF_NAME

Motif name. default=motif

-o OUT_FILE, --output=OUT_FILE

The prefix of the output file.

8.3. Input files (examples)

8.4. Command

$CpG_logo.py -i 450_CH.hg19.bed.gz -r hg19.fa -o 450_CH

8.5. Output files

  • 450_CH.logo.fa

  • 450_CH.logo.jaspar

  • 450_CH.logo.meme

  • 450_CH.logo.pfm

  • 450_CH.logo.ppm

  • 450_CH.logo.pwm

  • 450_CH.logo.logo.pdf

../_images/450_CH.logo.png