23. dmc_Bayes.py

23.1. Description

Different from statistical testing, this program tries to estimates “how different the means between the two groups are” using the Bayesian approach. An MCMC is used to estimate the “means”, “difference of means”, “95% HDI (highest posterior density interval)”, and the posterior probability that the HDI does NOT include “0”.

It is similar to John Kruschke’s BEST algorithm (Bayesian Estimation Supersedes T test)

Notes

This program is much slower than T-test due to MCMC (Markov chain Monte Carlo) step. Running it with multiple threads is highly recommended.

23.2. Options

--version

show program’s version number and exit

-h, --help

show this help message and exit

-i INPUT_FILE, --input_file=INPUT_FILE

Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).

-g GROUP_FILE, --group=GROUP_FILE

Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: Only for two group comparison.

-n N_ITER, --niter=N_ITER

Iteration times when using MCMC Metropolis-Hastings’s agorithm to draw samples from the posterior distribution. default=5000

-b N_BURN, --burnin=N_BURN

Number of simulated samples to discard. Thes initial samples are usually not completely valid because the Markov Chain has not stabilized to the stationary distribution. default=500.

-p N_PROCESS, --processor=N_PROCESS

The number of processes. default=1

-s SEED, --seed=SEED

The seed used by the random number generator. default=99

-o OUT_FILE, --output=OUT_FILE

The prefix of the output file.

23.3. Input files (examples)

23.4. Command

$  dmc_Bayes.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv.gz -p 10 -o dmc_output

23.5. Output files

dmc_output.bayes.tsv: this file consists of 6 columns:

ID : CpG ID

mu1 : Mean methylation level estimated from group1

mu2 : Mean methylation level estimated from gropu2

mu_diff : Difference between mu1 and mu2

mu_diff (95% HDI) : 95% of “High Density Interval” of mu_diff. The HDI indicates which points of distribution are most credible. This interval spans 95% of mu_diff’s distribution.

The probability that mu1 and mu2 are different.

$head -10 dmc_output.bayes.tsv

ID     mu1     mu2     mu_diff mu_diff (95% HDI)       Probability
cg00001099     0.775209        0.795404        -0.020196       (-0.065148,0.023974)    0.811024
cg00000363     0.610565        0.469523        0.141042        (0.030769,0.232965)     0.994665
cg00000884     0.845973        0.873761        -0.027787       (-0.051976,-0.004398)   0.984882
cg00000714     0.190868        0.199233        -0.008365       (-0.030071,0.014006)    0.816141
cg00000957     0.772905        0.827528        -0.054623       (-0.092116,-0.016465)   0.995327
cg00000292     0.748394        0.766326        -0.017932       (-0.051286,0.012583)    0.889729
cg00000807     0.729162        0.683732        0.045430        (-0.001523,0.086588)    0.981551
cg00000721     0.935903        0.935080        0.000823        (-0.013210,0.018628)    0.508686
cg00000948     0.898609        0.897536        0.001073        (-0.020663,0.026813)    0.518238

mu1 and mu2 can be considered as significantly different if the 95% HDI does NOT include zero.