21. dmc_Bayes.py

21.1. Description

Different from statistical testing, this program tries to estimates “how different the means between the two groups are” using the Bayesian approach. An MCMC is used to estimate the “means”, “difference of means”, “95% HDI (highest posterior density interval)”, and the posterior probability that the HDI does NOT include “0”.

It is similar to John Kruschke’s BEST algorithm (Bayesian Estimation Supersedes T test)

Notes

  • This program is much slower than T-test due to MCMC (Markov chain Monte Carlo) step. Running it with multiple threads is highly recommended.

21.2. Options

--version show program’s version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
 Data file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing CpG positions or probe IDs (must be unique). Except for the 1st row and 1st column, any non-numerical values will be considered as “missing values” and ignored. This file can be a regular text file or compressed file (.gz, .bz2).
-g GROUP_FILE, --group=GROUP_FILE
 Group file defining the biological group of each sample. It is a comma-separated 2 columns file with the 1st column containing sample IDs, and the 2nd column containing group IDs. It must have a header row. Sample IDs should match to the “Data file”. Note: Only for two group comparison.
-n N_ITER, --niter=N_ITER
 Iteration times when using MCMC Metropolis-Hastings’s agorithm to draw samples from the posterior distribution. default=5000
-b N_BURN, --burnin=N_BURN
 Number of simulated samples to discard. Thes initial samples are usually not completely valid because the Markov Chain has not stabilized to the stationary distribution. default=500.
-p N_PROCESS, --processor=N_PROCESS
 The number of processes. default=1
-s SEED, --seed=SEED
 The seed used by the random number generator. default=99
-o OUT_FILE, --output=OUT_FILE
 The prefix of the output file.

21.4. Command

$  dmc_Bayes.py -i test_05_TwoGroup.tsv.gz -g test_05_TwoGroup.grp.csv.gz -p 10 -o dmc_output

21.5. Output files

  • dmc_output.bayes.tsv: this file consists of 6 columns:
  1. ID : CpG ID
  2. mu1 : Mean methylation level estimated from group1
  3. mu2 : Mean methylation level estimated from gropu2
  4. mu_diff : Difference between mu1 and mu2
  5. mu_diff (95% HDI) : 95% of “High Density Interval” of mu_diff. The HDI indicates which points of distribution are most credible. This interval spans 95% of mu_diff’s distribution.
  6. The probability that mu1 and mu2 are different.
$head -10 dmc_output.bayes.tsv

ID     mu1     mu2     mu_diff mu_diff (95% HDI)       Probability
cg00001099     0.775209        0.795404        -0.020196       (-0.065148,0.023974)    0.811024
cg00000363     0.610565        0.469523        0.141042        (0.030769,0.232965)     0.994665
cg00000884     0.845973        0.873761        -0.027787       (-0.051976,-0.004398)   0.984882
cg00000714     0.190868        0.199233        -0.008365       (-0.030071,0.014006)    0.816141
cg00000957     0.772905        0.827528        -0.054623       (-0.092116,-0.016465)   0.995327
cg00000292     0.748394        0.766326        -0.017932       (-0.051286,0.012583)    0.889729
cg00000807     0.729162        0.683732        0.045430        (-0.001523,0.086588)    0.981551
cg00000721     0.935903        0.935080        0.000823        (-0.013210,0.018628)    0.508686
cg00000948     0.898609        0.897536        0.001073        (-0.020663,0.026813)    0.518238

mu1 and mu2 can be considered as significantly different if the 95% HDI does NOT include zero.