21. beta_trichotmize.py
21.1. Description
Rather than using a hard threshold to call “methylated” or “unmethylated” CpGs or regions, this program uses a probability approach (Bayesian Gaussian Mixture model) to trichotmize beta values into three status:
- Un-methylatedlabeled as “0” in the result file
Both the homologous chromosomes (i.e. The maternal and paternal chromosomes) are unmethylated.
- Semi-methylatedlabeled as “1” in the result file
Only one of the homologous chromosomes is methylated. This is also called allele-specific methylation or imprinting. Note: semi-methylation here is different from hemimethylation, which refers to “one of two (complementary) strands is methylated”.
- Full-methylatedlabeled as “2” in the result file
Both the homologous chromosomes (i.e., The maternal and paternal chromosomes) are methylated.
- unassignedlabeled as “-1” in the result file
CpGs failed to assigned to the three categories above.
21.2. Algorithm
As described above, in somatic cells, most CpGs can be grouped into 3 categories including “Un-methylated”, “Semi-methylated (imprinted)” and “Full-methylated”. Therefore, the Beta distribution of CpGs can be considered as the mixture of 3 Gaussian distributions (i.e. components). beta_trichotmize.py first estimates the parameters (mu1, mu2, mu3) and (s1, s2, s3) of the 3 components using expectation–maximization (EM) algorithm, then it calculates the posterior probabilities ( p0, p1, and p2) of each component given the beta value of a CpG.
- p0
the probability that the CpG belongs to un-methylated component.
- p1
the probability that the CpG belongs to semi-methylated component.
- p2
the probability that the CpG belongs to full-methylated component.
The classification will be made using rules:
if p0 == max(p0, p1, p2):
un-methylated
elif p2 == max(p0, p1, p2):
full-methylated
elif p1 == max(p0, p1, p2):
if p1 >= prob_cutoff:
semi-methylated
else:
unknown/unassigned
21.3. Options
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -i INPUT_FILE, --input_file=INPUT_FILE
Input plain text file containing beta values with the 1st row containing sample IDs (must be unique) and the 1st column containing probe IDs (must be unique).
- -c PROB_CUTOFF, --prob-cut=PROB_CUTOFF
Probability cutoff to assign a probe into “semi- methylated” class. default=0.99
- -r, --report
If True, generates “summary_report.txt” file. default=False
- -s RANDOM_STATE, --seed=RANDOM_STATE
The seed used by the random number generator. default=99
21.4. Input files (examples)
21.5. Command
$beta_trichotmize.py -i test_05_TwoGroup.tsv -r
21.6. Output files
.results.txt for each sample
summary_report.txt
$ head CirrHCV_01.results.txt
#Prob_of_0: Probability of CpG belonging to un-methylation group
#Prob_of_1: Probability of CpG belonging to semi-methylation group
#Prob_of_2: Probability of CpG belonging to full-methylation group
#Assigned_lable: -1 = 'unassigned', 0 = 'un-methylation', 1 = 'semi-methylation', 2 = 'full-methylation'
Probe_ID Beta_value Prob_of_1 Prob_of_0 Prob_of_2 Assigned_lable
cg00000109 0.8776539440000001 0.05562534330044164 3.673659573888142e-93 0.9443746566995583 2
cg00000165 0.239308082 0.999222373166152 0.0007776268338481155 1.3380168478281785e-21 1
cg00000236 0.8951333909999999 0.052142920095512614 3.5462722261710256e-97 0.9478570799044873 2
cg00000292 0.783661275 0.22215555206863843 1.46921724055509e-72 0.7778444479313614 2
cg00000321 0.319783971 0.9999999909047641 9.09523558157906e-09 1.4703488768311725e-16 1
$ cat summary_report.txt
#means of components
Subject_ID Unmethyl SemiMethyl Methyl
CirrHCV_01 0.0705891104729628 0.4949428535816466 0.8694861885234295
CirrHCV_02 0.06775600800214297 0.5018649959502874 0.8731195740516192
CirrHCV_03 0.07063205540113326 0.49795240946021674 0.8730234341971185
...
#Weights of components
Subject_ID Unmethyl SemiMethyl Methyl
CirrHCV_01 0.27231055290074735 0.35186129618859385 0.3758281509106588
CirrHCV_02 0.2623073658620772 0.36736674559925425 0.37032588853866855
CirrHCV_03 0.2659211619015646 0.3563058727320757 0.37777296536635974
...
#Converge status and n_iter
Subject_ID Converged n_iter
CirrHCV_01 True 35
CirrHCV_02 True 37
CirrHCV_03 True 34
Below histogram and piechart showed the proportion of CpGs assigned to “Un-methylated”, “Semi-methylated” and “Full-methylated”.