12. beta_combat.py
12.1. Overview
beta_combat.py takes a CpG × sample beta matrix and a sample→batch mapping, applies
ComBat, and writes an adjusted matrix plus before/after QC boxplots. If the input contains missing values, KNN imputation is performed prior to batch correction.
12.2. Input files
12.2.1. Beta matrix (TSV or TSV.GZ)
Delimiter: tab
Header: first row contains sample IDs
Index: first column contains CpG IDs
Values: beta values in [0, 1]
Missing values: if the file contains missing values, KNN will be used for imputation.
Example:
CpG_ID Sample_01 Sample_02 Sample_03 Sample_04
cg_001 0.831035 0.878022 0.794427 0.880911
cg_002 0.249544 0.209949 0.234294 0.236680
cg_003 0.845065 0.843957 0.840184 0.824286
...
12.2.2. Batch map (CSV)
Delimiter: comma
Columns:
Sample,GroupSample IDs: must match the beta-matrix header exactly (case-sensitive)
Grouping: each sample belongs to exactly one batch (e.g., plate, chip)
Example:
Sample,Group
Sample_01,plate_1
Sample_02,plate_1
Sample_03,plate_2
Sample_04,plate_2
...
12.2.3. Example input files
test_12_threebatch.beta.100K_NAs.tsv.gz (with 100,000 missing values)
12.3. Options
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
Tab-separated data frame with the 1st row as sample IDs
and the 1st column as CpG IDs. Accepts .tsv or .tsv.gz.
-k N_NEIGHBORS Number of neighbors to use for imputation. default=3
--axis=AXIS_CHOICE For KNN imputation when the input has missing values:
1: search columns to find the nearest neighbors.
0: search rows to find the nearest neighbors.
default=1
-g GROUP_FILE, --group=GROUP_FILE
Comma-separated file mapping samples to batch groups.
-o OUT_FILE, --output=OUT_FILE
Output prefix for all generated files.
12.4. Command examples
12.4.1. No missing values
$ beta_combat.py \
-i test_12_threebatch.beta.tsv.gz \
-g test_12_threebatch.batch.csv \
-o output
12.4.2. With missing values (KNN imputation applied first)
$ beta_combat.py \
-i test_12_threebatch.beta.100K_NAs.tsv.gz \
-g test_12_threebatch.batch.csv \
-o output
12.5. Outputs
12.5.1. Input without missing values
<prefix>.combat.tsv— beta matrix after ComBat batch correction<prefix>.boxplot.png— distribution before batch correction<prefix>.boxplot_combat.png— distribution after batch correction
12.5.2. Input with missing values
<prefix>.combat.tsv— beta matrix after ComBat (missing values imputed via KNN)<prefix>.combat_withNAs.tsv— beta matrix after ComBat without imputation (original NAs retained)<prefix>.boxplot.png— distribution before batch correction<prefix>.boxplot_combat.png— distribution after batch correction
12.6. Figures
12.7. Notes & tips
Ensure every sample ID in the beta matrix appears exactly once in the batch map.
Batch labels in
Groupcan be any strings (e.g.,plate_1,chip_B) as long as they consistently identify batches.If biological covariates need to be adjusted for, incorporate them upstream (this wrapper performs basic ComBat only).
12.8. Reference
Johnson, W.E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), 118–127. DOI: see PubMed 16632515.