12. beta_combat.py

12.1. Overview

beta_combat.py takes a CpG × sample beta matrix and a sample→batch mapping, applies ComBat, and writes an adjusted matrix plus before/after QC boxplots. If the input contains missing values, KNN imputation is performed prior to batch correction.

12.2. Input files

12.2.1. Beta matrix (TSV or TSV.GZ)

  • Delimiter: tab

  • Header: first row contains sample IDs

  • Index: first column contains CpG IDs

  • Values: beta values in [0, 1]

  • Missing values: if the file contains missing values, KNN will be used for imputation.

Example:

CpG_ID   Sample_01  Sample_02  Sample_03  Sample_04
cg_001   0.831035   0.878022   0.794427   0.880911
cg_002   0.249544   0.209949   0.234294   0.236680
cg_003   0.845065   0.843957   0.840184   0.824286
...

12.2.2. Batch map (CSV)

  • Delimiter: comma

  • Columns: Sample,Group

  • Sample IDs: must match the beta-matrix header exactly (case-sensitive)

  • Grouping: each sample belongs to exactly one batch (e.g., plate, chip)

Example:

Sample,Group
Sample_01,plate_1
Sample_02,plate_1
Sample_03,plate_2
Sample_04,plate_2
...

12.2.3. Example input files

12.3. Options

--version             show program's version number and exit
-h, --help            show this help message and exit

-i INPUT_FILE, --input_file=INPUT_FILE
                      Tab-separated data frame with the 1st row as sample IDs
                      and the 1st column as CpG IDs. Accepts .tsv or .tsv.gz.

-k N_NEIGHBORS        Number of neighbors to use for imputation. default=3
--axis=AXIS_CHOICE    For KNN imputation when the input has missing values:
                      1: search columns to find the nearest neighbors.
                      0: search rows to find the nearest neighbors.
                      default=1

-g GROUP_FILE, --group=GROUP_FILE
                      Comma-separated file mapping samples to batch groups.

-o OUT_FILE, --output=OUT_FILE
                      Output prefix for all generated files.

12.4. Command examples

12.4.1. No missing values

$ beta_combat.py \
    -i test_12_threebatch.beta.tsv.gz \
    -g test_12_threebatch.batch.csv \
    -o output

12.4.2. With missing values (KNN imputation applied first)

$ beta_combat.py \
    -i test_12_threebatch.beta.100K_NAs.tsv.gz \
    -g test_12_threebatch.batch.csv \
    -o output

12.5. Outputs

12.5.1. Input without missing values

  • <prefix>.combat.tsv — beta matrix after ComBat batch correction

  • <prefix>.boxplot.png — distribution before batch correction

  • <prefix>.boxplot_combat.png — distribution after batch correction

12.5.2. Input with missing values

  • <prefix>.combat.tsv — beta matrix after ComBat (missing values imputed via KNN)

  • <prefix>.combat_withNAs.tsv — beta matrix after ComBat without imputation (original NAs retained)

  • <prefix>.boxplot.png — distribution before batch correction

  • <prefix>.boxplot_combat.png — distribution after batch correction

12.6. Figures

Boxplot of beta values before ComBat Boxplot of beta values after ComBat

12.7. Notes & tips

  • Ensure every sample ID in the beta matrix appears exactly once in the batch map.

  • Batch labels in Group can be any strings (e.g., plate_1, chip_B) as long as they consistently identify batches.

  • If biological covariates need to be adjusted for, incorporate them upstream (this wrapper performs basic ComBat only).

12.8. Reference

Johnson, W.E., Li, C., & Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1), 118–127. DOI: see PubMed 16632515.