beta_combat.py ============== Overview -------- ``beta_combat.py`` takes a CpG × sample beta matrix and a sample→batch mapping, applies `ComBat `_, and writes an adjusted matrix plus before/after QC boxplots. If the input contains missing values, KNN imputation is performed prior to batch correction. Input files ----------- Beta matrix (TSV or TSV.GZ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Delimiter:** tab - **Header:** first row contains sample IDs - **Index:** first column contains CpG IDs - **Values:** beta values in [0, 1] - **Missing values:** if the file contains missing values, KNN will be used for imputation. Example:: CpG_ID Sample_01 Sample_02 Sample_03 Sample_04 cg_001 0.831035 0.878022 0.794427 0.880911 cg_002 0.249544 0.209949 0.234294 0.236680 cg_003 0.845065 0.843957 0.840184 0.824286 ... Batch map (CSV) ~~~~~~~~~~~~~~~ - **Delimiter:** comma - **Columns:** ``Sample,Group`` - **Sample IDs:** must match the beta-matrix header exactly (case-sensitive) - **Grouping:** each sample belongs to exactly one batch (e.g., plate, chip) Example:: Sample,Group Sample_01,plate_1 Sample_02,plate_1 Sample_03,plate_2 Sample_04,plate_2 ... Example input files ~~~~~~~~~~~~~~~~~~~ - `test_12_threebatch.beta.tsv.gz `_ - `test_12_threebatch.beta.100K_NAs.tsv.gz `_ (with 100,000 missing values) - `test_12_threebatch.batch.csv `_ Options ------- :: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILE, --input_file=INPUT_FILE Tab-separated data frame with the 1st row as sample IDs and the 1st column as CpG IDs. Accepts .tsv or .tsv.gz. -k N_NEIGHBORS Number of neighbors to use for imputation. default=3 --axis=AXIS_CHOICE For KNN imputation when the input has missing values: 1: search columns to find the nearest neighbors. 0: search rows to find the nearest neighbors. default=1 -g GROUP_FILE, --group=GROUP_FILE Comma-separated file mapping samples to batch groups. -o OUT_FILE, --output=OUT_FILE Output prefix for all generated files. Command examples ---------------- No missing values ~~~~~~~~~~~~~~~~~ :: $ beta_combat.py \ -i test_12_threebatch.beta.tsv.gz \ -g test_12_threebatch.batch.csv \ -o output With missing values (KNN imputation applied first) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :: $ beta_combat.py \ -i test_12_threebatch.beta.100K_NAs.tsv.gz \ -g test_12_threebatch.batch.csv \ -o output Outputs ------- Input without missing values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ``.combat.tsv`` — beta matrix after ComBat batch correction - ``.boxplot.png`` — distribution **before** batch correction - ``.boxplot_combat.png`` — distribution **after** batch correction Input with missing values ~~~~~~~~~~~~~~~~~~~~~~~~~ - ``.combat.tsv`` — beta matrix after ComBat (missing values imputed via KNN) - ``.combat_withNAs.tsv`` — beta matrix after ComBat **without** imputation (original NAs retained) - ``.boxplot.png`` — distribution **before** batch correction - ``.boxplot_combat.png`` — distribution **after** batch correction Figures ------- .. image:: ../_static/output.boxplot.png :height: 400px :width: 600px :alt: Boxplot of beta values before ComBat .. image:: ../_static/output.boxplot_combat.png :height: 400px :width: 600px :alt: Boxplot of beta values after ComBat Notes & tips ------------ - Ensure every sample ID in the beta matrix appears exactly once in the batch map. - Batch labels in ``Group`` can be any strings (e.g., ``plate_1``, ``chip_B``) as long as they consistently identify batches. - If biological covariates need to be adjusted for, incorporate them upstream (this wrapper performs basic ComBat only). Reference --------- Johnson, W.E., Li, C., & Rabinovic, A. (2007). *Adjusting batch effects in microarray expression data using empirical Bayes methods.* **Biostatistics**, 8(1), 118–127. DOI: see `PubMed 16632515 `_.