beta_combat.py
==============
Overview
--------
``beta_combat.py`` takes a CpG × sample beta matrix and a sample→batch mapping, applies
`ComBat `_, and writes an adjusted matrix plus before/after QC boxplots. If the input contains missing values, KNN imputation is performed prior to batch correction.
Input files
-----------
Beta matrix (TSV or TSV.GZ)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- **Delimiter:** tab
- **Header:** first row contains sample IDs
- **Index:** first column contains CpG IDs
- **Values:** beta values in [0, 1]
- **Missing values:** if the file contains missing values, KNN will be used for imputation.
Example::
CpG_ID Sample_01 Sample_02 Sample_03 Sample_04
cg_001 0.831035 0.878022 0.794427 0.880911
cg_002 0.249544 0.209949 0.234294 0.236680
cg_003 0.845065 0.843957 0.840184 0.824286
...
Batch map (CSV)
~~~~~~~~~~~~~~~
- **Delimiter:** comma
- **Columns:** ``Sample,Group``
- **Sample IDs:** must match the beta-matrix header exactly (case-sensitive)
- **Grouping:** each sample belongs to exactly one batch (e.g., plate, chip)
Example::
Sample,Group
Sample_01,plate_1
Sample_02,plate_1
Sample_03,plate_2
Sample_04,plate_2
...
Example input files
~~~~~~~~~~~~~~~~~~~
- `test_12_threebatch.beta.tsv.gz `_
- `test_12_threebatch.beta.100K_NAs.tsv.gz `_ (with 100,000 missing values)
- `test_12_threebatch.batch.csv `_
Options
-------
::
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
Tab-separated data frame with the 1st row as sample IDs
and the 1st column as CpG IDs. Accepts .tsv or .tsv.gz.
-k N_NEIGHBORS Number of neighbors to use for imputation. default=3
--axis=AXIS_CHOICE For KNN imputation when the input has missing values:
1: search columns to find the nearest neighbors.
0: search rows to find the nearest neighbors.
default=1
-g GROUP_FILE, --group=GROUP_FILE
Comma-separated file mapping samples to batch groups.
-o OUT_FILE, --output=OUT_FILE
Output prefix for all generated files.
Command examples
----------------
No missing values
~~~~~~~~~~~~~~~~~
::
$ beta_combat.py \
-i test_12_threebatch.beta.tsv.gz \
-g test_12_threebatch.batch.csv \
-o output
With missing values (KNN imputation applied first)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
$ beta_combat.py \
-i test_12_threebatch.beta.100K_NAs.tsv.gz \
-g test_12_threebatch.batch.csv \
-o output
Outputs
-------
Input without missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``.combat.tsv`` — beta matrix after ComBat batch correction
- ``.boxplot.png`` — distribution **before** batch correction
- ``.boxplot_combat.png`` — distribution **after** batch correction
Input with missing values
~~~~~~~~~~~~~~~~~~~~~~~~~
- ``.combat.tsv`` — beta matrix after ComBat (missing values imputed via KNN)
- ``.combat_withNAs.tsv`` — beta matrix after ComBat **without** imputation (original NAs retained)
- ``.boxplot.png`` — distribution **before** batch correction
- ``.boxplot_combat.png`` — distribution **after** batch correction
Figures
-------
.. image:: ../_static/output.boxplot.png
:height: 400px
:width: 600px
:alt: Boxplot of beta values before ComBat
.. image:: ../_static/output.boxplot_combat.png
:height: 400px
:width: 600px
:alt: Boxplot of beta values after ComBat
Notes & tips
------------
- Ensure every sample ID in the beta matrix appears exactly once in the batch map.
- Batch labels in ``Group`` can be any strings (e.g., ``plate_1``, ``chip_B``) as long as they consistently identify batches.
- If biological covariates need to be adjusted for, incorporate them upstream (this wrapper performs basic ComBat only).
Reference
---------
Johnson, W.E., Li, C., & Rabinovic, A. (2007). *Adjusting batch effects in microarray expression
data using empirical Bayes methods.* **Biostatistics**, 8(1), 118–127. DOI: see
`PubMed 16632515 `_.