17. beta_selectNBest.py

17.1. Description

Select the K best features according to the K highest scores. Scores can be measured by:

ANOVA F-value between label/feature for classification tasks.
Mutual information for a discrete target.
Chi-squared stats of non-negative features for classification tasks.

This is similar to beta_topN.py, but requires a group file.

Example of input

CpG_ID  Sample_01       Sample_02       Sample_03       Sample_04
cg_001  0.831035        0.878022        0.794427        0.880911
cg_002  0.249544        0.209949        0.234294        0.236680
cg_003  0.845065        0.843957        0.840184        0.824286

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file=INPUT_FILE
                        Tab-separated data frame file containing beta values
                        with the 1st row containing sample IDs and the 1st
                        column containing CpG IDs.
  -g GROUP_FILE, --group=GROUP_FILE
                        Comma-separated group file defining the biological
                        groups of each sample.
  -c CPG_COUNT, --topK=CPG_COUNT
                        Number of top features to select. default=100
  -s SCORE_FUNCTION, --score-function=SCORE_FUNCTION
                        Scoring function used to measure the dependency
                        between features scores and labels. Must be "chisq"
                        (chi-squared statistic), "anova" (ANOVA F-value), or
                        "mi" (mutual information). default=chisq
  -o OUT_FILE, --output=OUT_FILE
                        The prefix of the output file.

17.2. Input files (examples)

test_05_TwoGroup.tsv.gz

17.3. Command

$beta_selectNBest.py -i test_05_TwoGroup.tsv -g test_05_TwoGroup.grp.csv  -o test_05_out

@ 2024-09-05 09:29:38: Reading input file: "test_05_TwoGroup.tsv"
@ 2024-09-05 09:29:38: 0 rows with missing values were removed.
@ 2024-09-05 09:29:38: Transposing data matrix ...
@ 2024-09-05 09:29:38: Total number of features: 10000
@ 2024-09-05 09:29:38: Reading group file: "test_05_TwoGroup.grp.csv"
@ 2024-09-05 09:29:38: Using Chi Square statistic to select features ...
@ 2024-09-05 09:29:39: Total number of selected features : 100
@ 2024-09-05 09:29:39: Writing to file: "test_05_out.selectedFeatures.tsv"

17.4. Output file

test_05_out.selectedFeatures.tsv