.. role:: raw-math(raw) :format: latex html Input File and Data Format ========================== BED File -------- The **BED** (Browser Extensible Data) format is commonly used to describe blocks of genomic regions. Each line in a BED file represents one genomic feature and contains between **3 and 12 columns** of data. The BED format is **0-based**, meaning the first base of a chromosome is numbered **0**, and it follows a **left-open, right-closed** interval convention. For example, the BED entry ``chr1 10 15`` corresponds to the **11th to 15th bases** of chromosome 1 (i.e., bases 11–15 inclusive). **BED Variants** - **BED12 file** The standard BED format containing 12 fields. Each line represents a gene or a set of disconnected genomic regions. Detailed specifications are available `here `_. - **BED3 file** Contains only the first three required fields: ``chrom``, ``chromStart``, and ``chromEnd``. Each line represents a single genomic region where *score* and *strand* information are not required. - **BED3+ file** Contains at least three columns (``chrom``, ``chromStart``, ``chromEnd``). Any additional columns will be **ignored**. - **BED6 file** Includes the first six fields: ``chrom``, ``chromStart``, ``chromEnd``, ``name``, ``score``, and ``strand``. Each line represents a single genomic region and may include strand information or associated scores. - **BED6+ file** Contains at least six columns (``chrom``, ``chromStart``, ``chromEnd``, ``name``, ``score``, ``strand``). Any columns beyond these six will be **ignored**. --- Proportion Values ----------------- In `bisulfite sequencing `_ (e.g., RRBS or WGBS), the methylation level of a CpG site or region is represented by a **proportion value**. A proportion value is a pair of integers separated by a comma (``m,n``), where: - **m** = number of methylated reads (``0 ≤ m ≤ n``) - **n** = total number of reads (``n ≥ 0``) For example: :: 0,10 1,27 2,159 # three hypo-methylated loci 7,7 17,19 30,34 # three hyper-methylated loci --- Beta Values ----------- The **Beta-value** represents the proportion of methylation for a given CpG or locus. It ranges from **0 to 1**, and can be interpreted as an approximation of the **percentage of methylation**. A proportion value can be converted to a Beta-value, but **not vice versa**. In the equation below: - **C** = probe intensity or read count of the methylated allele - **U** = probe intensity or read count of the unmethylated allele .. math:: \beta = \frac{C}{U + C}, \quad (0 \leq \beta \leq 1) --- M Values -------- The **M-value** represents the log2 ratio of methylated versus unmethylated probe intensities (or read counts). It is calculated as follows: - **C** = probe intensity or read count of the methylated allele - **U** = probe intensity or read count of the unmethylated allele - **w** = offset (pseudo count) added to both numerator and denominator to prevent division by zero and reduce noise in low-coverage regions. .. math:: M = \log_{2}\left(\frac{C + w}{U + w}\right) --- Convert Beta-value to M-value or *vice versa* --------------------------------------------- The relationship between **Beta-value** and **M-value** can be expressed as: .. math:: \beta = \frac{2^{M}}{2^{M} + 1} \quad ; \quad M = \log_{2}\left(\frac{\beta}{1 - \beta}\right) The following figure illustrates this relationship: .. image:: _static/beta_vs_M_curve.png :align: center :height: 400px :width: 400px :scale: 80%