LUMC · sndrtj · Jan 30, 2019 · Dec 31, 2018 · Dec 31, 2018 · Dec 31, 2018
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,19 @@
+language: python
+matrix:
+  include:
+    - python: "3.6"
+      dist: xenial
+    - python: "3.7"
+      dist: xenial
+install:
+  - pip install codecov
+  - pip install -r requirements.txt
+  - pip install -r requirements-dev.txt
+  - python setup.py install
+script:
+  - flake8 --statistics tests rna_cd
+  - coverage run --source=rna_cd -m py.test -v tests
+  - coverage xml
+  - coverage report -m
+after_success:
+  - codecov
diff --git a/README.md b/README.md
@@ -1,10 +1,12 @@
-# Mouse-RNA Detection
+[![Build Status](https://travis-ci.org/LUMC/rna_cd.svg?branch=master)](https://travis-ci.org/LUMC/rna_cd) [![codecov](https://codecov.io/gh/LUMC/rna_cd/branch/master/graph/badge.svg)](https://codecov.io/gh/LUMC/rna_cd)
+# RNA Contamination Detection
 
-Detect contaminations of mouse RNA in human DNA Illumina reads. 
+Detect contaminations of mouse or human RNA in human DNA Illumina reads. 
 
-Mouse RNA gives a spike of reads with large amounts of softclips in chrM. 
-This is likely due to an expressed mitochondrial gene. 
-We can use this behaviour to detect the presence of mouse RNA in our data.
+Mouse and human RNA contamination in a DNA sample gives a spike of reads with 
+large amounts of softclips in chrM. This is likely due to an expressed 
+mitochondrial gene. We can use this behaviour to detect the presence of mouse 
+RNA in our data.
 
 The mitochondrial chromosome is usually covered completely in both exome and
 whole-genome sequencing experiments, and can thus be used for both approaches.
@@ -18,6 +20,79 @@ We will possibly provide a pre-trained model in the future.
 * click
 * scikit-learn
 * pysam
+* matplotlib
+
+## Installation
+
+`python setup.py install`
+
+## Usage
+
+### Training
+
+```
+Usage: rna_cd-train [OPTIONS]
+
+Options:
+  --chunksize INTEGER             Chunksize in bases. Default = 100
+  -c, --contig TEXT               Name of mitochrondrial contig in your BAM
+                                  files. Default = chrM
+  -pd, --positives-dir DIRECTORY  Path to directory containing positive BAM
+                                  files. Mutually exclusive with --positives-
+                                  list
+  -nd, --negatives-dir DIRECTORY  Path to directory containing negative BAM
+                                  files. Mutually exlusive with --negatives-
+                                  list
+  -pl, --positives-list FILE      Path to file containing a list of paths to
+                                  positive BAM files. Mutually exclusive with
+                                  --positives-dir
+  -nl, --negatives-list FILE      Path to file containing a list of paths to
+                                  negative BAM files. Mutuallly exclusive with
+                                  --negatives-dir
+  --cross-validations INTEGER     Number of folds for cross validation run.
+                                  Default = 3
+  --verbosity INTEGER             Verbosity value for cross validation step.
+                                  Default = 1
+  -j, --cores INTEGER             Number of cores to use for processing of BAM
+                                  files and cross validations. Default = 1
+  --plot-out PATH                 Optional path to PCA plot.
+  -o, --model-out PATH            Path where model will be stored.  [required]
+  --help                          Show this message and exit.
+```
+
+For example:
+
+```bash
+rna_cd-train -pl pos.list -nl neg.list -j 8 --plot-out out.png -o model.out
+```
+
+### Classification
+
+```
+Usage: rna_cd-classify [OPTIONS]
+
+Options:
+  --chunksize INTEGER        Chunksize in bases. Default = 100
+  -c, --contig TEXT          Name of mitochrondrial contig in your BAM files.
+                             Default = chrM
+  -j, --cores INTEGER        Number of cores to use for processing of BAM
+                             files. Default = 1
+  -d, --directory DIRECTORY  Path to directory with BAM files to be tested.
+                             Mutually exclusive with --list-items
+  -l, --list-items FILE      Path to file containing list of paths to BAM
+                             files to be tested. Mutually exclusive with
+                             --directory
+  -m, --model FILE           Path to model.
+  -o, --output PATH          Path to output file containing classifications.
+                             [required]
+  --help                     Show this message and exit.
+```
+
+For example:
+
+```bash
+rna_cd-classify -m model.out -l samples.list -j 8 -o pred.out 
+```
 
 ## License
 AGPLv3
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -0,0 +1,3 @@
+pytest>=4.1.0
+flake8>=3.6.0
+pytest-cov>=2.6.1
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,6 @@
+click>=7.0
+pysam>=0.15.1
+scikit-learn>=0.20.2
+matplotlib>=3.0.2
+semver>=2.8.1
+joblib>=0.13.0
diff --git a/rna_cd/__init__.py b/rna_cd/__init__.py
@@ -0,0 +1,18 @@
+"""
+rna_cd
+Copyright (C) 2018-2019  Leiden University Medical Center, Sander Bollen
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Affero General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU Affero General Public License for more details.
+
+You should have received a copy of the GNU Affero General Public License
+along with this program.  If not, see <http://www.gnu.org/licenses/>.
+"""
+from .utils import VERSION  # noqa
diff --git a/rna_cd/bam_process.py b/rna_cd/bam_process.py
@@ -0,0 +1,114 @@
+"""
+rna_cd
+Copyright (C) 2018-2019  Leiden University Medical Center, Sander Bollen
+
+This program is free software: you can redistribute it and/or modify
+it under the terms of the GNU Affero General Public License as published
+by the Free Software Foundation, either version 3 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU Affero General Public License for more details.
+
+You should have received a copy of the GNU Affero General Public License
+along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+bam_process.py
+~~~~~~~~~~~~~~
+
+Process bam file to numpy array for classifications.
+"""
+from functools import partial
+from multiprocessing import Pool
+from pathlib import Path
+from typing import Iterator, Tuple, Callable, List, Any
+
+import numpy as np
+from pysam import AlignmentFile
+
+from .utils import echo
+
+
+def chop_contig(size: int, chunksize: int) -> Iterator[Tuple[int, int]]:
+    """
+    For a contig of given size, generate regions maximally chunksize long
+    We use _0_ based indexing
+    """
+    pos = 0
+    while pos < size:
+        end = pos + chunksize
+        if end < size:
+            yield (pos, end)
+        else:
+            yield (pos, size)
+        pos = end
+
+
+def softclip_bases(reader: AlignmentFile, contig: str,
+                   region: Tuple[int, int]) -> int:
+    """Calculate amount of softclip bases for a region"""
+    it = reader.fetch(contig=contig, start=region[0], stop=region[1])
+    s = 0
+    for read in it:
+        if read.cigartuples is not None:
+            s += sum(amount for op, amount in read.cigartuples if op == 4)
+    return s
+
+
+def coverage(reader: AlignmentFile, contig: str, region: Tuple[int, int],
+             method: Callable = np.mean) -> int:
+    """Calculate average/median/etc coverage for a region"""
+    covs = reader.count_coverage(contig=contig, start=region[0],
+                                 stop=region[1])
+
+    return method(np.sum(covs))
+
+
+def process_bam(path: Path, chunksize: int = 100,
+                contig: str = "chrM") -> np.ndarray:
+    """Process bam file to an ndarray"""
+    echo("Calculating features for {0}".format(path.name))
+    reader = AlignmentFile(str(path))
+    try:
+        ctg_idx = reader.references.index(contig)
+    except ValueError:
+        raise ValueError("Contig {0} does not exist in BAM file".format(
+            contig
+        ))
+    contig_size = reader.lengths[ctg_idx]
+
+    arr = []
+    tot_reads = 0
+    for region in chop_contig(contig_size, chunksize):
+        block = []
+        n_reads = reader.count(contig=contig, start=region[0], stop=region[1])
+        tot_reads += n_reads
+        cov = coverage(reader, contig, region)
+        softclip = softclip_bases(reader, contig, region)
+        block += [n_reads, cov, softclip]
+        arr += block
+    # add normalization step
+    normalized = np.array(arr) / tot_reads
+    echo("Done calculating features for {0}".format(path.name))
+    return normalized
+
+
+def make_array_set(bam_files: List[Path], labels: List[Any],
+                   chunksize: int = 100,
+                   contig: str = "chrM",
+                   cores: int = 1) -> Tuple[np.ndarray, np.ndarray]:
+    """
+    Make set of numpy arrays corresponding to data  and labels.
+    I.e. train/testX and train/testY in scikit-learn parlance.
+
+    :param bam_files: List of paths to bam files
+    :param labels: list of labels.
+    :param cores: number of cores to use for processing
+    :return: tuple of X and Y numpy arrays. X = 2d, Y = 1d
+    """
+    pool = Pool(cores)
+    proc_func = partial(process_bam, chunksize=chunksize, contig=contig)
+    arr_X = pool.map(proc_func, bam_files)
+    return np.array(arr_X), np.array(labels)