Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init #1

Merged
merged 49 commits into from
Jan 30, 2019
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
8c091bb
file stubs
sndrtj Dec 31, 2018
0e03103
process bam file to array
sndrtj Dec 31, 2018
fa53c5b
train svm model function
sndrtj Dec 31, 2018
523ca4f
offload array generation to bam_process
sndrtj Jan 2, 2019
e91b7a5
train cli function
sndrtj Jan 2, 2019
65b8f5d
train cli in setup.py
sndrtj Jan 2, 2019
aeacaf6
use multiprocessing for bam processing
sndrtj Jan 2, 2019
35e1bf1
plot results
sndrtj Jan 2, 2019
44f3452
dump model to disk
sndrtj Jan 2, 2019
29da7e0
attempt to make contour plot
sndrtj Jan 2, 2019
1d41c80
rename package
sndrtj Jan 4, 2019
0416b13
remove contour attempts, move to separate branch
sndrtj Jan 4, 2019
69607eb
read obj from disk
sndrtj Jan 4, 2019
c8cb1b5
predict from model
sndrtj Jan 4, 2019
451294c
predict
sndrtj Jan 4, 2019
ff619ee
write predictions to disk
sndrtj Jan 7, 2019
02c5275
remove confusing plot elements
sndrtj Jan 7, 2019
c92e8eb
fix flake8 errors
sndrtj Jan 7, 2019
5b815b1
test stubs
sndrtj Jan 7, 2019
a1444c3
add help to train cli
sndrtj Jan 16, 2019
baed5a3
help messages for classify cli
sndrtj Jan 16, 2019
0924e77
install and usages in readme
sndrtj Jan 16, 2019
4852d20
utils testing
sndrtj Jan 17, 2019
3ec451d
some cli testing
sndrtj Jan 23, 2019
f79434d
add path callback testing
sndrtj Jan 23, 2019
0e3ddf7
train cli errors test
sndrtj Jan 23, 2019
5522162
add chop contig test
sndrtj Jan 23, 2019
6103152
classify cli errors testing
sndrtj Jan 23, 2019
2c2c0cd
add travis yaml
sndrtj Jan 23, 2019
501060f
fix travis
sndrtj Jan 23, 2019
c182a12
fix flake errors
sndrtj Jan 23, 2019
b5f63f0
add pytest cov to requirements-dev
sndrtj Jan 23, 2019
83be8b6
reorganize travis yaml
sndrtj Jan 23, 2019
e5c92e9
fix coverage?
sndrtj Jan 23, 2019
3b43b54
fix coverage?
sndrtj Jan 23, 2019
7ce0418
use coverage module directly
sndrtj Jan 23, 2019
127af8f
reinstate install
sndrtj Jan 23, 2019
be16f82
codecov with travis
sndrtj Jan 23, 2019
5b6d83f
add badges in readme
sndrtj Jan 23, 2019
653ce10
fix headers and test python 3.5 as well
sndrtj Jan 28, 2019
9916af9
python requires in setup py
sndrtj Jan 28, 2019
42cd785
unpack tuple and some extra comments
sndrtj Jan 28, 2019
365fde8
unpack more tuples and chance some more names
sndrtj Jan 28, 2019
1384de7
clarification on return types
sndrtj Jan 28, 2019
07d28bf
sanity checks on chop contig.
sndrtj Jan 28, 2019
dc00a4e
extra docs on callbacks
sndrtj Jan 28, 2019
a8dd866
docstring on training and extra sanity check on empty lists
sndrtj Jan 28, 2019
833b582
more tuple unpacking
sndrtj Jan 28, 2019
fa02ca1
better name for variable
sndrtj Jan 28, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
language: python
matrix:
include:
- python: "3.6"
dist: xenial
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
- python: "3.7"
dist: xenial
install:
- pip install codecov
- pip install -r requirements.txt
- pip install -r requirements-dev.txt
- python setup.py install
script:
- flake8 --statistics tests rna_cd
- coverage run --source=rna_cd -m py.test -v tests
- coverage xml
- coverage report -m
after_success:
- codecov
85 changes: 80 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Mouse-RNA Detection
[![Build Status](https://travis-ci.org/LUMC/rna_cd.svg?branch=master)](https://travis-ci.org/LUMC/rna_cd) [![codecov](https://codecov.io/gh/LUMC/rna_cd/branch/master/graph/badge.svg)](https://codecov.io/gh/LUMC/rna_cd)
# RNA Contamination Detection

Detect contaminations of mouse RNA in human DNA Illumina reads.
Detect contaminations of mouse or human RNA in human DNA Illumina reads.

Mouse RNA gives a spike of reads with large amounts of softclips in chrM.
This is likely due to an expressed mitochondrial gene.
We can use this behaviour to detect the presence of mouse RNA in our data.
Mouse and human RNA contamination in a DNA sample gives a spike of reads with
large amounts of softclips in chrM. This is likely due to an expressed
mitochondrial gene. We can use this behaviour to detect the presence of mouse
RNA in our data.

The mitochondrial chromosome is usually covered completely in both exome and
whole-genome sequencing experiments, and can thus be used for both approaches.
Expand All @@ -18,6 +20,79 @@ We will possibly provide a pre-trained model in the future.
* click
* scikit-learn
* pysam
* matplotlib

## Installation

`python setup.py install`

## Usage

### Training

```
Usage: rna_cd-train [OPTIONS]
sndrtj marked this conversation as resolved.
Show resolved Hide resolved

Options:
--chunksize INTEGER Chunksize in bases. Default = 100
-c, --contig TEXT Name of mitochrondrial contig in your BAM
files. Default = chrM
-pd, --positives-dir DIRECTORY Path to directory containing positive BAM
files. Mutually exclusive with --positives-
list
-nd, --negatives-dir DIRECTORY Path to directory containing negative BAM
files. Mutually exlusive with --negatives-
list
-pl, --positives-list FILE Path to file containing a list of paths to
positive BAM files. Mutually exclusive with
--positives-dir
-nl, --negatives-list FILE Path to file containing a list of paths to
negative BAM files. Mutuallly exclusive with
--negatives-dir
--cross-validations INTEGER Number of folds for cross validation run.
Default = 3
--verbosity INTEGER Verbosity value for cross validation step.
Default = 1
-j, --cores INTEGER Number of cores to use for processing of BAM
files and cross validations. Default = 1
--plot-out PATH Optional path to PCA plot.
-o, --model-out PATH Path where model will be stored. [required]
--help Show this message and exit.
```

For example:

```bash
rna_cd-train -pl pos.list -nl neg.list -j 8 --plot-out out.png -o model.out
```

### Classification

```
Usage: rna_cd-classify [OPTIONS]

Options:
--chunksize INTEGER Chunksize in bases. Default = 100
-c, --contig TEXT Name of mitochrondrial contig in your BAM files.
Default = chrM
-j, --cores INTEGER Number of cores to use for processing of BAM
files. Default = 1
-d, --directory DIRECTORY Path to directory with BAM files to be tested.
Mutually exclusive with --list-items
-l, --list-items FILE Path to file containing list of paths to BAM
files to be tested. Mutually exclusive with
--directory
-m, --model FILE Path to model.
-o, --output PATH Path to output file containing classifications.
[required]
--help Show this message and exit.
```

For example:

```bash
rna_cd-classify -m model.out -l samples.list -j 8 -o pred.out
```

## License
AGPLv3
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
3 changes: 3 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pytest>=4.1.0
flake8>=3.6.0
pytest-cov>=2.6.1
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
click>=7.0
pysam>=0.15.1
scikit-learn>=0.20.2
matplotlib>=3.0.2
semver>=2.8.1
joblib>=0.13.0
18 changes: 18 additions & 0 deletions rna_cd/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""
rna_cd
Copyright (C) 2018-2019 Leiden University Medical Center, Sander Bollen
sndrtj marked this conversation as resolved.
Show resolved Hide resolved

This program is free software: you can redistribute it and/or modify
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
"""
from .utils import VERSION # noqa
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
114 changes: 114 additions & 0 deletions rna_cd/bam_process.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""
rna_cd
Copyright (C) 2018-2019 Leiden University Medical Center, Sander Bollen

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.

bam_process.py
~~~~~~~~~~~~~~

Process bam file to numpy array for classifications.
"""
from functools import partial
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
from multiprocessing import Pool
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
from pathlib import Path
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
from typing import Iterator, Tuple, Callable, List, Any

import numpy as np
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
from pysam import AlignmentFile
sndrtj marked this conversation as resolved.
Show resolved Hide resolved

from .utils import echo
sndrtj marked this conversation as resolved.
Show resolved Hide resolved


def chop_contig(size: int, chunksize: int) -> Iterator[Tuple[int, int]]:
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
"""
For a contig of given size, generate regions maximally chunksize long
We use _0_ based indexing
"""
pos = 0
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
while pos < size:
end = pos + chunksize
if end < size:
yield (pos, end)
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
else:
yield (pos, size)
pos = end
sndrtj marked this conversation as resolved.
Show resolved Hide resolved


def softclip_bases(reader: AlignmentFile, contig: str,
region: Tuple[int, int]) -> int:
"""Calculate amount of softclip bases for a region"""
it = reader.fetch(contig=contig, start=region[0], stop=region[1])
s = 0
for read in it:
if read.cigartuples is not None:
s += sum(amount for op, amount in read.cigartuples if op == 4)
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
return s


def coverage(reader: AlignmentFile, contig: str, region: Tuple[int, int],
method: Callable = np.mean) -> int:
"""Calculate average/median/etc coverage for a region"""
covs = reader.count_coverage(contig=contig, start=region[0],
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
stop=region[1])

return method(np.sum(covs))


def process_bam(path: Path, chunksize: int = 100,
contig: str = "chrM") -> np.ndarray:
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
"""Process bam file to an ndarray"""
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
echo("Calculating features for {0}".format(path.name))
reader = AlignmentFile(str(path))
try:
ctg_idx = reader.references.index(contig)
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
except ValueError:
raise ValueError("Contig {0} does not exist in BAM file".format(
contig
))
contig_size = reader.lengths[ctg_idx]

arr = []
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
tot_reads = 0
for region in chop_contig(contig_size, chunksize):
block = []
n_reads = reader.count(contig=contig, start=region[0], stop=region[1])
tot_reads += n_reads
cov = coverage(reader, contig, region)
softclip = softclip_bases(reader, contig, region)
block += [n_reads, cov, softclip]
arr += block
# add normalization step
normalized = np.array(arr) / tot_reads
echo("Done calculating features for {0}".format(path.name))
return normalized


def make_array_set(bam_files: List[Path], labels: List[Any],
chunksize: int = 100,
contig: str = "chrM",
cores: int = 1) -> Tuple[np.ndarray, np.ndarray]:
"""
Make set of numpy arrays corresponding to data and labels.
I.e. train/testX and train/testY in scikit-learn parlance.

:param bam_files: List of paths to bam files
:param labels: list of labels.
:param cores: number of cores to use for processing
:return: tuple of X and Y numpy arrays. X = 2d, Y = 1d
"""
pool = Pool(cores)
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
proc_func = partial(process_bam, chunksize=chunksize, contig=contig)
arr_X = pool.map(proc_func, bam_files)
sndrtj marked this conversation as resolved.
Show resolved Hide resolved
return np.array(arr_X), np.array(labels)
Loading