Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use python-isal's igzip_threaded module to replace igzip piping. #131

Merged
merged 15 commits into from
Oct 18, 2023

Conversation

rhpvorderman
Copy link
Collaborator

Hi Marcel,

Sorry I haven't worked on my dnaio PR lately. I have been sick and taking my sick leave. Currently I am taking an algorithms for genomics course in Deflt, which is very enjoyable. 4 hours of travel everyday, yet I am still full of energy due to pure excitement.

I have been working a lot on python-isal lately because sequali was taking a lot of time to handle a gzipped dataset and I wanted to use more threads. Unfortunately with xopen it is not possible to track the progress in the file due to it being opened by an external process. Also there is quite some overhead using the piped solution.

So I tried to write a mutlithread solution. Unfortunately it proved hard to escape the GIL. So I rewrote the entire gzip reading process in C (pycompression/python-isal#151). This has the fortunate side effect of removing most python overhead for single-threaded decompression as well and also making BGZIP format decompression a lot faster.

After that I wrote multithreaded readers and writers. pycompression/python-isal#153
Then I needed to do some polishing with several PRs to get a satisfying result but I think I got there.

Here are the benchmarks for reading using sequali. Sequali is compute bound so spawning a separate gzip thread significantly decreases wall clock time.

Using xopen current main branch

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      4.735 s ±  0.188 s    [User: 6.366 s, System: 1.034 s]
  Range (min … max):    4.518 s …  4.934 s    5 runs

Using this branch on xopen

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      4.510 s ±  0.028 s    [User: 6.126 s, System: 0.261 s]
  Range (min … max):    4.478 s …  4.550 s    5 runs

As you can see it is a slight win in wall clock time, but also a massive win in system time, due to the system not having to manage a pipe and the program being able to take advantage of shared memory.

For comparison, decompression using no threads, and benchmarking no compressed data.

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      5.824 s ±  0.073 s    [User: 5.588 s, System: 0.234 s]
  Range (min … max):    5.748 s …  5.913 s    5 runs

Benchmark 1: sequali ~/test/5millionreads_R1.fastq
  Time (mean ± σ):      4.108 s ±  0.017 s    [User: 3.805 s, System: 0.303 s]
  Range (min … max):    4.085 s …  4.123 s    5 runs

Writing benchmarks. I tried to use cutadapt for this, but since that uses multiprocessing internally that got a little messy. So I wrote a simple dnaio read write script that only uses xopen threads:

import dnaio
import sys

with dnaio.open(sys.argv[1], mode="r", open_threads=1, qualities=True) as reader:
    with dnaio.open(sys.argv[2], mode="w", open_threads=1, qualities=True) as writer:
        for record in reader:
            writer.write(record)

Before, using main branch

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      3.326 s ±  0.034 s    [User: 4.875 s, System: 1.078 s]
  Range (min … max):    3.298 s …  3.384 s    5 runs

After, this branch

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      3.164 s ±  0.014 s    [User: 5.035 s, System: 0.577 s]
  Range (min … max):    3.151 s …  3.187 s    5 runs

That is with threads=1, but using more than one thread is also possible. Below threads=2

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      2.984 s ±  0.016 s    [User: 5.660 s, System: 0.757 s]
  Range (min … max):    2.964 s …  3.007 s    5 runs

There is a slight increase in user time and a decrease in system time when using one thread. Overall the threaded solution saves a small percentage of compute time compared to the piped solution. Two threads can further decrease wall clock time, but it is only advisable when the bottleneck is quite severe. Seems very little applications will break the 500MiBs barrier I think that one thread is a very sane default.

Main advantages of this change:

  • Less compute time in all situations.
  • The install base for python-isal is better than igzip
  • The changes from python-isal can be carried over the python-zlib-ng. Then I can redo Add zlib-ng and refactor gzip open functions. #124 in a much simpler fashion. Also this will give access to multithreaded zlib-ng compression level 5, finally eliminating one of cutadapt's bottlenecks on default settings.

It was quite a lot of work to get everything working in an efficient way in python-isal, but I think all this work will pay off in the long run. All these compressions and decompressions do add up over time.

@rhpvorderman
Copy link
Collaborator Author

I had to make a change requiring at least 3.8. This is due to python-isal having the same requirement. Given that 3.7 is out of support currently I think this is a valid change. Generally I'd like to maintain support as long as it is not annoying to do so. Given the extra typing requirements for 3.7 I think it is convenient to get rid of it.

src/xopen/__init__.py Outdated Show resolved Hide resolved
@marcelm
Copy link
Collaborator

marcelm commented Oct 18, 2023

I looked into why coverage goes down. We now never call _open_external_gzip_reader because the isal import never fails. We have an extra test in the test matrix with optional-dependencies: false where we don’t install the system isal libraries, but I guess that does not have an effect because python-isal comes with libisal already, is that correct?

I’m ok with leaving it as-is (with decreased coverage) since the platforms we care about are covered. We can consider re-adding tests later (or possibly decide that we want isal to be a hard dependency?).

Can you please:

  • Check that the description in the README regarding what happens when is still correct with this PR?
  • Add a changelog entry.

Then feel free to merge and make a release. I guess this should be called 1.8.0, but I’ll leave that up to you.

Thanks!

@rhpvorderman rhpvorderman force-pushed the isalthreads branch 2 times, most recently from 51afbd4 to 790b988 Compare October 18, 2023 12:23
@rhpvorderman
Copy link
Collaborator Author

I looked into why coverage goes down. We now never call _open_external_gzip_reader because the isal import never fails. We have an extra test in the test matrix with optional-dependencies: false where we don’t install the system isal libraries, but I guess that does not have an effect because python-isal comes with libisal already, is that correct?

It does not have an effect because igzip_threaded.open takes precedence and it is indeed always installed. I have to say, for good reason, as it is always better than the alternatives ;-). I made a bit of a clunky tox edit where I remove python-isal before running. I also tried forcing no installation of isal by adding --no--deps to the install_command. But that fails, because it does not install coverage and pytest dependencies as well. So, clunky, but it works, not too much effort and coverage is up again.

I think we should discuss at some point how many external readers we still want in xopen. Arguably only gzip is worth it as a fallback. All the rest of the use cases can be done with python-zlib-ng and python-isal eventually. This should make the _open_gz function a bit simpler. We still can keep the old classes around for backward compatibility.

@rhpvorderman rhpvorderman merged commit 97b29a6 into main Oct 18, 2023
19 checks passed
@rhpvorderman rhpvorderman deleted the isalthreads branch October 18, 2023 12:31
@marcelm
Copy link
Collaborator

marcelm commented Oct 18, 2023

Looks good, thanks for fixing the coverage issue. Agreed that we should consider reducing the number of external readers.

@rhpvorderman
Copy link
Collaborator Author

Released! Thanks for your quick review!

@marcelm
Copy link
Collaborator

marcelm commented Oct 18, 2023

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants