Use python-isal's igzip_threaded module to replace igzip piping. #131

rhpvorderman · 2023-10-12T06:45:49Z

Hi Marcel,

Sorry I haven't worked on my dnaio PR lately. I have been sick and taking my sick leave. Currently I am taking an algorithms for genomics course in Deflt, which is very enjoyable. 4 hours of travel everyday, yet I am still full of energy due to pure excitement.

I have been working a lot on python-isal lately because sequali was taking a lot of time to handle a gzipped dataset and I wanted to use more threads. Unfortunately with xopen it is not possible to track the progress in the file due to it being opened by an external process. Also there is quite some overhead using the piped solution.

So I tried to write a mutlithread solution. Unfortunately it proved hard to escape the GIL. So I rewrote the entire gzip reading process in C (pycompression/python-isal#151). This has the fortunate side effect of removing most python overhead for single-threaded decompression as well and also making BGZIP format decompression a lot faster.

After that I wrote multithreaded readers and writers. pycompression/python-isal#153
Then I needed to do some polishing with several PRs to get a satisfying result but I think I got there.

Here are the benchmarks for reading using sequali. Sequali is compute bound so spawning a separate gzip thread significantly decreases wall clock time.

Using xopen current main branch

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      4.735 s ±  0.188 s    [User: 6.366 s, System: 1.034 s]
  Range (min … max):    4.518 s …  4.934 s    5 runs

Using this branch on xopen

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      4.510 s ±  0.028 s    [User: 6.126 s, System: 0.261 s]
  Range (min … max):    4.478 s …  4.550 s    5 runs

As you can see it is a slight win in wall clock time, but also a massive win in system time, due to the system not having to manage a pipe and the program being able to take advantage of shared memory.

For comparison, decompression using no threads, and benchmarking no compressed data.

Benchmark 1: sequali ~/test/5millionreads_R1.fastq.gz
  Time (mean ± σ):      5.824 s ±  0.073 s    [User: 5.588 s, System: 0.234 s]
  Range (min … max):    5.748 s …  5.913 s    5 runs

Benchmark 1: sequali ~/test/5millionreads_R1.fastq
  Time (mean ± σ):      4.108 s ±  0.017 s    [User: 3.805 s, System: 0.303 s]
  Range (min … max):    4.085 s …  4.123 s    5 runs

Writing benchmarks. I tried to use cutadapt for this, but since that uses multiprocessing internally that got a little messy. So I wrote a simple dnaio read write script that only uses xopen threads:

import dnaio
import sys

with dnaio.open(sys.argv[1], mode="r", open_threads=1, qualities=True) as reader:
    with dnaio.open(sys.argv[2], mode="w", open_threads=1, qualities=True) as writer:
        for record in reader:
            writer.write(record)

Before, using main branch

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      3.326 s ±  0.034 s    [User: 4.875 s, System: 1.078 s]
  Range (min … max):    3.298 s …  3.384 s    5 runs

After, this branch

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      3.164 s ±  0.014 s    [User: 5.035 s, System: 0.577 s]
  Range (min … max):    3.151 s …  3.187 s    5 runs

That is with threads=1, but using more than one thread is also possible. Below threads=2

Benchmark 1: python dnaio_read_and_write.py ~/test/5millionreads_R1.fastq ramdisk/test.fastq.gz
  Time (mean ± σ):      2.984 s ±  0.016 s    [User: 5.660 s, System: 0.757 s]
  Range (min … max):    2.964 s …  3.007 s    5 runs

There is a slight increase in user time and a decrease in system time when using one thread. Overall the threaded solution saves a small percentage of compute time compared to the piped solution. Two threads can further decrease wall clock time, but it is only advisable when the bottleneck is quite severe. Seems very little applications will break the 500MiBs barrier I think that one thread is a very sane default.

Main advantages of this change:

Less compute time in all situations.
The install base for python-isal is better than igzip
The changes from python-isal can be carried over the python-zlib-ng. Then I can redo Add zlib-ng and refactor gzip open functions. #124 in a much simpler fashion. Also this will give access to multithreaded zlib-ng compression level 5, finally eliminating one of cutadapt's bottlenecks on default settings.

It was quite a lot of work to get everything working in an efficient way in python-isal, but I think all this work will pay off in the long run. All these compressions and decompressions do add up over time.

rhpvorderman · 2023-10-12T07:11:19Z

I had to make a change requiring at least 3.8. This is due to python-isal having the same requirement. Given that 3.7 is out of support currently I think this is a valid change. Generally I'd like to maintain support as long as it is not annoying to do so. Given the extra typing requirements for 3.7 I think it is convenient to get rid of it.

src/xopen/__init__.py

This reverts commit b2a7a7b.

marcelm · 2023-10-18T12:01:08Z

I looked into why coverage goes down. We now never call _open_external_gzip_reader because the isal import never fails. We have an extra test in the test matrix with optional-dependencies: false where we don’t install the system isal libraries, but I guess that does not have an effect because python-isal comes with libisal already, is that correct?

I’m ok with leaving it as-is (with decreased coverage) since the platforms we care about are covered. We can consider re-adding tests later (or possibly decide that we want isal to be a hard dependency?).

Can you please:

Check that the description in the README regarding what happens when is still correct with this PR?
Add a changelog entry.

Then feel free to merge and make a release. I guess this should be called 1.8.0, but I’ll leave that up to you.

Thanks!

rhpvorderman · 2023-10-18T12:30:44Z

I looked into why coverage goes down. We now never call _open_external_gzip_reader because the isal import never fails. We have an extra test in the test matrix with optional-dependencies: false where we don’t install the system isal libraries, but I guess that does not have an effect because python-isal comes with libisal already, is that correct?

It does not have an effect because igzip_threaded.open takes precedence and it is indeed always installed. I have to say, for good reason, as it is always better than the alternatives ;-). I made a bit of a clunky tox edit where I remove python-isal before running. I also tried forcing no installation of isal by adding --no--deps to the install_command. But that fails, because it does not install coverage and pytest dependencies as well. So, clunky, but it works, not too much effort and coverage is up again.

I think we should discuss at some point how many external readers we still want in xopen. Arguably only gzip is worth it as a fallback. All the rest of the use cases can be done with python-zlib-ng and python-isal eventually. This should make the _open_gz function a bit simpler. We still can keep the old classes around for backward compatibility.

marcelm · 2023-10-18T12:37:02Z

Looks good, thanks for fixing the coverage issue. Agreed that we should consider reducing the number of external readers.

rhpvorderman · 2023-10-18T13:01:06Z

Released! Thanks for your quick review!

marcelm · 2023-10-18T13:16:02Z

Awesome!

rhpvorderman added 9 commits October 7, 2023 20:34

Use igzip's threaded open

caf06f5

Fix test errors with threaded open

2d56f6c

Fix not reproducible gzip

6de6e65

Use patched isal

6bdb778

Add a comment why igzip_threaded.open's defer capability is not used

67fa7c9

Black formatting

57883bf

Make _open_gz less complex

b2a7a7b

Satisfy the latest mypy requirements

b4457ad

Require at least python 3.8

d9dd79e

rhpvorderman added 2 commits October 12, 2023 09:11

Update to latest supported pypy-version

df111c7

Correct small comment typo

2f72b02

marcelm reviewed Oct 18, 2023

View reviewed changes

src/xopen/__init__.py Outdated Show resolved Hide resolved

rhpvorderman added 2 commits October 18, 2023 13:11

Revert "Make _open_gz less complex"

4d38314

This reverts commit b2a7a7b.

Ignore _open_gz complexity

49ff56c

rhpvorderman mentioned this pull request Oct 18, 2023

no flush() on PipedGzipWriter #8

Open

marcelm mentioned this pull request Oct 18, 2023

Switch from setup.cfg to pyproject.toml #132

Merged

rhpvorderman force-pushed the isalthreads branch 2 times, most recently from 51afbd4 to 790b988 Compare October 18, 2023 12:23

Also test without python-isal

8e9a993

rhpvorderman force-pushed the isalthreads branch from 790b988 to 8e9a993 Compare October 18, 2023 12:24

Cover optional dependencies and no isal as well

ba0f6a1

rhpvorderman merged commit 97b29a6 into main Oct 18, 2023
19 checks passed

rhpvorderman deleted the isalthreads branch October 18, 2023 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use python-isal's igzip_threaded module to replace igzip piping. #131

Use python-isal's igzip_threaded module to replace igzip piping. #131

rhpvorderman commented Oct 12, 2023

rhpvorderman commented Oct 12, 2023

marcelm commented Oct 18, 2023 •

edited by rhpvorderman

Loading

rhpvorderman commented Oct 18, 2023

marcelm commented Oct 18, 2023 •

edited

Loading

rhpvorderman commented Oct 18, 2023

marcelm commented Oct 18, 2023

Use python-isal's igzip_threaded module to replace igzip piping. #131

Use python-isal's igzip_threaded module to replace igzip piping. #131

Conversation

rhpvorderman commented Oct 12, 2023

rhpvorderman commented Oct 12, 2023

marcelm commented Oct 18, 2023 • edited by rhpvorderman Loading

rhpvorderman commented Oct 18, 2023

marcelm commented Oct 18, 2023 • edited Loading

rhpvorderman commented Oct 18, 2023

marcelm commented Oct 18, 2023

marcelm commented Oct 18, 2023 •

edited by rhpvorderman

Loading

marcelm commented Oct 18, 2023 •

edited

Loading