-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use python-isal's igzip_threaded module to replace igzip piping. #131
Conversation
I had to make a change requiring at least 3.8. This is due to python-isal having the same requirement. Given that 3.7 is out of support currently I think this is a valid change. Generally I'd like to maintain support as long as it is not annoying to do so. Given the extra typing requirements for 3.7 I think it is convenient to get rid of it. |
This reverts commit b2a7a7b.
I looked into why coverage goes down. We now never call _open_external_gzip_reader because the isal import never fails. We have an extra test in the test matrix with I’m ok with leaving it as-is (with decreased coverage) since the platforms we care about are covered. We can consider re-adding tests later (or possibly decide that we want isal to be a hard dependency?). Can you please:
Then feel free to merge and make a release. I guess this should be called 1.8.0, but I’ll leave that up to you. Thanks! |
51afbd4
to
790b988
Compare
790b988
to
8e9a993
Compare
It does not have an effect because igzip_threaded.open takes precedence and it is indeed always installed. I have to say, for good reason, as it is always better than the alternatives ;-). I made a bit of a clunky tox edit where I remove python-isal before running. I also tried forcing no installation of isal by adding --no--deps to the install_command. But that fails, because it does not install coverage and pytest dependencies as well. So, clunky, but it works, not too much effort and coverage is up again. I think we should discuss at some point how many external readers we still want in xopen. Arguably only gzip is worth it as a fallback. All the rest of the use cases can be done with python-zlib-ng and python-isal eventually. This should make the _open_gz function a bit simpler. We still can keep the old classes around for backward compatibility. |
Looks good, thanks for fixing the coverage issue. Agreed that we should consider reducing the number of external readers. |
Released! Thanks for your quick review! |
Awesome! |
Hi Marcel,
Sorry I haven't worked on my dnaio PR lately. I have been sick and taking my sick leave. Currently I am taking an algorithms for genomics course in Deflt, which is very enjoyable. 4 hours of travel everyday, yet I am still full of energy due to pure excitement.
I have been working a lot on python-isal lately because sequali was taking a lot of time to handle a gzipped dataset and I wanted to use more threads. Unfortunately with xopen it is not possible to track the progress in the file due to it being opened by an external process. Also there is quite some overhead using the piped solution.
So I tried to write a mutlithread solution. Unfortunately it proved hard to escape the GIL. So I rewrote the entire gzip reading process in C (pycompression/python-isal#151). This has the fortunate side effect of removing most python overhead for single-threaded decompression as well and also making BGZIP format decompression a lot faster.
After that I wrote multithreaded readers and writers. pycompression/python-isal#153
Then I needed to do some polishing with several PRs to get a satisfying result but I think I got there.
Here are the benchmarks for reading using sequali. Sequali is compute bound so spawning a separate gzip thread significantly decreases wall clock time.
Using xopen current main branch
Using this branch on xopen
As you can see it is a slight win in wall clock time, but also a massive win in system time, due to the system not having to manage a pipe and the program being able to take advantage of shared memory.
For comparison, decompression using no threads, and benchmarking no compressed data.
Writing benchmarks. I tried to use cutadapt for this, but since that uses multiprocessing internally that got a little messy. So I wrote a simple dnaio read write script that only uses xopen threads:
Before, using main branch
After, this branch
That is with threads=1, but using more than one thread is also possible. Below threads=2
There is a slight increase in user time and a decrease in system time when using one thread. Overall the threaded solution saves a small percentage of compute time compared to the piped solution. Two threads can further decrease wall clock time, but it is only advisable when the bottleneck is quite severe. Seems very little applications will break the 500MiBs barrier I think that one thread is a very sane default.
Main advantages of this change:
It was quite a lot of work to get everything working in an efficient way in python-isal, but I think all this work will pay off in the long run. All these compressions and decompressions do add up over time.