Utility scripts for Deblur #119

cuttlefishh · 2016-12-06T21:20:45Z

Two scripts are included here:

verify_amplicon_type.py -- uses the first 4 nt to guess which kind of amplicon the OTU sequences come from. Currently supports 16S, 18S, and ITS. While a given study should all start at the same place in the sequence, and this should also be true for the same primer set, it is helpful to be able to check the that all the studies in a meta-analysis start at the same 5' position. Additionally, one might have some sequences of unknown origin. This will help identify what they are without blast and such.
summarize_otu_distributions.py -- gives for each OTU in a biom table, the number, fraction, and rank of samples in which an OTU is found, and the abundance, fraction, and rank of observations represented by that OTU. It also provides the taxonomy and a list of all the samples the OTU is found in. Importantly, the script requests that the user feeds in a rarefied biom table. This code was developed for the 'OTU sequence lookup' effort, and is mostly a wrapper for some biom commands, but I think it should have general utility for Deblur users.

mortonjt · 2016-12-06T21:25:38Z

scripts/summarize_otu_distributions.py

+              help="Output OTU summary (.tsv)")
+
+def make_otu_summary(input_biom_fp, output_summary_fp):
+    """Summarize distribution information about each OTU (sequnece) in a Deblur


sequnece -> sequence

cuttlefishh · 2016-12-06T21:28:00Z

Thanks! Fixed.

…

On Dec 6, 2016, at 1:25 PM, Jamie Morton ***@***.***> wrote: @mortonjt commented on this pull request. In scripts/summarize_otu_distributions.py <#119 (review)>: > +import pandas as pd +import numpy as np +import biom + ***@***.***() ***@***.***('--input_biom_fp', '-i', required=True, + type=click.Path(resolve_path=True, readable=True, exists=True, + file_okay=True), + help="Input rarefied OTU table (.biom)") ***@***.***('--output_summary_fp', '-o', required=True, + type=click.Path(resolve_path=True, readable=True, exists=False, + file_okay=True), + help="Output OTU summary (.tsv)") + +def make_otu_summary(input_biom_fp, output_summary_fp): + """Summarize distribution information about each OTU (sequnece) in a Deblur sequnece -> sequence — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#119 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFa_ZofG5OlU3RKuoHUOFfPAOGl_zbrbks5rFdLSgaJpZM4LF5X3>.

wasade · 2016-12-06T22:53:23Z

scripts/summarize_otu_distributions.py

+              file_okay=True), 
+              help="Output OTU summary (.tsv)")
+
+def make_otu_summary(input_biom_fp, output_summary_fp):


wouldn't it make sense for this to be part of biom?

wasade · 2016-12-06T22:53:39Z

scripts/summarize_otu_distributions.py

+    samples = table.ids(axis='sample')
+    otus = table.ids(axis='observation')
+    for idx, cdat in enumerate(table.iter_data(axis='observation')):
+        otu_total_obs[otus[idx]] = np.sum(cdat)


this forloop can be replaced with calls to Table.sum

cuttlefishh · 2016-12-06T22:57:12Z

Yeah, I was thinking that also. I think the only part that is specific to Deblur is calling the OTU identifier a "sequence" in the column header and maybe a few other places. That can easily be changed. Should I issue a PR to biom?

…

On Dec 6, 2016, at 5:53 PM, Daniel McDonald ***@***.***> wrote: @wasade commented on this pull request. In scripts/summarize_otu_distributions.py: > + + Input biom table must be rarefied for results to be meaningful.""" + + # Read OTU table (must be rarefied) + table = biom.load_table(input_biom_fp) + num_samples = len(table.ids(axis='sample')) + + # Get arrays of sample IDs and OTUs (sequences), dicts per OTU of total + # observations, number of samples, list of samples, and taxonomy + otu_total_obs = {} + otu_num_samples = {} + otu_list_samples = {} + samples = table.ids(axis='sample') + otus = table.ids(axis='observation') + for idx, cdat in enumerate(table.iter_data(axis='observation')): + otu_total_obs[otus[idx]] = np.sum(cdat) this forloop can be replaced with calls to Table.sum — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

wasade · 2016-12-06T23:00:39Z

Sure. deblur could import later the object and revise the column name afterward

coveralls · 2016-12-06T23:05:36Z

Coverage remained the same at 89.322% when pulling 58ec12f on cuttlefishh:otuscripts into b377d4a on biocore:master.

coveralls · 2016-12-06T23:23:05Z

Coverage remained the same at 89.322% when pulling 887ec13 on cuttlefishh:otuscripts into b377d4a on biocore:master.

coveralls · 2016-12-07T18:32:38Z

Coverage remained the same at 89.322% when pulling 47e5f80 on cuttlefishh:otuscripts into b377d4a on biocore:master.

cuttlefishh · 2016-12-07T18:35:20Z

I have deleted summarize_otu_distributions.py and issued a PR to biom-format.

For verify_amplicon_type.py, it would be prudent to wait until we have tested Deblur with ITS and 18S, so we know it works and the 5' tetramer frequencies are known. Deblur seems to leave the ITS 5' tetramer frequencies intact, but the 18S frequencies are changing, and it's unclear what the underlying causes are.

cuttlefishh · 2016-12-15T01:24:53Z

I think 18S is working now, so we should be good. After positive filtering with Silva 18S, the top three 5' tetramers are GCTA, GCTC, and ACAC, which which are found in slightly >50% of OTUs.

coveralls · 2016-12-15T01:29:10Z

Coverage decreased (-0.2%) to 89.162% when pulling 3ba33bf on cuttlefishh:otuscripts into b377d4a on biocore:master.

wasade · 2017-10-17T01:14:36Z

Going through old PRs. I think it would be fantastic to get this merged, but in my opinion the following few items would be great if possible:

shift the primary logic to library code, and wrap with unit tests
parameterize the amplicon to detect. It would be great if a few more tetramers were available
for the provided tetramers, would it be possible to have citable material (either publications or analysis) which provides support for reporting the amplicon type?

cuttlefishh · 2017-10-17T01:24:59Z

These all sound reasonable. Could someone help me with the unit tests? I'm a noob. Daniel, maybe I can bribe you with beer?

…

On Oct 16, 2017, at 6:14 PM, Daniel McDonald ***@***.***> wrote: Going through old PRs. I think it would be fantastic to get this merged, but in my opinion the following few items would be great if possible: shift the primary logic to library code, and wrap with unit tests parameterize the amplicon to detect. It would be great if a few more tetramers were available for the provided tetramers, would it be possible to have citable material (either publications or analysis) which provides support for reporting the amplicon type? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFa_ZoKMFsqKYtTx2fSX-CDRYYnuK1Akks5ss_99gaJpZM4LF5X3>.

cuttlefishh added 4 commits December 6, 2016 05:26

new script to calculate OTU distribution statistics

daca1f5

using more standard variable names

8ba14be

script to check amplicon type using deblur fasta

54b3c91

added support for 18S and ITS, now explicitly calling k-mers tetramers

58ec12f

mortonjt reviewed Dec 6, 2016

View reviewed changes

fixed typo

887ec13

wasade reviewed Dec 6, 2016

View reviewed changes

deleting summarize_otu_distributions.py and adding to biom-format

47e5f80

updated 18S tetramers after positive filtering with Silva 18S

3ba33bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utility scripts for Deblur #119

Utility scripts for Deblur #119

cuttlefishh commented Dec 6, 2016

mortonjt Dec 6, 2016

cuttlefishh commented Dec 6, 2016 via email

wasade Dec 6, 2016

wasade Dec 6, 2016

cuttlefishh commented Dec 6, 2016 via email

wasade commented Dec 6, 2016

coveralls commented Dec 6, 2016

coveralls commented Dec 6, 2016

coveralls commented Dec 7, 2016

cuttlefishh commented Dec 7, 2016 •

edited

Loading

cuttlefishh commented Dec 15, 2016

coveralls commented Dec 15, 2016

wasade commented Oct 17, 2017

cuttlefishh commented Oct 17, 2017 via email

Utility scripts for Deblur #119

Are you sure you want to change the base?

Utility scripts for Deblur #119

Conversation

cuttlefishh commented Dec 6, 2016

mortonjt Dec 6, 2016

Choose a reason for hiding this comment

cuttlefishh commented Dec 6, 2016 via email

wasade Dec 6, 2016

Choose a reason for hiding this comment

wasade Dec 6, 2016

Choose a reason for hiding this comment

cuttlefishh commented Dec 6, 2016 via email

wasade commented Dec 6, 2016

coveralls commented Dec 6, 2016

coveralls commented Dec 6, 2016

coveralls commented Dec 7, 2016

cuttlefishh commented Dec 7, 2016 • edited Loading

cuttlefishh commented Dec 15, 2016

coveralls commented Dec 15, 2016

wasade commented Oct 17, 2017

cuttlefishh commented Oct 17, 2017 via email

cuttlefishh commented Dec 7, 2016 •

edited

Loading