Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility scripts for Deblur #119

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

cuttlefishh
Copy link

Two scripts are included here:

  • verify_amplicon_type.py -- uses the first 4 nt to guess which kind of amplicon the OTU sequences come from. Currently supports 16S, 18S, and ITS. While a given study should all start at the same place in the sequence, and this should also be true for the same primer set, it is helpful to be able to check the that all the studies in a meta-analysis start at the same 5' position. Additionally, one might have some sequences of unknown origin. This will help identify what they are without blast and such.
  • summarize_otu_distributions.py -- gives for each OTU in a biom table, the number, fraction, and rank of samples in which an OTU is found, and the abundance, fraction, and rank of observations represented by that OTU. It also provides the taxonomy and a list of all the samples the OTU is found in. Importantly, the script requests that the user feeds in a rarefied biom table. This code was developed for the 'OTU sequence lookup' effort, and is mostly a wrapper for some biom commands, but I think it should have general utility for Deblur users.

help="Output OTU summary (.tsv)")

def make_otu_summary(input_biom_fp, output_summary_fp):
"""Summarize distribution information about each OTU (sequnece) in a Deblur
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequnece -> sequence

@cuttlefishh
Copy link
Author

cuttlefishh commented Dec 6, 2016 via email

file_okay=True),
help="Output OTU summary (.tsv)")

def make_otu_summary(input_biom_fp, output_summary_fp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make sense for this to be part of biom?

samples = table.ids(axis='sample')
otus = table.ids(axis='observation')
for idx, cdat in enumerate(table.iter_data(axis='observation')):
otu_total_obs[otus[idx]] = np.sum(cdat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this forloop can be replaced with calls to Table.sum

@cuttlefishh
Copy link
Author

cuttlefishh commented Dec 6, 2016 via email

@wasade
Copy link
Member

wasade commented Dec 6, 2016

Sure. deblur could import later the object and revise the column name afterward

@coveralls
Copy link

Coverage Status

Coverage remained the same at 89.322% when pulling 58ec12f on cuttlefishh:otuscripts into b377d4a on biocore:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 89.322% when pulling 887ec13 on cuttlefishh:otuscripts into b377d4a on biocore:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 89.322% when pulling 47e5f80 on cuttlefishh:otuscripts into b377d4a on biocore:master.

@cuttlefishh
Copy link
Author

cuttlefishh commented Dec 7, 2016

I have deleted summarize_otu_distributions.py and issued a PR to biom-format.

For verify_amplicon_type.py, it would be prudent to wait until we have tested Deblur with ITS and 18S, so we know it works and the 5' tetramer frequencies are known. Deblur seems to leave the ITS 5' tetramer frequencies intact, but the 18S frequencies are changing, and it's unclear what the underlying causes are.

@cuttlefishh
Copy link
Author

I think 18S is working now, so we should be good. After positive filtering with Silva 18S, the top three 5' tetramers are GCTA, GCTC, and ACAC, which which are found in slightly >50% of OTUs.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 89.162% when pulling 3ba33bf on cuttlefishh:otuscripts into b377d4a on biocore:master.

@wasade
Copy link
Member

wasade commented Oct 17, 2017

Going through old PRs. I think it would be fantastic to get this merged, but in my opinion the following few items would be great if possible:

  • shift the primary logic to library code, and wrap with unit tests
  • parameterize the amplicon to detect. It would be great if a few more tetramers were available
  • for the provided tetramers, would it be possible to have citable material (either publications or analysis) which provides support for reporting the amplicon type?

@cuttlefishh
Copy link
Author

cuttlefishh commented Oct 17, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants