Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much focus on transformation in the RNA-Seq exploration section #10

Open
TomSmithCGAT opened this issue May 10, 2022 · 4 comments
Open

Comments

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented May 10, 2022

The 'RNA-seq Data Exploration' notebook covers:

  1. Reading in the salmon quant and sample info, including exploring data structure
  2. Importing scaled TPM (excercise)
  3. Filtering the genes
  4. Raw count distributions
  5. log2 transformation
  6. vst
  7. rlog (exercise)
  8. PCA
  9. Plotting different PCs (exercise)
  10. Reveal sample swap and update sample info

My concerns are that

  • Much of the material is focused on data transformation, but ultimately, only the rlog is used, and the only exploration of the data is the PCA.
  • At no point do we explore the expression of a single gene, which I expect will be a more intuitive way to explore the data for many participants
  • 2 extra packages are introduced just for the PCA (ggfortify, ggrepel)
  • The introduction beforehand focuses on DESeq2, but they don't use this package and explore the object structure until much later (excepting vst and rlog).

I would favour the following, with some of the below potentially being recovered from the 'Additional Data Exploration' notebook.

  1. Reading in the salmon quant and sample info
  2. Creating a DESeq2DataSet, including exploring data structure (with explanation that the design argument will be fully explored in a later session)
  3. Exploring raw counts by looking at the overall distibution and one example gene (DESeq2::plotCounts)
  4. Repeat 3 but for log transformed counts (exercise)
  5. Hierachical clustering of raw counts (a few lines with pheatmap should be suffficient, as suggested in DESeq2 vignette, or alternatively, just base R hclust, with appropriate sample naming)
  6. Repeat 6 with log-transformed counts to demonstrate the value of transformation (exercise; separation should now be clearer)
  7. Reveal sample swap and update sample info
  8. rlog
  9. PCA using DESeq2::plotPCA
  10. What does the PCA 'say' (discussion)
@TomSmithCGAT
Copy link
Member Author

I'm aware that introducing DESeq2DataSet at this point might be considered too early and over-complicated. Would be simple enough to do all the above just from tximport output obviously. Just a few more lines for the single gene plotting and keep 'ggfortify' for the PCA.

@tavareshugo
Copy link
Contributor

I like the idea of introducing the DESeqDataSet early on, since that's the main data structure to become familiar with.
I don't think it's necessary to interact with the txi object directly.

I like this kind of approach:

assay(dds, "rlog") <- assay(rlog(dds))

To include the normalised counts within the main object. So that they can saveRDS() the object for other downstream analysis later on.
This way you could do the exploration by fetching the matrices as assay(dds, "counts") for raw counts or assay(dds, "rlog") for normalised counts.

For example, this is what the popular workflow nf-core/rnaseq does, creating a DESeqDataSet object with those assays (and a null design design(dds) ~ 1), so if users or their bioinfo core use that workflow for their analysis, they will then be able to follow on from our course and apply all the downstream analysis we cover.

Incidentally, doing the rlog from the DESeqDataSet object takes into account transcript length normalisation factors (generated by salmon and stored in assay(dds, "avgTxLength") from objects created by DESeqDataSetFromTximport() - see estimateSizeFactors() docs), whereas the way we are doing it now (directly from the txi$counts matrix) does not.


Exploring the expression of a single gene seems very useful too. It maybe requires a bit of gymnastics, but probably worth covering.
I would usually do something like:

assay(dds[c("gene1", "gene2"), ], "rlog") |> 
  as_tibble(rownames = "gene") |> 
  pivot_longer(-gene, names_to = "sample", values_to = "expr") |> 
  left_join(as_tibble(colData(dds)), by = "sample") |> 
  ggplot(aes(treatment, expr)) +
  geom_jitter() +
  facet_wrap(~ gene)

Which can probably be broken into simpler steps for teaching. Or maybe there's a simpler way of doing this?

@AbbiEdwards
Copy link
Contributor

Alot of what you suggest is actually covered in the later sections (deseq2 objects/hierarchical clustering/heatmaps etc.). In fact a huge chunk at the start of the deseq2 section is devoted to the parts that make up the deseq2 object and how it works. I am open to discussing maybe moving some of it earlier though if people think it would be more useful but my feeling is that this section is for the little qc/explorations you do before you get to deseq2. There was a section of the visualisation section which deals with looking at a single gene but so much had to be removed to make it fit into teaching remotely unfortunately it got cut but we have it in supplementary if we wanted to bring it back. I worry that diving into their favourite gene too early might increase the likelihood of cherrypicking outcomes but I guess if there is a control gene they definitely know should do something...? Food for thought...

I do take your point on transformations, although I didn't feel like I spent much time on it at all last time I taught that bit lol! Maybe we could rationalise the materials here. Personally I have also been thinking that this section needs a refresh but I didn't have time between Feb when I taught it last and now.

I don't think its an issue packages get introduced just for one task (although incidentally I think those are also in later sections) as thats the nature of how things work in the real world and this isn't a R beginners course.

I'll set up a debrief meeting so we can get everyone's feedback on any changes we want to make over the summer break but if you have anymore thoughts keep adding them here/as issues its good to have a record.

@TomSmithCGAT
Copy link
Member Author

TomSmithCGAT commented May 12, 2022

I think hierachical clustering and sample correlations (both in the sup materials for this notebook), would complement the PCA, and give the participants another angle for the data exploration to understand if the experiment has 'worked', which seems like it should be the major aim of this section to me.

Whether to do this from a DESeqDataSet object though? 🤔 I can definitely both sides on that one.

I would only suggest adding in the single gene example if we want to demonstrate the effect of the transformation more clearly, since I think many participants will find that helpful. Agree that we might not want them to focus on a single gene at this stage though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants