Tuesday, 12 November 2013

FAQ and clarifications for the ExomeDepth R package

With a few colleagues, we published in 2012 a paper describing ExomeDepth, a R package designed to call CNVs from exome sequencing experiments. The ExomeDepth paper has received some citations (not hundreds though, but still something substantial) which is obviously something I am happy to see. When a user asks questions, I always feel like there may be others with the same issues. Therefore, I decided to put together a small FAQ which I will edit as time goes by and when the package is updated. The main point I want to make is really that I encourage people to get in touch with me and feedback issues and results that seem incorrect. I am really happy to help if I can.



Q: ExomeDepth is advertised for exome sequencing. But can it be used on smaller targets than exomes?


Yes it will work without any substantial difference, but you need a substantial number of genes (or rather exons) to get the parameter estimated properly. I guess that ~ 20 genes will do the job.


Q: How do I know if my ExomeDepth results look fine?


For exome data, I like to look at the overlap with the common CNVs flagged in the Conrad et al paper. This dataset is already part of the ExomeDepth package and the vignette shows how to use it. In my initial tests using exome data generated by the BGI, about 70-80% of CNV calls had at least one Conrad CNV overlapping at least 50% of the call. In these cases I had about 180 CNV calls per sample, two third of them deletions. This is really the best case scenario and in such situations the CNV calls are quite reliable. Anything close to these numbers probably means that ExomeDepth ran properly and gave a meaningful output.


If you consider a smaller set of targeted genes, it is not absurd (but a very rough approximation) to get a prior estimate of the number of CNV calls by scaling the genome-wide number down to the smaller gene set of interest to you (i.e. 200 genes should give about 2 CNV calls per sample on average, but of course it will depend on the genes being targeted).


Q: But sometimes things really don’t look right. My numbers of CNVs is off, and the overlap with common known CNVs is too small. Why is that?


It may be an issue with the data. For example the correlation between the test sample and the combined reference must be very high (correlation coefficient at least 0.98 is recommended) otherwise the inference simply cannot work. In the next version of the package I plan to return warnings when that is not the case.


This being said I have seen data that look right by any obvious measure but which obviously was returning too few CNVs (about 40-50 per sample, and not a great overlap with the Conrad set). I do not have an answer to explain such poor results. Correlations between tests and reference looked high… just odd. I am wondering if it is an issue with recent library prep and capture experiments. This is something I need to investigate, and feedback is welcome.


Q: What is the main difference between CoNIFER and ExomeDepth?


CoNIFER (as far as I understand, and the authors should correct me otherwise!) corrects for differences between the "test" exome and the "reference" ones using a principal component type analysis. It is a perfectly valid approach of course. However,  it probably requires a substantial number of exome samples to do this correction reliably. Moreover, my personal view is that usually these technical artifacts are so complex and difficult to correct genome-wide that this task is really challenging. ExomeDepth takes the alternative approach of mining through the "reference" exomes to find the one(s) that are best matched to the "test" exome, and performs a comparison between both. So rather than correcting for differences, I propose to start with something with minimum variability. Which one works best will probably depend on the data, but I think CoNIFER tends to be more conservative. It may be possible to combine both strategies in fact.


Q: I have just read a paper reporting what looks like a high false positive rate. Is that true?


It goes without saying that I am not very convinced (nor happy!) with this paper. First of all, ExomeDepth expects exomes generated from the same batch, and I have no idea what was actually done in that case. The authors may have done something a bit silly with the tool, like comparing exomes generated at different dates (or perhaps not, I just can’t say). They also report the validation rate on the whole set of CNVs (1.4K CNVs in 12 samples, which seems about right). But they compare numbers with CoNIFER which reports a much smaller number (I see n = 38 in Table 2). CNVs calls returned by ExomeDepth come with a Bayes factor that support the calls. The higher the Bayes Factor, the more confident the call is. If the authors had ranked CNVs by Bayes factor and reported validation for the confidently called large CNVs, I presume things would have been much close to the CoNIFER numbers. The next vignette will highlight the use of Bayes Factors, I think this point is not sufficiently advertised.


I have no problem with comparisons, negative or positive, I just feel like it’s a bit random in this case. I can absolutely see  that ExomeDepth will sometimes delivers unreliable results, but I don’t think this paper shows the pros and cons well.


Q: I never see a CNV call in the first exon of a chromosome. Is this a bug?


Not really. The explanation for this is that I had issues with CNVs overlapping chromosome boundaries. The whole downstream analysis, in particular the annotation steps, were negatively affected. Therefore, in the coding of the underlying Markov Chain, I decided to force an absence of CNV (i.e. normal copy number 2) to start a chromosome. It is not an optimum way to address the problem but at least it is clean to code. This is not a high priority but some fix should be implemented at some point.


Q: Will ExomeDepth be updated? And when?


I plan to. For one thing some packages ExomeDepth depends on (like aod) have changed the syntax. I will probably need to update ExomeDepth using aods3 instead. While there are a few small things to improve, I would like to add more diagnostic tools to understand when and why ExomeDepth returns results with lower accuracy. That is the main thing to do really, other things are more minor incremental improvements.

9 comments:

  1. Hi Vincent,
    I'm trying to install ExomeDepth with R version 3.2.1 with the command
    > install.packages("ExomeDepth")
    But I get the error message:
    ERROR: dependencies ‘Biostrings’, ‘IRanges’, ‘Rsamtools’, ‘GenomicRanges’, ‘GenomicAlignments’ are not available for package ‘ExomeDepth’
    installation of package ‘ExomeDepth’ had non-zero exit status.

    Any suggestion?
    Thanks
    C.

    ReplyDelete
  2. Yes these are the bioconductor requirements, and they must probably be installed separately. See:
    http://bioconductor.org/packages/release/bioc/html/Rsamtools.html
    Rsamtools will most likely install the other dependencies from bioconductor.

    ReplyDelete
  3. Will ExomeDepth be available for R version 3.2.2 in the near future?

    ReplyDelete
  4. I am seeing in some cases the name on a input BAM changed with pre-pending an "X" to it when converting to a dataframe from a counts output....I am trying to figure out why this is happening and it does not seem to be an issue with the vignette data, any insight would be appreciated. see example below

    > colnames(my.counts)
    [1] "1358242663.bam" "Reference.bam"
    > my.df<-as(my.counts[,colnames(my.counts)],'data.frame')
    > print(head(my.df))
    space start end width names X1358242663.bam
    1 chr1 35139 35174 36 FAM138A_3 0
    2 chr1 35278 35481 204 FAM138A_2 0
    3 chr1 35722 35736 15 FAM138A_1 0
    4 chr1 53050 53067 18 AL627309.2_1 0
    5 chr1 54831 54936 106 AL627309.2_2 0
    6 chr1 69092 70008 917 OR4F5_1 0
    Reference.bam
    1 0
    2 0
    3 0
    4 0
    5 0
    6 0
    >

    ReplyDelete
  5. R column names, I think, should not start with a number. So the "X" gets added to take care of that. I don't think it should have any impact on the final results of exomeDepth?

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. Thanks for the clarification, I am new to using R so still working out some of the details. I have an additional question regarding building the reference set, in following the vignette (section 5), I am a little perplexed by these statements:

    my.matrix <- as.matrix( ExomeCount.dafr[, my.choice$reference.choice, drop = FALSE])
    my.reference.selected <- apply (X = my.matrix, MAR = 1, FUN = sum)

    It seems that the final my.reference.selected has the counts from the selected references summed. I thought the gist was to find a single best matching reference. If the counts are summed across the reference candidates then that seems to inflate the expected count at that location....and that I would think impact sample X comparison...or am I missing something? I could see using the average read depth across matching candidates.

    > head(ExomeCount$Exome1)
    [1] 0 0 118 198 516 272
    > head(ExomeCount$Exome3)
    [1] 0 0 116 104 530 336
    > head(ExomeCount$Exome2)
    [1] 0 0 242 48 1112 762
    > head(my.reference.selected)
    [1] 0 0 476 350 2158 1370
    >

    ReplyDelete
  8. Hello Vincent,

    I am a big fan of ExomeDepth!
    I am working with WES with ExomeDepth, which really works well.

    Now I started having one question related with reference set.
    I know that ExomeDepth is trying to find best reference set, sometimes, one to several.

    Now what I found is that depedning on the my reference pool, extracted CNVs were dramatically different. In my case, I put about 10 WES samples and allowed ExomeDepth to find best reference set.

    So, Now I have a question. Is there any recommendation to put which WES to my reference set?? For example, reference WES should be the one, which has been done in same experiment batch?? Or any other restrcition for the reference set? Since, the extrated CNVs are very different depending on my reference pool, I am afraid that I might be ended up with FP CNVs.

    Could you please give any suggestions?


    ReplyDelete
  9. Hello,

    Greetings!
    I am working with the exome data. While annotating the CNV calls with Conrad.hg19, I am getting the following error. On the other hand if I use exons.hg19 as reference annotation everything works fine.

    Error in GenomicRanges::findOverlaps(query = my.calls.GRanges, subject = reference.annotation) :
    object 'Conrad.hg19.common.CNVs' not found
    Calls: AnnotateExtra -> AnnotateExtra ->

    As per the manual this should not happen as Conrad.hg19 has been integrated in exomedepth. I ran the analysis with different data set and got the same error. Could you please throw some insight as what could be the probable reasons for the error. I am using the version 1.1.10 (R>3.3.0).

    Regards,
    Ashwani

    ReplyDelete