Tuesday, 12 November 2013

FAQ and clarifications for the ExomeDepth R package

With a few colleagues, we published in 2012 a paper describing ExomeDepth, a R package designed to call CNVs from exome sequencing experiments. The ExomeDepth paper has received some citations (not hundreds though, but still something substantial) which is obviously something I am happy to see. When a user asks questions, I always feel like there may be others with the same issues. Therefore, I decided to put together a small FAQ which I will edit as time goes by and when the package is updated. The main point I want to make is really that I encourage people to get in touch with me and feedback issues and results that seem incorrect. I am really happy to help if I can.

Q: ExomeDepth is advertised for exome sequencing. But can it be used on smaller targets than exomes?

Yes it will work without any substantial difference, but you need a substantial number of genes (or rather exons) to get the parameter estimated properly. I guess that ~ 20 genes will do the job.

Q: How do I know if my ExomeDepth results look fine?

For exome data, I like to look at the overlap with the common CNVs flagged in the Conrad et al paper. This dataset is already part of the ExomeDepth package and the vignette shows how to use it. In my initial tests using exome data generated by the BGI, about 70-80% of CNV calls had at least one Conrad CNV overlapping at least 50% of the call. In these cases I had about 180 CNV calls per sample, two third of them deletions. This is really the best case scenario and in such situations the CNV calls are quite reliable. Anything close to these numbers probably means that ExomeDepth ran properly and gave a meaningful output.

If you consider a smaller set of targeted genes, it is not absurd (but a very rough approximation) to get a prior estimate of the number of CNV calls by scaling the genome-wide number down to the smaller gene set of interest to you (i.e. 200 genes should give about 2 CNV calls per sample on average, but of course it will depend on the genes being targeted).

Q: But sometimes things really don’t look right. My numbers of CNVs is off, and the overlap with common known CNVs is too small. Why is that?

It may be an issue with the data. For example the correlation between the test sample and the combined reference must be very high (correlation coefficient at least 0.98 is recommended) otherwise the inference simply cannot work. In the next version of the package I plan to return warnings when that is not the case.

This being said I have seen data that look right by any obvious measure but which obviously was returning too few CNVs (about 40-50 per sample, and not a great overlap with the Conrad set). I do not have an answer to explain such poor results. Correlations between tests and reference looked high… just odd. I am wondering if it is an issue with recent library prep and capture experiments. This is something I need to investigate, and feedback is welcome.

Q: What is the main difference between CoNIFER and ExomeDepth?

CoNIFER (as far as I understand, and the authors should correct me otherwise!) corrects for differences between the "test" exome and the "reference" ones using a principal component type analysis. It is a perfectly valid approach of course. However,  it probably requires a substantial number of exome samples to do this correction reliably. Moreover, my personal view is that usually these technical artifacts are so complex and difficult to correct genome-wide that this task is really challenging. ExomeDepth takes the alternative approach of mining through the "reference" exomes to find the one(s) that are best matched to the "test" exome, and performs a comparison between both. So rather than correcting for differences, I propose to start with something with minimum variability. Which one works best will probably depend on the data, but I think CoNIFER tends to be more conservative. It may be possible to combine both strategies in fact.

Q: I have just read a paper reporting what looks like a high false positive rate. Is that true?

It goes without saying that I am not very convinced (nor happy!) with this paper. First of all, ExomeDepth expects exomes generated from the same batch, and I have no idea what was actually done in that case. The authors may have done something a bit silly with the tool, like comparing exomes generated at different dates (or perhaps not, I just can’t say). They also report the validation rate on the whole set of CNVs (1.4K CNVs in 12 samples, which seems about right). But they compare numbers with CoNIFER which reports a much smaller number (I see n = 38 in Table 2). CNVs calls returned by ExomeDepth come with a Bayes factor that support the calls. The higher the Bayes Factor, the more confident the call is. If the authors had ranked CNVs by Bayes factor and reported validation for the confidently called large CNVs, I presume things would have been much close to the CoNIFER numbers. The next vignette will highlight the use of Bayes Factors, I think this point is not sufficiently advertised.

I have no problem with comparisons, negative or positive, I just feel like it’s a bit random in this case. I can absolutely see  that ExomeDepth will sometimes delivers unreliable results, but I don’t think this paper shows the pros and cons well.

Q: I never see a CNV call in the first exon of a chromosome. Is this a bug?

Not really. The explanation for this is that I had issues with CNVs overlapping chromosome boundaries. The whole downstream analysis, in particular the annotation steps, were negatively affected. Therefore, in the coding of the underlying Markov Chain, I decided to force an absence of CNV (i.e. normal copy number 2) to start a chromosome. It is not an optimum way to address the problem but at least it is clean to code. This is not a high priority but some fix should be implemented at some point.

Q: Will ExomeDepth be updated? And when?

I plan to. For one thing some packages ExomeDepth depends on (like aod) have changed the syntax. I will probably need to update ExomeDepth using aods3 instead. While there are a few small things to improve, I would like to add more diagnostic tools to understand when and why ExomeDepth returns results with lower accuracy. That is the main thing to do really, other things are more minor incremental improvements.