Tuesday, 12 November 2013

FAQ and clarifications for the ExomeDepth R package

With a few colleagues, we published in 2012 a paper describing ExomeDepth, a R package designed to call CNVs from exome sequencing experiments. The ExomeDepth paper has received some citations (not hundreds though, but still something substantial) which is obviously something I am happy to see. When a user asks questions, I always feel like there may be others with the same issues. Therefore, I decided to put together a small FAQ which I will edit as time goes by and when the package is updated. The main point I want to make is really that I encourage people to get in touch with me and feedback issues and results that seem incorrect. I am really happy to help if I can.

Q: ExomeDepth is advertised for exome sequencing. But can it be used on smaller targets than exomes?

Yes it will work without any substantial difference, but you need a substantial number of genes (or rather exons) to get the parameter estimated properly. I guess that ~ 20 genes will do the job.

Q: How do I know if my ExomeDepth results look fine?

For exome data, I like to look at the overlap with the common CNVs flagged in the Conrad et al paper. This dataset is already part of the ExomeDepth package and the vignette shows how to use it. In my initial tests using exome data generated by the BGI, about 70-80% of CNV calls had at least one Conrad CNV overlapping at least 50% of the call. In these cases I had about 180 CNV calls per sample, two third of them deletions. This is really the best case scenario and in such situations the CNV calls are quite reliable. Anything close to these numbers probably means that ExomeDepth ran properly and gave a meaningful output.

If you consider a smaller set of targeted genes, it is not absurd (but a very rough approximation) to get a prior estimate of the number of CNV calls by scaling the genome-wide number down to the smaller gene set of interest to you (i.e. 200 genes should give about 2 CNV calls per sample on average, but of course it will depend on the genes being targeted).

Q: But sometimes things really don’t look right. My numbers of CNVs is off, and the overlap with common known CNVs is too small. Why is that?

It may be an issue with the data. For example the correlation between the test sample and the combined reference must be very high (correlation coefficient at least 0.98 is recommended) otherwise the inference simply cannot work. In the next version of the package I plan to return warnings when that is not the case.

This being said I have seen data that look right by any obvious measure but which obviously was returning too few CNVs (about 40-50 per sample, and not a great overlap with the Conrad set). I do not have an answer to explain such poor results. Correlations between tests and reference looked high… just odd. I am wondering if it is an issue with recent library prep and capture experiments. This is something I need to investigate, and feedback is welcome.

Q: What is the main difference between CoNIFER and ExomeDepth?

CoNIFER (as far as I understand, and the authors should correct me otherwise!) corrects for differences between the "test" exome and the "reference" ones using a principal component type analysis. It is a perfectly valid approach of course. However,  it probably requires a substantial number of exome samples to do this correction reliably. Moreover, my personal view is that usually these technical artifacts are so complex and difficult to correct genome-wide that this task is really challenging. ExomeDepth takes the alternative approach of mining through the "reference" exomes to find the one(s) that are best matched to the "test" exome, and performs a comparison between both. So rather than correcting for differences, I propose to start with something with minimum variability. Which one works best will probably depend on the data, but I think CoNIFER tends to be more conservative. It may be possible to combine both strategies in fact.

Q: I have just read a paper reporting what looks like a high false positive rate. Is that true?

It goes without saying that I am not very convinced (nor happy!) with this paper. First of all, ExomeDepth expects exomes generated from the same batch, and I have no idea what was actually done in that case. The authors may have done something a bit silly with the tool, like comparing exomes generated at different dates (or perhaps not, I just can’t say). They also report the validation rate on the whole set of CNVs (1.4K CNVs in 12 samples, which seems about right). But they compare numbers with CoNIFER which reports a much smaller number (I see n = 38 in Table 2). CNVs calls returned by ExomeDepth come with a Bayes factor that support the calls. The higher the Bayes Factor, the more confident the call is. If the authors had ranked CNVs by Bayes factor and reported validation for the confidently called large CNVs, I presume things would have been much close to the CoNIFER numbers. The next vignette will highlight the use of Bayes Factors, I think this point is not sufficiently advertised.

I have no problem with comparisons, negative or positive, I just feel like it’s a bit random in this case. I can absolutely see  that ExomeDepth will sometimes delivers unreliable results, but I don’t think this paper shows the pros and cons well.

Q: I never see a CNV call in the first exon of a chromosome. Is this a bug?

Not really. The explanation for this is that I had issues with CNVs overlapping chromosome boundaries. The whole downstream analysis, in particular the annotation steps, were negatively affected. Therefore, in the coding of the underlying Markov Chain, I decided to force an absence of CNV (i.e. normal copy number 2) to start a chromosome. It is not an optimum way to address the problem but at least it is clean to code. This is not a high priority but some fix should be implemented at some point.

Q: Will ExomeDepth be updated? And when?

I plan to. For one thing some packages ExomeDepth depends on (like aod) have changed the syntax. I will probably need to update ExomeDepth using aods3 instead. While there are a few small things to improve, I would like to add more diagnostic tools to understand when and why ExomeDepth returns results with lower accuracy. That is the main thing to do really, other things are more minor incremental improvements.

Monday, 20 May 2013

Latest paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

In this latest paper (just submitted to arXiv, led by PhD student Claudia Giambartolomei) we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?  
We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.
Previous work that led to this manuscript
A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.
Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.
A warning about the interpretation
As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.
An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.
Two limitations of this approach
There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics
Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.
Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.
A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.
We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.
From genome-wide to “phenome-wide”
What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.

Wednesday, 1 May 2013

We're hiring

We just advertised a new two year position available to work at the UGI on a range of medical genetics research projects. The main source of funding is a collaboration between University of Virginia, the Wellcome Trust Sanger Institute and the University of Cambridge on the genetics of type 1 diabetes. But this is only part of the story and the aim is to integrate the successful applicant into a broader research program that includes a collaboration with the Institute of Neurology on the genetics of ALS.

A range of applicants are welcome, and the applicant's field can range from computer science to more abstract mathematics. Some experience with programming and scientific computation is expected however.

The link for this application is here. Please get in touch with me if you are potentially interested in working with me at UCL and have any question about the application process.

Tuesday, 19 March 2013

23AndMe replicates a common variant association with chordoma

I am quite excited (and frankly a bit relieved) by the fact that 23AndMe replicated a finding which we published last year. In that first study we associated a common coding variant located in the gene brachyury with much increased odds of developing a rare bone cancer called chordoma (odds ratio around 5 which is really unusual for a common variant). This research received support from the Chordoma foundation which also has a report on the first paper.

Why I was worried

This finding was really a bit of a miracle. As a statistician, I am the person supposed to tell clinicians that small underpowered association studies are pointless. This one pushed the limit quite far, with 40 patients in the discovery set. It was even a candidate study because the initial sequencing work only considered the brachyury gene. This gene is a strong candidate for chordoma owing to the role of duplications in familial forms of the disease. When my colleague Nischalan Pillay came to me with these results, I did not believe them at all. It took me a while and several failures at challenging the result to start believing. The addition of exome sequence data helped rule out technical artifacts. But even with the paper accepted, I must say I had doubts. Now it's OK.

What 23AndMe did

23AndMe is in this amazing position where they can gather information a posteriori on rare diseases like chordoma in a very large cohort of patients. They could rapidly put together a panel of 22 cases (and many controls obviously) that confirmed our result. They describe the work in their blog post which is well worth a quick read. I am really excited by the possibilities offerred by the 23AndMe experimental design. There is really much to do with this ressource. It is also a good but somewhat puzzling thought that any association result published can be almost right away challenged and checked by 23AndMe. This reminds me of the ongoing position of Decode which could publish an incredible amount of results by taking advantage of a ressource that no one else could match.

What it means for chordoma research

The finding for the research on chordoma is quite significant. From a heritability point of view, it is obviously a massive chunk explained. A rs2305089-CC individual has almost no chance of developing chordoma, whereas a TT individual has substantially higher risk (about 25 fold compared to CC).  In fact the 23AndMe description is a bit misleading when they state that CC represents the typical odds of developing chordoma: the variant is so common that the heterozygous group is probably a better representation of the typical odds. This being said the disease is still very rare, so even a TT genotype (like me in fact) should not panic. This is still a very unlikely disease to develop.

Now does that lead to a treatment? Far from that, and in fact the effect of that high risk variant may occur early in development and be absolutely impossible to target from a therapeutic point of view. But still, this is a possibility to explore and it is one important piece of the pattern of inheritance of bone cancers.

Update (June 9 2013)
Nick Ericksson from 23AndMe kindly communicated the exact numbers from the replication work.
Genotypes: (AA/AG/GG)
Cases: 12/10/2
Controls: 68/131/41

The 240 controls were chosen to not report any cancer and to provide the best ancestry match for the cases (as determined by the first 5 components of a PCA analysis). A simple Fisher exact test, which in R looks like this:

fisher.test(matrix(nrow = 2, ncol = 2, data = c(12*2 + 10, 10 + 2*2, 68*2 + 131, 131+ 41*2), byrow = TRUE))

yields a two-tailed P-value of 0.047 (and half of that of course to test the exact replication hypothesis of the Nature Genetics paper).  I note that the odds ratio is lower than our estimates (point estimate OR = 2 in this replication set). It may well be that we have a case of "winner's curse" for our Nat Gen paper but only the future will tell.

Very many thanks to Nick Eriksson and the 23andMe team to share these data.

Monday, 11 February 2013

Genetics of cardiomyopathies: it's not easy...

Congratulation to Luis Lopes (PhD at the UCL Heart Hospital) who just published his paper (open access in the Journal of Medical Genetics) on a high throughput sequencing screen of a cohort of hypertrophic cardiomyopathy (HCM) patients. Luis (and others, including me) scanned 41 genes either known or candidates for being involved in more than 200 HCM cases and used these data to infer what variants are likely to be causal, and with what level of confidence.

Genetics of inherited cardiac disorders are really complex
This scan highlights the complexity associated with interpreting these data. The issue is that quite clearly not all of the likely causal variants are fully penetrant (ie. the disease may happen, but definitely not systematically). Relatively late onset is also not helping out the interpretation. In addition some of the associated genes are quite polymorphic.  The literature has reported plenty of variants associated with HCM but it is hard to know what to do with these, and I suspect that many false positive have been published.

Lastly, it is plausible that not one but multiple variants, some common some rare, act together to define the phenotype and the disease outcome. We will need incredibly large panels to dissect this complexity. I am not completely convinced by this hypothesis, as the disease may show variable outcome due to factors what have nothing to do with genetics, but this is plausible.

The attitude of cardiologists toward genetic diagnosis is changing
This paper would definitely benefit from showing the stream of review, because it was not easy getting it published. Some of the comments were fair and I am in no way arguing that the reviews missed the point. But, while I am rather new to this disease area, my experience with this paper and my reading of the literature suggests that there is a bit of a turning point: in the past, clinicians have often relied on large pedigrees to obtain very reliable evidence that a variant is causal. While this is a worthy approach, it is often impractical in a clinical context, simply because relatives may not be available or the infrastructure of going through the cascade testing may simply be too complicated.

My point is that to be we need to be able to make statements on the basis of what we find in a single case if we want it to be relevant in the clinic. And that must mean not being completely sure in some cases, and being comfortable making probabilistic statements (such as: "on the basis of the data we see, we estimate that this variant is causing the disease with 80% probability").

Clinicians are, of course, not always happy with uncertainty. It's never very satisfying to tell a patient that we cannot be sure. But this is also not unusual: doctors always need to deal with "maybe" when making decisions. It is perhaps less accepted when it comes to genetic diagnosis because we have in mind a very deterministic view of what genetics can do. But uncertainty is unavoidable, so we will have to deal with it.

Last comment: thank you to the UK10K project
The control samples for this paper were provided by the UK10K, which is great because we need large panels of well matched and well phenotyped sequenced samples to understand what is going on in cases. I am pretty sure that there will be quite a few papers using the UK10K data as controls in the near future. Studies like ours would not be possible without these. As the sample size is growing and the phenotyping becomes more thorough, we plan to publish more on this theme and we will reliably continue using the UK10K as a ressource for controls. I am excited by the prospect of these data becoming readily available (which is not quite the case yet I think, as we had to fill a data acces request).

Friday, 1 February 2013

Occult macular dystrophy: neither Mendelian nor complex, but somewhere in between

With this first post I want to briefly discuss a recently published paper, led by Alice Davidson (UCL Institute of Ophthalmology). This is a nice paper with two stories related to two different eye diseases, with one of them intriguing enough to motivate this post.

The main story: RP1L1 and retinitis pigmentosa

This paper started as a study of the genetics of retinitis pigmentosa (RP). As an aside, RP is an incredibly complex disease with a ridiculously long list of associated genes, each (or at least most) in a Mendelian type manner. In fact, it is probably fair to say that after the brain the eye must be the most complex human organ, at least based on the incredibly complex genetic architecture of these developmental traits.

So, in this study, exome sequencing in a RP patient identified a homozygous loss-of-function variant in a massive candidate gene (owing to homology with another RP gene). So it is quite certain that we have found yet again a novel gene for RP, which is the key finding of the paper.

The second story: occult macular dystrophy

Owing to previous reports of association between RP1L1 variants and another eye disease called   occult macular dystrophy (OCMD), Alice also scanned this gene in a cohort of 28 OCMD case. The results were convincing and consistent with previous work: 5 out of 28 harboured the same rare variant (pArg45Trp) which had been reported before in a Japanese study of 3 OCMD families (Akahori et al., 2010). In contrast the NHLBI exome sequencing dataset only reports one pArg45Trp call in more than 5,000 individuals. So there is no doubt that pArg45Trp is associated with OCMD. And we found other rare variants in this cohort, together explaining probably 9/28 cases.

Occult macular dystrophy is not a simple Mendelian disease

So far so good. But the story became rapidly more complex. Intriguingly, we found in the family of these probands 9 patients with the same RP1L1 clearly causal variants but unaffected (and old enough to have potentially developed the disease). So the RP1L1 variants are probably not fully penetrant (i.e. being a carrier does not mean one will develop the disease). As pointed out above, we also found 19/28 OCMD cases without any RP1L1 variants. SO RP1L1 is neither sufficient nor necessary to cause OCMD, in spite of the massive association.

A middle ground between complex and Mendelian?

The emerging picture is that OCMD is half-way between a complex multi-factorial trait and a Mendelian one: there are very strongly associated variants in RP1L1, almost like a Mendelian trait, but this is not quite enough to explain the heritability that we observe. Nevertheless, the genetic architecture may still be simple enough to obtain in the future a more thorough, if not complete, dissection of the trait.

Now is that a common situation?  I suspect it really depends on the disease field but it's definitely unusual for me. Cardiologists who work with diseases like hypertrophic cardiomyopathy are very familiar with these non fully penetrant variants and these complicated pedigrees that do not look fully Mendelian. But in my past experience I have mostly worked on Mendelian disorders with a very clear pattern of inheritance, or alternatively very complex multifactorial traits (like type 1 diabetes). It is somewhat rare for me to see such disorders with an intriguing architecture but perhaps simple enough to solve fully in the future, with the help of modern DNA sequencing technologies.

What next?

So where do we go from this point? As mentioned earlier in this post, it may be possible to further characterise the architecture of OCMD. We are working toward that (with some support from Fight for Sight), by sequencing additional cases without detected RP1L1 variants. We'll see how far we can go with this but I would hope that these intermediate-complexity traits will eventually give us insights into more complex ones. Indeed, it will probably be easier to dissect the role of genetics and environment for diseases like OCMD than for much more complex traits.

Tuesday, 8 January 2013

Blogging from and about a statistical genetics lab at UCL

Welcome. This is not quite a true blog post yet but it is a starting point. My name is Vincent Plagnol, and I am a lecturer in statistical genetics at University College London, UK. Like many researchers, I feel like the pace of peer review is not quite as fast as it should and this blog is meant to address this by pointing out to new publications or ongoing work in a more reactive way, and discussing scientific issues which I think are worth posting. It is part of new year resolution to use public preprint servers whenever I feel it is appropriate. Basically to use more modern means of communication for my work, and see how far I can go with this.

So will this blog be used frequently? Ot will it be one of these new year resolutions that end up being forgotten? I am not sure yet but I will surely give it a go. So more soon on this.