Last week, my lab found out that the LeafCutter BioRxiv paper was updated to include extensive comparisons with other methods, including our own method, MAJIQ, for detecting and quantifying differential splicing.
UPDATE: We just found out the Li at all bioRxiv was released as a technical report today, Dec 11th 2017, in Nature Genetics.
Before I delve into the details some background for those who are not necessarily into differential RNA splicing: We published MAJIQ in Elife on Feb 1st, 2016. We constantly maintain the software and a lively user group, frequently pushing updates and bug fixes. The original LeafCutter BioRxiv paper was posted on March 16th 2016 and was used to derive the sQTL results in the Yang et al Science paper, which came out on Apr 29th 2016 and was titled “RNA splicing is a primary link between genetic variation and disease”. This BioRxiv paper cited MAJIQ extensively, but did not include any comparisons to it, or to any other method, nor did it include any quantitative assessment of the method or independent experimental validations. The updated preprint, which was posted on Sep 2017, is a very different paper though the basic algorithm seems to have stayed the same. Besides the addition of extensive evaluation of splicing and sQTL in GTEx samples (which is very interesting and novel, but is not the focus of this post) the new BioRxiv version included an extensive comparison to MAJIQ, rMATS, and Cufflinks2. Opening that new bioRxiv with nervous excitement and going through the evaluations, figures, and claims, I quickly got the following feeling
(GIF shamelessly stolen from the wonderful r.i.p.)
To be fair, comparing to other algorithms is hard. Furthermore, authors always want to highlight the relative advantage of their method, and in many cases the other methods end up being run with default params while authors spend many days tweaking their own method to prove improvement. This results in an inflation of “best ever” methods and the known ML rule: Your safest bet is the method that robustly came second in others’ papers. In this case, comparison is even harder: As Li at el note in LeafCutter’s bioRxiv preprint, the compared algorithms “speak different languages” in terms of what they define as transcript variations. Nonetheless, after a careful reading of the updated preprint and extensive reanalysis, we wish to respond to several specific points. The rest of the post is organized as follows: (a) Specific critiques of the evaluations performed in the updated preprint, (b) alternative approaches for evaluation of DS with the matching results based on our previous work in Vaquero et al 2016 and Norton et al 2017 (which was posted as a bioRxiv on Jan 2017 and updated on May 2017) , followed by (c) a discussion.
We hope this post and subsequent discussion will help not only improve our respective works but also other future works in the field and more generally how we as a community handle some of the issues raised here.
1.Main issues we identified with the evaluation in Li et al 2017:
The comparison of Li et al to other methods in the main text is all concentrated in Fig3a-c. There are additional comparisons in the supplement, but many of those share issues related to Fig3a-c. We only refer here to rMATS and MAJIQ and not cuffdiff2, the latter being less relevant to the points made here. Let’s go through the main points we observed:
- What do you compare against?
The version of the softwares used is not mentioned anywhere, but is clearly outdated. The bioRxiv revision was posted on Sep 7th 2017, but rMATS already had it’s rMATS-turbo version released on May 1st 2017, and MAJIQ had a major upgrade to version 1.0.0 on May 10th. Both were designed to handle large datasets with significant improvement in running time and memory footprint. Yet Li et al claim that those methods “do not scale well over 30 samples” taking dozens of GB memory and are excruciatingly slow (Fig3a, FigS9).
- How do you evaluate performance?
The authors offer two measures of performance:
- P-value accuracy (Fig3b): To evaluate p-value accuracy and the control for FP, the authors compare the distribution of p-values observed in a tissue1 vs tissue2 comparison with the distribution observed when the 2 tissues have been mixed and split into 2 random groups (i.e., a contrast with no expected differences). There is one major problem with this comparison however: MAJIQ does not produce a p-value. It does not use a null model but a Bayesian posterior for P(dPSI > C), with the default set to P(dPSI > 20%) > 0.95. The authors admit that, yet plot 1-P(dPSI > C) against the p-value line, just like all the other methods, and state that it’s not “well calibrated”. Many readers looking at these graphs are likely to quickly conclude “Oh, that MAJIQ method is *terrible*” while in fact they should be going: “??”. The definition of P(dPSI>C) is clearly not of a p-value, and we hold its usage as such is simply wrong. The shape of the graph simply reflects that rather than a problem in calibration as suggested by the authors. We will get back to what this plot really means for MAJIQ later in the post.
- Evaluation of synthetic data: Here the authors use synthetic data to measure accuracy in classifying changing vs. non changing events using ROC curves (Fig 3c): This is a more delicate point as results will depend greatly on (1) how the synthetic data are defined and (2) what you define as significant changes. We will ignore (1) for now and focus on the latter as this is crucial to understand the flaw in this comparison. Li et al defined changes by altering a specific isoform expression by 10%, 30% etc. They measure how many of those changes are then detected by each method. The problem with this is simple: A change of 10% in the expression of an isoform does not translate to a specific level of change in *splicing*. Remember: We are not talking about methods to quantify expression, but changes in splicing. Methods such as rMATS and MAJIQ measure splicing changes as the difference in ratios of inclusion of RNA segments. In the best case scenario an isoform whose expression is changed is solely responsible for a specific junction inclusion. So, for a 10% change of its expression if the original junction inclusion was at X/Y = Z%, it will now be included at 1.1X/(Y+0.1X) = Z’%. As you can see, dPSI = Z-Z’ really depends on X vs. Y. For example, if X=1 and Y=9 (which means the changed isoform is expressed at a reasonable 10% of the gene’s level) then dPSI is less than 1%! Remember: This is the actual splicing change, and now MAJIQ is ran on this with default parameters, which means it is looking for high confidence events of dPSI > 20%. Again, look at those ROC graphs in Fig3c and your spidey senses should be tingling: Notice how they completely plateau very quickly for MAJIQ? You are basically asking it to find changes in splicing that are bigger than 20% while defining as positives events changes in *expression* by 10%, 25%, 50%, or 300%. Indeed, when you require 300%, an isoform expressed at 10% of the gene could achieve (depending on the splice graph specifics) a dPSI >= 20%, and in that case MAJIQ with its default params does just as well as LeafCutter. And what happens if you actually evaluate it on changes in splicing? The answer is given below. Finally, we note that Li et al seem to be aware of at least some of these issues (p.28 top paragraph) regarding expression changes, but unfortunately, these observations do not propagate to the overall analysis.
To summarize, the comparisons in the recent Li et al preprint use the highly outdated software at the time of posting it, and metrics that are not compatible with algorithm output/definitions. The combination of those creates what we assert to be a highly misleading picture about algorithm performance. Furthermore, we assert these comparisons fail to actually evaluate performance on changes in *splicing* and lack experimental validation.
2. Our comparative evaluation of LeafCutter and MAJIQ.
2.1 Running time/memory:
Here is an evaluation we ran with current versions which should be similar/same to the ones available then except various (minor) bug fixes. Here time is in minutes, run with 16 threads on a Linux machine.

While we don’t have the exact memory usage, it is a few hundred MB per thread for MAJIQ and rMATS (see also Norton et al 2017). Two points are worth making here: First, How do you actually compare execution time? For these algorithms it is more complicated than just the numbers above, a point we raise and discuss in Norton et al. Briefly, the algorithms operate differently. For example, if you want to do multiple comparisons of groups of 3 with the 20 samples of the 10vs10 above, MAJIQ will run the builder *once*, then the quantifier multiple times for 3vs3. In rMATS, each run is independent. The second point is that execution time is tied to what you quantify. In LeafCutter’s case that does not include intron retention events (see more on this below), which makes a huge difference.
In summary, rMATS is faster than LeafCutter, and both MAJIQ and rMATS are completely capable of handling large datasets, contrary to the picture presented in Li et al 2017. Again, the fact no version of the software or execution params are documented make the analysis in Li et al 2017 both misleading and harder to reproduce.
2.1 Synthetic data:
Synthetic datasets are useful for evaluating algorithms under varying controlled conditions, when you know what the “truth” is. The downside is that they are, well, synthetic. Specifically, for the problem of DS they commonly involve strong simplifying assumptions about expression levels, distribution of variation in and between the groups compared, read errors, coverage levels, underlying known transcriptome, and variants that involve only predefined “classical” AS event types such as cassette exons.
In order to create more realistic synthetic data for DS analysis, we employed real data from biological replicates. If you are interested in how the data was generated we encourage you to read Norton et al 2017 but for the discussion here I would only point out the following key points: First, expression levels and transcriptome variations are not based on a synthetic model but on real data/samples. Second, we use the most complex splicing event we can find based on reliably detected (multi reads in multi positions) junction spanning reads from STAR. This enables us to define a lower bound on each gene transcriptome complexity. The expression level of each isoform to be simulated is then set by the gene’s overall FPKM and the raw junction spanning read ratios in the real sample to avoid biasing estimations towards any specific algorithm. This means we can simulate not just cassette exons or binary events but also complex variations, which both MAJIQ and LeafCutter are geared to capture. Simulated data was generated by our colleague Greg Grant, who is a known leader in the field of evaluating methods for RNA-Seq analysis.
How do we evaluate DS methods using this data? Following a common criterion in the RNA splicing field we used a change in inclusion level >= 20% (dPSI >= 0.2) to define a positive event and an inclusion change < 5% as a negative event. Importantly, we use each methods “native language” so each method’s set of events are evaluated by its definition of what an event is. This is markedly different from the procedure used by Li et al where they state that they “collapsed all events in rMATS and MAJIQ that shared a single splice site into a single event (as is done in LeafCutter)”. The consequence of such evaluation on PSI/dPSI and the evaluation is not clear as details are missing.
Let’s look at the results if we use the p-value based approach which the authors advocate and on which they base their subsequent analysis in both Lea et al 2017 and Li et al 2016 (sQTL included):

The first 3 columns represent the number of changing (>20%), not changing (< 5%), and in between (5-20%) events, according to each method “language”. This is why they are so different and hard to compare directly. The stats on the right are computed with respect to those. As you can see, LeafCutter is much more permissive than MAJIQ: It reports almost all the TP events (Sens of 93% vs. MAJIQ’s 83%) but this comes with a high price in FP (FDR of 21% vs. MAJIQ’s 1%).
Now, it’s important to note we view the above evaluation as problematic: A p-value is used to test our belief in a change, not for the actual magnitude of it. So what happens if we adjust LeafCutter to use p-value for confidence but add a requirement that their dPSI estimates are >= 20%? Here are the results:

As you can see, now the results make much more sense: LeafCutter’s FDR drops to 2%, but sensitivity drops to 79%, with MAJIQ still comparing favorably. We would like to point out that if you look at our recent Norton et al 2017 you will find we tried many different configurations for all the algorithms we used to make sure the evaluation is as fair as reasonably possible. We also tried a more permissive threshold of dPSI >10% for positive events, with similar results.
2.2 Reproducibility plots (RR):
RR plots follow the same kind of logic and similar procedure to that of irreproducible discovery rate (IDR), used extensively to evaluate ChiP-Seq peak callers. Briefly, we ask the following simple question: Given an algorithm A and a dataset D, if you rank all the events that algorithm A identifies as differentially spliced 1…N_A, how many would be reproduced if you repeat this with dataset D’, made of similar experiments using biological/technical replicates? The RR plot is the fraction of those events that are reproduced (y axis) as a function of n<=N_A (x-axis). Here are the results from tissue data we used for the Norton et. al and Vaquero et. al papers, comparing groups (3vs3) of Cerebellum and Liver samples:

We note that here we use the same p-value criteria used by the LeafCutter authors’ throughput their paper(s). However, when we tried a new criteria, where events were instead screened for significance (p-value < 0.05) and but then ranked by dPSI, LeafCutter RR improved significantly from 62% to 77%, yet was still lower from MAJIQ’s 83%.
2.3 Intra to Inter Ratio (IIR):
Reproducibility alone is not sufficient to establish accuracy. For example, an algorithm can be extremely reproducible but highly biased. To get a better sense of possible levels of false positives we devised the following test: We compare similarly sized groups of the same condition (e.g. brain vs. brain or liver vs. liver) and compute the ratio between the number of events identified as significantly changing in such a setting (N’_A) to the number of events identified between the groups of different conditions (N_A, e.g. brain vs. liver). We term this the inter to intra ratio (IIR). This test is similar to the one used by Li et al to test for FP (Fig3b). In their setting, they *mix* the two groups. We postulate that the IIR is better for testing the ratio between natural variations within groups/conditions which an algorithm deems as significant and variations between conditions (i.e. the biological signal of interest).
Here are the IIR results we get for the tissue comparison data we used in Norton et al 2017:

Notice there are two numbers for each method as the N_A is compared to the N’_A from within each group (liver and cerebellum). These results are inline with the synthetic data and point to significant levels of FP by LeafCutter.
Two final points about IIR and FP: First, just as we did with the RR plots above we also tried to improve LeafCutter’s results by making a more conservative test of p-value with dPSI> 20%. LeafCutter’s IIR improves significantly (0.048, 0.035) but MAJIQ still compares favorably. Second, we point the readers to the small print on each panel in Fig3b of Li et al: They refer to this test ratio as an estimate of FDR (whether this is a true measure of FDR or not is arguable given the points made above), and you could see that at both thresholds MAJIQ outperforms LeafCutter (245 vs 0 when using MAJIQ’s criteria of P(dPSI > 20%) > 95%; 766 vs 721 when they set P(dPSI > 20%)>80%). Yet this point is not mentioned or discussed. Instead the authors focus on claiming a non calibrated p-value test, which is irrelevant for MAJIQ (see above).
2.4 RT-PCR:
While the above tests are important and informative, these are focused on dPSI and do not address the question of how accurate the actual PSI levels are quantified by the algorithm. This is particularly relevant for the kind of analysis Li et al focus on, i.e. sQTL, as this is done using PSI in each sample and not dPSI between groups. Moreover, they do not supply experimental validations or independent measure of accuracy (see point above about possible inherent biases). For these, RT-PCR experiments are considered the gold standard in the RNA field. To be sure, these too are not free of potential pitfalls and biases (see a discussion about this in Norton et al): many available RT-PCR from publications are only usable qualitatively (i.e. “change” vs “no change”) so one must be careful in simply downloading and using these, their selection can be highly biased (e.g. only cassette exons), and these are low throughput experiments. Nonetheless, including them for methods that make strong claims about splicing quantification is important if not crucial. This is the correlation to RT-PCR by LeafCutter compared to our results in Norton et al for MAJIQ and rMATS. We note that all the data for these RT-PCR has been made available in Vaquero et al 2016, when MAJIQ was published.



The above figures show both MAJIQ and rMATS outperform LeafCutter (R2 = 0.923 vs. 0.972, 0.967). Things become more interesting when you try to validate PSI (these are PSI for the two conditions underlying the dPSI above):




We note that LeafCutter’s PSI quantification is significantly more biased and less accurate (R2 = 0.806 vs. 0.87, 0.906 vs 0.936). We suspect the reduced accuracy may have to do with how it defines splicing events. LeafCutter, similar to MAJIQ, relies on junction spanning reads but LeafCutter’s events are intron centered and collapse all overlapping sets to an “intron cluster” for which the inclusion levels are normalized and quantified. While these clusters are a convenient mathematical construction for the purpose they were originally designed for (detecting splicing changes for sQTL, as in Li et al 2016, 2017), there is no real guarantee that in the underlying transcriptome these actually represent a set whose combined inclusion is supposed to sum to a constant (1, when normalized to compute PSI). In fact, even for a simple cassette exon LeafCutter does not report the exon inclusion but reports 3 PSI values (which sum to 1) for the 3 possible intronic regions (C1-A, A-C2, and C-2, where “A” is the alternative exon and C1, C2 are the upstream/downstream exons respectively). It took us time to figure this out and correct for this, and these were all simple cassette exons. For the more general case, we are not sure how leafCutter’s output is to be translated to quantitative measures of inclusion that can be validated experimentally.
These issues require further investigation which are beyond the scope of this post. We note though that PSI quantification is highly relevant as the sQTL pipelines employed in Li et al 2016, 2017 are all based on feeding PSI quantification (and not dPSI) into standard QTL tools such as fastQTL.
2. Discussion
We presented here several issues we identified with LeafCutter comparison to other methods, which we believe lead to severe misrepresentation of relative performance. We then followed with suggested alternative metrics and evaluation procedures, including suggestions for how LeafCutter’s performance could be improved compared to the procedures used by the authors.
Another important point of comparison glossed over in the Li at al paper is that LeafCutter does not offer detection of alternative first/last exons, or intron retention. This is in sharp contrast to MAJIQ which offers both known and de-novo intron retention. De-novo intron retention has a *huge* computational overhead which the other methods (LeafCutter, rMATS) do not pay. This distinction is almost absent in Li et al (p. 27 top in the supplementary). Consequently, we did not include intron retention in the evaluations above. We note that besides the fact these are not included in the output for the users, not taking those into account may have considerable effects on splicing variation detection and quantification given how common these occur in human (Braunschweig et al, GR 2014).
Several other points are worth making. First, is that our analysis supports Li et al assertion that LeafCutter can perform well (especially if the adaptations we suggested are used), and can be used to derive new insights into splicing and sQTL. We reiterate that this post completely ignores the novel and significant contribution of Li et al of extensive splicing and sQTL analysis in GTEx. We congratulate the authors for this achievement. We also point to the extensive effort made by the authors to show concordance between methods and even the effect of mappers (STAR vs. OLEGO), as detailed in the long supplementary notes.
We believe that for DS, and for many similar problems in Genomics and ML for that matter, there is no “silver bullet”, even though we always want to have one (and preferably, a really fast one for that matter…). As we discuss in Norton et. al, different algorithms may have different advantages depending on the needs and the settings. For example, Li et al did not review SUPPA which is far faster than all the other methods, allowing users to quickly build a large scale genomic landscape of splicing even though quantification may be less accurate (see comparative analysis in Norton et al). Similarly, VAST-TOOLS offers a dedicated pipeline sensitive to detecting micro-exons. As for future methods development, we have made all the data and pipelines we used to derive the results presented here available at https://majiq.biociphers.org/norton_et_al_2017/.
Beyond the technical points raised about DS and evaluation methods etc. there are some important discussions that arise. For one, who is responsible for running updated versions of alternative software? The authors? The reviewers? We think both. Second, how should a discussion around archive papers occur. Posts? Tweets? Matching archive papers? Direct contact by phone/email? We thought a lot about this. Previously, when we observed other clear inaccuracies we contacted the authors of that archive preprint directly with a detailed list. Here, given the fact we do not know the authors as well, the nature of the issues, and the long time that has passed (preprint was posted in Sep, we only noticed it last week), we opted for posting a detailed response and hope for a constructive discussion. We also informed the authors immediately. But more generally, what should be the new standard for such cases? The common courtesy?
Finally, we want to raise the issue of implications. A bioRxiv preprint is exactly that – a public statement of in progress, non-peer reviewed, work. Similar to a talk, but more concrete and detailed. It can be cited, acknowledged, and inform the greater scientific community. But the reality is that it can also have severe ramifications. People read BioRxiv papers and their opinions are shaped by what they read. For example, others in the field could get the inaccurate impression that rMATS and MAJIQ are flawed. They might then end up as your reviewers (OK, you get where this is going). We hope that by having constructive discussions around preprints, as we aim to achieve here, both our science and our scientific practices improve as a community.