LeafCutter vs. MAJIQ and comparing differential splicing algorithms

Last week, my lab found out that the LeafCutter BioRxiv paper was updated to include extensive comparisons with other methods, including our own method, MAJIQ, for detecting and quantifying differential splicing.
UPDATE: We just found out the Li at all bioRxiv was released as a technical report today, Dec 11th 2017, in Nature Genetics.

Before I delve into the details some background for those who are not necessarily into differential RNA splicing: We published MAJIQ in Elife on Feb 1st, 2016. We constantly maintain the software and a lively user group, frequently pushing updates and bug fixes. The original LeafCutter BioRxiv paper was posted on March 16th 2016 and was used to derive the sQTL results in the Yang et al Science paper, which came out on Apr 29th 2016 and was titled “RNA splicing is a primary link between genetic variation and disease”. This BioRxiv paper cited MAJIQ extensively, but did not include any comparisons to it, or to any other method, nor did it include any quantitative assessment of the method or independent experimental validations. The updated preprint, which was posted on Sep 2017, is a very different paper though the basic algorithm seems to have stayed the same. Besides the addition of extensive evaluation of splicing and sQTL in GTEx samples (which is very interesting and novel, but is not the focus of this post) the new BioRxiv version included an extensive comparison to MAJIQ, rMATS, and Cufflinks2. Opening that new bioRxiv with nervous excitement and going through the evaluations, figures, and claims, I quickly got the following feelingInTheRain(GIF shamelessly stolen from the wonderful r.i.p.)

To be fair, comparing to other algorithms is hard. Furthermore, authors always want to highlight the relative advantage of their method, and in many cases the other methods end up being run with default params while authors spend many days tweaking their own method to prove improvement. This results in an inflation of “best ever” methods and the known ML rule: Your safest bet is the method that robustly came second in others’ papers. In this case, comparison is even harder: As Li at el note in LeafCutter’s bioRxiv preprint, the compared algorithms “speak different languages” in terms of what they define as transcript variations. Nonetheless, after a careful reading of the updated preprint and extensive reanalysis, we wish to respond to several specific points. The rest of the post is organized as follows: (a) Specific critiques of the evaluations performed in the updated preprint, (b) alternative approaches for evaluation of DS with the matching results based on our previous work in Vaquero et al 2016 and Norton et al 2017 (which was posted as a bioRxiv on Jan 2017 and updated on May 2017) , followed by (c) a discussion.

We hope this post and subsequent discussion will help not only improve our respective works but also other future works in the field and more generally how we as a community handle some of the issues raised here.

1.Main issues we identified with the evaluation in Li et al 2017:

The comparison of Li et al to other methods in the main text is all concentrated in Fig3a-c. There are additional comparisons in the supplement, but many of those share issues related to Fig3a-c. We only refer here to rMATS and MAJIQ and not cuffdiff2, the latter being less relevant to the points made here. Let’s go through the main points we observed:

  1. What do you compare against?
    The version of the softwares used is not mentioned anywhere, but is clearly outdated. The bioRxiv revision was posted on Sep 7th 2017, but rMATS already had it’s rMATS-turbo version released on May 1st 2017, and MAJIQ had a major upgrade to version 1.0.0 on May 10th. Both were designed to handle large datasets with significant improvement in running time and memory footprint. Yet Li et al claim that those methods “do not scale well over 30 samples” taking dozens of GB memory and are excruciatingly slow (Fig3a, FigS9).
  2. How do you evaluate performance?
    The authors offer two measures of performance:

    1. P-value accuracy (Fig3b): To evaluate p-value accuracy and the control for FP, the authors compare the distribution of p-values observed in a tissue1 vs tissue2 comparison with the distribution observed when the 2 tissues have been mixed and split into 2 random groups (i.e., a contrast with no expected differences). There is one major problem with this comparison however: MAJIQ does not produce a p-value. It does not use a null model but a Bayesian posterior for P(dPSI > C), with the default set to P(dPSI > 20%) > 0.95. The authors admit that, yet plot 1-P(dPSI > C) against the p-value line, just like all the other methods, and state that it’s not “well calibrated”. Many readers looking at these graphs are likely to quickly conclude “Oh, that MAJIQ method is *terrible*” while in fact they should be going: “??”. The definition of P(dPSI>C) is clearly not of a p-value, and we hold its usage as such is simply wrong. The shape of the graph simply reflects that rather than a problem in calibration as suggested by the authors. We will get back to what this plot really means for MAJIQ later in the post.
    2. Evaluation of synthetic data: Here the authors use synthetic data to measure accuracy in classifying changing vs. non changing events using ROC curves (Fig 3c): This is a more delicate point as results will depend greatly on (1) how the synthetic data are defined and (2) what you define as significant changes. We will ignore (1) for now and focus on the latter as this is crucial to understand the flaw in this comparison. Li et al defined changes by altering a specific isoform expression by 10%, 30% etc. They measure how many of those changes are then detected by each method. The problem with this is simple: A change of 10% in the expression of an isoform does not translate to a specific level of change in *splicing*. Remember: We are not talking about methods to quantify expression, but changes in splicing. Methods such as rMATS and MAJIQ measure splicing changes as the difference in ratios of inclusion of RNA segments. In the best case scenario an isoform whose expression is changed is solely responsible for a specific junction inclusion. So, for a 10% change of its expression if the original junction inclusion was at X/Y = Z%, it will now be included at 1.1X/(Y+0.1X) = Z’%. As you can see, dPSI = Z-Z’ really depends on X vs. Y. For example, if X=1 and Y=9 (which means the changed isoform is expressed at a reasonable 10% of the gene’s level) then dPSI is less than 1%! Remember: This is the actual splicing change, and now MAJIQ is ran on this with default parameters, which means it is looking for high confidence events of dPSI > 20%. Again, look at those ROC graphs in Fig3c and your spidey senses should be tingling: Notice how they completely plateau very quickly for MAJIQ? You are basically asking it to find changes in splicing that are bigger than 20% while defining as positives events changes in *expression* by 10%, 25%, 50%, or 300%. Indeed, when you require 300%, an isoform expressed at 10% of the gene could achieve (depending on the splice graph specifics) a dPSI >= 20%, and in that case MAJIQ with its default params does just as well as LeafCutter. And what happens if you actually evaluate it on changes in splicing? The answer is given below. Finally, we note that Li et al seem to be aware of at least some of these issues (p.28 top paragraph) regarding expression changes, but unfortunately, these observations do not propagate to the overall analysis.

To summarize, the comparisons in the recent Li et al preprint use the highly outdated software at the time of posting it, and metrics that are not compatible with algorithm output/definitions. The combination of those creates what we assert to be a highly misleading picture about algorithm performance. Furthermore, we assert these comparisons fail to actually evaluate performance on changes in *splicing* and lack experimental validation.

2. Our comparative evaluation of LeafCutter and MAJIQ.

2.1 Running time/memory:

Here is an evaluation we ran with current versions which should be similar/same to the ones available then except various (minor) bug fixes. Here time is in minutes, run with 16 threads on a Linux machine.

LeafCutter.RunTimeComparison

While we don’t have the exact memory usage, it is a few hundred MB per thread for MAJIQ and rMATS (see also Norton et al 2017). Two points are worth making here: First, How do you actually compare execution time? For these algorithms it is more complicated than just the numbers above,  a point we raise and discuss in Norton et al. Briefly, the algorithms operate differently. For example, if you want to do multiple comparisons of groups of 3 with the 20 samples of the 10vs10 above, MAJIQ will run the builder *once*, then the quantifier multiple times for 3vs3. In rMATS, each run is independent. The second point is that execution time is tied to what you quantify. In LeafCutter’s case that does not include intron retention events (see more on this below), which makes a huge difference.

In summary, rMATS is faster than LeafCutter, and both MAJIQ and rMATS are completely capable of handling large datasets, contrary to the picture presented in Li et al 2017. Again, the fact no version of the software or execution params are documented make the analysis in Li et al 2017 both misleading and harder to reproduce.

2.1 Synthetic data:

Synthetic datasets are useful for evaluating algorithms under varying controlled conditions, when you know what the “truth” is. The downside is that they are, well, synthetic. Specifically, for the problem of DS they commonly involve strong simplifying assumptions about expression levels, distribution of variation in and between the groups compared, read errors, coverage levels, underlying known transcriptome, and variants that involve only predefined “classical” AS event types such as cassette exons.

In order to create more realistic synthetic data for DS analysis, we employed real data from biological replicates. If you are interested in how the data was generated we encourage you to read Norton et al 2017 but for the discussion here I would only point out the following key points: First, expression levels and transcriptome variations are not based on a synthetic model but on real data/samples. Second, we use the most complex splicing event we can find based on reliably detected (multi reads in multi positions) junction spanning reads from STAR. This enables us to define a lower bound on each gene transcriptome complexity. The expression level of each isoform to be simulated is then set by the gene’s overall FPKM and the raw junction spanning read ratios in the real sample to avoid biasing estimations towards any specific algorithm. This means we can simulate not just cassette exons or binary events but also complex variations, which both MAJIQ and LeafCutter are geared to capture. Simulated data was generated by our colleague Greg Grant, who is a known leader in the field of evaluating methods for RNA-Seq analysis.

How do we evaluate DS methods using this data? Following a common criterion in the RNA splicing field we used a change in inclusion level >= 20% (dPSI >= 0.2) to define a positive event and an inclusion change < 5% as a negative event. Importantly, we use each methods “native language” so each method’s set of events are evaluated by its definition of what an event is. This is markedly different from the procedure used by Li et al where they state that they “collapsed all events in rMATS and MAJIQ that shared a single splice site into a single event (as is done in LeafCutter)”. The consequence of such evaluation on PSI/dPSI and the evaluation is not clear as details are missing.

Let’s look at the results if we use the p-value based approach which the authors advocate and on which they base their subsequent analysis in both Lea et al 2017 and Li et al 2016 (sQTL included):

The first 3 columns represent the number of changing (>20%), not changing (< 5%), and in between (5-20%) events, according to each method “language”. This is why they are so different and hard to compare directly. The stats on the right are computed with respect to those. As you can see, LeafCutter is much more permissive than MAJIQ: It reports almost all the TP events (Sens of 93% vs. MAJIQ’s 83%) but this comes with a high price in FP (FDR of 21% vs. MAJIQ’s 1%).

Now, it’s important to note we view the above evaluation as problematic: A p-value is used to test our belief in a change, not for the actual magnitude of it. So what happens if we adjust LeafCutter to use p-value for confidence but add a requirement that their dPSI estimates are >= 20%? Here are the results:

As you can see, now the results make much more sense: LeafCutter’s FDR drops to 2%, but sensitivity drops to 79%, with MAJIQ still comparing favorably. We would like to point out that if you look at our recent Norton et al 2017 you will find we tried many different configurations for all the algorithms we used to make sure the evaluation is as fair as reasonably possible. We also tried a more permissive threshold of dPSI >10% for positive events, with similar results.

2.2 Reproducibility plots (RR):

RR plots follow the same kind of logic and similar procedure to that of irreproducible discovery rate (IDR), used extensively to evaluate ChiP-Seq peak callers. Briefly, we ask the following simple question: Given an algorithm A and a dataset D, if you rank all the events that algorithm A identifies as differentially spliced 1…N_A, how many would be reproduced if you repeat this with dataset D’, made of similar experiments using biological/technical replicates? The RR plot is the fraction of those events that are reproduced (y axis) as a function of n<=N_A (x-axis). Here are the results from tissue data we used for the Norton et. al and Vaquero et. al papers, comparing groups (3vs3) of Cerebellum and Liver samples:

We note that here we use the same p-value criteria used by the LeafCutter authors’ throughput their paper(s). However, when we tried a new criteria, where events were instead screened for significance (p-value < 0.05) and but then ranked by dPSI, LeafCutter RR improved significantly from 62% to 77%, yet was still lower from MAJIQ’s 83%.

2.3 Intra to Inter Ratio (IIR):

Reproducibility alone is not sufficient to establish accuracy. For example, an algorithm can be extremely reproducible but highly biased. To get a better sense of possible levels of false positives we devised the following test: We compare similarly sized groups of the same condition (e.g. brain vs. brain or liver vs. liver) and compute the ratio between the number of events identified as significantly changing in such a setting (N’_A) to the number of events identified between the groups of different conditions (N_A, e.g. brain vs. liver). We term this the inter to intra ratio (IIR). This test is similar to the one used by Li et al to test for FP (Fig3b). In their setting, they *mix* the two groups. We postulate that the IIR is better for testing the ratio between natural variations within groups/conditions which an algorithm deems as significant and variations between conditions (i.e. the biological signal of interest).

Here are the IIR results we get for the tissue comparison data we used in Norton et al 2017:

Notice there are two numbers for each method as the N_A is compared to the N’_A from within each group (liver and cerebellum). These results are inline with the synthetic data and point to significant levels of FP by LeafCutter.

Two final points about IIR and FP: First, just as we did with the RR plots above we also tried to improve LeafCutter’s results by making a more conservative test of p-value with dPSI> 20%. LeafCutter’s IIR improves significantly (0.048, 0.035) but MAJIQ still compares favorably. Second, we point the readers to the small print on each panel in Fig3b of Li et al: They refer to this test ratio as an estimate of FDR (whether this is a true measure of FDR or not is arguable given the points made above), and you could see that at both thresholds MAJIQ outperforms LeafCutter (245 vs 0 when using MAJIQ’s criteria of P(dPSI > 20%) > 95%; 766 vs 721 when they set P(dPSI > 20%)>80%). Yet this point is not mentioned or discussed. Instead the authors focus on claiming a non calibrated p-value test, which is irrelevant for MAJIQ (see above).

2.4 RT-PCR:

While the above tests are important and informative, these are focused on dPSI and do not address the question of how accurate the actual PSI levels are quantified by the algorithm. This is particularly relevant for the kind of analysis Li et al focus on, i.e. sQTL, as this is done using PSI in each sample and not dPSI between groups. Moreover, they do not supply experimental validations or independent measure of accuracy (see point above about possible inherent biases). For these, RT-PCR experiments are considered the gold standard in the RNA field. To be sure, these too are not free of potential pitfalls and biases (see a discussion about this in Norton et al): many available RT-PCR from publications are only usable qualitatively (i.e. “change” vs “no change”) so one must be careful in simply downloading and using these, their selection can be highly biased (e.g. only cassette exons), and these are low throughput experiments. Nonetheless, including them for methods that make strong claims about splicing quantification is important if not crucial. This is the correlation to RT-PCR by LeafCutter compared to our results in Norton et al for MAJIQ and rMATS. We note that all the data for these RT-PCR has been made available in Vaquero et al 2016, when MAJIQ was published.

The above figures show both MAJIQ and rMATS outperform LeafCutter (R2 = 0.923 vs. 0.972, 0.967). Things become more interesting when you try to validate PSI (these are PSI for the two conditions underlying the dPSI above):

We note that LeafCutter’s PSI quantification is significantly more biased and less accurate (R2 = 0.806 vs. 0.87, 0.906 vs 0.936). We suspect the reduced accuracy may have to do with how it defines splicing events. LeafCutter, similar to MAJIQ, relies on junction spanning reads but LeafCutter’s events are intron centered and collapse all overlapping sets to an “intron cluster” for which the inclusion levels are normalized and quantified. While these clusters are a convenient mathematical construction for the purpose they were originally designed for (detecting splicing changes for sQTL, as in Li et al 2016, 2017), there is no real guarantee that in the underlying transcriptome these actually represent a set whose combined inclusion is supposed to sum to a constant (1, when normalized to compute PSI). In fact, even for a simple cassette exon LeafCutter does not report the exon inclusion but reports 3 PSI values (which sum to 1) for the 3 possible intronic regions (C1-A, A-C2, and C-2, where “A” is the alternative exon and C1, C2 are the upstream/downstream exons respectively). It took us time to figure this out and correct for this, and these were all simple cassette exons. For the more general case, we are not sure how leafCutter’s output is to be translated to quantitative measures of inclusion that can be validated experimentally.

These issues require further investigation which are beyond the scope of this post. We note though that PSI quantification is highly relevant as the sQTL pipelines employed in Li et al 2016, 2017 are all based on feeding PSI quantification (and not dPSI) into standard QTL tools such as fastQTL.

2. Discussion

We presented here several issues we identified with LeafCutter comparison to other methods, which we believe lead to severe misrepresentation of relative performance. We then followed with suggested alternative metrics and evaluation procedures, including suggestions for how LeafCutter’s performance could be improved compared to the procedures used by the authors.

Another important point of comparison glossed over in the Li at al paper is that LeafCutter does not offer detection of alternative first/last exons, or intron retention. This is in sharp contrast to MAJIQ which offers both known and de-novo intron retention. De-novo intron retention has a *huge* computational overhead which the other methods (LeafCutter, rMATS) do not pay. This distinction is almost absent in Li et al (p. 27 top in the supplementary). Consequently, we did not include intron retention in the evaluations above. We note that besides the fact these are not included in the output for the users, not taking those into account may have considerable effects on splicing variation detection and quantification given how common these occur in human (Braunschweig et al, GR 2014).

Several other points are worth making. First, is that our analysis supports Li et al assertion that LeafCutter can perform well (especially if the adaptations we suggested are used), and can be used to derive new insights into splicing and sQTL. We reiterate that this post completely ignores the novel and significant contribution of Li et al of extensive splicing and sQTL analysis in GTEx. We congratulate the authors for this achievement. We also point to the extensive effort made by the authors to show concordance between methods and even the effect of mappers (STAR vs. OLEGO), as detailed in the long supplementary notes.

We believe that for DS, and for many similar problems in Genomics and ML for that matter, there is no “silver bullet”, even though we always want to have one (and preferably, a really fast one for that matter…). As we discuss in Norton et. al, different algorithms may have different advantages depending on the needs and the settings. For example, Li et al did not review SUPPA which is far faster than all the other methods, allowing users to quickly build a large scale genomic landscape of splicing even though quantification may be less accurate (see comparative analysis in Norton et al). Similarly, VAST-TOOLS offers a dedicated pipeline sensitive to detecting micro-exons. As for future methods development, we have made all the data and pipelines we used to derive the results presented here available at https://majiq.biociphers.org/norton_et_al_2017/.

Beyond the technical points raised about DS and evaluation methods etc. there are some important discussions that arise. For one, who is responsible for running updated versions of alternative software? The authors? The reviewers? We think both. Second, how should a discussion around archive papers occur. Posts? Tweets? Matching archive papers? Direct contact by phone/email? We thought a lot about this. Previously, when we observed other clear inaccuracies we contacted the authors of that archive preprint directly with a detailed list. Here, given the fact we do not know the authors as well, the nature of the issues, and the long time that has passed (preprint was posted in Sep, we only noticed it last week), we opted for posting a detailed response and hope for a constructive discussion. We also informed the authors immediately. But more generally, what should be the new standard for such cases? The common courtesy?

Finally, we want to raise the issue of implications. A bioRxiv preprint is exactly that – a public statement of in progress, non-peer reviewed, work. Similar to a talk, but more concrete and detailed. It can be cited, acknowledged, and inform the greater scientific community. But the reality is that it can also have severe ramifications. People read BioRxiv papers and their opinions are shaped by what they read. For example, others in the field could get the inaccurate impression that rMATS and MAJIQ are flawed. They might then end up as your reviewers (OK, you get where this is going). We hope that by having constructive discussions around preprints, as we aim to achieve here, both our science and our scientific practices improve as a community.

Advertisements

Should I (Bio)Rxiv?

“So what is your view on BioRxiv?” asked me Donny Licatalosi last week on one of those long nights over beers at CSH RNA Processing meeting (which, btw, is a *terrific* meeting I highly recommend). Donny is one of the bright new PIs in the RNA field (see here), and I believe his question represents many other young PIs in a similar stage, trying to decide whether to jump on the archive wagon. After laying out my views on Rxiv pros and cons (with the help of more beer, of course…) we agreed I should probably do what I said I’d do a long time ago, and actually write about this in my blog, so here goes….

For the record, I should state that the four last papers from my lab have all been posted on BioRxiv. This already tells you I am generally in favor so I’ll start with the benefits, at the risk of stating the obvious:

  1. There are ideological points behind it which are worth supporting:
    1. Make science progress faster by increasing communication.
    2. Open access science. Give everyone equal access to cutting-edge science, whether you’re a curious student in a rural countryside or you’re a tenure-track professor in a large research institution.
    3. Help change the current landscape of publishing. This system is bound to change, and BioRXiv can be seen as a catalyst of this change. Think of Taxis: We needed those to get around because cars were expensive, unreliable, and required expert knowledge (driving and getting around). Now we have cheap, reliable cars with GPS, cars/driving are common, and we are all connected via smartphones – so now we have Uber/Lyft. The publishing landscape is just as archaic, with us scientist doing all the editorial work for free, feeding private companies billions of dollars of taxpayer money to do jobs which are for the most part no longer needed (publishing/distribution).
  2. There are practical reasons/cases where it can benefit you:
    1. Spread the word about your advances/get faster recognition, i.e. the personal version of points 1.1/2. This may be particularly useful when you are a young PI, starting your lab takes a long time, and you carry little weight with editors. This is, of course, a hypothetical, totally fictional situation. Any resemblance to real life characters is totally coincidental.
    2. Get a citable point of reference in a grant/related paper. While reviewers of your grant can ignore it they may also have read it or at least acknowledge it as proof-of-concept/progress (sure beats writing “manuscript in preparation”…). It may also help/not hurt when you are submitting your tenure package (another totally fictional scenario).
    3. Can help lay claim to new results/avoid scooping/possibly IP problems – this may be particularly relevant to situations where you know someone else is trying to beat you to it. By posting it on BioRxiv you clearly put it in the public domain (not patentable by others).
    4. You could get valuable feedback from the community (I think this is still in it’s infancy, but there is potential)
    5. You could possibly get some points for being a good citizen, and really why not? Which brings me to the next part….

If you ask my milieu of young PI’s such as the color Caseys (the Brown and the Greene) the answer would be a resonating “you should do it!”. It’s definitely a common practice in the Genomics and Machine Learning fields – two communities I belong to. But it’s not a common practice in Biology such as in the RNA field which I’m also a (proud) member of (with notable exceptions such as Brent Gravely, also being very “Genomics”). So, for me, the more interesting part is maybe the arguments against preprints:

  1. The big supporters of archive papers like to raise claims in the spirit of: “The only difference between a published and archive paper is that 3 more people read it”. I’m sorry, but that’s just not true. In a properly functioning system (I know, I know…) these 3 are not random people but independent experts in your field. That’s huge. And because you know that, you put much more effort into it. You are also held accountable for the content (in a proper system, I know, I know…). So yes, the current peer review system is problematic and definitely requires serious amendments, but I don’t like the claim it’s just the same. We are getting to a point where some even make use of this claim. One researcher told me his postdoc did not bother to publish a method paper and moved on to other projects because the method was already used for X (X being the important project they needed it for), and “people are using it already, so why bother”. Well, for one, if the method was not validated/tested in the original archive paper, there is no way to enforce it (or evaluate the consequent results/claims). I am definitely sympathetic to constraints such as the timing of projects etc. but the result of this approach, even when not intentional, is problematic for our field. Which leads me to the next point.
  2. Researchers, whether intentionally or not, sometimes abuse the archive system. I have seen multiple cases where a paper is submitted/published completely relying on an unpublished method only available as an archive paper. In theory, this is still legit if reviewers get all the info needed. Indeed, sometimes it is of little relevance – e.g. you use a method to find something and then you validate it experimentally. In such cases, the original detection method is almost irrelevant as the result holds. But in some cases, main results rely on a separate pre print, which can be problematic. First, this puts an unfair load on the reviewers which, in order to do a proper job, are now required to review two papers (and that second one may not even fit their expertise). Consequently, the entire premise of the paper being reviewed may be wrong. The archive methods paper may lack validations, proper evaluation of FDR etc. because after all, it’s only an archive paper (see point 1 above….). And editors are to blame for this as well because they allow this to happen and play along. Again, I totally understand that project/paper timing is an issue, but things need to be done properly in order for us to trust papers, and this new practice is not making things better. Finally, another version of abuse is a form of “salami publishing,” where a whole salami (scientific story) posted on BioRxiv is sliced into pieces that the authors try to publish separately even though these are highly dependent, without proper acknowledgment.
  3. While the above two points are basically me ranting about possible pre prints abuse I think the following is a more interesting point to consider personally: Biomedical research papers are inherently different from those of CS/Math. In CS if you have a good idea you write it down and can put it out there quickly via arXiv, soon to be followed by a matching (quick, short) conference paper. In biology, time constants are usually larger and the paper can change dramatically between the first submission and the final version as the story is driven first and foremost by the results (as opposed to by the math/model in typical Math/CS papers). A senior PI told me recently she does not like the archive craze because she does not want to be known by those initial versions. Think: How many people who read your initial paper bother to read the final one when it comes out??
  4. For a Bioinformatics method developer, the combination of making things publically available as quickly as possible yet still wanting a traditional publication (for grants, tenure etc.) may be lethal: You post your paper + method, but by the time your paper is actually reviewed someone else already shows their method is much better. Now go publish that as a significant contribution…..
  5. You might get scooped: I wrote this as the last point because I think this may be an overrated concern for biologists against using bioRxiv, but it still exists. In general, I think this applies if you have something very cool in a hot topic where simply posting the finding may lead specific people whom you do not trust to quickly replicate & submit while your own paper is being delayed (intentionally or not). Notice there are a lot of conditions in the previous sentence which you need to think whether (a) they hold (b) whether these people would do that given that the rest of the world and their lab (may) know about your archive paper. Still, this is a clear case where biology and CS differ as the former is (a) much more discovery/finding oriented (b) development time (c) review time are generally longer and more variable.

The mistrust in archive papers has already garnered some attention in blog posts and twitter feeds – see for example discussions here and here, the latter leading to the provocative title to Boycott BioRxiv (to be clear if you actually read it, it does not really call for it). Suffice to say there is clear evidence supporting a healthy dose of mistrust, regardless of who the author is and where it finally ends up. So one clear take home message is that we should teach ourselves and our students exactly that.

But beyond mistrust, I think there is something to be said about well-thought reviews and paper write ups, taking the time for your work to mature. I definitely see the great value of archive papers and I used it myself (see opening paragraph) but somehow it also represents for me something of the current times, where people are in constant search for instant gratification as the number one priority/value. Or maybe I’m just getting old. In any case, I think that at least considering the perils listed above and avoiding the abuse of the archive system is worthwhile for all of us as we try to advance Science and make our environment a better place.

I’d like to finish with two points. First, I think that preprints are part of a more general question about the future of Scientific publishing which I hope to cover a bit in my next blog post.

Second, there are also interesting questions regarding how should pre prints be treated. For example, there was a lot of discussion about whether these should be included in NIH grant applications and how (e.g here). Also, how should you as an author/reviewer/editor treat those? Are you expected to know about those? Compare/cite those? What should an editor do if she knows about a similar paper already on BioRxiv? In NIPS, for example, we had to declare as reviewers whether we already saw the paper on arXiv (NIPS is double blind). Given the timelines and emphasis on discovery in biomedical research (see above) it’s likely these fields require different approaches than CS. I would love to see people address those questions in the comments below, and maybe these should form another blog post too.

In summary, as I told Donny that night at CSH, I see BioRxiv as a tool in our toolset – be aware of your options and think given the above whether this is the right tool for that specific work/paper. Good luck with it and may the force be with you!

 

Update 9/9/2017:

  1. Here is a paper that does a nice job comparing the effect of enforced, voluntary, and no submissions to archive on subsequent paper citations. This serves to put the numbers behind positive point 2.1 I listed above. As it happens a different paper with a much more catchy message (500% increase in citations) but no such control for confounding factors appeared as an archived paper and created the consequent buzz – nicely illustrating some of the other points I was trying to make….
  2. Brent Gravely responded to my blog (here) stating all the positive effects posting on BioRxiv had for him (including reversing an editor decision!) and stating no negatives occurred. This is, of course, a case of proof by example, but still, a nice illustrations of the benefits to be had.

What does it take to get a PhD?

“What does it mean for you to be/earn a PhD?” “What does it take?” I often pose that question to students at various stages.  I think this is an important question one should ask oneself, during/before training (or when you train others) as this will affect the place/environment one chooses, and what they do during that time.

Today I want to focus on my answer to this question, in the hope that it will help current and future students assess their way. I note that my answers are focused on the research heavy environment, naturally in the computational/CS/biomedical area. But some general principles apply nonetheless.

Anyone who has followed my blog even a little would know that as its name implies, I tend to find connections between the Martial Arts and everyday life. In this case, it so happens that I earned my PhD about the same time I earned my black belt and could not help but see the similarities. For one, a black belt can take 5-10yr to get if you train seriously. Also, people from the “outside” think that if you got a black belt you are “a master”, but any serious practitioner knows it’s hardly the case. You get your black belt when you finish your basic training and then you set out on your way to *start* and become a master. It’s a life long journey and what actually matters is the Way, not the target. PhD is pretty much the same, but what you are training for is to be an independent researcher.

But what does it actually mean to finish basic training to become an independent researcher? It means that if you are given a research question you can formulate it, develop methods to tackle it, assess their results/performance, iterate and finally converge to a solution which you are then able to reproduce, write up in a scientific paper, and present. In order to do that you should establish knowledge of the field you work in (what was done, main problems etc.), the techniques used (experimental, computational – does not matter), developing the ability to find what it is you are missing and learn those missing components as part of the process. You should be able to critique your work and others’ and express this clearly. This means for example that you should be learning how to write reviews as a guided process with your PI so by the end of your PhD you can do it well without them. The observed output of this process are talks, papers, reviews etc. but these are byproducts, not the goal. Which is why I don’t like when PhDs are defined by those.

Why is this important? Because if you adopt this view you can take a step back, look at your PhD and think: OK, what am I missing? What haven’t I done yet? What am I relatively weak at? Then start working towards these so by the end of your PhDs you have gotten through the entire “checklist” and feel you can go through the research process by yourself. Too many students seem to be thinking instead along the lines of “I need to get X papers”. They look at the end-point (the black belt, the paper), not the Way, and therefore actually miss the point.

You might be asking yourselves at this point what about other things? Working in a team? Managing/Teaching/Mentoring?

I think these are great to have and try to get my students practice in those as I think it prepares them better for the modern jobs wherever they choose to go. As a student, you can be proactive about it, look for opportunities, emphasize this to your PI/Thesis committee to maximize exposure etc. But strictly speaking, a PhD at the core is about the ability to execute independent research and not these.

So, what about “knowing the landscape of open problems”  that you can consider going after? Learning how to define the questions in the first place? Get funding?

My view is that these are all good things to learn/know, but again they are not part of the PhD definition. After you finish your basic training you go on a postdoc. As a postdoc, you keep learning more techniques/methods (possibly in new areas), and practice your ability to do independent research. By the end of that, you have proved yourself completely capable of independent research, gained knowledge and experience, built a track record and developed the elements mentioned above (mentoring, view of interesting problems etc) so now you are ready to become a PI. Networking, gaining experience in managing projects/people (and your time…;) are all a big plus. You should strive to build those capabilities to make your life easier regardless of whether your stay in Academia or not.

In practice, some come out of their PhD much more mature than others [1]. A good example is my friend and colleague, John Calarco. John was a PhD student when I was a postdoc at the Belncowe lab. He came out of his PhD with a dozen papers including first author papers in obscure journals such as Cell and Nature. Yes, I know it’s not a great metric but the point is he came out of his PhD with a view, a vision of what he wants to do. He became a Bauer Fellow at Harvard, which meant running his own small lab for a limited time (4 years) with guaranteed funding, and recently got a PI position back at UofT. I see John as a great example of rare maturity at the end of a PhD. I, for one, was not even close to that when I finished my PhD. I did not even know what I wanted to do when I grow up (still working on that one… ;).  All I knew at the time was that *if* I ever wanted to become a PI (and that was a very big if), then I should do a postdoc. So I went looking for a good place for a postdoc (that’s a topic for another post). However, I did get my good basic training at the dojo of my Sensei, I mean advisor, Nir Friedman. And btw, as I noted in a previous blog, “Sensei” does not mean “Master”, it literally translates as “that who has come before”, pointing to the endless chain we create by training the next generation.

So, my view is that if at the end of your PhD you don’t have a comprehensive plan/view of research problems, how to get funding etc. that’s OK. You should not stress about that. But make sure you get your basic training right. Generally speaking, you will *not* get that chance as a postdoc, where you are expected to deliver. Getting your base is intimately linked to choosing an advisor who is a good fit – see a recent post about that from the great Eleazar Eskin here. Regardless, and even if you do not agree with some/all of my points, I hope the above discussion got you thinking and helps you on your “Way” of PhD training or training others!

 


[1] My impression is that students in biomedical fields are generally more mature in that respect than the typical CS ones. Not sure why – maybe it has to do with the time CS students spend on technical abilities, maybe it’s their nature, maybe it’s the culture – I really don’t know, but that has been my general impression through the years.

Follow up on “Proper upbringing vs negligence of computational students in biomedical research environments”

So it turns out my previous post struck a cord with quite a few readers. Some of them contacted me directly to comment and share their stories. I decided to include three. The first is an update I previously posted quickly after a senior PI misinterpreted my post. The other two came later and reflect personal experiences which I thought are important to share both for the PIs out there that think there is no problem and of course for the aspiring compbio students/postdocs who read this. I think those stories speak for themselves and it’s interesting to see the comments and feedback coming from basically all over the world, so this seems to be somewhat of a universal problem in our field. Of note, those stories are not about evil PI exploiting students (the wrong impression the senior PI from the first update was worried about), but rather various forms of negligence and nonsupportive environments, which is what I was actually describing. For obvious reasons, I removed all details that may identify the people involved (and got prior authorization to post those here).

PREVIOUS UPDATE – ORIGINALLY POSTED 1/16/2017:

It seems this post got a lot of views but was also misinterpreted by some who got back to me with legitimate concerns and criticism. Specifically,  a senior PI wrote me they read this as “data generation labs are exploiting the students”. That was never my intention. Let me clarify, and I’ll use Penn’s GCB graduate group to make the point. GCB stands for “Genomics and Computational Biology”. I think the creators of GCB were wise to define it as such. It means GCB caters to a wide range of students who want to get exposed to “real life data/problems”. Some are more into methods development to derive hypotheses (hence “Computational Biology”), others are more into actually generating the data and analyzing it themselves (hence “Genomics”). These are crude distinctions of course but the point is not every student is interested in methods development, not every student requires co-advising. And Sometimes a student may need co-advising/collaboration for a specific project/problem and that’s all. As the PI rightfully wrote me “there is no one size fits all”. Indeed. And students that are becoming experts in a certain field while using/producing Genomic data are not “exploited.” As that PI wrote me: “I’d be better off hiring a good bioinformatician then taking on an untrained grad student who typically needs close supervision and mentorship.” That’s a fair point. My worry, and what sparked this blog in the first place, is with students who want to do more “methods development” at some level and do not get to do that because (a) they haven’t realized that’s what they actually want to do (b) they did not articulate it (see my suggestions above) (c) the system/lab they are in does not support it.

PERSONAL STORY 1:

Your recent blog post strikes a deep personal chord with me because, during grad school, I was one of the “computationally oriented students [that were] basically used as in-house bioinformaticians to solve the bioinformatics needs of a data generating lab”.
Before we go on, I should say that my grad advisor is a very nice person, excellent scientist, absolutely looks out for me, and we have a great ongoing relationship. So, this is definitely a classic case of asking the student (me) to “go explore and tell me what you may want to do” “[w]ith all the good intentions.”
I joined a genetics lab with a lot of interest in computational biology but, being a naive undergrad, I did not realize that, although the science was really cool, my advisor will not be able to advise me on the computational aspects of my work. After I started my work this slowly dawned on me when problems were being posed to me and I was being asked to “solve” them without being given any starting point or subsequent guidance. This was still my first year and I found it very hard to cope with.
I struggled day and night to find relevant papers & reviews, read them end-to-end, read online tutorials, improve both my programming and analysis skills, and started working on the given problems. Then, I started seeking out other bioinformatics/computational-biology faculty on campus to interact with and attend journal clubs with, and I was also doing my best to identify one of them to be my co-advisor.
But, the latter – engaging with other computational faculty – was not easy at all due to complicated politics from all parties involved: my advisor only wanted to ‘collaborate’ and did not want me to be partially subsumed into another faculty’s group, distracting me from my main work; he also did not have good experience/relationship with a few bioinformatics faculty whom he wanted to work with, and so, he decided to “grow the expertise” in his own lab and liked to tout that he had in-house bioinformatics capabilities.
I survived by working very hard, making hundreds of mistakes, interacting with folks far and wide across the campus, and finding a couple of “shadow” mentors whom I could go to for general guidance when things really were not looking well. Along the way (just like you pointed out), I also managed to mildly enjoy the part of being in a lab that was generating data and interacting closely with experimental biologists, both helping me tremendously in my scientific development.
So, in spite of my survival and subsequent success, I couldn’t agree more with your post. Now as a faculty myself, I cannot emphasize enough the value of “advice”, “training”, “guidance”, and “well-rounded professional growth” of my students and I’m committed to “improving the upbringing of our future generation of scientists”.
Thank you for your post. This is a super-important issue and I’m glad you brought it up.
PERSONAL STORY 2:
Your recent post on computational training has touched me deeply.
I have read this almost 50 times and this completely echoes my sentiments.
We all acknowledge the misuse of computational trainees as in-house bioinformaticians, but your post also talks about the “benign form of negligence”, i.e. not knowing what to do with a computational trainee.
I am currently in the same situation, figuring out what to do next. Unfortunately, most people never realize this problem until it is too late.
Thank you again for this post.

Proper upbringing vs negligence of computational students in biomedical research environments

So, today I want to write about a topic I feel strongly about which is how we raise the next generation of computational biologists.

To start with, I think that in many ways we had made great progress compared to the state of affairs when I started my graduate studies: There is a much better understanding of what it is students should actually know, there are dedicated courses, books, online material, etc.

I also want to emphasize that I’m not advocating that computational students do not train/work in biomedical environments. Unless what you really want is only do CS/Math you may miss out *a lot* in terms of real life data/problems (domain specific data science if you will), how biologists think about problems (quite different I tell you, and there is a lot to learn there!), or thinking about the next set of problems/challenges to tackle. Not to mention cutting edge biomedical research you get exposed to can be absolutely fascinating even when no computational problems are involved!

But I’m not here to discuss all that but rather the not uncommon situation where computationally oriented students are basically used as in-house bioinformaticians to solve the bioinformatics needs of a data generating lab. And sure I understand it’s not a black or white and there is great value in getting your hands dirty with real data, and that it’s important to help each other, be a good citizen, etc. That’s not what I’m talking about. I’m talking about students with computational aspirations that end up doing all the bioinformatics work in the lab because (a) it’s really needed (b) they can (c) they are much cheaper and easier to get than a Bioinformatician. Sure, these students may end up on great papers representing great science from great labs. But I argue that’s not enough, and that can not be an excuse. Why? Because they come to *train* and it’s our responsibility to train them. And if you think that just by making them solve your Bioinformatics problems you are giving them proper computational training you are *wrong*: They will not necessarily develop the technical skills in algorithms, proper coding, data analysis, thinking about computational modeling and many more things they should be getting. And don’t tell me that the fact they are coming out to a market that will now snatch them is enough. Because if they have the proper training they can easily grow, do something else entirely, etc. But if they don’t then they are much more likely to get stuck at a lower level, not mature as independent compbio researchers that are sought after in Academia/Industry.

I should also mention a “lighter form” of negligence: When a PI gets a highly computational student but does not necessarily know how to guide her. With all the good intentions this results in “go explore and tell me what you may want to do.” It sounds great in theory, but the problem is that (a) these students commonly lack a strong biomedical base and (b) even if they are computationally savvy they don’t know how to actually translate something they hear/read about to a computationally framed problem. They often don’t even know what questions to ask.

Naturally, I meet many researchers during my work, and some PIs acknowledge the problem. I talked to one such senior PI in a meeting last summer who told me: “you are right. They are desperately needed in the labs, we try to make the best of it, but I know it’s not always good for them”. But not all are like that. I had a quite different exchange with another senior PI. During a social event, the conversation drifted to this, and I said it’s a problem we need to deal with. She said it’s totally fine (using the argument above about having job offers). I iterated our obligation to train them properly computationally and that otherwise we are not doing it right. At which point she said, jokingly, “Well, you are lucky I’m not on your tenure committee.” I could not agree more, and a joke or not I don’t like that kind of humor [1]. Regardless, I see that “everything is fine” answer a representative of a too common approach in biomedical research labs.

So what should we do? There are several things I can think of:

  1. As an institution/graduate program: Make the effort to have computational students be advised properly. So if the PI is not up to it/interested, get a co-advisor [2] and make sure the computational skills development is on the student’s todo list.
  2. As a student:
    1. Same as (1) above regarding skill development and/or co-advising.
    2. Think carefully before that in what institute/program/lab you want to spend your time. Think what it is you actually want, ask questions, shop around. Maybe do research in a lab for a year to get the hang of it and see for yourself before you commit for 5 or so years.
    3. Be Proactive – do not just count on your program/mentor/whatever to take good care of you/your interests. Maybe your interests are not their or not high enough on their priority, or they are just too busy or do not know any better. We are brought up in a system where we follow what the teachers tell us, get good grades, and constantly look for their approval. Ph.D. students are in a  period where they are still training but also transitioning towards independence, the job market, etc. You should still focus on doing a great job, but don’t follow blindly everything else.

The above points also relate to some of my previous posts about finding yourself a good mentor (or Sensei…). At the very least if we all become more aware of this issue I think there is a good chance of improving the upbringing of our future generation of scientists.

UPDATE 1/16/2017:

So, it seems this post got a lot of views but was also misinterpreted by some who got back to me with legitimate concerns and criticism. Specifically,  a senior PI wrote me they read this as “data generation labs are exploiting the students”. That was never my intention. Let me clarify, and I’ll use Penn’s GCB graduate group to make the point. GCB stands for “Genomics and Computational Biology”. I think the creators of GCB were wise to define it as such. It means GCB caters to a wide range of students who want to get exposed to “real life data/problems”. Some are more into methods development to derive hypotheses (hence “Computational Biology”), others are more into actually generating the data and analyzing it themselves (hence “Genomics”). These are crude distinctions of course but the point is not every student is interested in methods development, not every student requires co-advising. And Sometimes a student may need co-advising/collaboration for a specific project/problem and that’s all. As the PI rightfully wrote me “there is no one size fits all”. Indeed. And students that are becoming experts in a certain field while using/producing Genomic data are not “exploited.” As that PI wrote me: “I’d be better off hiring a good bioinformatician then taking on an untrained grad student who typically needs close supervision and mentorship.” That’s a fair point. My worry, and what sparked this blog in the first place, is with students who want to do more “methods development” at some level and do not get to do that because (a) they haven’t realized that’s what they actually want to do (b) they did not articulate it (see my suggestions above) (c) the system/lab they are in does not support it.


[1] This reminded me of a joke my father always liked to tell when I was little: Two guys cross each other on the street. The big guy suddenly slaps the little guy out of nowhere. The little guy looks at him intensely and says: “What was that? Was that a joke or something?” To which the big guy replies: “No, I was serious.” “Oh, you’re lucky then,” says the little guy, “because I really don’t like that kind of humor.”

[2] Co-advising is a solution used in GCB [Genomics and Computational Biology] here at Penn. I was fortunate to be co-mentored through some of my PhD and it was instrumental during my postdoc years.

 

Training DNN and Training BJJ – What’s the Connection??

Brazilian Jiu Jitsu (BJJ) is a grappling martial art focused on submitting your opponent by chokes and joint locks (arm bar, knee bar etc.) Deep Neural Networks (DNN) is a class of models in machine learning which you train on data for specific tasks (e.g. object recognition in images). Is there any connection between the two??

In today’s post I want to get back to one of my favorite themes, drawing connections between Martial Arts and research or everyday life. I usually identify some principal/idea in my Martial Arts, then find a parallel in everyday life/research (see for example here). But this time it started with our current efforts to train deep neural networks. Taming the DNN beasts is somewhat of an art, but that’s not the point I want to make today. Instead, I want to focus on what we optimize, how we optimize, and for what. Sure, you can be technical at this point and talk about the details of your SGD, the learning rate, momentum, dropout, Adam and whatnot. But I want to look at it more, well, philosophically: What you really want to achieve is *generalization*, “mastery of the domain” if you like. And how do you go about it? You try to optimize some sort of temporal cost function. I call it temporal because in typical DNN training the sample set is huge and your SGD is based on your model’s current experience (the sample/mini batch) and its current state (model parameters), with the hope that when the model sees something new in the future (test sample) it will be able to handle it well. The reaction to the current sample is based on the function you set to optimize, i.e the cost function. While in some cases the cost function naturally arise, it is generally something we *make up*. Sometimes it’s a reasonable approximation of what we are truly after (think of #mistakes vs. say cross entropy or hinge loss), sometimes it’s rather crude (e.g. maximum likelihood of a simplistic probabilistic model for a biological process). Another point is that in typical DNN training we don’t even try to achieve the magical “global optimum”, settling for a “good enough” reduction in test error. And if you try to “rush” your learning going too fast for your temporal loss function (e.g learning rate) you do not do as well.

And what happens when you train BJJ? your temporal/local optimization function is winning/losing when you train fighting with someone. But that’s not your *real* goal. Just like your DNN you want to achieve mastery of your domain (BJJ). And you may have other goals as well: Good health/shape, self-defense, fun, etc. But if you focus too much on the local function (winning) you are going to miss out. How? For one, if all you care about is winning/losing you will not push yourself into hard situations (which may cost you the fight), limiting your exploration and therefore slowing yourself on your way to master the vast space of positions/states within the art. If you think of every time you get submitted/choked as merely a (negative) loss you will likely lose not just good lessons, but also much of the fun. But too much focus on winning/losing can have more subtle effects.  Instead of emphasizing good technique and accepting a loss when you fail to execute you will insist in muscling yourself out of bad situations. This eventually leads to injuries which again slow you down, or completely stop you. And the funny thing is it does not need to involve a heroic move/submission or an opponent that goes crazy and breaks your arm. It can be as simple as you exploding out of a bad position for a split of a second, when your body is already tired after say two hours of training with a bunch of opponents that are all better than you. So you (partially) tear your knee’s MCL, have to recover for 3 month and have to go to your conference (ISMB 2015, Dublin) on crutches. Now you could argue that’s a good way to build name recognition (“I remember you, you are the guy with the crutches from last year” – I got this at ISMB 2016 in Orlando this year…) but I would strongly advise against it*.

In summary, it turns out that good practices in training DNN translates to good practices in training BJJ, which can in turn have great impact on your mood, happiness, and health…. Who knew? So just keep in mind in your own DNN/BJJ/whatever training to not let your local optimization function make you lose sight of your true goals….. and good luck!


*When on crutches at a conference, try to say make it on time for a talk, or squeezing in a row for an empty sit when you finally arrive late. After the talk, you can stand in the line for coffee, then get “here is your coffee sir”, then realize you can’t do anything with it and you have to relay on the good help of others. The good news is that it can be temporary and can hopefully go away. Having crutches is a humbling experience, but that’s a topic for another post….

 

 

A window into Research

In my previous post I discussed how Scientific meetings/conferences can be viewed as a Window into Research. Keeping up with the light summertime blog post theme, I wanted to present another interoperation for Window into Research, an “Arts & Science” project I have been working on in the last year.

Almost a year ago our lab moved to its new location in the newly renovated Richards Building. The building, designed by the famous architect Louis Kahn, is a designated National Historic Landmark and (as we found out) a constant tourist attraction. Because it’s a historical site we were not allowed to modify the walls – but no one said anything about the windows….. And so, shortly after we moved in, we turned the big windows into another whiteboard for the lab. The colorful result brought further attention (we recently had the dean visit with donors, admiring our new art form…;) and even Michael Schade, the architect responsible for the renovation, was pleasantly surprised by it (I told him that’s what happens if you let people into your plans…;)

At some point I decided to turn the lab’s corner window into my own mini “Arts & Science” project: A view into our lab’s research/daily life via stills of the window. The original idea was to constantly capture the window area across times of the day/week/month/year. While I failed to systematically do that (for now…), I did manage to collect quite a few images through the year which I turned into the clip below. I plan to keep updating this as the story of our lab’s research unfolds….

A few things worth noting:

  1. The theme song was deliberately chosen. First, it’s a nice song by Mark Knopfler. Second, its name is “Sailing to Philadelphia” matches well the lab’s location at UPenn. But the third reason is maybe the most intriguing one: It revolves around Charles Mason and Jeremiah Dixon. Many people know about their contribution: Mapping the Mason-Dixon line between Maryland, Pennsylvania, and Delaware. This line later became famous as the border between the South and the North, thus representing much more than just a line. But what people do not know is that Mason and Dixon were English Scientists (astronomers, cartographers) that came to America to help map the then uncharted territories of the colonies with their new tools/methods. A few centuries after Mason & Dixon we too are in Philadelphia, mapping uncharted territories but in Genomics and Genetics….
  2. The clip gives a biased view of our research as it depicts more frequently the people actually sitting besides the window (Jordi, Anu). Other people have the misfortune of using plain whiteboards…..
  3. Finally, based on the information contained in the clip:
    1. Can you tell what type of data/problems we commonly work on? (hint: molecules like DNA, but cooler, and a process that rhymes with “rhyming”  😉
    2. Can you guess what is our family pet??

 

And now, for the actual clip… Enjoy!