Sottaku Doji in Science

The other day I was chatting over lunch with a visiting colleague who told me the following story, not uncommon in our field: Several years ago, as they were just starting their independent career, they approached a senior PI with a proposal for what they thought would be a great research project involving technology development in their lab. The senior PI shrugged them off. A few year later, the same PI met my colleague and told them: “Hey, maybe you can do X, that would be exciting” (where X is exactly what my friend proposed before). But now it was too late as similar tech already came out and my friend was invested in newer directions.

Well, you can sum the above story as “timing is everything”, but it reminded me a more nuanced version of it, used in Martial Arts, but applicable in Science (and everyday life) as well. It’s called Sottaku Doji.

Sottaku means “a pecking noise.” Doji means “simultaneously” [1]. It refers to the critical time when a chick hatches. The mother hears the delicate sound of the chick pecking against the shell and then, and only then, helps it break the shell. Too early – and the chick will die as it’s underdeveloped. Too late, they may not be able to break free and die. I find it marvelous that Japanese actually has a term for this. Even more, it was borrowed into Martial Arts to refer to the need to have a willing trainee, a tentative teacher, and to give the mentoring just at the right time. Clearly you can see the relevance for Science, especially as advisors/mentors working with trainees. Sure, it’s not a life/death situation as for the bird (or possibly in old school kenjustsu), but we too need to make the observation/comment at the right time so that is sinks in (or, admittedly, sometimes we need to hear things ourselves at the right time, as in the story above).

I first learned about Sottaku Doji when I read Davee Lowry’s book Moving Towards Stillness [1]. I highly recommend it for those interested in inspiring stories about the connections between Martial Arts and the everyday life. In the book, Dave recounts how he failed to do a certain kenjutsu kata for many months. One day, while observing him fail again his (very traditional Japanese) Sensei told him “just shift your weight at this point” – and everything worked. Frustrated and angry, Dave told the sensei’s wife “couldn’t he have told me this before?” to which she replied: “And if he did, would you have listened??”.

We live in a fast paced world, with so much information that can be gained from online courses, tutorials, and how-to videos. Yet, in Scientific research, as in the Martial Arts, there is no replacement to individualized, attentive training. The term for that already exists. All we need to do as mentors is to, well, pay attention. Sottaku Doji.

[1] Lowry, Dave. Moving Toward Stillness (p. 47). Tuttle Publishing.


Should I do a PhD?

One of the fortunate aspects of being a Professor in Academia is that you get to meet the future on a regular basis: Part of your job is to interview Ph.D. candidates and/or present to undergraduates contemplating research programs (Penn’s SUIP is a terrific example, but of course, there are many others). For me, this means presenting to undergrads all across the country as part of the Diversity and Outreach Committee I chair, and interviewing candidates for multiple PhD. programs that come from diverse backgrounds: Computer Science (Engineering School), Genomics and CompBio, Genetics & Epigenetics, and last but not least MSTP (Penn’s MD/PhD program).

One optimistic observation I could make is that our future looks very bright: The students are top-notch. Their background knowledge and experience is so much more advanced than what it used to be even 5-10 years ago (not to mention our era as students…;), they are motivated, smart, and ask outstanding questions. This leads me to today’s post: Should I do a PhD?

I think this is a question that many undergraduate students don’t spend enough time contemplating, or are sometimes unaware of some of the considerations. I will focus here on CompBio/Applied ML/Genomics, which are the areas of my work, but many of the points discussed below are more general.

A common approach to this question is to think about it as a business transaction: How much time will it take, what would be the financial price, what will be the gain in terms of increased salary/opportunities, etc. This is definitely a valid approach and important consideration: You don’t want to run into a “glass ceiling” of opportunities and salaries blocked due to lack of a higher degree, and from what I hear, this is definitely the case for various industries involving both Genomics/CompBio and ML. The reasonable assumption is that a PhD may cost a few years of low income but can enable you to attain research-intensive and leading roles that are mostly open to those with a PhD. That said, trying just to maximize potential future revenues could also backfire: If you are not particularly good at research and end up with a formal PhD that didn’t get you far, you may find yourself “stuck” with a PhD: You are considered overqualified to jobs you would actually love to get, but not a good fit (or not interested) in positions available to a PhD. Keep in mind that even if that is the case, you could still use the skills you learned during your PhD and transition to a non-research-centered position that would make you happy: Teaching at a high school/college, working for the government, publishing, not-for-profit organizations, etc. A PhD done right should help you develop many skills that could help in other domains – coding (for comp people), communication, teaching, writing, managing projects, prioritizing tasks, etc.

While all of the above are valid considerations, these are not the ones I want to focus on here. Instead, I want to focus on the “Is a PhD really for me?” aspect. My impression is that many undergraduates have unrealistic expectations of what a PhD is about. So let me highlight a few issues and dispel some common misconceptions:

  • “I’m going to cure cancer with machine learning”: This is something I see more from students capable in CS/math but with little hands-on experience in genomics and genetic research. They get excited about “ML for good” and curing disease with their newly acquired skills. I very much sympathize with this motivation, but being inexperienced in such research, they think they are going to wow a problem into being solved by a cool DL algorithm. They lack realization of what research in Applied ML/CompBio entails, which is what I discuss next.
  • “Comp bio is 95% ‘dirty work’: Cleaning data, getting your code to work, getting your model to work, analyzing the results, and validations that don’t pan out. Not so much thinking up glorious models/algorithms. (and no, I don’t have quantitative assessment to base the 95% number on, but there is a lot of that, same for the next item below)
  • Research is 95% failure, and it’s a long game: Many young students are used to continuous success – in their courses, exams, and in sport tournaments. These also require only bursts of effort. But a PhD is a long game that requires stamina and grit. And you don’t want just to survive it – you want to shine, do great work. Yet, paradoxically, you have to accept/realize that most of the time you fail. Which leads us to our next point.
  • Papers are completely misleading. You only read about the eventual successes and they are always laid out in this wonderful logical progression. But that is an unrealistic and misleading view of how Science actually works. You try A to prove B, A doesn’t work and you end up doing C (or, more likely, C1/C2/C3 etc) which shows D. Then you end up with a paper describing how we set out to solve D with this elegant C3 model or experiment. Uri Alon has has an inspiring talk about getting lost in the research cloud and finding your way. And while the talk is insightful and inspiring (highly recommend), it is built to help scientists and scientists, not answer our question (“Should I do a PhD?”). Which brings me to the next point.
  • Research is dealing with uncertainty and unknowns. All the time. You may not like it. Many tend to think exploring the unknown is exciting, something they would want to do if just given the option. But in reality, I suspect majority of people would be much happier with certainty: Certainty in our society and world view, our income, and our work. We want to know that if we do a good job on solving well defined X as set by our boss(es), then we will progress, get a bonus, feel pride in our work etc. We don’t want to spend years on a research question which turns out to be unsolvable or the results suck. We don’t want to try and figure out all the questions to build a research program. We would much rather have someone else do that. And that’s 100% totally OK: open ended volatile research questions are not everyone’s cup of tea. You just need to recognize if that’s the case for you; maybe go become for example an incredibly capable/creative developer instead – more happy, fulfilled and accomplished.

I realize the above list may be demoralizing for some. To be clear, I am not advocating against doing a PhD. I love what I do and I think that if you ask around my lab you will find people love (while suffering the above…;) doing research. But I do think it’s not necessarily for everyone. Maybe the best advice I can give to young undergraduates is to get some hands-on experience, preferably more than one, to see if this is for them (see point above about SUIP and similar programs). More generally, as someone who hasn’t grown up here, I find that North American culture cultivates rushing for external achievements as fast/as soon as possible. Young people rush to college before they figure out what they want to do/study (while at the same time they are not used to failing, see point above). Then they rush to a PhD. In many cases, they are better off growing up a little first – figuring out what they like/want to do, get some valuable experience in research or in industry. Such experience can help wonders in your coding skills, prioritization, time management etc., which will help you get more out of your PhD. It could also help you figure out who you are. But that’s maybe a topic for another blog post in the future.

I hope to follow up on some of the points I made above in the near future. I want to explore answers to great questions I got asked by students this recruitment season, so stay tuned!

Before I end, let me just mention a few other past posts by me and others that may be useful for students contemplating their next step in research or towards a PhD – I hope you find these useful!

Navigating Grad School Admissions in Science | by Talia Lerner

Why do you want to be a scientist?

Thoughts on picking a rotation lab – Avasthi Lab

What does it take to get a PhD?

The Art in Science – Part III: “What problem should I choose to work on?”

Do Not Seek Your Goal

Advice for Picking a Graduate Advisor | ZarLab

Proper upbringing vs negligence of computational students in biomedical research environments

Follow up on “Proper upbringing vs negligence of computational students in biomedical research environments”

(this last one is not about PhDs but about Bioinformaticians in wet labs. It was recommended to me by a student I interviewed this year – I thought it was a great commendation and promised to share next time I write about related topics…😀)

A guide for the lonely bioinformatician

MAJIQ-HET is out!

We are thrilled to release the new version of MAJIQ (2.3 for those keeping count), which contains a slew of new features and capabilities. The release is accompanied by a pre-print posted last week. This blog post will give a general overview of the new MAJIQ features and why we are super excited about this – hopefully you all are too…. 😃

The new MAJIQ is built to scale well between few samples to large heterogeneous data sets with thousands of samples and dozens of conditions (GTEx anyone? TCGA? TARGET? ENCODE? Or maybe you want to build your own combined dataset?). It came out of our research and after we realized there are no available tools for related tasks (e.g. differential splicing, clustering). Yes, there are efficient tools that can process large data, but large heterogeneous datasets require more than that to efficiently and accurately detect, quantify, and visualize.


First, some background is due for the newcomers (those who know/use MAJIQ are welcomed to skip this): We released MAJIQ v0.9 in February 2016 (> 5.5 years ago!). MAJIQ introduced a new vocabulary to how RNA alternative splicing (AS) is formulated by defining local splicing variations, or LSV, which generalized over the classical binary AS events of cassette exons, 3’/5’ alternative splice site and intron retention. It was the first tool to detect, quantify and visualize complex (involving more than 2 alternative junctions) and de-novo (unannotated) junctions, exons, and intron retention events. Using this new formulation of splicing variations we were able to show that over 30% of already annotated splicing variations in human/mouse are complex, and that including de-novo events can boost detection of differential splicing analysis by > 30% while retaining the same reproducibility.

Since its publication in Feb 2016 we have released accompanying tools for automated RT-PCR primer design for validating LSVs and more recently for correcting known and unknown confounding factors in RNA splicing analysis. Two years ago we also released MAJIQ 2.0 which had much more efficient algorithms/code, retaining its advantages in accuracy and features yet making it as fast/efficient as the most efficient algorithms (rMATS-turbo, LeafCutter). We even released a pre-print showing it compares favorably and that previous comparisons misrepresented MAJIQ, but eventually deferred on publishing a paper about that (buy me a beer sometime and I’ll tell you how a paper gets stuck as a pre-print). Instead, we opted to complete a whole new set of features and an accompanying paper to showcase these new capabilities. Fast forward professional developers coming/going, students graduating, a few babies, one global pandemic etc. and we are suddenly in November 2021. So…. What’s new??

Incremental build:

I wish I had Shaggy’s super powers, but not quite…..In a nutshell, incremental build allows users to process samples only once. This is critical with large datasets to which you may want to add more samples later (e.g. updating with a new release of GTEx, or adding additional datasets to existing compendia). What you don’t want to do is (a) have to save all the raw data and/or redownload it (b) reprocess the data. To enable users to avoid this, MAJIQ builds a database of all identified junctions and intron retentions (IR) for each sample and saves all the necessary statistics for those so you don’t need to save the original BAM files and re-analyze those.

The simplifier:

Splicing can get quite complex pretty fast, especially when you are processing hundreds or thousands of samples. To remove “noise” (e.g. lowly used junctions or those not actually observed in your data) we implemented the simplifier – both at the level of the splice graph building stage and later at the level of visualization. Below is an illustration of the simplifier in action. We released the simplifier a while ago and have a separate blog post about it here.


OK, so not this kind of hat though we do have a few more surprises for you down below…😊

HET stands for “heterogeneous” and refers to new and (mostly) rank-based statistic used for more robust detection of differential splicing between groups of samples. The original MAJIQ was built for (small) groups of biological replicates and assumes the group shares an (unknown) PSI value per junction, per LSV. Instead, MAJIQ HET uses robust rank based statistics (e.g. Wilcoxon, TNOM), combined with estimated dPSI between the median of each group to filter and rank differential splicing events. Check out how HET compares to all other algorithms in reproducibility plots (x-axis “how many events are reported as changing”, y-axis “how many are reproduced when repeating with different groups of same size and tissues”) using real data (GTEx), when comparing groups of 3, 15 or 50 (!!) samples. We also compared many other metrics (see pre-print). Also, all the scripts and data are available as a package for the community. This means future developers can use it for comparative analysis and encourage potential users to run it on *their* data to see how the different algorithms perform in those settings.

VOILA 2.0:

How do you simplify complex splice graphs to remove “noise” (e.g. lowly used junctions)? How do you visualize hundreds or thousands of samples? How do you do multiple condition comparisons? Easy – you run VOILA 2.0 😁!

VOILA 2.0 can be run on Mac/Win/Unix but also as a server so you can share results with your collaborators without file transfers (but yes, due to popular demand we did also bring back an option to save HTML files for sharing/saving visualization of specific results) . VOILA 2.0 allows you to query the results looking for specific genes, junctions etc. Look at the result below for a cerebellum specific splicing change in the splice factor PTB1, where every point is a GTEx V8 brain sub-region sample (total 2,335). If you want to see the visualization “live” click here.

MAJIQ – Modulizer:

The modulizer is a completely new feature/algorithm which should help users on one hand to simplify splicing analysis while at the same time giving them the ability to dissect types of splicing variations like never before.

The idea is simple: first, we create an updated splice graph for a gene using all the known annotation and de-novo elements (this part existed before). Next, we break down each gene splice graph into separate alternative splice “modules” (ASMs). ASMs were proposed by Yin et al but we do not follow their recursive definition or associated algorithms. Instead, you can simply think of disjoint regions that have a single point of entry or exit in the gene splice graph as shown below:

Now, with the ASMs defined as above, we can do two main things: One, we can define ASM types and find out which are the most common ones. For example, when analyzing cerebellum specific splicing changes we found the following types of changes most common:

But maybe what’s more interesting is that the modulizer allows you to break down a complex ASM to its subcomponents and enumerate all two-way alternative splice (AS) events that occur within an ASM. Here is an example of two ASMs in a gene broken down into all possible binary AS events:

You can also take a specific splicing change of interest – cassette exon in the example below – and study differences between types of ASMs that incorporate it. For example, we broke down differential cassette exon inclusion/skipping in cerebellum by the most common ASM types that include it:

Focusing on ASMs w/wo intron retention we found specific regulatory features that are shared and others that are distinct (see details in the pre-print). Again, we are not aware of previous tools/methods that allowed users to perform this kind of analysis and we hope it would help make new discoveries related to RNA splicing regulation.

As a final comment I would like to mention that, for the record, I very much wanted to call this new tool the modlizanator or at least the Combine-inator, But for some reason, my suggestion was overruled… something to do with misinterpreting our efforts as evil and ineffective 🤔.

Being very negative, in a positive way…

I’ll finish this blog post with another small addition which we implemented many years ago but somehow failed to make it to the public release. We found that in our work we sometime want to find not just splicing events that are changing, but also compare them (e.g. regulatory features) to splicing events that are definitely *not* changing, i.e. a high confidence negative set. Surprisingly, all available tools only focus on high confidence true positives not true negatives. So we implemented this as an option to get high confidence non-changing LSVs (users can get those by using the –show-all flag)

Tell us what you think

We built the new MAJIQ based on many years of experience performing RNA splicing analysis and research, combining continuous input/suggestions from our user base. We want it to be the best tool it can be so please give it a try and send us your feedback and suggestions!

And finally – yes, there are more features/capabilities coming…. So stay tuned!

RNA Splicing? There is an App for that!

In a previous post we showed how splicing analysis can be greatly simplified by using the MAJIQ simplifier. In this post we continue with this theme of making RNA splicing analysis more accessible and easy to use, introducing an app for it:

OK, so it doesn’t run on your smartphone with such cool animation (yet!;) but it does work on MAC/Win/Linux and the visualization is pretty nice too…

We created this app in response to users’ request: The previous VOILA (1.x) ran in the Chrome browser using html5 with interactive animation based on D3. This was great – anyone could run it with no special installation. But the html5 code could only do so much in terms of interactive usage (e.g. search options) and was also running out of steam for large datasets. So, in MAJIQ/VOILA 2.x we switched to a servlet based visualization, giving us many more capabilities. But, we lost the ease of installation – no free lunches… Or maybe not? The new app saves users from the need of any sophisticated installations – simply drag and drop:

Now, lets say your student wants to send you their results – simple! They send you the VOILA file, you click open the app on your laptop, upload the file, and…. Voila!

More info about the VOILA App can be found here. So try it out and let us know what you think. And if you have more ideas for how to make MAJIQ/VOILA even better shoot us an email or even better: post on the user support group here.

Time to KIS-AS: Keep It Simple for Alternative Splicing

In my previous post I stated that exciting things are coming to all you MAJIQ users for splicing detection/quantification/visualization. But you did not suspect it will involve KIS-AS, right?

OK, I guess some explanation is in order here about why we need to Keeping it Simple for Alternative Splicing (KIS-AS)….

Well, anyone who works on alternative splicing will tell you it gets complicated very fast: too many junctions obscure what is actually simple. And the more samples you add, the more things become hairy, like in this example:

These many junctions are likely NOT the result of technical errors (e.g. in mapping): in this case we see they are marked as dashed grey/red which means they are in the transcriptome annotation but never observed in your data/specific sample. In other cases junctions may be observed at very low levels (either de-novo or known junctions) i.e even if they are indeed real they represent some form of biological “noise” which is known to occur in splicing. The deeper you sequence, the more junctions you will find, but rarely/lowly expressed junctions are likely not that relevant for your work…. Consequently, such lowly included junction “obscure the view” and make it hard for us to see the forest from the trees. But here is the dilemma: On the one hand, you do not want to ignore complex events, possibly with de-novo junctions, which may be extremely important for the process you are studying. On the other hand, you can’t see the forest from the trees. So, what do you do? Well, it’s time to KIS-AS, by using our new MAJIQ/VIOLA simplifier!

MAJIQ’s simplifier is a part of the builder pipeline (the part that builds all the splice graphs etc.), and it can be triggered using the argument –simplify . This argument enables the simplification process that can be tweaked using a set of specific parameters.

  • –simplify [psi_threshold] : Simplify enables the simplification step, the ratio used to simplify is specified by psi_threshold, 0.01 by default.
  • –simplify-denovo [readnum] : Simplify all denovo junctions which total number of raw reads is lower than readnum. Default value is 0
  • –simplify-annotated [readnum] : Simplify all annotated junctions which total number of raw reads is lower than readnum. Default value is 0
  • –simplify-ir [readnum] : Simplify all intron retentions which total number of reads is lower than readnum. Default value is 0

To make things simpler (😉) only the –simplify argument is required to activate the simplification, in which case the rest of the arguments would be set to their default values. Note though that the above flags give users a lot of power/flexibility. For example users can sacrifice little in PSI accuracy by removing junctions who’s inclusion level (PSI) is minor. Junctions that do not pass the simplifier threshold(s) are deemed (putatively) irrelevant.

But if you use MAJIQ’s simplifier, what happens to the LSVs? Good question!

After the simplification steps, all the LSVs are defined and quantified without considering the irrelevant junctions. The output is “lighter” (e.g. file size) but note that any future quantification of the removed junctions would require re-running the builder step with different settings to include those in the quantification. “But if I need to do that again it will take long!” – Well, not really. First MAJIQ 2.0 is now significantly faster (see here), but we are also working on more goodies to make all these processes much faster….

OK, so now the next question: What happens if you run with the simplifier but then your PI’s favorite junction becomes irrelevant? (**GASP**) Or maybe you just want to see what you might have missed? Fret not, the VOILA visualization provides two views: 1) a simplified view of only the relevant junctions that were actually quantified by MAJIQ or 2) a non-simplified view showing all the irrelevant and relevant junctions in your data.

The effects of the simplification are visible using a new button than can be activated or deactivated to show the splicegraph structure w/wo the irrelevant junctions. They say an image is worth a thousand words, so let’s try this to show the effect of having the new simplifier built-in:


We hope you like the new MAJIQ/VOILA simplifier. It’s already available on the recent MAJIQ 2.1 release. You can send us feedback/suggestions on the user support group (here). And as promised, more cool features are coming soon so stay tuned!

MAJIQ 2 is Out!

MAJIQ 2 is out! – What’s all the excitement?

So…. We *finally* released MAJIQ 2.0, our tool for splicing detection, quantification, and visualization. What’s the excitement all about?


I mean, it’s not like there is a shortage of tools/pipelines for splicing quantifications from RNA-Seq…. But there are many excellent reasons to get excited about MAJIQ in general, and about the 2.0 version in particular. I discuss some of these reasons below.

*Hint: Those of you who already know what MAJIQ can do can skip straight to Reason 3

Reason 1: Ability to detect, quantify, and visualize complex and de-novo splicing variations from RNASeq.

Most software for splicing quantification rely solely on an annotated transcriptome given as input. This means you are bound to miss everything not in the annotation. “That’s minor” you say? Well, we have shown that even if you just compare normal mouse tissues and use Ensembl you gain ~30% more significant (>20% change inclusion) differentially spliced events that are just as reproducible as the annotated ones. And if you are studying disease (cancer, anyone? Cryptic splicing, maybe?) or other perturbations you may not want to overlook those de-novo variants…

Also, most software tools only study “classic” splicing events like exon skipping, or 3’/5’ alternative splice site. We have shown that complex events, i.e. those involving 3 or more alternative junctions, are extremely common (> 30%!) in human, mouse, and other species. And yes, that complexity is actually used even in normal tissues (see, our Elife paper for details). So, again, if you are studying splicing and are only using annotated and “classical” events you may be missing a lot of the things you are looking for….

Dealing with complex splicing variations requires the ability to visualize them, validate them, connect them back to the raw read rate etc. Fortunately, we have taken care of these things with a visualization package, VOILA, and a web-tool for automated RT-PCR primer design for LSVs, called MAJIQ-SPEL (we kept with the magic theme…)

Recently many tools have been released that advertise the ability to detect de-novo and complex splicing. Examples include Whippet and LeafCutter. A comprehensive comparison is beyond the scope of this blog post (don’t worry, we will have comparisons in an upcoming paper…) but at this point we only would like to point out that each of those other methods lack some features/capabilities compared to MAJIQ. For example, Whippet only detects connections between known junctions, LeafCutter is unable to detect intron retention etc. The bottom line is that we are not aware of a tool that offers such a package of tools/capabilities for detection, quantification, visualization, and validation of de-novo, complex, events including intron retention (IR), de-novo junctions, and de-novo exons.

Here is an illustration of what MAJIQ can help you get: Splice graphs of Col11a2, a key gene in ear development. The splice graphs, the visualization of the event and the quantification of the change are all part of the output. Green arcs/rectangles are de-novo junctions/exons capturing misplacing caused by KD of splice factor ESRP1. This beautiful study by led by Alex Rohacek originated from a case of a deaf child with rare mutations in ESRP.


Reason 2: MAJIQ’s accuracy compares favorably to other algorithms.

OK, so maybe you are convinced MAJIQ might be useful for your data analysis, but is it actually accurate??

Good question. We did extensive testing with both real and “realistic” synthetic data using a variety of metrics and found MAJIQ compares favorably to other tools. Here is an illustration of those evaluations:

The graph on the left shows reproducibility ratio (RR) plots. This is the ratio of events called as differentially spliced between two conditions and reproduced (y-axis) as a function of their relative ranking/confidence (x-axis). The two conditions are cerebellum vs. liver 5 samples each, from GTEx. Reproducibility is when you repeat this with a different set of samples (same set size, same tissues). The right plot, called IIR (intra to inter ratio), is a proxy for the ratio of putative false positive events (percent, y-axis). We previously performed extensive comparisons to other algorithms as well, such ad DEXSeq, SUPPA2, and MISO – including evaluation on “realistic” synthetic data and comparison to triplicates of RT-PCRs, the golden standard in the RNA field (see here, here, and here). Again, a more extensive analysis will appear in a paper we are writing. I will only add that in the only case where claims were made that MAJIQ does not perform well we showed, clearly and extensively, that those claims were false and based on critical misuse of our software (see here).

NOTE: while we are obviously promoting MAJIQ here and we think it generally performs well it’s still possible that for *your* data/usage case MAJIQ really sucks…. That’s why we advocate in our papers/talks the usage of several metrics such as RR and IIR described above and make our code/data available so you can try those evaluations on your data and see for yourself what works….

Reason 3: MAJIQ 2 is *way* faster, more memory and I/O efficient

MAJIQ’s main disadvantage compared to alternative methods was speed. Let me emphasize that we believe the added ability to detect complex, de-novo, variations and intron retention events are in many cases worth the extra time, even more so given the improved accuracy. Or, to put the accuracy vs. speed into a visual summary:


Nonetheless, we do admit that speed is important and we worked hard to make 2.0 much faster. How much faster? Well, they say a picture is worth a thousand words so here is comparative analysis we did with GTEX samples:

As you can see, compared to MAJIQ 1.1, MAJIQ 2.0 is ~10 times faster, and is now comparable to rMATS and LeafCutter, and it was all done with very efficient implementation in terms of memory consumption (e.g. in all of the above tests peak memory consumption hovers stably around 2GB RAM, i.e. you can run on a laptop).

So, with 2.0 we brought MAJIQ to be similar, and in some cases faster, than the current most efficient methods. To be clear, we are not aiming to build the fastest tool. There is substantial overhead (which results in additional compute time) in modeling de-novo events, especially de-novo intron retention. That overhead is not going away. But, we are aiming to make 2.0 as useful as possible, and for that speed is important.

Btw, the speed improvements are due not only to code improvements and porting to C++. Amongst other things we significantly revised the algorithm for intron retention. To be honest, IR has been a *huge* headache for us. It was basically *the* thing that prevented us from releasing 2.0 last year. So much so, that Jordi, the main developer behind MAJIQ, announced on the group chat that if anyone else reports another bug with intron retention he will be doing this:


Good news: He didn’t. And we got it to work 😁

Reason 4: New visualization (VOILA 2.0)

VOILA 2.0 has changed significantly. Much of it is right now only under the hood (though see below about new features coming soon..). The two main differences you will observe now is that VOILA is run as a local service. This allows for additional features like a search which were not available before. The second is that the improved implementation requires less disc space and is more responsive for large datasets.

Here is an example of how it looks (notice the search box on the top left)

Reason 5: Ability to analyze hundreds and thousands of samples

The new speed and new visualization also mean MAJIQ supports the analysis of hundreds and thousands of samples for large datasets such as GTEX and TCGA. There are more features and algorithmic improvements that will build on it and improve such analysis even more (see below).

Reason 6: Why so negative? (Support for a confident negative set)

One feature we found missing in practically all other tools is the ability to get a confident set of events that are *not* changing. Most tools focus on protecting from false positives, trying to find things that are confidently changing. However, as RNA researchers, we and our collaborators have found it very useful to be able to get a set to compare against – a very “clean” set of things we are confident are not changing.

Reason 7: Wait! There is more!

Finally, a major reason we are excited about MAJIQ 2.0 is that it sets the code base for many new exciting algorithmic and visualization improvements, with application to new research questions so stay tuned!


I want to thank all the BioCiphers Lab who worked hard on this as a great team effort led by the relentless Jordi Vaquero. We also thank all MAJIQ users for their input that help us make MAJIQ better. We hope you all will enjoy the new MAJIQ 2.0 and don’t forget to subscribe to the user forum!

LeafCutter vs. MAJIQ and comparing differential splicing algorithms

Last week, my lab found out that the LeafCutter BioRxiv paper was updated to include extensive comparisons with other methods, including our own method, MAJIQ, for detecting and quantifying differential splicing.
UPDATE: We just found out the Li at all bioRxiv was released as a technical report today, Dec 11th 2017, in Nature Genetics.

Before I delve into the details some background for those who are not necessarily into differential RNA splicing: We published MAJIQ in Elife on Feb 1st, 2016. We constantly maintain the software and a lively user group, frequently pushing updates and bug fixes. The original LeafCutter BioRxiv paper was posted on March 16th 2016 and was used to derive the sQTL results in the Yang et al Science paper, which came out on Apr 29th 2016 and was titled “RNA splicing is a primary link between genetic variation and disease”. This BioRxiv paper cited MAJIQ extensively, but did not include any comparisons to it, or to any other method, nor did it include any quantitative assessment of the method or independent experimental validations. The updated preprint, which was posted on Sep 2017, is a very different paper though the basic algorithm seems to have stayed the same. Besides the addition of extensive evaluation of splicing and sQTL in GTEx samples (which is very interesting and novel, but is not the focus of this post) the new BioRxiv version included an extensive comparison to MAJIQ, rMATS, and Cufflinks2. Opening that new bioRxiv with nervous excitement and going through the evaluations, figures, and claims, I quickly got the following feelingInTheRain(GIF shamelessly stolen from the wonderful r.i.p.)

To be fair, comparing to other algorithms is hard. Furthermore, authors always want to highlight the relative advantage of their method, and in many cases the other methods end up being run with default params while authors spend many days tweaking their own method to prove improvement. This results in an inflation of “best ever” methods and the known ML rule: Your safest bet is the method that robustly came second in others’ papers. In this case, comparison is even harder: As Li at el note in LeafCutter’s bioRxiv preprint, the compared algorithms “speak different languages” in terms of what they define as transcript variations. Nonetheless, after a careful reading of the updated preprint and extensive reanalysis, we wish to respond to several specific points. The rest of the post is organized as follows: (a) Specific critiques of the evaluations performed in the updated preprint, (b) alternative approaches for evaluation of DS with the matching results based on our previous work in Vaquero et al 2016 and Norton et al 2017 (which was posted as a bioRxiv on Jan 2017 and updated on May 2017) , followed by (c) a discussion.

We hope this post and subsequent discussion will help not only improve our respective works but also other future works in the field and more generally how we as a community handle some of the issues raised here.

1.Main issues we identified with the evaluation in Li et al 2017:

The comparison of Li et al to other methods in the main text is all concentrated in Fig3a-c. There are additional comparisons in the supplement, but many of those share issues related to Fig3a-c. We only refer here to rMATS and MAJIQ and not cuffdiff2, the latter being less relevant to the points made here. Let’s go through the main points we observed:

  1. What do you compare against?
    The version of the softwares used is not mentioned anywhere, but is clearly outdated. The bioRxiv revision was posted on Sep 7th 2017, but rMATS already had it’s rMATS-turbo version released on May 1st 2017, and MAJIQ had a major upgrade to version 1.0.0 on May 10th. Both were designed to handle large datasets with significant improvement in running time and memory footprint. Yet Li et al claim that those methods “do not scale well over 30 samples” taking dozens of GB memory and are excruciatingly slow (Fig3a, FigS9).
  2. How do you evaluate performance?
    The authors offer two measures of performance:

    1. P-value accuracy (Fig3b): To evaluate p-value accuracy and the control for FP, the authors compare the distribution of p-values observed in a tissue1 vs tissue2 comparison with the distribution observed when the 2 tissues have been mixed and split into 2 random groups (i.e., a contrast with no expected differences). There is one major problem with this comparison however: MAJIQ does not produce a p-value. It does not use a null model but a Bayesian posterior for P(dPSI > C), with the default set to P(dPSI > 20%) > 0.95. The authors admit that, yet plot 1-P(dPSI > C) against the p-value line, just like all the other methods, and state that it’s not “well calibrated”. Many readers looking at these graphs are likely to quickly conclude “Oh, that MAJIQ method is *terrible*” while in fact they should be going: “??”. The definition of P(dPSI>C) is clearly not of a p-value, and we hold its usage as such is simply wrong. The shape of the graph simply reflects that rather than a problem in calibration as suggested by the authors. We will get back to what this plot really means for MAJIQ later in the post.
    2. Evaluation of synthetic data: Here the authors use synthetic data to measure accuracy in classifying changing vs. non changing events using ROC curves (Fig 3c): This is a more delicate point as results will depend greatly on (1) how the synthetic data are defined and (2) what you define as significant changes. We will ignore (1) for now and focus on the latter as this is crucial to understand the flaw in this comparison. Li et al defined changes by altering a specific isoform expression by 10%, 30% etc. They measure how many of those changes are then detected by each method. The problem with this is simple: A change of 10% in the expression of an isoform does not translate to a specific level of change in *splicing*. Remember: We are not talking about methods to quantify expression, but changes in splicing. Methods such as rMATS and MAJIQ measure splicing changes as the difference in ratios of inclusion of RNA segments. In the best case scenario an isoform whose expression is changed is solely responsible for a specific junction inclusion. So, for a 10% change of its expression if the original junction inclusion was at X/Y = Z%, it will now be included at 1.1X/(Y+0.1X) = Z’%. As you can see, dPSI = Z-Z’ really depends on X vs. Y. For example, if X=1 and Y=9 (which means the changed isoform is expressed at a reasonable 10% of the gene’s level) then dPSI is less than 1%! Remember: This is the actual splicing change, and now MAJIQ is ran on this with default parameters, which means it is looking for high confidence events of dPSI > 20%. Again, look at those ROC graphs in Fig3c and your spidey senses should be tingling: Notice how they completely plateau very quickly for MAJIQ? You are basically asking it to find changes in splicing that are bigger than 20% while defining as positives events changes in *expression* by 10%, 25%, 50%, or 300%. Indeed, when you require 300%, an isoform expressed at 10% of the gene could achieve (depending on the splice graph specifics) a dPSI >= 20%, and in that case MAJIQ with its default params does just as well as LeafCutter. And what happens if you actually evaluate it on changes in splicing? The answer is given below. Finally, we note that Li et al seem to be aware of at least some of these issues (p.28 top paragraph) regarding expression changes, but unfortunately, these observations do not propagate to the overall analysis.

To summarize, the comparisons in the recent Li et al preprint use the highly outdated software at the time of posting it, and metrics that are not compatible with algorithm output/definitions. The combination of those creates what we assert to be a highly misleading picture about algorithm performance. Furthermore, we assert these comparisons fail to actually evaluate performance on changes in *splicing* and lack experimental validation.

2. Our comparative evaluation of LeafCutter and MAJIQ.

2.1 Running time/memory:

Here is an evaluation we ran with current versions which should be similar/same to the ones available then except various (minor) bug fixes. Here time is in minutes, run with 16 threads on a Linux machine.


While we don’t have the exact memory usage, it is a few hundred MB per thread for MAJIQ and rMATS (see also Norton et al 2017). Two points are worth making here: First, How do you actually compare execution time? For these algorithms it is more complicated than just the numbers above,  a point we raise and discuss in Norton et al. Briefly, the algorithms operate differently. For example, if you want to do multiple comparisons of groups of 3 with the 20 samples of the 10vs10 above, MAJIQ will run the builder *once*, then the quantifier multiple times for 3vs3. In rMATS, each run is independent. The second point is that execution time is tied to what you quantify. In LeafCutter’s case that does not include intron retention events (see more on this below), which makes a huge difference.

In summary, rMATS is faster than LeafCutter, and both MAJIQ and rMATS are completely capable of handling large datasets, contrary to the picture presented in Li et al 2017. Again, the fact no version of the software or execution params are documented make the analysis in Li et al 2017 both misleading and harder to reproduce.

2.1 Synthetic data:

Synthetic datasets are useful for evaluating algorithms under varying controlled conditions, when you know what the “truth” is. The downside is that they are, well, synthetic. Specifically, for the problem of DS they commonly involve strong simplifying assumptions about expression levels, distribution of variation in and between the groups compared, read errors, coverage levels, underlying known transcriptome, and variants that involve only predefined “classical” AS event types such as cassette exons.

In order to create more realistic synthetic data for DS analysis, we employed real data from biological replicates. If you are interested in how the data was generated we encourage you to read Norton et al 2017 but for the discussion here I would only point out the following key points: First, expression levels and transcriptome variations are not based on a synthetic model but on real data/samples. Second, we use the most complex splicing event we can find based on reliably detected (multi reads in multi positions) junction spanning reads from STAR. This enables us to define a lower bound on each gene transcriptome complexity. The expression level of each isoform to be simulated is then set by the gene’s overall FPKM and the raw junction spanning read ratios in the real sample to avoid biasing estimations towards any specific algorithm. This means we can simulate not just cassette exons or binary events but also complex variations, which both MAJIQ and LeafCutter are geared to capture. Simulated data was generated by our colleague Greg Grant, who is a known leader in the field of evaluating methods for RNA-Seq analysis.

How do we evaluate DS methods using this data? Following a common criterion in the RNA splicing field we used a change in inclusion level >= 20% (dPSI >= 0.2) to define a positive event and an inclusion change < 5% as a negative event. Importantly, we use each methods “native language” so each method’s set of events are evaluated by its definition of what an event is. This is markedly different from the procedure used by Li et al where they state that they “collapsed all events in rMATS and MAJIQ that shared a single splice site into a single event (as is done in LeafCutter)”. The consequence of such evaluation on PSI/dPSI and the evaluation is not clear as details are missing.

Let’s look at the results if we use the p-value based approach which the authors advocate and on which they base their subsequent analysis in both Lea et al 2017 and Li et al 2016 (sQTL included):

The first 3 columns represent the number of changing (>20%), not changing (< 5%), and in between (5-20%) events, according to each method “language”. This is why they are so different and hard to compare directly. The stats on the right are computed with respect to those. As you can see, LeafCutter is much more permissive than MAJIQ: It reports almost all the TP events (Sens of 93% vs. MAJIQ’s 83%) but this comes with a high price in FP (FDR of 21% vs. MAJIQ’s 1%).

Now, it’s important to note we view the above evaluation as problematic: A p-value is used to test our belief in a change, not for the actual magnitude of it. So what happens if we adjust LeafCutter to use p-value for confidence but add a requirement that their dPSI estimates are >= 20%? Here are the results:

As you can see, now the results make much more sense: LeafCutter’s FDR drops to 2%, but sensitivity drops to 79%, with MAJIQ still comparing favorably. We would like to point out that if you look at our recent Norton et al 2017 you will find we tried many different configurations for all the algorithms we used to make sure the evaluation is as fair as reasonably possible. We also tried a more permissive threshold of dPSI >10% for positive events, with similar results.

2.2 Reproducibility plots (RR):

RR plots follow the same kind of logic and similar procedure to that of irreproducible discovery rate (IDR), used extensively to evaluate ChiP-Seq peak callers. Briefly, we ask the following simple question: Given an algorithm A and a dataset D, if you rank all the events that algorithm A identifies as differentially spliced 1…N_A, how many would be reproduced if you repeat this with dataset D’, made of similar experiments using biological/technical replicates? The RR plot is the fraction of those events that are reproduced (y axis) as a function of n<=N_A (x-axis). Here are the results from tissue data we used for the Norton et. al and Vaquero et. al papers, comparing groups (3vs3) of Cerebellum and Liver samples:

We note that here we use the same p-value criteria used by the LeafCutter authors’ throughput their paper(s). However, when we tried a new criteria, where events were instead screened for significance (p-value < 0.05) and but then ranked by dPSI, LeafCutter RR improved significantly from 62% to 77%, yet was still lower from MAJIQ’s 83%.

2.3 Intra to Inter Ratio (IIR):

Reproducibility alone is not sufficient to establish accuracy. For example, an algorithm can be extremely reproducible but highly biased. To get a better sense of possible levels of false positives we devised the following test: We compare similarly sized groups of the same condition (e.g. brain vs. brain or liver vs. liver) and compute the ratio between the number of events identified as significantly changing in such a setting (N’_A) to the number of events identified between the groups of different conditions (N_A, e.g. brain vs. liver). We term this the inter to intra ratio (IIR). This test is similar to the one used by Li et al to test for FP (Fig3b). In their setting, they *mix* the two groups. We postulate that the IIR is better for testing the ratio between natural variations within groups/conditions which an algorithm deems as significant and variations between conditions (i.e. the biological signal of interest).

Here are the IIR results we get for the tissue comparison data we used in Norton et al 2017:

Notice there are two numbers for each method as the N_A is compared to the N’_A from within each group (liver and cerebellum). These results are inline with the synthetic data and point to significant levels of FP by LeafCutter.

Two final points about IIR and FP: First, just as we did with the RR plots above we also tried to improve LeafCutter’s results by making a more conservative test of p-value with dPSI> 20%. LeafCutter’s IIR improves significantly (0.048, 0.035) but MAJIQ still compares favorably. Second, we point the readers to the small print on each panel in Fig3b of Li et al: They refer to this test ratio as an estimate of FDR (whether this is a true measure of FDR or not is arguable given the points made above), and you could see that at both thresholds MAJIQ outperforms LeafCutter (245 vs 0 when using MAJIQ’s criteria of P(dPSI > 20%) > 95%; 766 vs 721 when they set P(dPSI > 20%)>80%). Yet this point is not mentioned or discussed. Instead the authors focus on claiming a non calibrated p-value test, which is irrelevant for MAJIQ (see above).

2.4 RT-PCR:

While the above tests are important and informative, these are focused on dPSI and do not address the question of how accurate the actual PSI levels are quantified by the algorithm. This is particularly relevant for the kind of analysis Li et al focus on, i.e. sQTL, as this is done using PSI in each sample and not dPSI between groups. Moreover, they do not supply experimental validations or independent measure of accuracy (see point above about possible inherent biases). For these, RT-PCR experiments are considered the gold standard in the RNA field. To be sure, these too are not free of potential pitfalls and biases (see a discussion about this in Norton et al): many available RT-PCR from publications are only usable qualitatively (i.e. “change” vs “no change”) so one must be careful in simply downloading and using these, their selection can be highly biased (e.g. only cassette exons), and these are low throughput experiments. Nonetheless, including them for methods that make strong claims about splicing quantification is important if not crucial. This is the correlation to RT-PCR by LeafCutter compared to our results in Norton et al for MAJIQ and rMATS. We note that all the data for these RT-PCR has been made available in Vaquero et al 2016, when MAJIQ was published.

The above figures show both MAJIQ and rMATS outperform LeafCutter (R2 = 0.923 vs. 0.972, 0.967). Things become more interesting when you try to validate PSI (these are PSI for the two conditions underlying the dPSI above):

We note that LeafCutter’s PSI quantification is significantly more biased and less accurate (R2 = 0.806 vs. 0.87, 0.906 vs 0.936). We suspect the reduced accuracy may have to do with how it defines splicing events. LeafCutter, similar to MAJIQ, relies on junction spanning reads but LeafCutter’s events are intron centered and collapse all overlapping sets to an “intron cluster” for which the inclusion levels are normalized and quantified. While these clusters are a convenient mathematical construction for the purpose they were originally designed for (detecting splicing changes for sQTL, as in Li et al 2016, 2017), there is no real guarantee that in the underlying transcriptome these actually represent a set whose combined inclusion is supposed to sum to a constant (1, when normalized to compute PSI). In fact, even for a simple cassette exon LeafCutter does not report the exon inclusion but reports 3 PSI values (which sum to 1) for the 3 possible intronic regions (C1-A, A-C2, and C-2, where “A” is the alternative exon and C1, C2 are the upstream/downstream exons respectively). It took us time to figure this out and correct for this, and these were all simple cassette exons. For the more general case, we are not sure how leafCutter’s output is to be translated to quantitative measures of inclusion that can be validated experimentally.

These issues require further investigation which are beyond the scope of this post. We note though that PSI quantification is highly relevant as the sQTL pipelines employed in Li et al 2016, 2017 are all based on feeding PSI quantification (and not dPSI) into standard QTL tools such as fastQTL.

2. Discussion

We presented here several issues we identified with LeafCutter comparison to other methods, which we believe lead to severe misrepresentation of relative performance. We then followed with suggested alternative metrics and evaluation procedures, including suggestions for how LeafCutter’s performance could be improved compared to the procedures used by the authors.

Another important point of comparison glossed over in the Li at al paper is that LeafCutter does not offer detection of alternative first/last exons, or intron retention. This is in sharp contrast to MAJIQ which offers both known and de-novo intron retention. De-novo intron retention has a *huge* computational overhead which the other methods (LeafCutter, rMATS) do not pay. This distinction is almost absent in Li et al (p. 27 top in the supplementary). Consequently, we did not include intron retention in the evaluations above. We note that besides the fact these are not included in the output for the users, not taking those into account may have considerable effects on splicing variation detection and quantification given how common these occur in human (Braunschweig et al, GR 2014).

Several other points are worth making. First, is that our analysis supports Li et al assertion that LeafCutter can perform well (especially if the adaptations we suggested are used), and can be used to derive new insights into splicing and sQTL. We reiterate that this post completely ignores the novel and significant contribution of Li et al of extensive splicing and sQTL analysis in GTEx. We congratulate the authors for this achievement. We also point to the extensive effort made by the authors to show concordance between methods and even the effect of mappers (STAR vs. OLEGO), as detailed in the long supplementary notes.

We believe that for DS, and for many similar problems in Genomics and ML for that matter, there is no “silver bullet”, even though we always want to have one (and preferably, a really fast one for that matter…). As we discuss in Norton et. al, different algorithms may have different advantages depending on the needs and the settings. For example, Li et al did not review SUPPA which is far faster than all the other methods, allowing users to quickly build a large scale genomic landscape of splicing even though quantification may be less accurate (see comparative analysis in Norton et al). Similarly, VAST-TOOLS offers a dedicated pipeline sensitive to detecting micro-exons. As for future methods development, we have made all the data and pipelines we used to derive the results presented here available at

Beyond the technical points raised about DS and evaluation methods etc. there are some important discussions that arise. For one, who is responsible for running updated versions of alternative software? The authors? The reviewers? We think both. Second, how should a discussion around archive papers occur. Posts? Tweets? Matching archive papers? Direct contact by phone/email? We thought a lot about this. Previously, when we observed other clear inaccuracies we contacted the authors of that archive preprint directly with a detailed list. Here, given the fact we do not know the authors as well, the nature of the issues, and the long time that has passed (preprint was posted in Sep, we only noticed it last week), we opted for posting a detailed response and hope for a constructive discussion. We also informed the authors immediately. But more generally, what should be the new standard for such cases? The common courtesy?

Finally, we want to raise the issue of implications. A bioRxiv preprint is exactly that – a public statement of in progress, non-peer reviewed, work. Similar to a talk, but more concrete and detailed. It can be cited, acknowledged, and inform the greater scientific community. But the reality is that it can also have severe ramifications. People read BioRxiv papers and their opinions are shaped by what they read. For example, others in the field could get the inaccurate impression that rMATS and MAJIQ are flawed. They might then end up as your reviewers (OK, you get where this is going). We hope that by having constructive discussions around preprints, as we aim to achieve here, both our science and our scientific practices improve as a community.

Should I (Bio)Rxiv?

“So what is your view on BioRxiv?” asked me Donny Licatalosi last week on one of those long nights over beers at CSH RNA Processing meeting (which, btw, is a *terrific* meeting I highly recommend). Donny is one of the bright new PIs in the RNA field (see here), and I believe his question represents many other young PIs in a similar stage, trying to decide whether to jump on the archive wagon. After laying out my views on Rxiv pros and cons (with the help of more beer, of course…) we agreed I should probably do what I said I’d do a long time ago, and actually write about this in my blog, so here goes….

For the record, I should state that the four last papers from my lab have all been posted on BioRxiv. This already tells you I am generally in favor so I’ll start with the benefits, at the risk of stating the obvious:

  1. There are ideological points behind it which are worth supporting:
    1. Make science progress faster by increasing communication.
    2. Open access science. Give everyone equal access to cutting-edge science, whether you’re a curious student in a rural countryside or you’re a tenure-track professor in a large research institution.
    3. Help change the current landscape of publishing. This system is bound to change, and BioRXiv can be seen as a catalyst of this change. Think of Taxis: We needed those to get around because cars were expensive, unreliable, and required expert knowledge (driving and getting around). Now we have cheap, reliable cars with GPS, cars/driving are common, and we are all connected via smartphones – so now we have Uber/Lyft. The publishing landscape is just as archaic, with us scientist doing all the editorial work for free, feeding private companies billions of dollars of taxpayer money to do jobs which are for the most part no longer needed (publishing/distribution).
  2. There are practical reasons/cases where it can benefit you:
    1. Spread the word about your advances/get faster recognition, i.e. the personal version of points 1.1/2. This may be particularly useful when you are a young PI, starting your lab takes a long time, and you carry little weight with editors. This is, of course, a hypothetical, totally fictional situation. Any resemblance to real life characters is totally coincidental.
    2. Get a citable point of reference in a grant/related paper. While reviewers of your grant can ignore it they may also have read it or at least acknowledge it as proof-of-concept/progress (sure beats writing “manuscript in preparation”…). It may also help/not hurt when you are submitting your tenure package (another totally fictional scenario).
    3. Can help lay claim to new results/avoid scooping/possibly IP problems – this may be particularly relevant to situations where you know someone else is trying to beat you to it. By posting it on BioRxiv you clearly put it in the public domain (not patentable by others).
    4. You could get valuable feedback from the community (I think this is still in it’s infancy, but there is potential)
    5. You could possibly get some points for being a good citizen, and really why not? Which brings me to the next part….

If you ask my milieu of young PI’s such as the color Caseys (the Brown and the Greene) the answer would be a resonating “you should do it!”. It’s definitely a common practice in the Genomics and Machine Learning fields – two communities I belong to. But it’s not a common practice in Biology such as in the RNA field which I’m also a (proud) member of (with notable exceptions such as Brent Gravely, also being very “Genomics”). So, for me, the more interesting part is maybe the arguments against preprints:

  1. The big supporters of archive papers like to raise claims in the spirit of: “The only difference between a published and archive paper is that 3 more people read it”. I’m sorry, but that’s just not true. In a properly functioning system (I know, I know…) these 3 are not random people but independent experts in your field. That’s huge. And because you know that, you put much more effort into it. You are also held accountable for the content (in a proper system, I know, I know…). So yes, the current peer review system is problematic and definitely requires serious amendments, but I don’t like the claim it’s just the same. We are getting to a point where some even make use of this claim. One researcher told me his postdoc did not bother to publish a method paper and moved on to other projects because the method was already used for X (X being the important project they needed it for), and “people are using it already, so why bother”. Well, for one, if the method was not validated/tested in the original archive paper, there is no way to enforce it (or evaluate the consequent results/claims). I am definitely sympathetic to constraints such as the timing of projects etc. but the result of this approach, even when not intentional, is problematic for our field. Which leads me to the next point.
  2. Researchers, whether intentionally or not, sometimes abuse the archive system. I have seen multiple cases where a paper is submitted/published completely relying on an unpublished method only available as an archive paper. In theory, this is still legit if reviewers get all the info needed. Indeed, sometimes it is of little relevance – e.g. you use a method to find something and then you validate it experimentally. In such cases, the original detection method is almost irrelevant as the result holds. But in some cases, main results rely on a separate pre print, which can be problematic. First, this puts an unfair load on the reviewers which, in order to do a proper job, are now required to review two papers (and that second one may not even fit their expertise). Consequently, the entire premise of the paper being reviewed may be wrong. The archive methods paper may lack validations, proper evaluation of FDR etc. because after all, it’s only an archive paper (see point 1 above….). And editors are to blame for this as well because they allow this to happen and play along. Again, I totally understand that project/paper timing is an issue, but things need to be done properly in order for us to trust papers, and this new practice is not making things better. Finally, another version of abuse is a form of “salami publishing,” where a whole salami (scientific story) posted on BioRxiv is sliced into pieces that the authors try to publish separately even though these are highly dependent, without proper acknowledgment.
  3. While the above two points are basically me ranting about possible pre prints abuse I think the following is a more interesting point to consider personally: Biomedical research papers are inherently different from those of CS/Math. In CS if you have a good idea you write it down and can put it out there quickly via arXiv, soon to be followed by a matching (quick, short) conference paper. In biology, time constants are usually larger and the paper can change dramatically between the first submission and the final version as the story is driven first and foremost by the results (as opposed to by the math/model in typical Math/CS papers). A senior PI told me recently she does not like the archive craze because she does not want to be known by those initial versions. Think: How many people who read your initial paper bother to read the final one when it comes out??
  4. For a Bioinformatics method developer, the combination of making things publically available as quickly as possible yet still wanting a traditional publication (for grants, tenure etc.) may be lethal: You post your paper + method, but by the time your paper is actually reviewed someone else already shows their method is much better. Now go publish that as a significant contribution…..
  5. You might get scooped: I wrote this as the last point because I think this may be an overrated concern for biologists against using bioRxiv, but it still exists. In general, I think this applies if you have something very cool in a hot topic where simply posting the finding may lead specific people whom you do not trust to quickly replicate & submit while your own paper is being delayed (intentionally or not). Notice there are a lot of conditions in the previous sentence which you need to think whether (a) they hold (b) whether these people would do that given that the rest of the world and their lab (may) know about your archive paper. Still, this is a clear case where biology and CS differ as the former is (a) much more discovery/finding oriented (b) development time (c) review time are generally longer and more variable.

The mistrust in archive papers has already garnered some attention in blog posts and twitter feeds – see for example discussions here and here, the latter leading to the provocative title to Boycott BioRxiv (to be clear if you actually read it, it does not really call for it). Suffice to say there is clear evidence supporting a healthy dose of mistrust, regardless of who the author is and where it finally ends up. So one clear take home message is that we should teach ourselves and our students exactly that.

But beyond mistrust, I think there is something to be said about well-thought reviews and paper write ups, taking the time for your work to mature. I definitely see the great value of archive papers and I used it myself (see opening paragraph) but somehow it also represents for me something of the current times, where people are in constant search for instant gratification as the number one priority/value. Or maybe I’m just getting old. In any case, I think that at least considering the perils listed above and avoiding the abuse of the archive system is worthwhile for all of us as we try to advance Science and make our environment a better place.

I’d like to finish with two points. First, I think that preprints are part of a more general question about the future of Scientific publishing which I hope to cover a bit in my next blog post.

Second, there are also interesting questions regarding how should pre prints be treated. For example, there was a lot of discussion about whether these should be included in NIH grant applications and how (e.g here). Also, how should you as an author/reviewer/editor treat those? Are you expected to know about those? Compare/cite those? What should an editor do if she knows about a similar paper already on BioRxiv? In NIPS, for example, we had to declare as reviewers whether we already saw the paper on arXiv (NIPS is double blind). Given the timelines and emphasis on discovery in biomedical research (see above) it’s likely these fields require different approaches than CS. I would love to see people address those questions in the comments below, and maybe these should form another blog post too.

In summary, as I told Donny that night at CSH, I see BioRxiv as a tool in our toolset – be aware of your options and think given the above whether this is the right tool for that specific work/paper. Good luck with it and may the force be with you!


Update 9/9/2017:

  1. Here is a paper that does a nice job comparing the effect of enforced, voluntary, and no submissions to archive on subsequent paper citations. This serves to put the numbers behind positive point 2.1 I listed above. As it happens a different paper with a much more catchy message (500% increase in citations) but no such control for confounding factors appeared as an archived paper and created the consequent buzz – nicely illustrating some of the other points I was trying to make….
  2. Brent Gravely responded to my blog (here) stating all the positive effects posting on BioRxiv had for him (including reversing an editor decision!) and stating no negatives occurred. This is, of course, a case of proof by example, but still, a nice illustrations of the benefits to be had.

What does it take to get a PhD?

“What does it mean for you to be/earn a PhD?” “What does it take?” I often pose that question to students at various stages.  I think this is an important question one should ask oneself, during/before training (or when you train others) as this will affect the place/environment one chooses, and what they do during that time.

Today I want to focus on my answer to this question, in the hope that it will help current and future students assess their way. I note that my answers are focused on the research heavy environment, naturally in the computational/CS/biomedical area. But some general principles apply nonetheless.

Anyone who has followed my blog even a little would know that as its name implies, I tend to find connections between the Martial Arts and everyday life. In this case, it so happens that I earned my PhD about the same time I earned my black belt and could not help but see the similarities. For one, a black belt can take 5-10yr to get if you train seriously. Also, people from the “outside” think that if you got a black belt you are “a master”, but any serious practitioner knows it’s hardly the case. You get your black belt when you finish your basic training and then you set out on your way to *start* and become a master. It’s a life long journey and what actually matters is the Way, not the target. PhD is pretty much the same, but what you are training for is to be an independent researcher.

But what does it actually mean to finish basic training to become an independent researcher? It means that if you are given a research question you can formulate it, develop methods to tackle it, assess their results/performance, iterate and finally converge to a solution which you are then able to reproduce, write up in a scientific paper, and present. In order to do that you should establish knowledge of the field you work in (what was done, main problems etc.), the techniques used (experimental, computational – does not matter), developing the ability to find what it is you are missing and learn those missing components as part of the process. You should be able to critique your work and others’ and express this clearly. This means for example that you should be learning how to write reviews as a guided process with your PI so by the end of your PhD you can do it well without them. The observed output of this process are talks, papers, reviews etc. but these are byproducts, not the goal. Which is why I don’t like when PhDs are defined by those.

Why is this important? Because if you adopt this view you can take a step back, look at your PhD and think: OK, what am I missing? What haven’t I done yet? What am I relatively weak at? Then start working towards these so by the end of your PhDs you have gotten through the entire “checklist” and feel you can go through the research process by yourself. Too many students seem to be thinking instead along the lines of “I need to get X papers”. They look at the end-point (the black belt, the paper), not the Way, and therefore actually miss the point.

You might be asking yourselves at this point what about other things? Working in a team? Managing/Teaching/Mentoring?

I think these are great to have and try to get my students practice in those as I think it prepares them better for the modern jobs wherever they choose to go. As a student, you can be proactive about it, look for opportunities, emphasize this to your PI/Thesis committee to maximize exposure etc. But strictly speaking, a PhD at the core is about the ability to execute independent research and not these.

So, what about “knowing the landscape of open problems”  that you can consider going after? Learning how to define the questions in the first place? Get funding?

My view is that these are all good things to learn/know, but again they are not part of the PhD definition. After you finish your basic training you go on a postdoc. As a postdoc, you keep learning more techniques/methods (possibly in new areas), and practice your ability to do independent research. By the end of that, you have proved yourself completely capable of independent research, gained knowledge and experience, built a track record and developed the elements mentioned above (mentoring, view of interesting problems etc) so now you are ready to become a PI. Networking, gaining experience in managing projects/people (and your time…;) are all a big plus. You should strive to build those capabilities to make your life easier regardless of whether your stay in Academia or not.

In practice, some come out of their PhD much more mature than others [1]. A good example is my friend and colleague, John Calarco. John was a PhD student when I was a postdoc at the Belncowe lab. He came out of his PhD with a dozen papers including first author papers in obscure journals such as Cell and Nature. Yes, I know it’s not a great metric but the point is he came out of his PhD with a view, a vision of what he wants to do. He became a Bauer Fellow at Harvard, which meant running his own small lab for a limited time (4 years) with guaranteed funding, and recently got a PI position back at UofT. I see John as a great example of rare maturity at the end of a PhD. I, for one, was not even close to that when I finished my PhD. I did not even know what I wanted to do when I grow up (still working on that one… ;).  All I knew at the time was that *if* I ever wanted to become a PI (and that was a very big if), then I should do a postdoc. So I went looking for a good place for a postdoc (that’s a topic for another post). However, I did get my good basic training at the dojo of my Sensei, I mean advisor, Nir Friedman. And btw, as I noted in a previous blog, “Sensei” does not mean “Master”, it literally translates as “that who has come before”, pointing to the endless chain we create by training the next generation.

So, my view is that if at the end of your PhD you don’t have a comprehensive plan/view of research problems, how to get funding etc. that’s OK. You should not stress about that. But make sure you get your basic training right. Generally speaking, you will *not* get that chance as a postdoc, where you are expected to deliver. Getting your base is intimately linked to choosing an advisor who is a good fit – see a recent post about that from the great Eleazar Eskin here. Regardless, and even if you do not agree with some/all of my points, I hope the above discussion got you thinking and helps you on your “Way” of PhD training or training others!


[1] My impression is that students in biomedical fields are generally more mature in that respect than the typical CS ones. Not sure why – maybe it has to do with the time CS students spend on technical abilities, maybe it’s their nature, maybe it’s the culture – I really don’t know, but that has been my general impression through the years.

Follow up on “Proper upbringing vs negligence of computational students in biomedical research environments”

So it turns out my previous post struck a cord with quite a few readers. Some of them contacted me directly to comment and share their stories. I decided to include three. The first is an update I previously posted quickly after a senior PI misinterpreted my post. The other two came later and reflect personal experiences which I thought are important to share both for the PIs out there that think there is no problem and of course for the aspiring compbio students/postdocs who read this. I think those stories speak for themselves and it’s interesting to see the comments and feedback coming from basically all over the world, so this seems to be somewhat of a universal problem in our field. Of note, those stories are not about evil PI exploiting students (the wrong impression the senior PI from the first update was worried about), but rather various forms of negligence and nonsupportive environments, which is what I was actually describing. For obvious reasons, I removed all details that may identify the people involved (and got prior authorization to post those here).


It seems this post got a lot of views but was also misinterpreted by some who got back to me with legitimate concerns and criticism. Specifically,  a senior PI wrote me they read this as “data generation labs are exploiting the students”. That was never my intention. Let me clarify, and I’ll use Penn’s GCB graduate group to make the point. GCB stands for “Genomics and Computational Biology”. I think the creators of GCB were wise to define it as such. It means GCB caters to a wide range of students who want to get exposed to “real life data/problems”. Some are more into methods development to derive hypotheses (hence “Computational Biology”), others are more into actually generating the data and analyzing it themselves (hence “Genomics”). These are crude distinctions of course but the point is not every student is interested in methods development, not every student requires co-advising. And Sometimes a student may need co-advising/collaboration for a specific project/problem and that’s all. As the PI rightfully wrote me “there is no one size fits all”. Indeed. And students that are becoming experts in a certain field while using/producing Genomic data are not “exploited.” As that PI wrote me: “I’d be better off hiring a good bioinformatician then taking on an untrained grad student who typically needs close supervision and mentorship.” That’s a fair point. My worry, and what sparked this blog in the first place, is with students who want to do more “methods development” at some level and do not get to do that because (a) they haven’t realized that’s what they actually want to do (b) they did not articulate it (see my suggestions above) (c) the system/lab they are in does not support it.


Your recent blog post strikes a deep personal chord with me because, during grad school, I was one of the “computationally oriented students [that were] basically used as in-house bioinformaticians to solve the bioinformatics needs of a data generating lab”.
Before we go on, I should say that my grad advisor is a very nice person, excellent scientist, absolutely looks out for me, and we have a great ongoing relationship. So, this is definitely a classic case of asking the student (me) to “go explore and tell me what you may want to do” “[w]ith all the good intentions.”
I joined a genetics lab with a lot of interest in computational biology but, being a naive undergrad, I did not realize that, although the science was really cool, my advisor will not be able to advise me on the computational aspects of my work. After I started my work this slowly dawned on me when problems were being posed to me and I was being asked to “solve” them without being given any starting point or subsequent guidance. This was still my first year and I found it very hard to cope with.
I struggled day and night to find relevant papers & reviews, read them end-to-end, read online tutorials, improve both my programming and analysis skills, and started working on the given problems. Then, I started seeking out other bioinformatics/computational-biology faculty on campus to interact with and attend journal clubs with, and I was also doing my best to identify one of them to be my co-advisor.
But, the latter – engaging with other computational faculty – was not easy at all due to complicated politics from all parties involved: my advisor only wanted to ‘collaborate’ and did not want me to be partially subsumed into another faculty’s group, distracting me from my main work; he also did not have good experience/relationship with a few bioinformatics faculty whom he wanted to work with, and so, he decided to “grow the expertise” in his own lab and liked to tout that he had in-house bioinformatics capabilities.
I survived by working very hard, making hundreds of mistakes, interacting with folks far and wide across the campus, and finding a couple of “shadow” mentors whom I could go to for general guidance when things really were not looking well. Along the way (just like you pointed out), I also managed to mildly enjoy the part of being in a lab that was generating data and interacting closely with experimental biologists, both helping me tremendously in my scientific development.
So, in spite of my survival and subsequent success, I couldn’t agree more with your post. Now as a faculty myself, I cannot emphasize enough the value of “advice”, “training”, “guidance”, and “well-rounded professional growth” of my students and I’m committed to “improving the upbringing of our future generation of scientists”.
Thank you for your post. This is a super-important issue and I’m glad you brought it up.
Your recent post on computational training has touched me deeply.
I have read this almost 50 times and this completely echoes my sentiments.
We all acknowledge the misuse of computational trainees as in-house bioinformaticians, but your post also talks about the “benign form of negligence”, i.e. not knowing what to do with a computational trainee.
I am currently in the same situation, figuring out what to do next. Unfortunately, most people never realize this problem until it is too late.
Thank you again for this post.