We are thrilled to release the new version of MAJIQ (2.3 for those keeping count), which contains a slew of new features and capabilities. The release is accompanied by a pre-print posted last week. This blog post will give a general overview of the new MAJIQ features and why we are super excited about this – hopefully you all are too…. 😃
The new MAJIQ is built to scale well between few samples to large heterogeneous data sets with thousands of samples and dozens of conditions (GTEx anyone? TCGA? TARGET? ENCODE? Or maybe you want to build your own combined dataset?). It came out of our research and after we realized there are no available tools for related tasks (e.g. differential splicing, clustering). Yes, there are efficient tools that can process large data, but large heterogeneous datasets require more than that to efficiently and accurately detect, quantify, and visualize.
First, some background is due for the newcomers (those who know/use MAJIQ are welcomed to skip this): We released MAJIQ v0.9 in February 2016 (> 5.5 years ago!). MAJIQ introduced a new vocabulary to how RNA alternative splicing (AS) is formulated by defining local splicing variations, or LSV, which generalized over the classical binary AS events of cassette exons, 3’/5’ alternative splice site and intron retention. It was the first tool to detect, quantify and visualize complex (involving more than 2 alternative junctions) and de-novo (unannotated) junctions, exons, and intron retention events. Using this new formulation of splicing variations we were able to show that over 30% of already annotated splicing variations in human/mouse are complex, and that including de-novo events can boost detection of differential splicing analysis by > 30% while retaining the same reproducibility.
Since its publication in Feb 2016 we have released accompanying tools for automated RT-PCR primer design for validating LSVs and more recently for correcting known and unknown confounding factors in RNA splicing analysis. Two years ago we also released MAJIQ 2.0 which had much more efficient algorithms/code, retaining its advantages in accuracy and features yet making it as fast/efficient as the most efficient algorithms (rMATS-turbo, LeafCutter). We even released a pre-print showing it compares favorably and that previous comparisons misrepresented MAJIQ, but eventually deferred on publishing a paper about that (buy me a beer sometime and I’ll tell you how a paper gets stuck as a pre-print). Instead, we opted to complete a whole new set of features and an accompanying paper to showcase these new capabilities. Fast forward professional developers coming/going, students graduating, a few babies, one global pandemic etc. and we are suddenly in November 2021. So…. What’s new??
I wish I had Shaggy’s super powers, but not quite…..In a nutshell, incremental build allows users to process samples only once. This is critical with large datasets to which you may want to add more samples later (e.g. updating with a new release of GTEx, or adding additional datasets to existing compendia). What you don’t want to do is (a) have to save all the raw data and/or redownload it (b) reprocess the data. To enable users to avoid this, MAJIQ builds a database of all identified junctions and intron retentions (IR) for each sample and saves all the necessary statistics for those so you don’t need to save the original BAM files and re-analyze those.
Splicing can get quite complex pretty fast, especially when you are processing hundreds or thousands of samples. To remove “noise” (e.g. lowly used junctions or those not actually observed in your data) we implemented the simplifier – both at the level of the splice graph building stage and later at the level of visualization. Below is an illustration of the simplifier in action. We released the simplifier a while ago and have a separate blog post about it here.
OK, so not this kind of hat though we do have a few more surprises for you down below…😊
HET stands for “heterogeneous” and refers to new and (mostly) rank-based statistic used for more robust detection of differential splicing between groups of samples. The original MAJIQ was built for (small) groups of biological replicates and assumes the group shares an (unknown) PSI value per junction, per LSV. Instead, MAJIQ HET uses robust rank based statistics (e.g. Wilcoxon, TNOM), combined with estimated dPSI between the median of each group to filter and rank differential splicing events. Check out how HET compares to all other algorithms in reproducibility plots (x-axis “how many events are reported as changing”, y-axis “how many are reproduced when repeating with different groups of same size and tissues”) using real data (GTEx), when comparing groups of 3, 15 or 50 (!!) samples. We also compared many other metrics (see pre-print). Also, all the scripts and data are available as a package for the community. This means future developers can use it for comparative analysis and encourage potential users to run it on *their* data to see how the different algorithms perform in those settings.
How do you simplify complex splice graphs to remove “noise” (e.g. lowly used junctions)? How do you visualize hundreds or thousands of samples? How do you do multiple condition comparisons? Easy – you run VOILA 2.0 😁!
VOILA 2.0 can be run on Mac/Win/Unix but also as a server so you can share results with your collaborators without file transfers (but yes, due to popular demand we did also bring back an option to save HTML files for sharing/saving visualization of specific results) . VOILA 2.0 allows you to query the results looking for specific genes, junctions etc. Look at the result below for a cerebellum specific splicing change in the splice factor PTB1, where every point is a GTEx V8 brain sub-region sample (total 2,335). If you want to see the visualization “live” click here.
MAJIQ – Modulizer:
The modulizer is a completely new feature/algorithm which should help users on one hand to simplify splicing analysis while at the same time giving them the ability to dissect types of splicing variations like never before.
The idea is simple: first, we create an updated splice graph for a gene using all the known annotation and de-novo elements (this part existed before). Next, we break down each gene splice graph into separate alternative splice “modules” (ASMs). ASMs were proposed by Yin et al but we do not follow their recursive definition or associated algorithms. Instead, you can simply think of disjoint regions that have a single point of entry or exit in the gene splice graph as shown below:
Now, with the ASMs defined as above, we can do two main things: One, we can define ASM types and find out which are the most common ones. For example, when analyzing cerebellum specific splicing changes we found the following types of changes most common:
But maybe what’s more interesting is that the modulizer allows you to break down a complex ASM to its subcomponents and enumerate all two-way alternative splice (AS) events that occur within an ASM. Here is an example of two ASMs in a gene broken down into all possible binary AS events:
You can also take a specific splicing change of interest – cassette exon in the example below – and study differences between types of ASMs that incorporate it. For example, we broke down differential cassette exon inclusion/skipping in cerebellum by the most common ASM types that include it:
Focusing on ASMs w/wo intron retention we found specific regulatory features that are shared and others that are distinct (see details in the pre-print). Again, we are not aware of previous tools/methods that allowed users to perform this kind of analysis and we hope it would help make new discoveries related to RNA splicing regulation.
As a final comment I would like to mention that, for the record, I very much wanted to call this new tool the modlizanator or at least the Combine-inator, But for some reason, my suggestion was overruled… something to do with misinterpreting our efforts as evil and ineffective 🤔.
Being very negative, in a positive way…
I’ll finish this blog post with another small addition which we implemented many years ago but somehow failed to make it to the public release. We found that in our work we sometime want to find not just splicing events that are changing, but also compare them (e.g. regulatory features) to splicing events that are definitely *not* changing, i.e. a high confidence negative set. Surprisingly, all available tools only focus on high confidence true positives not true negatives. So we implemented this as an option to get high confidence non-changing LSVs (users can get those by using the –show-all flag)
Tell us what you think
We built the new MAJIQ based on many years of experience performing RNA splicing analysis and research, combining continuous input/suggestions from our user base. We want it to be the best tool it can be so please give it a try and send us your feedback and suggestions!
And finally – yes, there are more features/capabilities coming…. So stay tuned!