In an earlier post I discussed how we might benefit by learning from other cultures/fields a term or concept that we never had, then implement it in our everyday life. I also suggested a way we can actually hope to achieve that: importing the (native) term into our own vocabulary, then actively look for opportunities to observe/implement it. The examples I gave came from Hindi, living in Canada, and my army service. In this blog I want to expand on the idea of “the naming of things”, connecting it to ideas in linguistics and our recent research work.
In linguistics the Sapir Whorf hypothesis  states that language determines or at least influences our thought. This idea was followed in many works, from the study of cultures to programming languages. Franz Boas for example popularized the idea that the Eskimo languages has a reach vocabulary to describe the many possible forms of snow. Kenneth Iverson, the turing awardee and developer of APL (the historical ancestor of Matlab and math operation based coding languages), argued for this idea in the context of coding languages.
I know this all sounds theoretical and ancient but I was thinking about Sapir Whorf when we ran into some surprising findings during our research. Our lab is interested in RNA processing and post-transcriptional regulation. Historically, the study of RNA splicing variations has focused on two main approaches: studying whole transcripts or quantifying alternative splicing “events”. The latter have been identified in model systems and categorized to basic subtypes such as a cassette exon (including/skipping an exon), intron retention, and alternative 3’/5′ splice sites. The common wisdom, supported by high-throughput studies since 2008, was that these are the most common forms of splicing variations. And so, the fields “vocabulary” was set for the decades that followed and subsequent works either studied whole transcripts or these AS “events”.
When we started working on MAJIQ (Modeling Alternative Junction Inclusion Quantification) 2.5 years ago we just wanted to better quantitate alternative splicing “events” from RNA-Seq. But working on MAJIQ led us to define LSVs, or local splicing variations. LSVs can be thought of as splits in a gene splice graph coming from or going into a single exon (hence “local”). Here is a simple illustration of such LSVs on a splice graph:
Besides the intuitive definition, one nice thing about LSVs is that while they are graph based, they actually correspond directly to biology (i.e. which RNA segments should the spliceosome splice together at any given point) as well as direct experimental evidence (the junction spanning reads). The second nice thing about LSVs is that unlike full transcripts they can be directly inferred from short RNA-Seq reads, yet still offer a much more expressive language than previously defined AS “events”. Specifically, previously defined AS “types”, appear as special cases of binary LSVs, but much more complex variations can be captured. Here is a simple illustration of this:
Now that we suddenly had this reacher language at hand we could actually study the full spectrum of LSVs – identify, quantify, and visualize those. We realized that complex, non binary, variations made up over a third of the variations in mouse and human and were highly enriched in regulated variations. In our recent paper published in eLife we started characterizing the spectrum of LSVs and how they are relevant to gene regulation, development, and disease. But this is really the tip of the iceberg. And we hope that with the tools we created (MAJIQ and the matching visualization package VOILA), the full effect of LSVs will be discovered by the greater scientific community. Or, to put it in linguistics terms, now that we have the ability to name those variations, we can bring those to the focus of our attention.