The Art in Science Part II: How to build models for real life problems?

In my previous post I explored some art aspects of scientific work that have to do with esthetics, creativity, and self expression. Another “art” aspect of ML is what kind of model/algorithm should we build for a given real life problem. There is no specific formula/recipe for that and like many things in life getting good at modeling takes time and practice, making it more of an “art” as referred to in Neil Lawrence’s post. Nonetheless, just as in Martial Arts (yes!) there are some basic principles/guidelines we should follow. Some of those I can think about include:

  1. Do not skip steps. When we want to computationally solve a real life problem there are basic steps we have to go through. These can be defined as:
    (a) Thinking how to formulate the problem. This includes the basic entities, the relations between them, the feature space, what we may be able to generalize from etc.
    (b) Deciding what kind of function we should optimize.
    (c) Deciding how to go about performing the optimization (i.e. the learning algorithm).
    (d) Thinking how we can evaluate success/accuracy and what would a “good” model give us.
    True, the above “steps” are highly connected. Still, way too often we jump to (b) and (c), the steps that are generally more technical and the focus of most course work. While we can not do without those I find that in many real life problems step (a) can be 90% of the fight. This is similar to jumping ahead to code something before you fully thought it through (a “sin” we have all committed, and likely more than once…). And just like in coding (or, for that matter, elbow escaping from a mount in BJJ) jumping ahead may work but is more likely to yield subpar results.
  2. Let the data guide you. Eyeballing the data to see where the main issues arise and to get clues for model preferences is priceless. But it’s more than that. Michelangelo is known to have claimed a stone has a statue inside it and the sculptor’s role is simply to discover it. Of course, if we take a rational scientific approach to this statement it makes little sense. But I see it as an insightful comment about the process: By taking this approach we are more likely to shed our pre assumptions, biases, and egos so as to see more clearly what is in front of us and what is required. So let the data (or stone) tell you its story. It’s the data’s story, not yours.
  3. Keep it simple (unless you have a good reason not to). This is basically restating Occam’s razor. As ML researchers, we commonly want to build fancy models with all the bells and whistles we just learned about or thought about. But as researchers in Computational Biology who handle real life problems we should curb that urge. Besides practicality, there is ML theory behind keeping it simple (c.f. chapter 2 in the great Kearns & Vazirani [1]), and even beauty. And, again, simplicity and minimalism are common themes related to esthetics and beauty in the (Martial) Arts.
  4. Consider different lines of attack. Our first solution is not necessarily the optimal one. Especially if things do not work well try to open your mind, take a step back, and think what other approaches may work better.
  5. Iterate. This is also related to the above point. That book you love may look like a beautiful result of ingenuity and creativity, but is nonetheless the end product of many iterations, small insights and little victories. So will the model you develop.

One issue with the above principles is that they are rarely articulated to students. Worse, our current educational system is not geared to teach those (more on that in a future post). Their “artistic nature” also means they are easier to grasp and master through personal instructions and closely watching someone who “has come before us”. That, btw, is the literal translation in Japanese of the word “Sensei”. So if you are a student, go find yourself a good Sensei. Oops, sorry – I meant advisor 😉

[1] An Introduction to Computational Learning Theory – Kearns & Vazirani, 1994.

The Art in Science – Part I

I was reading recently Neil Lawrence’s excellent post on how computer science degrees should be adapted given today’s challenges. Neil nicely points out that “Teaching programming alone is like teaching someone how to write without giving them something to say”, and discusses the need to understand diverse systems – unstructured documents, speech, vision, Bioinformatics etc. Then, one implicit point in Neil’s post caught my attention. Neil states:

Sitting at the core of each of these areas is machine learning: the art of processing and assimilating a range of unstructured data sources into a single model.

I found the choice of words quite interesting: A highly accomplished scientist lays claims about artistic elements in science. Is that really so? is there Art in Science??

Art involves “the expression or application of human creative skill and imagination”. It also relates to a notion of beauty and esthetics. Indeed, after spending some time in the field of ML you start seeing the beauty and creativity in elegant formulations for a specific real life problem as well as the distinct personal signatures of those formulating the solutions. Examples I recall include learning about Shannon’s information theory for the first time, the generalization of EM by Radford & Hinton [1], and the “magic” of boosting followed by its probabilistic interpretation by Friedman Hastie and Tibshirani with subsequent discussions [2]. So, perhaps surprisingly similar to (yes!) Martial Arts, ML requires high technical skills but skills alone are not enough: you need to be creative in order to really push the boundary of what can be achieved and at a certain level you make the techniques your own, expressing your character.

This brings me to another important aspect of “the art in ML” which may have been alluded to in Neil’s post: What kind of models should you build? And more generally – what kind of questions you should be asking as a scientist? I’ll discuss this in my next post. In the meantime, anyone who has a nice personal example about where she/he found beauty and personal expression in ML papers is welcomed to leave it as a comment – it could make for an interesting reading list…

[1] A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants – Radford & Hinton 1998
[2] Additive logistic regression: A statistical view of boosting – Friedman Hastie and Tibshirani, 2000

Note Taking and Squirrelly Software

In the past year or so I have been growing discontent with my method of note taking. I started using Evernote during my postdoc years though admittedly with some reservation: putting all my notes in some new company’s propriety database that can only be accessed via dedicated software seemed problematic. Still, it helped me (and apparently many others) get better organized. And so my list of notes grew longer through the years, as did Evernote’s feature list. Recently though Evernote seems to have shifted its focus to business solutions: integrated group chat, collaborative note editing, etc. As a result, the software seems to have bloated, becoming more slow and more buggy. Moreover, many features you get for free from your OS, like offline note taking or searching in PDFs, are only available for premium costumers. So I started wondering: if I am interested simply in note taking, is Evernote the way to go? The last straw was when I was writing a post for this blog and Evernote managed to sync it out of existence, with no hope of recovery. Writing is painful enough without having my notes deleted, thank you very much.

So what went wrong with Evernote for me? besides the basic reservations about propriety database etc. it seems to have gone from a sharp tool for a specific task (note taking) to a dull one that does not excel at anything. Sounds familiar? Yes, it can be seen as another example of the squirrelly approach to Budo, interdisciplinary research or, in this case, software development. Not surprisingly, I was not the only one feeling discomfort. My grief with Evernote has been crystallized in Alex Payne’s excellent post where he calls Evernote and similar software an “Everything Bucket”. One of his rules for achieving computing bliss is to “not use software that does many things poorly” i.e. “Squirrelly Software”.

What did I end up doing to solve my quandary about note taking? I followed Adam’s Pash recommendation for SimpleNote combined with nvALT. I get fast and reliable note synching, in a format that is also searchable directly on my local disc, with matching apps on all OS and mobile platforms.You can easily hook nvALT to your favorite text editor (Emacs with markdown extension? VI?) or directly start notes in the synced directory with your editor of choice. So far I am a happy camper. Good luck with your note taking, and beware of squirrelly software!

The squirrelly approach to Budo, interdisciplinary science, and software development

In his book Moving Towards Stillness [1], Dave Lowry discusses the squirrelly approach to Budo*, citing ancient writings by Hsun Tsu**:

The squirrel can do five things: He can climb a tree, swim, dig a hole, jump, and run. All these are within its capacity, yet he does none well.

The analogy is to people who try to train in many different Martial Arts but end up not excelling at any, with a superficial understanding of all.

I find that in interdisciplinary fields like Computational Biology we, and more worryingly our students, may end up like the squirrel. Admittedly, I find quite a few papers in the Bioinformatics field to be like that: Yet another method which is not particularly interesting computationally, accompanied with shallow understanding of the underlying biology. Such papers end up not really advancing our methods, tools, or our biological understanding. Many are well intended I’m sure, but the end result is not great. So what are we to do in our own scientific practice and when raising the next generation of scientist?

Dave’s advice is to concentrate on a single discipline in which you gain significant expertise and deep understanding. In Martial Arts, that can take a good ten years or so. However, Martial Arts tend to have many shared principles (more on those in later posts) and so by identifying and internalizing those one can later more easily learn from other Martial Arts, bringing more insights and depth to his/her original practice. Practicing hard and earnestly also teaches you *how* to learn, an ability that serves you well when you later expand to other disciplines.

The analogy in Science is to have a good foundation in some area, then add to it. If not, we run the risk of creating Bioinformaticians (including ourselves) that will have a hard time pushing the boundary of current knowledge.

Now, with all that said, to be perfectly honest squirrels do seem to excel at something (a point Hsun Tsu may have not realized or chose to ignore): They are very good at being squirrels. In fact, squirrels are one of the few mammalian families endemic to Eurasia, Africa, North America and South America, starting some 36 millions ago in North America [2]. So, while they may have not excelled at Hsun Tsu’s five tasks, they certainly have been around far longer than us, having their place in the grand scheme of things. And they probably don’t care much if some philosophers think highly of them or not.

P.S: Wait, didn’t I promise a connection to software development as well?
Well, this post has grown long already, so this will have to wait for the next time.

*Bu – Martial, Do – way; a Japanese term referring to the Martial Arts.
**An influential Chinese Neo-Confucian philosopher from the third century BC

[1] Moving Toward Stillness: Lessons in Daily Life from the Martial Ways of Japan, Dave Lowry, Tuttle Publishing, 1999
[2] The effects of Cenozoic global change on squirrel phylogeny., J.M. Mercer & V.L. Roth, Science, 2003

What is this blog about?

About a year ago I was riding the train back home from work and bumped into my colleague and friend, Arjun Raj. Arjun writes a popular blog about everything science related that I highly recommend [1]. He described how liberating it was to write a blog and how he realized it actually reached people. I found this quite interesting and opposite to my experience: I struggle with writing (lets face it, this is not even my native language) and for as long as I can remember I did not like to say things in public unless I was absolutely sure I got it right. On the other hand, I found from conversations I had with students and colleagues that they found some of the observations I made quite useful. And so, I decided to step out of my comfort zone and start putting out there my musings about the world. Like so many things in Science, It just took me another year.

So why “Martial Arts Life Science and everything in between”?
Research, specifically in Life and Computer Science, is what I do and love doing. Martial Arts is another key component in my life experience. Besides the obvious physical/defense aspects, I see it as a way to learn about ourselves and the world, I like the philosophical aspects of it, and I like to find how I can bring insights from my Martial Arts practice to my everyday life to make me a better researcher, mentor, father, and a person. So here we go.