Deep Learning - How it works under the hood

Development
14
byKunal Mathur, Senior AI Consultant, Finnovion

Deep learning is a pretty rich field of research that comprises many different approaches, and there's still a lot to be figured out in terms of what's going to transfer well to any particular problem. For example, the Campagne lab's approach starts from reads and, in their own words, "maps structured data (alignment) to a small number of informative features, which can be used for efficient DL model training". In my work I've chosen to treat the problem as a variant filtering question (or classification in machine learning terms), where I take the output of the caller and use a deep learning process to filter that callset. It seems the Google team's approach is similar in that their software involves a first step that looks a lot like the HaplotypeCaller (which makes sense since Mark and Ryan wrote most of the HC's original code), then applies the deep learning classifier to the variant candidates.

My approach uses variants and annotations encoded as tensors, which carry the precise read and reference sequences, read flags, as well as base and mapping qualities. The modeling architecture I use is specifically designed to process these read tensors, and so avoids the extensive use of maxpooling found in many image classification CNNs (maxpooling is motivated by the smoothness prior of natural images, where neighboring pixels tend to be the same, but DNA sequences are not smooth, so we reserve maxpooling for the deeper layers of the net). Because the tensor dimensions contain different types of data, I convolve each separately so as to allow deep layers in the net to see this meaningful difference. This architecture also integrates annotations seamlessly into the classification, which accelerates learning times dramatically and opens the door to unsupervised training down the road.

Machine Learning, Deep Learning and Prescriptive Analytics: What’s the Difference?

Feel free to contact me by email or leave a comment below if there are any questions you have or if you'd like pointers on where to find additional material.

3 Comments

What's your class imbalance for your classification tasks. I expect there is significant imbalance since variants are rare events as compared to the size of the genome. If that is the case, auROC can be very misleading. You'd want to report area under the precision recall curve and recall at specific precisions eg. 70, 80, 90 %. Had posted this on Twitter. You guys should get on there. -Anshul Devansh

Hi Anshul, Thanks for the comment. The AUC table was made from a test set of variants enriched for negative examples so that the classes are nearly balanced. -Payal

If one uses binary cross-entropy loss and optimizes for F1-score, you can try modifying binary cross-entropy to -log^2(x) or log^3(x), this way the loss would penalize wrong predictions after 0.5 more since F1 operates on discrete classes. And penalty for points that are closer to correct probability will be lower wrt original binary cross-entropy. - Keerthy

Add Your Comment