RNA-sequencing data can provide valuable insights into physiological and disease processes. However, the sheer size and noisy nature of sequencing data makes it challenging to analyze and find meaningful differences between groups. Traditional methods, like differential expression analysis require large sample sizes, rely on arbitrary significance thresholds, and struggle to model complex gene interactions. In contrast, machine learning overcomes these hurdles and can identify informative features (or genes) that best discriminate between sample groups. These features are then used to develop predictive models that can classify new data, such as in disease diagnosis or prognosis.
In our protocol, we provide a user-friendly approach for data preprocessing, exploration, feature selection, and creation of a hierarchical classification model. Accompanied by open-source code and an interactive ensemble feature selection app, our approach encourages users to explore their data and interpret results from a biological perspective. Requiring minimal programming experience, our protocol allows researchers from diverse backgrounds to apply machine learning in their work. Additionally, the protocol is not limited to RNA-sequencing data but can also be extended to other omics datasets (e.g. proteomics, metabolomics, microarray, etc.)!
The protocol will support researchers in identifying important markers from biomedical data, creating prediction models with clinical relevance, and guide hypothesis generation for future experimental work. For more information on this protocol, please see: https://www.sciencedirect.com/science/article/pii/S2666166723006287