Download PDFOpen PDF in browser

Graph Random Forest: a Graph Embedded Algorithm for Identifying Highly Connected Important Features

EasyChair Preprint no. 8913

12 pagesDate: October 3, 2022

Abstract

Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It can train on over parameterized datasets which benefits the applications in the field of biology. For example, gene expression data always has a considerable number of features $(p)$ compared to the size of samples $(n)$. Though the predictive accuracy using RF is high, there are some problems when selecting important genes from a large number of features. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of connectivity between effective features. To apply random forest better in the biological field with external topological information between features, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving an interactive network when constructing the forest. The algorithm can identify effective features that form a highly connected sub-graph and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets -- non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graph, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.

Keyphrases: feature selection, gene network, Random Forest

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:8913,
  author = {Leqi Tian and Tianwei Yu},
  title = {Graph Random Forest: a Graph Embedded Algorithm for Identifying Highly Connected Important Features},
  howpublished = {EasyChair Preprint no. 8913},

  year = {EasyChair, 2022}}
Download PDFOpen PDF in browser