Background: This work seeks to develop a methodology for identifying reliable biomarkers of disease activity,
progression and outcome through the identification of significant associations between high-throughput flow
cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype
which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a
clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of
SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this
instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of
diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from
disease progression or unnecessary treatment side effects.
The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response
In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine
learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets
of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and
validate ILD risk screening tools.
Results: Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high
degree of success (>82 % correct classification in validation; 79 patients in the training data set, 40 patients in the
validation data set).
Conclusions: IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our
new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in
identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly
identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the
identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or
down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge
of systemic sclerosis pathogenesis.
BMC Bioinformatics (2015) 16:293