Deep neural network analyses of spirometry for structural phenotyping of chronic obstructive pulmonary disease

Participants enrolled in a large multicenter study (COPDGene) were included. The data points from expiratory flowvolume curves were trained using a deep-learning model to predict structural phenotypes of chronic obstructive pulmonary disease (COPD) on CT, and results were compared with traditional spirometry metrics and an optimized random forest classifier. Area under the receiver operating characteristic curve (AUC) and weighted F-score were used to measure the discriminative accuracy of a fully convolutional neural network, random forest, and traditional spirometry metrics to phenotype CT as normal, emphysema-predominant (>5% emphysema), airway-predominant (Pi10 > median), and mixed phenotypes. Similar comparisons were made for the detection of functional small airway disease phenotype (>20% on parametric response mapping).


Introduction
Chronic obstructive pulmonary disease (COPD) is an inflammatory disease of the lungs that is associated with substantial respiratory morbidity and health care costs and is now the fourth leading cause of death in the United States (1). COPD is defined by persistent airflow obstruction on spirometry, the result of a combination of 2 distinct structural processes: emphysema characterized by alveolar destruction and poor elastic recoil of the lungs as well as airway disease characterized by airway narrowing and remodeling (2,3). Although spirometric measures of airflow obstruction correlate strongly with CT measures of both emphysema and airway disease, spirometry does not discern the relative contributions of these structural disease processes to overall airflow obstruction. Furthermore, recent studies demonstrate that approximately half of current and former smokers, with no evidence of spirometric airflow obstruction according to traditional criteria, have evidence of emphysema and/or airway disease (4,5). These findings suggest that the existing spirometry criteria for airflow obstruction are not sensitive to the contributory structural changes.
The inability to accurately and easily differentiate predominant emphysema from predominant airway disease hinders the development of targeted therapies (6). Furthermore, these structural changes have significant consequences beyond those due to lung function impairment. The degree of emphysema and airway wall thickening on CT are both independently associated with worse respiratory quality of life, dyspnea, and mortality (7)(8)(9)(10)(11)(12)(13). Despite these associations, CT is often not recommended for diagnosis in clinical practice due to concerns about high costs and risk of radiation. There are currently no low-cost, low-risk tools to phenotype the structural components of COPD, and even its diagnosis relies on demonstrating abnormalities in discrete components of spirometry, such as the forced expiratory volume in first second (FEV 1 ) and the ratio of FEV 1 to the forced vital capacity (FEV 1 /FVC). Specific components of the spirometric flow-volume and volume-time curves have been analyzed to identify early and mild COPD but have not been successful in distinguishing emphysema-predominant disease from airway disease predominance (14)(15)(16)(17)(18)(19)(20).
We hypothesized that machine-learning approaches trained on all the data points contained in the expiratory flow-volume curve would accurately distinguish individuals with predominant emphysema from those with predominant airway disease. We used fully convolutional network (FCN) and random forest classifier to test our hypothesis.

C L I N I C A L M E D I C I N E
Classification results for phenotyping emphysema/airway disease Training. Results presented here are aggregated means from 10 replications of Monte-Carlo cross-validation in the training data set. The average out-of-the-bag error for the random forest model was 0.45 (95%CI 0.44 to 0.46), whereas the average validation loss for the neural network was 1.01 (95%CI 1.00 to 1.02). The parameters and weights of the model with minimum validation loss (neural network) and minimum outof-the-bag error (random forest) among the 10 training/validation splits were used to evaluate the model performance on the held-out test set.
Held-out test data set. In the test data (20% of cohort), (1796 participants), 617 participants (34.3%) were normal, 641 participants (35.6%) had predominant airway disease, 278 participants (15.4%) had emphysema predominant disease, and 260 participants (14.4%) participants had a mixed phenotype. For the prediction of structural phenotypes, the AUCs for FEV 1 % predicted and FEV 1 /FVC were 0.70 (95%CI 0.68 to 0.71) and 0.71 (95%CI 0.69 to 0.71), respectively. The area under the receiver operating characteristic curve (AUC) for the random forest classification was 0.78 (95%CI 0.77 to 0.79). The neural network outperformed traditional measures of spirometry and also the optimized random forest classifier with AUC of 0.80 (95%CI 0.79 to 0.81) ( Table 2). The F1 score for the neural network was 0.56 compared with 0.45, 0.43, and 0.54 for FEV 1 % predicted, FEV 1 /FVC, and random forest classifier, respectively ( Figure 2 and  Classification results for phenotyping emphysema/functional small airway disease Training. The average out-of-the-bag error for the random forest model was 0.19 (95%CI 0.20 to 0.18), whereas the average validation loss for the neural network was 0.57 (95%CI 0.55 to 0.59).

Discussion
In a large multicenter cohort of current and former smokers, machine-learning approaches, including deep-learning methods, trained on spirometry data outperformed traditional spirometry measures for the phenotyping of COPD into its structural components, including predominant small airway disease, and provided flow-volume curve signatures for predominant structural disease categories. These results will enhance patient identification for phenotypic characterization and targeting therapies.
Spirometric impairment is a summary metric, and the development of targeted therapies is hindered by the inability to identify predominant COPD phenotypes. Existing threshold-based spirometry criteria are also insensitive to early and mild damage in the lungs. As much as 20%-25% of the lung may be affected by emphysema before these changes manifest on spirometry (21). Substantial airway remodeling and loss also All values are expressed as mean (SD) unless specified otherwise. FEV 1 , forced expiratory volume in the first second; FVC, forced vital capacity; GOLD, Global Initiative for Chronic Obstructive Lung Disease; PRISm, preserved ratio impaired spirometry; Pi10, square root of wall area of a theoretical airway with 10-mm luminal perimeter.
occur before the development of significant spirometric impairment (22,23). Because airflow in the middle part of the flow-volume curve disproportionately results from flow in the small airways, multiple studies have evaluated various methods of analyzing the mid and distal parts of the curve. These include the forced expiratory flow in the 25th to the 75th percentile, forced expiratory flow in the first 3 seconds, the shape of the maximum expiratory curve, and change in angle of flow during forced exhalation (14)(15)(16)(17)(18)(19)(20)(24)(25)(26)(27)(28)(29). None of these studies, however, validated their measures against structural lung disease and, hence, were unable to separate emphysema from airway predominant disease. Distinguishing predominant emphysema and airway disease is relevant for optimizing and advancing clinical care. Current therapy for COPD includes bronchodilators and inhaled corticosteroids, and only half of patients treated with these medications have a clinically meaningful improvement in their respiratory quality of life (30). These therapies target airway tone and inflammation and do not target emphysema. Although no specific pharmacologic therapies are currently approved that separately target emphysema and airway disease, interventional and pharmacologic therapies are being developed that are likely to benefit carefully phenotyped and selected individuals. Surgical and bronchoscopic lung volume reduction procedures are approved for severe emphysema. New interventions that target chronic bronchitis and airway remodeling are being developed, and there are ongoing trials that specifically target emphysema (NCT02696564). The results of this study can help identify these patients for clinical trials and eventually therapy.
Several aspects related to deep learning that are pertinent to this study should be considered. The prediction of structural lung disease from a sequence of flow values derived from a forced expiratory effort is effectively a sequence classification task. Capitalizing on recent advances in deep learning, several studies implemented convolutional neural networks and long short-term memory models to classify sequential data from natural language processing and speech recognition tasks. In the current study, we applied a fully convolutional network (FCN) as well as a random forest classifier on flow sequence generated from the expiratory flow-volume curve to phenotype structural lung disease as the outcome. Wang et al. proposed the use of FCNs to analyze sequential data and achieved robust results on several data sets from the University of California, Riverside, time series classification archive (31, 32). The FCN architecture was initially proposed for semantic image segmentation tasks, where the architecture is composed of 3 computation

C L I N I C A L M E D I C I N E
blocks and each block performs convolution operations followed by batch normalization and ReLU activation layers. The resulting output from the 3 convolution operations is fed into a global average pooling layer, which drastically reduces the number of training parameters and further enables the visualization of class activations specific to each class. The minimal requirement for preprocessing and feature crafting before classification, and the visualization of feature activations specific to each class through the pooling layer, make FCN an effective choice for classification of sequential or time series data in the medical domain. Nonetheless, a random forest classifier with optimized parameters performed almost equally well as the neural network, suggesting that a number of other machine-learning and deep-learning algorithms may be applied to the sequence of raw data points that constitute the spirometry curves. Random forest relies on decision points and avoids correlated points, whereas FCN uses near neighbors that may be correlated. Although further improvement in accuracy may be possible with other algorithms, the overall results reflect the frequently observed overlapping and interrelated structural changes in both airways and the parenchyma in varying proportions that occurs in the majority of smokers. There is considerable airway-parenchymal interdependence; the presence of emphysema can untether airways and result in a predisposition to airway collapse, and peribronchial fibrosis and airway loss can result in distal emphysema. Although this inherent biological complexity limits the information that can be ascertained from spirometry alone, the probability scores that result from the FCN model for each individual raise the likelihood of accurately identifying the predominant structural category and represent a substantial advance in the identification of structural phenotypes in COPD.
The study has several strengths. Data from a large multicenter cohort of participants whose disease spanned the range of severity were included. Extensive CT phenotyping was performed with stringent quality control of both CT and spirometry. The structural phenotypes were classified using quantitative CT data that are more objective than labels applied by experts and are less subject to variability. COPDGene (Genetic Epidemiology of COPD) included a substantial number of African Americans and women. The training of the neural network and subsequent hyperparameter optimization was performed over 10 replications of Monte-Carlo cross-validation to ensure robustness of the model. The final evaluation of the classifier was performed on a hold-out test data set, which was not seen by the model previously.
Limitations. The study also has several limitations. First, COPDGene included current and former smokers, and hence, these results should be validated in cohorts that include nonsmokers with and at risk for COPD. Second, CT scans were not spirometry gated. Participants were, however, coached to reproducibly achieve maximal inhalation. Third, the outcome variables were numeric values for CT parameters, and it is not known how factors that cause variability in CT assessments, such as scanner type and field of view, effect the performance of the machine-learning models. These aspects need further analysis. Fourth, although the top 5 flows at a given volume that are associated with each of the phenotypic classes were identified, we are unable to ascribe a physiologic explanation to these findings. The flow values in combination appear to reflect structural processes that are not detected when discrete single values are used. Although we used SHAP to identify the top features, interpretation is limited, as these features may slightly differ in a different data set, and hence, the inherent black box nature of FCN remains (33). Fifth, machine-learning algorithms can be affected by underfitting and overfitting biases (34), but we obtained similar results in a hold-out test data set.
Conclusions. Structural phenotypes of COPD can be identified from spirometry using a deep neural network and machine-learning approaches, demonstrating their potential to identify individuals for targeted therapies. Further research is necessary to evaluate the applicability of the deep-learning model to improve COPD outcomes.

Methods
Study population and physiologic assessments. Spirometry data from participants enrolled in the COPDGene study were included (35). COPDGene is a large multicenter cohort study of current and former smokers aged between 45 and 80 years, with a smoking history of at least 10-pack years; the details of this study have been previously published.   All participants underwent a standard protocol, which included prebronchodilator and postbronchodilator spirometry using the New Diagnostic Design Easy-One spirometer per the American Thoracic Society criteria. Postbronchodilator spirometry was performed 20 minutes after administration of 180 μg albuterol HFA with a spacer (Aerochamber, Monaghan Medical Corporation). Quality control was performed by including only those spirometry efforts that met at least grade 2 ATS standards (repeatable between 100 and 150 ml). The postbronchodilator ratio of FEV 1 /FVC < 0.70 was used to confirm the presence of airflow obstruction (36), and FEV 1 % predicted was used to estimate the severity of airflow obstruction per Global initiative for Obstructive Lung Disease (GOLD) recommendations (37). Participants with FEV 1 /FVC >0.70 but with FEV 1 % predicted <80% were categorized as having PRISm (38). We selected the postbronchodilator effort with the highest sum of FEV 1 and FVC for the analysis as per ATS criteria. The raw data points that constitute the expiratory flow-volume curve were decomposed incrementally as flow data at every 30 mL volume exhaled (39,40). CT-based phenotyping. Quantitative CT scans were acquired at maximal inspiration (total lung capacity). Emphysema was quantified on inspiratory CT as the percentage of low attenuation areas <-950 Hounsfield units using Slicer 3D software (11). Clinically significant emphysema was defined as ≥5% low attenuation areas. This threshold was selected as there appears to be an inflection point at 5%, above which the frequency of exacerbations and mortality increases considerably (12). Large and medium size airway disease was quantified by the Pi10, the square root of the wall area of a hypothetical airway with internal perimeter of 10 mm, using Apollo Software (VIDA Diagnostics) (11). Because there is no established threshold for clinically significant airway wall disease, Pi10 > median in the COPDGene cohort was used for categorization as significant airway disease. Functional small airway disease (fSAD) phenotype was quantified by >20% lung affected by small airway disease measured on parametric response mapping, where fSAD is nonemphysematous air trapping and, hence, an indirect measure of small airway disease (41,42). Using these emphysema and Pi10 thresholds, we classified participants into 1 of 4 CT categories: normal, <5% emphysema and < median Pi10; airway predominant, <5% emphysema but with Pi10 ≥ median; emphysema predominant, ≥5% emphysema and Pi10 < median; and mixed emphysema/airway, ≥5% emphysema and ≥ median Pi10. We also classified participants into 4 groups based on fSAD: normal, <5% emphysema and <20% fSAD; airway predominant, <5% emphysema but with fSAD ≥20%; emphysema predominant, ≥5% emphysema and fSAD <20%; and mixed emphysema/airway, ≥5% emphysema and fSAD ≥20%.
Deep neural network. FCN was developed for image segmentation tasks and has shown significant improvements in efficiency and overall performance as compared with traditional deep convolutional networks. In this study, we used FCN as a feature extractor of a time series (or sequential) data, where these features were further fed into a global average pooling layer and a soft-max layer to classify the sequences into different labels. The basic architecture of FCN includes 3 stacked computation blocks, where each block consists of 1D convolutional layer followed by a batch normalization layer and a rectified linear unit activation layer (Supplemental Figure 3). Convolution on the 1D input sequence was performed by the convolutional layers followed by the batch normalization layer to improve generalizability and faster convergence. The penultimate global average pooling layer reduces the number of weights and prevents overfitting. This FCN architecture has been previously shown to achieve superior performance in several 1D sequence classification tasks.
Model training and evaluation. The flow data points in each expiratory flow-volume curve were used as a 1D input sequence, and each sequence was standardized to have a length of 200 points using data padding with zeros at the end of the sequence. The expiratory flow data was divided into input (80%) and hold-out test (20%) data sets. All possible combinations of number of filters (32, 64, 128, 256 filters) in the convolutional layers, learning rate in the range of 0.00001-0.1, and batch sizes of 64, 128, and 256 were evaluated on the training set to select the best hyperparameters. The hyperparameter tuning was performed using TALOS library in Python. The model with the best hyperparameters, where the 3 convolutional layers with filter sizes of 128, 256, and 128, corresponding kernel sizes of 9, 5, and 3, at a learning rate of 0.0001, with batch size of 64 over 100 epochs, was selected for further evaluation. The input data set was further divided into 10 random splits of training (80%) and validation (20%) to train the FCN model. The weights of the neural network with minimum loss on the validation set were used for subsequent evaluation on the hold-out test data set. Early stopping of the training was implemented when there was no decline in the validation loss for at least 25 epochs. The learning rate was reduced by a factor of 0.01 after 15 epochs of no decline in the validation loss. The primary outcome was classification of each participant into 1 of the 4 structural disease categories on quantitative CT. Supplemental Figure 4 shows the visualization of the FCN training process with the chosen hyperparameters to classify spirometry data into the 4 different structural COPD phenotypes. The performance of the FCN was compared by implementing optimized random forest model (parameters were chosen by 5-fold cross validation and selected the model with minimum outof-the-bag error) on the same input sequences and also with the performance of the traditional spirometry variables (FEV 1 /FVC and FEV 1 % predicted). Computation of feature importance using SHAP values is described in the Supplemental Methods.
Statistics. AUC analyses were computed to evaluate the accuracy of the FCN and the random forest classifier. Their discriminative accuracies were compared with 2 traditional spirometry measurements (FEV 1 /FVC and FEV 1 % predicted) based on logistic regression. Sensitivity, specificity, Youden index (sensitivity + specificity −1), and F1 score for structural disease classification were tested for each model (43). The nonparametric DeLong test was used to compare AUCs between the models (44). A 2-tailed P value of < 0.05 was considered significant for all analyses. Analyses were performed using Python ≥ 3.0, R version ≥ 3.6.0 (R Project for Statistical Computing), and MedCalc Statistical Software.
Study approval. All participants provided written informed consent before enrollment, and the COPD-Gene study protocol was approved by the University of Alabama at Birmingham Institutional Review Board (IRB) for human use (F070712014). The