AI-assisted discovery of an ethnicity-influenced driver of cell transformation in esophageal and gastroesophageal junction adenocarcinomas

Although Barrett’s metaplasia of the esophagus (BE) is the only known precursor lesion to esophageal adenocarcinomas (EACs), drivers of cellular transformation in BE remain incompletely understood. We use an artificial intelligence–guided network approach to study EAC initiation and progression. Key predictions are subsequently validated in a human organoid model, in patient-derived biopsy specimens of BE, a case-control study of genomics of BE progression, and in a cross-sectional study of 113 patients with BE and EACs. Our model classified healthy esophagus from BE and BE from EACs in several publicly available gene expression data sets (n = 932 samples). The model confirmed that all EACs must originate from BE and pinpointed a CXCL8/IL8↔neutrophil immune microenvironment as a driver of cellular transformation in EACs and gastroesophageal junction adenocarcinomas. This driver is prominent in White individuals but is notably absent in African Americans (AAs). Network-derived gene signatures, independent signatures of neutrophil processes, CXCL8/IL8 expression, and an absolute neutrophil count (ANC) are associated with risk of progression. SNPs associated with changes in ANC by ethnicity (e.g., benign ethnic neutropenia [BEN]) modify that risk. Findings define a racially influenced immunological basis for cell transformation and suggest that BEN in AAs may be a deterrent to BE→EAC progression.


Supplementary Text i. Creation of a Boolean map of metaplastic progression in the esophagus
We used Boolean Network Explorer (BoNE) 1 to first create a model of progressive gene regulatory events that occur during metaplastic transition (Figure 2A). For model training and development, we used the largest (to our knowledge) well annotated transcriptomic dataset [n = 76: GSE100843 2 ] derived from BE and proximal matched normal mucosa from squamous esophagus from 18 BE patients. Gene expression patterns were first simplified into 'clusters' of genes equivalent to each other (Figure 2A-2). The clusters (nodes) were connected to one another based on the pattern of relationships between the clusters (edges), conforming to one of the six possible Boolean implication relationships (BIRs; Figure 2A-3). These efforts helped chart numerous Boolean paths (Figure 2A, 4-left) within a network with directed edges. Each cluster was then evaluated for whether they belong to the healthy esophagus or diseased side (BE) depending on whether the average gene expression value of a cluster in heathy samples is up or down, respectively. Each path of connected gene clusters indicates a certain hierarchy in gene expression events, which translates to a progressive series of gene down/upregulation events, predicted to occur in sequence during the metaplastic process (Figure 2A, 4-right).
We next introduced in BoNE machine learning that seeks to identify which of the gene clusters (nodes) connected by Boolean implication relationships (edges) are most optimal in distinguishing healthy from diseased samples. BoNE computes a score that naturally orders the samples; this score can be thought of as a continuum of states. A set of two clusters emerged as most robust, which was further refined by an additional filtering step through a second 'training dataset' [GSE39491 3 ; see Supplemental Information 1] which is comprised of BE and normal esophageal matched samples from 43 patients. Both training datasets were analyzed independently throughout the process. The resultant model of metaplastic transition pinpointed a time series of metaplasia (BE)-associated invariant events, in which downregulation of expression of 220 genes (SPINK7-cluster; Figure  2B-C) was invariably associated with a concomitant upregulation of 24 genes (SLC44A4-cluster, Figure 2B-C) in all samples in the training datasets. SPINK7 (serine peptidase inhibitor, kazal type 7), is a key checkpoint in the esophageal keratinocyte stem cell, which regulates mucosal differentiation, barrier function, and inflammatory responses 4 . SLC44A4 encodes a specific high-affinity regulated carrier-mediated uptake system for TPP in human colonocytes, involved in regulation of microbiota-generated thiamine 5 .
Reactome pathway analysis of the upregulated SLC44A4-and downregulated SPINK7-clusters along the path continuum revealed the most important biological processes that they control ( Figure S1). The downregulated pathways ( Figure S1A) were cellular processes that are inherently associated with squamous epithelium, e.g., keratinization, cornified envelope formation, as expected. Other notable changes were TP53 expression and cell-cell adhesion proteins. These findings are consistent with emerging evidence from numerous independent studies which agree that aberrant TP53 IHC highly correlated with TP53 mutation status (90.6% agreement) and was strongly associated with higher risk of neoplastic progression regardless of the presence/absence of dysplasia [6][7][8] . The findings are also in keeping with the reduction observed by IHC in cell adhesion proteins in BE lesions [E-cadherin, P-cadherin and the catenins which serve as adaptor proteins that enable the cadherins to achieve cell adhesion 9 ]. The most notable cellular processes that were upregulated ( Figure S1B) were related to oxygen delivery to the tissue, consistent with reports of Warburg and Crabtree effects in BE tissues 10 .

ii. Currently available animal models of BE→EAC transformation rarely recapitulate human disease
Animal models of diseases have both merits and limitations 11,12 . Because 'mice are not men' 13,14 , especially when it comes to their innate immune system 14,15 , and EACs and GEJ-ACs are associated with a prominent immune signature, we asked how well currently available EAC models recapitulate the human disease ( Figure  S5A). To model how EAC-associated risk factors, i.e., obesity/BMI and IL8-induction enhance cell transformation, mice challenged with high-fat diet (HFD [16][17][18] or overexpressing IL8 18 have been developed. Neither model induce our network-derived BE/EAC signatures ( Figure S5B, row i-iii; Figure S5D-E); nor did they display induction of the neutrophil signatures we observed in human tissues (compare human- Figure 4G with murine- Figure S5C, row i-iii). The signatures were, however, induced in a transgenic interleukin1-β (IL1β)overexpression model (GSE24931; Figure S5B-C; rows iv-vi; Figure S5F-H); in this model, the human IL1β cDNA was inserted downstream of an Epstein-Barr virus (ED-L2) promoter that targets the oral cavity, esophagus, and squamous forestomach 19 . These mice develop chronic inflammation that progresses to BE and EAC; progression was accelerated by exposure to bile acids. Findings show that a combination of inflammation and bile acids, the latter are components of gastroduodenal reflux that has been linked to BE→EAC progression 19,20 . Most importantly, the bile acid-accelerated model (GSE24931) recapitulated the neutrophil processes that were encountered in most human datasets of EACs and GEJ-ACs (Figure 4F').

An AI-assisted study design that uses Boolean approach to build transcriptomic networks
We chose a Boolean approach to building transcriptomic networks 21 because of its ability to pinpoint with precision cellular states in tissues. For example, it helped pinpoint branchpoints in B/T cell differentiation 22,23 , define progenitor cell hierarchy in blood [24][25][26][27][28] , normal and neoplastic cell states in colorectal cancers 29,30 , bladder cancers [31][32][33][34] , and prostate cancers 35,36 , and identify NK cell exhaustive states 37 , universal cell proliferative 38 and macrophage 39 markers, and cell states in the mucosal barrier in IBD 40 . Because the Boolean approach relies on invariant relationships that are conserved despite heterogeneity in the samples used for the analysis, which often represent maximum possible diversity, i.e., the relationships can be thought of as general relationships among pairs of genes across all samples irrespective of their origin (normal or disease), laboratories/cohorts, different perturbations, and sometimes in multiple species including human, mouse and rat, and hence, considered conserved invariants. It is assumed that such 'invariants' are likely to be fundamentally important for any given process.

Barrett's and Esophageal Adenocarcinoma datasets used for network analysis
One microarray dataset (GSE100843; n = 76, 36 Normal esophageal squamous mucosa and 40 Barrett [41][42][43] or European Molecular Biology Laboratory (EMBL) European Bioinformatics Institute (EMBL-EBI) ArrayExpress website 44 . All gene expression datasets (Supplementary Information 1) were processed separately using the Hegemon data analysis framework 29,30,32 . We did not combine datasets that belong to two different platforms. See Supplemental Information 1 for the degree of heterogeneity among samples in the datasets used in this work.

Test cohort selection
Two different test cohorts were used to build the network and perform machine learning for BE: GSE100843 and GSE39491. Both GSE100843 and GSE39491 are microarray datasets that included reasonable number (>= 30) of normal esophageal squamous mucosa and reasonable number (>= 40) of BE samples. Since the number of samples in these cohort are less than 100, which is on the lower side for comprehensive Boolean analysis, network is built on GSE100843 (n = 76) and machine learning is performed on an independent dataset GSE39491 (n = 80) to cover the entire spectrum of gene expression dataset from different microarray platforms. Surprisingly, variation in gene expression were good enough to use our standard BooleanNet statistic (S > 3 and p < 0.1) to identify Boolean Implication relationships with n = 76. Only one cohort was used to build the network and perform machine learning for EAC: E-MTAB-4054. E-MTAB-4054 (n = 63) is the only large RNASeq dataset available that provided high-quality measurements of mRNA extracted from normal, BE and EAC tissue samples. Since all these cohorts have small number of samples, reliability of Boolean analysis is low, and the results need to be supported by large and strong groups of validation datasets.

Boolean Analysis
Boolean logic is a simple mathematic relationship of two values, i.e., high/low, 1/0, or positive/negative. The Boolean analysis of gene expression data requires first the conversion of expression levels into two possible values. The StepMiner algorithm is reused to perform Boolean analysis of gene expression data 21 . The Boolean analysis is a statistical approach which creates binary logical inferences that explain the relationships between phenomena. Boolean analysis is performed to determine the relationship between the expression levels of pairs of genes. The StepMiner algorithm is applied to gene expression levels to convert them into Boolean values (high and low). In this algorithm, first the expression values are sorted from low to high and a rising step function is fitted to the series to identify the threshold. Middle of the step is used as the StepMiner threshold. This threshold is used to convert gene expression values into Boolean values. A noise margin of 2fold change is applied around the threshold to determine intermediate values, and these values are ignored during Boolean analysis. In a scatter plot, there are four possible quadrants based on Boolean values: (low, low), (low, high), (high, low), (high, high).

Invariant Boolean implication relationships
A Boolean implication relationship is observed if any one of the four possible quadrants or two diagonally opposite quadrants are sparsely populated. Based on this rule, there are six different kinds of Boolean implication relationships. Two of them are symmetric: equivalent (corresponding to the highly positively correlated genes), opposite (corresponding to the highly negatively correlated genes). Four of the Boolean relationships are asymmetric, and each corresponds to one sparse quadrant: (low => low), (high => low), (low => high), (high => high). BooleanNet statistics (Equations listed below) is used to assess the sparsity of a quadrant and the significance of the Boolean implication relationships 21,22 . Given a pair of genes A and B, four quadrants are identified by using the StepMiner thresholds on A and B by ignoring the Intermediate values defined by the noise margin of 2-fold change (+/-0.5 around StepMiner threshold). Number of samples in each quadrant are defined as a00, a01, a10, and a11. Total number of samples where gene expression values for A and B are low is computed using following equations.
Total number of samples considered is computed using following equation.

= + + +
Expected number of samples in each quadrant is computed by assuming independence between A and B. For example, expected number of samples in the bottom left quadrant e00 = is computed as probability of A low ((a00 + a01)/total) multiplied by probability of B low ((a00 + a10)/total) multiplied by total number of samples. Following equation is used to compute the expected number of samples.
To check whether a quadrant is sparse, a statistical test for (e00 > a00) or ( > ) is performed by computing S00 and p00 using following equations. A quadrant is considered sparse if S00 is high ( > ) and p00 is small.
A threshold of S00 > sthr and p00 < pthr to check sparse quadrant. A Boolean implication relationship is identified when a sparse quadrant is discovered using following equation.
Boolean Implication = (Sij > sthr, pij < pthr) A relationship is called Boolean equivalent if top-left and bottom-right quadrants are sparse.
Boolean opposite relationships have sparse top-right (a11) and bottom-left (a00) quadrants. A low => B high is discovered if bottom-left (a00) quadrant is sparse and this relationship satisfies following conditions.
A high => B high Boolean implication is established if bottom-right (a10) quadrant is sparse as described below.
Boolean implication A high => B low is found if top-right (a11) quadrant is sparse using following equation.
For each quadrant, a statistic Sij and an error rate pij is computed. Sij > 3 and pij < 0.1 are the thresholds used on the BooleanNet statistics to identify Boolean implication relationships (BIRs). False discovery rate is computed by randomly shuffling each gene and computing the ratio of the number of Boolean implication relationship discovered in the randomized dataset and original dataset. The false discovery rate for BE and EAC dataset was less than 0.001.
Boolean Implication analysis looks for invariant relationship across all the different types of samples regardless of the conditions and treatment protocols. Therefore, it does not distinguish the sample types when discovering Boolean implication relationships. We assume that there are fundamental invariant Boolean implication formula that are satisfied by every sample regardless of their type.

Construction of BE/EAC Boolean Implication Networks
A Boolean implication network (BIN) is created by identifying all significant pairwise Boolean implication relationships (BIRs) 44,45 . The Boolean implication network contains the six possible Boolean relationships between genes in the form of a directed graph with nodes as genes and edges as the Boolean relationship between the genes. The nodes in the BIN are genes and the edges correspond to BIRs. Equivalent and Opposite relationships are denoted by undirected edges and the other four types (low => low; high => low; low => high; high => high) of BIRs are denoted by having a directed edge between them. The network of equivalences seems to follow a scale-free trend; however, other asymmetric relations in the network do not follow scale-free properties. BIR is strong and robust when the sample sizes are usually more than 200 (from our experience of using Boolean Implication for more than 10 years). All our previous papers use thousands of diverse samples to establish Boolean implication relationships. However, Boolean Implication analysis is carried out in such low number of samples such as the selected GSE100843 (n =76) and E-MTAB-4054 (n = 63) datasets. We have demonstrated that we have a reasonable False Discovery Rate (< 0.001) when S > 3 and p < 0.1 are used. Both GSE100843 and E-MTAB-4054 dataset were prepared for Boolean analysis by filtering genes that had a reasonable dynamic range of expression values. When the dynamic range of expression values was small, it was difficult to distinguish if the values were all low or all high or there were some high and some low values. Thus, it was determined to be best to ignore them during Boolean analysis. The filtering step was performed by analyzing the fraction of high and low values identified by the StepMiner algorithm 46 . Any probe set or genes which contained less than 5% of high or low values were dropped from the analysis.

Generation of Clustered Boolean Implication network
Clustering was performed in the Boolean implication network to dramatically reduce the complexity of the network. A clustered Boolean implication network (CBIN) was created by clustering nodes in the original BIN by following the equivalent BIRs. One approach is to build connected components in an undirected graph of Boolean equivalences. However, because of noise, the connected components become internally inconsistent e.g., two genes opposite to each other become part of the same connected component. In addition, the size of clusters became unusually big with almost everything in one cluster.
To avoid such a situation, we need to break the component by removing the weak links. To identify the weakest links, we first computed a minimum spanning tree for the graph and computed the Jaccard similarity coefficient for every edge in this tree. Ideally if two members are part of the same cluster, they should share as many connections as possible. A threshold is considered for the Jaccard similarity coefficient (0.8 for BE network and 0.5 for the EAC network) below which the edges are dropped from further analysis. Thus, many weak equivalences were dropped using the above algorithm leaving the clusters internally consistent. We removed all edges that have Jaccard similarity coefficient less than the selected threshold and built the connected components with the rest. The connected components were used to cluster the BIN which is converted to the nodes of the CBIN. The choice of the threshold on the Jaccard similarity coefficient play an important role in determining the size and the number of clusters as well as whether they are internally consistent. A new graph was built that connected the individual clusters to each other using Boolean relationships. The link between two clusters (A, B) was established by using the top representative node from A that was connected to most of the members of A and sampling 6 nodes from cluster B and identifying the overwhelming majority of BIRs between the nodes from each cluster.
A CBIN was created using the GSE100843 and E-MTAB-4054 datasets. The edges between the clusters represented the Boolean relationships that are color-coded as follows: orange for low => high, dark blue for low => low, green for high => high, red for high => low, light blue for the equivalent and black for the opposite. A subnetwork is selected using low=>low (blue), high => low (red) and opposite (merged with high=>low as red) edges among the top 10 clusters.

Charting Boolean paths
Boolean paths have been explored before to predict the underlying time series events in biological processes such as B cell differentiation 22,23 and early differentiation events in cancer stem cell 29,30,32,35 . This algorithm is called MiDReG (Mining Developmentally Regulated Genes) that uses two seed genes to identify intermediate genes in a biological process. MiDReG infer intermediate states using a sequence of asymmetric BIRs. Here, using MiDReg algorithm/concept to traverse the Boolean Implication network that identifies paths of clusters where the start and end clusters in the clustered Boolean implication network mark the end points of a possible set of events from healthy to disease. The asymmetric BIRs provide a unique dimension to the network that is fundamentally different from any other gene expression networks in the literature. Traversing a set of nodes in a directed graph of the Boolean network constitutes a Boolean path. A simple Boolean path involves two nodes and the directed edge between them. A complex Boolean path involves more than two nodes and the edges between them.

Ordering samples based on composite score of Boolean path
A Boolean path contains one or more clusters. A composite score is computed for each cluster and combined later. To compute the final score, first the genes present in each cluster were normalized and averaged. Gene expression values were normalized according to a modified Z-score approach centered around StepMiner threshold (formula = (expr -SThr)/3/stddev). A weighted linear combination of the averages from the clusters of a Boolean path was used to create a score for each sample. The weights along the path either monotonically increased or decreased to make the sample order consistent with the logical order based on BIR. The samples were ordered based on the final weighted and linearly combined score. A cluster highly expressed in a disease setting received a weight of 1 and healthy setting received a weight of -1.

Summary of genes in the clusters
Reactome pathway analysis of each cluster along the top continuum paths was performed to identify the enriched pathways 47 . The pathway description was used to summarize at a high-level what kind of biological processes are enriched in a particular cluster. These can be accessed in Supplementary Information 2 (for BE) and Supplementary Information 3 (for EAC).

Measurement of classification strength or prediction accuracy
Receiver operating characteristic (ROC) curves were computed by simulating a score based on the ordering of samples that illustrates the diagnostic ability of binary classifier system as its discrimination threshold is varied along with the sample order. The ROC curves were created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen IBD samples higher than a randomly chosen healthy samples. In addition to ROC AUC, other classification metrics such as accuracy ((TP + TN)/N; TP: True Positive; TN: True Negative; N: Total Number), precision (TP/(TP+FP); FP: False Positive), recall (TP/(TP+FN); FN: False Negative) and f1 (2 * (precision * recall)/(precision + recall)) scores were computed. Precision score represents how many selected items are relevant and recall score represents how many relevant items are selected. Fisher exact test is used to examine the significance of the association (contingency) between two different classification systems (one of them can be ground truth as a reference).

AI guided discovery of Boolean paths
A Boolean path is converted to a path score as mentioned above using a linear combination of normalized gene expression values. The strength of classification of normal and BE/EAC samples using this score is computed by the ROC-AUC measurement. We ranked the clusters based on the ROC-AUC values in the cohort (GSE39491, n = 80, 40 normal, 40 BE) for BE network and E-MTAB-4054 (27 BE vs 17 EAC). Multivariate regression is performed to select the best clusters. The clusters for BE network are filtered by enrichment of differentially expression genes from a recently published BE model (GSE153129). Only two clusters (SPINK7 and SLC44A4) are selected based on this filter: SPINK7 cluster is down-regulated in BE and SLC44A4 cluster is up-regulated in BE.

Training and Validation Datasets
A Boolean path is selected after machine learning to construct a Boolean model. The Boolean model is tested in several human and mouse datasets, each comprised of a heterogeneous collection of samples (as mentioned in Supplementary  Information 1) to demonstrate reproducibility. Selected Boolean path score is computed as mentioned in section "Ordering samples based on composite score of Boolean path". The sample order using the Boolean path score is evaluated using the sample annotation (normal vs BE; BE vs EAC) by ROC-AUC analysis. We tested how the SPINK7-SLC44A4 path score with weight -1, 1 respectively distinguishes normal and BE samples as they are annotated in training datasets (GSE100843, GSE39491), and validation datasets (GSE65013, GSE64894, GSE49292, GSE26886, GSE34619, GSE13083, E-MTAB-4054). Top two EAC network clusters are selected based on the best ROC-AUC values using the training cohort (E-MTAB-4054): IL10RA, LILRB3. Both these clusters are upregulated in EAC compared to BE samples. The IL10RA-LILRB3 path score is tested with weight 1, 1 respectively to distinguish BE vs EAC samples in training dataset (E-MTAB-4054) and validation datasets (GSE26886, GSE37200, GSE77563). The EAC score is also tested to distinguish normal vs EAC+GEJAC in GSE74553 cohort. We have collected publicly available gene expression datasets derived from mouse models of BE/EAC (GSE24931, GSE103616, GSE158116; Supplementary Data 1) to test whether human Boolean models perform well in mice. The gene name conversion from human to the mouse is performed using human genome GRCh38.95 ensembl IDs and mapping data exported from ensemble BioMart web-interface.  Outcome studies:

Patient cohort
We retrospectively analyzed a previously assembled dataset 49 of patients with biopsies reporting BE between 2013 and 2017 and with a complete blood count within 6 months from the endoscopy, as well as patients with esophageal adenocarcinoma (EAC). Cases (n = 113) were classified as non-dysplastic BE (NDBE, n = 72), dysplastic BE (DBE, n = 11) and EAC (n = 30).
Briefly, to enroll the patients with BE, medical records of all patients undergoing upper endoscopy at a tertiary care center in Brazil (Hospital de Clínicas dePorto Alegre) between January 2013 and September 2017 who had columnar epithelium visualized in the distal esophagus were retrospectively analyzed. All endoscopic examinations that fulfilled criteria were included, even though belonging to the same patient in a different period of the surveillance. Exclusion criteria were absence of confirmed intestinal metaplasia on histology (goblet cells on Alcian-Blue staining), immunosuppression by drugs or chronic diseases, active or recent (< 6 months) infectious disease, history of cancer or any hematological or autoimmune disease, and previous surgery of the gastrointestinal tract (except fundoplication for GERD). To be considered eligible, cases also must have had a complete blood count (CBC) collected between the period of 6 months before and 6 months after endoscopy and necessarily outside of a context of clinical emergency (for example, on the emergency room, for any reason) or invasive procedures (such as surgery or esophageal dilation). Those criteria were assessed through a thorough exam of electronic medical records. In the presence of more than one eligible CBC, the one closest to the day of the endoscopy was selected. If an included patient had any endoscopy performed before 2013, those exams were also analyzed for inclusion according to the criteria stated above.
Patients with EAC were selected from hospital discharge diagnostic records between January 2005 and December 2017 if they had the following codes, according to the international classification of diseases (ICD-10; C15: malignant neoplasm of the esophagus): C15.2 (abdominal esophagus), C15.5 (lower third), C15.8 (overlapping sites), and C15.9 (unspecified). Squamous cell carcinoma and gastroesophageal junction (GEJ) tumor types 2 or 3 of Siewert classification14 were excluded. Same exclusion criteria used for BE cases were applied, except the need of confirmed intestinal metaplasia.
Demographic, clinical, and endoscopic data collected (as published before 49 ). Data collected from endoscopic histopathological reports were histologic diagnosis and adherence to Seattle Protocol, that was evaluated comparing endoscopic descriptions with pathology reports (adherence was considered if sent to pathology separated samples with at least 4 biopsy fragments for every 2 cm of columnar epithelia). Histopathological diagnosis of intestinal metaplasia, dysplasia, and adenocarcinoma was carried out by expert pathologists in the institution in a clinical routine fashion.
All cases were classified into three groups based on histopathological diagnosis: non-dysplastic BE (NDBE), dysplastic BE (DBE)-with either low-or high-grade dysplasia-and esophageal adenocarcinoma (EAC

Univariate and multivariate analyses
Prediction of progression from NDBE to DBE to EAC is analyzed using univariate and multivariate analyses based on several clinical parameters such as SEX, ALC, ANC, and PLAT. Univariate and multivariate analyses is performed using Ordinary Least Squares regression (OLS) python statsmodels (version 0.12.2) package.

Survival analysis in TCGA-EASC and TCGA-STAD dataset
TCGA-EASC and TCGA-STAD datasets were used to study the relationship between the composite expression score of different genes (or single CXCL8 gene) from various signatures and patient clinical outcomes for esophageal adenocarcinoma (EAC), esophageal squamous cell carcinomas (ESCC) and gastric adenocarcinomas (GC). We StepMiner tool was used to compare the predictive value of several gene combinations in patients with low and high expressions. Patients were divided into two groups, high vs. low expression, based on the StepMiner threshold +/-noise margin of gene expression or the composite expression score. Noise margin of a composite signature is estimated by the square root of the sum of squares of the scaled down (1/3/stddev) version of the individual noise margin of the gene which is 0.5 based on the BooleanNet approach. The OS Kaplan-Meier plots are presented with the log-rank p-value. For the composite signature scores of CD16 (FCGR3A and FCGR3B) in EAC, threshold was computed by using StepMiner twice, first on the whole scores and then using all the values lower than the first StepMiner threshold. For ESCC, threshold for composite score of CD16 (FCGR3A and FCGR3B) was computed by using StepMiner twice, first on the whole scores and then using all the values greater than the first StepMiner threshold.

Patient samples
Whole-genome sequencing data for 320 esophageal biopsies from 80 patients with Barrett's esophagus (BE) were analyzed (de-identified data publicly available, dbGaP Study Accession: phs001912 and phs001654). Longitudinal data from patients were collected as part of a case-control study design performed at the Fred Hutchinson Cancer Research Center. Demographics for patient cohort have been previously described 50 . Progressors (cases) were defined as patients who progressed to esophageal adenocarcinoma (EAC) during surveillance (n=40 total) in the Seattle Barrett's Esophagus Program. Non-progressors (controls) were defined as those who did not progress to EAC throughout long-term follow-up (n=40 total).

Mutational analysis
After downloading the BE genomics data, quality control checks were performed for each sample to remove any sequencing files that do not meet our stringent criteria. Briefly, samples were greylisted when: (i) FASTQ file(s) failed one or more of the criteria from FastQC1; (ii) a tumor-normal pair had less than 95% concordance as estimated by Conpair2; (iii) samples exhibited low tumor purity, as estimated by ascatNgs3; (iv) samples exhibited low sequencing coverage for either the cancer or the normal tissue. Additionally, after mutation calling was completed, any sample with more than 10% of mutations attributed to signatures of known sequencing artifacts was greylisted. Only two of the original BE samples failed these quality control checks. Note that while somatic mutations and germline variants were identified in all greylisted samples, these samples were not used for any subsequent analyses. Somatic mutations and germline variants were identified using our ensemble variant calling pipeline. Specifically, FASTQ files were aligned to human genome build GRCh38.d1.vd1 using BWA-MEM5. Duplicate reads were marked using Picard MarkDuplicate6. The alignment was further refined through local indel realignment and base score recalibration using the GATK toolkit7,8. Next, the paired cancer and normal BAM files were subjected to somatic mutation calling analysis using four state-of-the-art computational tools. More specifically, single base substitutions (SBSs) and small insertions and deletions (indels) were identified independently using Mutect29, VarScan210, Strelka211, and MuSE12. Germline single-nucleotide polymorphisms (SNPs) were detected using VarScan2. Copy-number alterations were identified using ascatNgs3. Somatic mutations marked as "PASS" by Mutect2 and Strelka2 were filtered based on their mutation confidence scores: TLOD score >= 10 (Mutect2) and SomaticEVS >= 15 (Strelka2). Any somatic mutations found in 2 or more of the 4 variant callers were considered as high confidence mutations; this optimum combination strategy is based on the previous experience from pan-cancer analysis performed by TCGA MC313 and ICGC PCAWG14. Subsequent filters were used to produce the final set of mutations for downstream analysis. The following filters addressed most artifact mutations introduced during sample preparation and/or sequencing: (i) DKFZbiasFiter15 removed SBSs with biased variant read support, which can be a result of unbalanced PCR amplification, biased reading during sequencing, and DNA shearing leading to 8-oxoguanine lesions during sample preparation16; (ii) panel of normal (PON) filter, generated using Mutect2, removed additional technical artifacts recurrently appearing in genomic locations of normal samples; (iii) variant allele frequency threshold was applied to all SBSs and indels to remove any low frequency mutations, which were commonly introduced as part of the DNA sequencing process17. Overall, this analysis generated a catalog of harmonized bona fide DNA somatic mutations and germline variants in each BE sample.
For downstream analyses, we further filtered for somatic mutations from this catalog that were found by all 4 callers described above. For each patient, we then considered the set of unique somatic mutations across their 4 samples from 2 timepoints (including 2 sampling locations of the upper esophagus and lower esophagus regions at each timepoint).

Statistical analysis
To calculate the number of occurrences of mutations in the cluster of genes identified from the BE→EAC map (Figure S8A), we cycled through merged VCF files from EAC Progressors (n=40) and summed the total number of mutations in each patient across the genes in each cluster as shown in blue boxplots. The same calculations were performed for nonprogressor patients (n=40) and total mutations are shown in red boxplots. P-values from two-sided Mann-Whitney U-tests are reported above each plot comparing Progressor and Non-progressor patients for each cluster (statistical tests performed using scipy.stats.mannwhitneyu in Python v3.9.7).
To calculate the number of occurrences of mutations in neutrophil function associated genes, we cycled through the merged VCF files from patients who were EAC Progressors (n=40) and summed the total count for each gene. These totals by gene were then compared to totals counted in the same way within the non-progressor patients (n=40), see Figure S8B.

Immunohistochemistry of patient esophageal samples
Formalin-fixed, paraffin-embedded (FFPE) tissue sections of 4 µm thickness were cut and placed on glass slides coated with poly-L-lysine, followed by deparaffinization and hydration. Heat-induced epitope retrieval was performed using tris EDTA (pH 9.0) in a pressure cooker using a rolling boil for 15 minutes. Tissue sections were incubated with 3% hydrogen peroxidase for 5 minutes to block endogenous peroxidase activity, followed by a one-hour incubation with 2.5% horse serum. Slides were then incubated with primary antibodies for 1.5 hours in a humidified chamber at room temperature. After primary incubation, slides were incubated with secondary antibodies (horse, anti-rabbit) for 30 minutes at room temperature and washed. Antibodies used for immunostaining; SUPT6H [1:250], TP63 [1:500]. Immunostaining was visualized using 3,3′-diaminobenzidine chromogen and counterstained with Mayer's Hematoxylin. Samples were quantitatively analyzed and scored based on the presence (positive) or absence (negative) of staining.

IHC Quantification
IHC images were randomly sampled at different 300x300 pixel regions of interest (ROI). The ROIs were analyzed using IHC Profiler 51 . IHC Profiler uses a spectral deconvolution method of DAB/hematoxylin color spectra by using optimized optical density vectors of the color deconvolution plugin for proper separation of the DAB color spectra. The histogram of the DAB intensity was divided into 4 zones: high positive (0 to 60), positive (61 to 120), low positive (121 to 180) and negative (181 to 235). High positive, positive, and low positive percentages were combined to compute the final percentage positive for each region of interest (ROI). The range of values for the percent positive is compared among different experimental groups. Data is displayed as percent positive stain.

Statistical analysis
Statistical significance between experimental groups was determined using two-tailed Mann-Whitney test (IHC score). For all tests, a p-value of 0.05 was used as the cutoff to determine significance. All experiments were repeated a least three times, and p-values are indicated in each figure. All statistical analysis was performed using GraphPad prism 8.4.3. C. Schematic summarizing the key findings in gene expression and epithelial morphology observed and reported earlier 52 , upon depletion of SPT6 in keratinocyte stem cells by siRNA 52 . While control keratinocytes formed stratified squamous epithelium, siRNA mediated transient depletion of SPT6 in keratinocytes (SPT6i) grew as 'intestine-like' monolayers. RNA seq studies on those monolayers confirmed that this model recapitulates metaplastic gene signatures in BE, and not normal gut differentiation that is observed in the healthy gut lining.

D.
Gene clusters from the BE-map were analyzed for overlap with those affected in the organoid BE model by loss or gain of access due to chromatin remodeling upon SPT6 loss, as identified by ATAC Seq. Only significant p values, as determined using hypergeometric analyses, are displayed.    Top: Correlation tests were calculated in GSE159721 (123 pairs of normal GEJ and GEJ-ACs) and displayed as scatter plots using python seaborn lmplots with the p-values between the 18-gene TIS signature on the Y axis and EAC signature on the X axis (left), neutrophil degranulation signature on the Y axis and the EAC signature on the X axis (middle) and the 18-gene TIS signature on the Y axis and neutrophil degranulation signature in the X axis (right).
Bottom: "r" values from correlation tests calculated on numerous independent cohorts are displayed as heatmap.