Abstract Text: T-cells function as a computational nexus in the immune system: they detect antigens from all major pathogens, and based on environmental cues and prior experiences, execute gene expression programs (GEPs) to carry-out adaptive functions (e.g. proliferation, cytotoxicity, cytokine production). Painstaking flow cytometric analysis in controlled models has defined core mutually exclusive response programs – canonically Th1, Th2, and Th17 – that helper T-cells enact upon encountering antigen. However, dozens of contemporary scRNA-Seq experiments have revealed a continuum of T-cell states without distinct clusters corresponding to these subtypes.
Here, we resolve this discrepancy by advancing an alternative analysis strategy -- consensus non-negative matrix factorization (cNMF), which learns GEPs from scRNA-Seq, representing each cell as a mixture of programs. Applying cNMF to seven datasets spanning over 2,000,000 T-cells from 700 individuals across tissues and diseases, we identify 60 GEPs including 50 that are reproducible across 2+ datasets and many associated with diseases including Covid-19 and cancer. We discover GEPs reflecting the core known functions of T-cells including proliferation, cytotoxicity, exhaustion, Th1/Th2/ Th17 effector states, and ten novel programs. Simultaneously quantifying the activities of all active GEPs in each T-cell, rather than the single strongest one (as in hard clustering), better reveals developmental lineages and activation states that underlie a cell’s expression profile. We provide our GEP catalog and software – T-Cell AnnoTator (TCAT) – to infer their activities in new datasets. Our approach generalizes to all cell-types and provides quantitative measures of GEP activity for applications including eQTL analysis and disease associations.