Our study aims to refine Seahorse bioenergetics flux analysis using ML and ANN models for the metabolic phenotyping of human nucleus pulposus cells. Our initial modeling efforts concentrated on optimizing key experimental variables, including seeding density and FCCP concentration, as a baseline application of machine learning to Seahorse data. While these analyses provide useful technical validation, their direct biological relevance is limited. To address this limitation, we developed an approach designed to classify predefined bioenergetic phenotypes. Specifically, unsupervised clustering was first employed to identify reproducible phenotypes from Seahorse parameters, followed by the use of ANN for supervised classification based on these biologically meaningful categories. This strategy enables automatic phenotype detection from metabolic readouts, improving both efficiency and accuracy. By saving time and resources, it enhances metabolic phenotyping of NP cells and supports deeper insights into cellular bioenergetics under healthy and diseased conditions.
Human nucleus pulposus cells were purchased from ScienCell Research Laboratories (4800, ScienCell, Texas, USA) and cultured in Nucleus Pulposus Cell Medium (NPCM, ScienCell, Texas, USA) supplemented with 2% fetal bovine serum (FBS; ScienCell, Carlsbad, CA, USA), 1% penicillin/streptomycin (P/S; ScienCell, Carlsbad, CA, USA), and nucleus pulposus cell growth supplement (NPCGS; ScienCell, Carlsbad, CA, USA) in a poly-D-lysine-coated culture plate. The cells were kept in an incubator with 5% CO₂ at 37 °C. The medium was changed every three days until the cells reached about 70% confluency, and then every other day until the cells reached about 90% confluency. Using a mild DPBS rinse, 0.05% trypsin/EDTA dissociation, neutralization, and replating at 5,000 cells/cm², the subculture was carried out at 90-95% confluency. To reduce stress and prevent centrifugation and refreezing, cryopreserved cells were quickly thawed and handled with care. Every process was carried out in an aseptic environment while adhering to accepted biosafety guidelines.
Cells were seeded on Seahorse XF96 Cell Culture microplates (10000, 20000, 30000, and 40000 cells/well) and were subjected to metabolic profiling in sextuplicate two days after seeding. For Mito stress tests, the growth medium was replaced with Seahorse XF DMEM Base Medium (#103575-100, Agilent Technologies, pH adjusted to 7.4), supplemented with 10 mM D-glucose, 2 mM L-glutamine, and 1 mM sodium pyruvate. Cells were then cultured for 45 min in a CO-free incubator at 37 °C. OCR and ECAR were monitored at basal conditions and, after sequential injections of 2.5 µM oligomycin (Oligo) to block the mitochondrial ATP synthase, FCCP (1.25, 2.5, and 5 µM) to uncouple oxidative phosphorylation and then 0.5 µM rotenone and antimycin A (Rot/AA) was used to inhibit mitochondrial respiration fully. The groups were designed with respective cell density and FCCP concentration (Table 1). Measurements of OCR and ECAR were performed in a 3-minute mix and 3-minute measured cycles at 37 °C on a Seahorse XFe96 Analyzer (Agilent Technologies). Wave software (Agilent Technologies) was used to analyze the datasets. OCR and ECAR were represented as pmol/min and mpH/min, respectively, and normalized to the protein concentration of each well measured by BCA assay. The Pierce BCA Protein Assay Kit (89900, Thermo Fisher Scientific) was used to measure the total protein concentration after completing the Seahorse run. The cells were lysed using RIPA lysis and extraction buffer (#89900, Thermo Fisher Scientific) supplemented with 1× Halt Protease and Phosphatase Inhibitor Cocktail (#78441, Thermo Fisher Scientific) in accordance with the manufacturer's instructions. The Seahorse Wave software was used to enter protein data. It automatically adjusted the OCR and ECAR values to the protein content of each well and gave the results as OCR or ECAR per µg of protein, taking into consideration variations in the number of cells in each well.
The input data is normalized to adjust the data features to a specified range, typically between 0 and 1 or -1 and 1. This normalization ensures that all features contribute equally to the machine learning model, preventing features with larger ranges from dominating the learning process. We utilized the widely used normalization method, Min-Max scaling, which rescales each feature individually based on its minimum and maximum values, as described in Eq. (1)
In this context, represents the normalized data, while denotes the original data. The symbols ( and ) indicate the maximum and minimum values of the original vector, respectively. The variables and represent the upper and lower bounds of the new range for the normalized data, with typical values being [ = 1, = 0].
We evaluated the initial accuracy of various supervised learning algorithms using a standard workflow for training classification models, or classifiers, in the MATLAB "Classification Learner App" (Fig. 3a). The algorithms examined include decision trees, discriminant analysis, ensemble methods, support vector machines (SVM), K-nearest neighbors (KNN), and ensemble classifiers. To ensure an accurate estimation of performance and to prevent overfitting, we employed a 5-fold cross-validation technique. Cross-validation is a machine learning method used to evaluate a model's overall performance and its ability to generalize. In 5-fold cross-validation, the dataset is divided into five equal parts, known as folds. The model is trained on four of these folds and tested on the fifth fold. This process is repeated five times, ensuring each fold is used as the test set exactly once. The accuracy is then calculated as the average over the five iterations, providing a reliable assessment of the model's effectiveness. We assessed the accuracy of several techniques by analyzing an experimental dataset comprising 144 samples. The Classification Learner was used to train models for the following classifiers: decision trees, discriminant analysis, naive Bayes, support vector machine (SVM), K-nearest neighbors (KNN), and ensembles. After training the classification models, the application displays the model accuracy, reflecting the validation results. The formula for calculating the model accuracy percentage is applied as shown in Eq. (2).
Where total instance is a sum of True positive True + True negative + False positive + False negative.
As part of a thorough assessment, we evaluated the predictive accuracy of various regression algorithms for OCR and ECAR using large numerical datasets. The MATLAB "Regression Learner App" was employed to develop regression models and assess their predictive capabilities through a standard workflow for training regression models (Fig. 1Sa).
The regression techniques explored included linear regression, decision tree regression, SVM regression, ensemble regression, and Gaussian process regression (GPR). To evaluate the effectiveness of these models and compare them, we utilized the model statistics available in the Current Model Summary pane. These statistics were gathered after training the regression models in the Regression Learner App.
To prevent overfitting and ensure accurate performance evaluation, we implemented cross-validation. This technique involves splitting the dataset into distinct folds and calculating the accuracy for each fold, helping to assess the model's ability to generalize. The dataset used for this evaluation consisted of 144 samples. After training the models in the Regression Learner App, we reviewed the model's window to select the one with the highest overall score. This score was determined using the validation dataset's optimal R-squared (R²), root mean square error (RMSE), mean squared error (MSE), and mean absolute error (MAE) metrics, as detailed in Eq. (3-6). These metrics serve as indicators of how well the trained model is likely to perform on new testing data. The main goal of this assessment was to identify the best regression strategy for accurately predicting OCR and ECAR based on the provided dataset. We utilized a combination of model statistics, cross-validation, and performance metrics to make well-informed evaluations and decisions among the tested regression models.
In the equation, , , and denote the observed (experimental) value of the evaluation model, the associated anticipated (predicted) data, and the number of experimental data, respectively, and X̄ represents mean of the observed values.
The ANN model was designed with a structured architecture, including an input layer, hidden layers, and an output layer. In this study, the Multilayer Perceptron (MLP) neural network employs the backpropagation algorithm to improve accuracy by minimizing the error between predicted and actual outputs (Fig. 1a). Initially, the model calculates the output using the current network weights and compares it with the actual target, generating an error term. This error, representing the difference between the network's prediction and the true output, is propagated backward from the output layer to the hidden layers. During this backward pass, the algorithm adjusts the network weights to reduce errors. This iterative process continues as the model recalculates the output with updated weights, checks the error, and further refines the weights until the error is minimized or reaches an acceptable level. The backpropagation algorithm optimizes the neural network's weights through repeated iterations, enhancing its predictive accuracy. The ANN structure was developed using MATLAB's Neural Network Toolbox (nntool) to explore the relationship between input factors -- cell density, FCCP dose, and time interval -- and output variables -- oxygen consumption rate and extracellular acidification rate (Fig. 1b). A total of 144 data samples were used to train and test the neural networks; 108 samples (75%) were selected for the training subset and 36 samples (25%) for the test subset. Various network topologies (Supplementary Table 1) were evaluated, featuring one, two, or three hidden layers with different neuron counts per layer. The hyperbolic tangent sigmoid (tansig) activation function Eq. (7) was applied in the hidden layers to introduce non-linearity into the model.
Where represents the weighted sum of inputs to a neuron and donates the function's output, which is the transformed value based on the input .
Two independent training techniques, Levenberg-Marquardt (TRAINLM) and Bayesian regularization (TRAINBR) backpropagation, were employed to optimize the network. TRAINLM and TRAINBR utilize distinct formulations for adjusting weights, often incorporating second-order derivative information or regularization terms to prevent overfitting. Both training methods were evaluated using the mean absolute error (MAE) Eq. (4) as the performance metric (Fig. 1c), enabling an effective assessment of the model's predictive accuracy in estimating OCR and ECAR across various experimental scenarios. As illustrated in Fig. 1c, the MLP model with a Network-12 configuration (i.e., 3-12-10-2 structure, with 3 neurons in the input layer, 12 neurons in the first hidden layer, 10 neurons in the second hidden layer, and 2 neurons in the output layer) achieved the lowest MAE of 0.0503. The error for the training and testing data of the proposed ANN model was assessed using mean squared error (MSE) Eq. (5) and root mean square error (RMSE) Eq. (6) (Table 2).
In our integrated framework, supervised ANN classification for accurate phenotype prediction was combined with unsupervised clustering for phenotype discovery. Seahorse metabolic parameters -- including basal respiration (BR), maximal respiration (MT), ATP production (ATP), baseline OCR (bOCR), baseline ECAR (bECAR), stressed OCR (sOCR), and stressed ECAR (sECAR) -- were normalized using Z-score transformation Eq. (8) to ensure equal contribution of each parameter to clustering analysis.
Where X represents the raw parameter value, µ is the mean, and σ is the standard deviation across all samples.
To determine the optimal number of clusters (K), we employed three complementary methods: the elbow method to minimize within-cluster sum of squares (WSS), silhouette analysis to evaluate average silhouette values, and the adjusted Rand index (ARI) to assess clustering consistency across four algorithms (K-means, Spectral Clustering, GMM, and DBSCAN). Statistical validation was performed using one-way ANOVA with cluster assignment (phenotype) as the independent variable, followed by Tukey's HSD post-hoc test for pairwise comparisons. Significance was defined as p < 0.05. The three resulting phenotypes were further confirmed through multidimensional scatter plots with convex hulls and correlation pattern analysis. Based on parameter amplitude and their coordinated patterns, clusters were named Energetic (high values, strong associations), Quiescent (low values, atypical patterns), and Intermediate (moderate values).
The ANN model was implemented in MATLAB R2019b using the Neural Fitting Toolbox (nftool). The fully connected feedforward architecture (patternnet) consisted of an input layer of seven neurons (normalized metabolic parameters), one hidden layer with ten neurons using hyperbolic tangent activation functions, and an output layer of three neurons with a softmax activation function for probabilistic multi-class classification. Stratified random sampling was applied to split the dataset into training (70%), validation (15%), and testing (15%) subsets, preserving the original phenotype distribution. Training was carried out with the scaled conjugate gradient backpropagation algorithm (trainscg) to minimize cross-entropy loss. To reduce overfitting, L2 weight regularization was applied with a coefficient (λ) of 0.001.