Skip to contents

This function trains and evaluates an XGBoost classification model on cytokine data. It allows for hyperparameter tuning, cross-validation, and visualizes feature importance.

Usage

cyt_xgb(
  data,
  group_col,
  train_fraction = 0.7,
  nrounds = 500,
  max_depth = 6,
  learning_rate = 0.1,
  nfold = 5,
  cv = FALSE,
  objective = "multi:softprob",
  early_stopping_rounds = NULL,
  eval_metric = "mlogloss",
  min_split_loss = 0,
  colsample_bytree = 1,
  subsample = 1,
  min_child_weight = 1,
  top_n_features = 10,
  plot_roc = FALSE,
  output_file = NULL,
  progress = NULL
)

Arguments

data

A data frame containing the cytokine data, with one column as the grouping variable and the rest as numerical features.

group_col

A string representing the name of the column with the grouping variable (i.e., the target variable for classification).

train_fraction

A numeric value between 0 and 1 representing the proportion of data to use for training (default is 0.7).

nrounds

An integer specifying the number of boosting rounds (default is 500).

max_depth

An integer specifying the maximum depth of the trees (default is 6).

learning_rate

A numeric value representing the learning rate (default is 0.1).

nfold

An integer specifying the number of folds for cross-validation (default is 5).

cv

A logical value indicating whether to perform cross-validation (default is FALSE).

objective

A string specifying the XGBoost objective function (default is "multi:softprob" for multi-class classification).

early_stopping_rounds

An integer specifying the number of rounds with no improvement to stop training early (default is NULL).

eval_metric

A string specifying the evaluation metric (default is "mlogloss").

min_split_loss

A numeric value for the minimum loss reduction required to make a further partition (default is 0).

colsample_bytree

A numeric value specifying the subsample ratio of columns when constructing each tree (default is 1).

subsample

A numeric value specifying the subsample ratio of the training instances (default is 1).

min_child_weight

A numeric value specifying the minimum sum of instance weight needed in a child (default is 1).

top_n_features

An integer specifying the number of top features to display in the importance plot (default is 10).

plot_roc

A logical value indicating whether to plot the ROC curve and calculate the AUC for binary classification (default is FALSE).

output_file

Optional. A file path to save the outputs as a PDF file. If provided, outputs are written to the file and results are returned invisibly.

progress

Optional. A Shiny Progress object for reporting progress updates.

Value

A list containing:

summary_text

A character string summarizing key results (interactive mode only).

model

The trained XGBoost model.

confusion_matrix

The confusion matrix (test set).

importance

The feature importance data for the top features.

class_mapping

A named vector showing the mapping from class labels to numeric values used for training.

cv_results

Cross-validation results if performed (otherwise NULL).

plot

A ggplot object showing the feature importance plot.

Examples

# Example usage:
data_df0 <- ExampleData1
data_df <- data.frame(data_df0[, 1:3], log2(data_df0[, -c(1:3)]))
data_df <- data_df[, -c(2,3)]
data_df <- dplyr::filter(data_df, Group != "ND")

cyt_xgb(
  data = data_df, group_col = "Group",
  nrounds = 500, max_depth = 4, learning_rate = 0.05,
  nfold = 5, cv = FALSE, eval_metric = "mlogloss",
  early_stopping_rounds = NULL, top_n_features = 10,
  plot_roc = TRUE
)

#> $summary_text
#> [1] "### XGBOOST RESULTS ###\n\n1) Group -> Numeric Label Mapping:\nPreT2D    T2D \n     0      1 \n\n2) Confusion Matrix on Test Set:\n          Reference\nPrediction  0  1\n         0 25  7\n         1  4 22\n\nTest Accuracy: 0.81 \n\nSensitivity: 0.862069 \nSpecificity: 0.7586207 \n\nAUC: 0.914 \n\n3) Top 10 Important Features:\n          Feature       Gain      Cover  Frequency\n           <char>      <num>      <num>      <num>\n 1:         TNF.A 0.18457678 0.10535233 0.09103448\n 2:         IL.22 0.15117263 0.16064281 0.09885057\n 3:     IL.12.P70 0.09955532 0.12855158 0.12321839\n 4:         IL.33 0.09384522 0.08514973 0.07908046\n 5:         IL.1B 0.07717208 0.03835908 0.05149425\n 6:          IL.9 0.07554566 0.06955229 0.07080460\n 7:         IL.15 0.04705740 0.05853239 0.03310345\n 8:         IL.23 0.04440949 0.02533791 0.02988506\n 9: CCL.20.MIP.3A 0.03678675 0.05623800 0.05287356\n10:         IL.13 0.02785650 0.02633047 0.03678161"
#> 
#> $model
#> ##### xgb.Booster
#> call:
#>   xgboost::xgb.train(params = params, data = dtrain, nrounds = nrounds, 
#>     evals = list(train = dtrain, test = dtest), verbose = FALSE, 
#>     early_stopping_rounds = early_stopping_rounds)
#> # of features: 25 
#> # of rounds:  500 
#> callbacks:
#>    evaluation_log 
#> evaluation_log:
#>   iter train_mlogloss test_mlogloss
#>  <int>          <num>         <num>
#>      1     0.66160263     0.6754284
#>      2     0.63290945     0.6599301
#>    ---            ---           ---
#>    499     0.01423204     0.4264843
#>    500     0.01421502     0.4263731
#> 
#> $confusion_matrix
#>           Reference
#> Prediction  0  1
#>          0 25  7
#>          1  4 22
#> 
#> $importance
#>           Feature       Gain      Cover  Frequency
#>            <char>      <num>      <num>      <num>
#>  1:         TNF.A 0.18457678 0.10535233 0.09103448
#>  2:         IL.22 0.15117263 0.16064281 0.09885057
#>  3:     IL.12.P70 0.09955532 0.12855158 0.12321839
#>  4:         IL.33 0.09384522 0.08514973 0.07908046
#>  5:         IL.1B 0.07717208 0.03835908 0.05149425
#>  6:          IL.9 0.07554566 0.06955229 0.07080460
#>  7:         IL.15 0.04705740 0.05853239 0.03310345
#>  8:         IL.23 0.04440949 0.02533791 0.02988506
#>  9: CCL.20.MIP.3A 0.03678675 0.05623800 0.05287356
#> 10:         IL.13 0.02785650 0.02633047 0.03678161
#> 
#> $class_mapping
#> PreT2D    T2D 
#>      0      1 
#> 
#> $cv_results
#> NULL
#> 
#> $plot

#> 
#> $roc_plot

#>