Basic SVM using simulated multiplex cytokine data

The goal of this document is to provide a basic structure to use Support Vector Machine (SVM) algorithms to Luminex multi-plex data using the xMAP system. This r markdown document is designed to be scalable and the only requirement is the data is clean and has a category column for training. The data is then divided into training and testing. Once the SVM is trained it can be applied to the testing data set to obtain basic performance statistics and the confusion matrix. If the standard SVMlinear kernel is acceptable, the trained model can be saved and used to predict new samples. This is meant to be a very basic example and the data may not be realistic.

Document summary:

Basic SVM Modeling in R for use with Luminex data
Written by: Eric W. Olle
For use with Luminex generated data
CC-BY-SA license (For the RMD document)
No warranty provided.
Created: Febuary 2022
Edited: Aug. 14, 2022

GitHub Repository for the R-markdown document and the data. The Rmd document text may differ slightly from the blog.

NOTE: Uploaded using knitr and RWordPress packages.

Loading Packages

The following document is meant to be a basic overview of how to use SVM to categorize two or more different groups (i.e. Normal v disease) or different sub-types of similar disease based upon a standard Luminex 10-plex panel This can be applied to larger size panels and different data sets. This is just meant to be an very basic “off the shelf” solution for a standard problem. There are several excellent online resources such as Data Camp as well as some excellent books dedicated to machine learning. The basic references used for this were were: Recognition in Machine Learning by Christopher M. Bishop (2006) and Mastering Machine Learning with R by Cory Lesmeister along with the R package manuals. For additional information on the R code the appropriate package manuals and vignettes were used.

library(tidyverse)  #Easiest best is to do them individually
library(kernlab) # Required for easier repeats and confusion matrix
library(caret)  # This package makes data set splitting easier

More information on tidyverse, kernlab or caret just click the name and a new window will open with the CRAN or package specific website.

Loading data

In this part the data is loaded from a: csv, excel or RDS file For the purpose of teaching a “manufactured” data set was created to be similar to a standard human 10-plex assay. The data is not meant to be realistic and is used for teaching purposes only. This generated data has two categories of samples: Control and a “made up” auto immune disease (AID).

In general there are three steps. First, the data is loaded from an appropriate format. Second, any values that are not appropriate (i.e. a negative number for cytokines concentration) are replaced by an NA. Third, a clean data set is created by removing any rows with NA values which can skew the results. Once clean the data set can be visualized and then divided into training and testing sub-groups for Support Vector Machine Analysis.

This section will verify the data set loaded with checking the first 6 records. If you notice there are records with negative values and this data needs to be cleaned using the tidyverse::dplyr package.

#NOTE:  The file type can also be excel or csv just requires a different read command.

dataset_raw <- readRDS(file = "ml_dataset_raw")

# The following is for other types of files.  Will need to uncomment and correctly configure.
#-----------------------------------------------------------------------------#
# For CSV In rstudio use the import data set verify and copy the command.

# data set_raw <- read.csv(<file>, header = TRUE, sep = ",", quote = "\"", dec = ".")

# For excel files (xls and xlsx)
#library(readxl)
#read_excel(<path>) 
#-----------------------------------------------------------------------------#

head(dataset_raw)
## # A tibble: 6 × 11
##   GM_CSF IFN_g IL_1b  IL_2  IL_4  IL_5  IL_6  IL_8 IL_10  TNF_a category
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>   
## 1   63.1  9.02  6.93  24.7  3.96  42.7  302.  52.3 55.6  295.   control 
## 2   58.9  7.04  8.55  34.5  3.96  61.6  223.  71.2  8.64 128.   control 
## 3   56.2 13.6   6.80  24.4  4.80  50.4  120.  65.2 47.7  -54.1  control 
## 4   46.9  8.83  7.86  35.9  3.00  42.4  298.  68.7 38.9  114.   control 
## 5   40.2  9.70  8.47  18.9  3.12  72.9  103.  25.9 25.4    7.30 control 
## 6   52.2 11.1   6.26  40.5  3.98  56.7  153.  68.4 26.0  127.   control

The above data shows that the data contains some negative numbers and that the category column is not a factor. These will be corrected in the following sections. Please note that the column names use and underscore instead of a dash and have had any special characters (i.e. gamma) removed. This just makes future referencing easier. To see negative numbers check in the TNF_a column.

dataset_raw <- dataset_raw %>%
    replace(., dataset_raw  < 0, NA)

# Looking at the number of rows in the data set prior to NA removal

                                        # filtering out NA


svm_dataset <- dataset_raw %>%
    na.exclude()

# Looking at percent remaining

head(svm_dataset)
## # A tibble: 6 × 11
##   GM_CSF IFN_g IL_1b  IL_2  IL_4  IL_5  IL_6  IL_8 IL_10  TNF_a category
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>   
## 1   63.1  9.02  6.93  24.7  3.96  42.7 302.   52.3 55.6  295.   control 
## 2   58.9  7.04  8.55  34.5  3.96  61.6 223.   71.2  8.64 128.   control 
## 3   46.9  8.83  7.86  35.9  3.00  42.4 298.   68.7 38.9  114.   control 
## 4   40.2  9.70  8.47  18.9  3.12  72.9 103.   25.9 25.4    7.30 control 
## 5   52.2 11.1   6.26  40.5  3.98  56.7 153.   68.4 26.0  127.   control 
## 6   56.4 11.4   7.44  10.1  2.62  83.7  17.3  54.1  7.08 225.   control

The above data set has had the negative numbers converted to NA’s. This can be any value that is not appropriate for analysis. The data set then is converted to the SVM data set for training and testing by removing the NA’s that may affect the algorithm. The SVM data set has 76.5% of the raw data set.

Concluding thoughts for data set cleaning. A clean data set is required to train any machine learning or AI models. However, be careful on “over cleaning” to fit potential individual/group conditional biases. Do just enough to clean the set such as removing missing values and with VERY large data sets consider removing outliers. Be very careful on using outlier removal if the data set does not have normal distribution or if is a small N-size.

Once the data is cleaned/“tidy data” one column need to be converted to a factor. In this example there are two categories: control and AID (Auto Immune Disease). Remember this is “fake” data so these can be called any name.

svm_dataset$category <- as.factor(svm_dataset$category)

head(svm_dataset)
## # A tibble: 6 × 11
##   GM_CSF IFN_g IL_1b  IL_2  IL_4  IL_5  IL_6  IL_8 IL_10  TNF_a category
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <fct>   
## 1   63.1  9.02  6.93  24.7  3.96  42.7 302.   52.3 55.6  295.   control 
## 2   58.9  7.04  8.55  34.5  3.96  61.6 223.   71.2  8.64 128.   control 
## 3   46.9  8.83  7.86  35.9  3.00  42.4 298.   68.7 38.9  114.   control 
## 4   40.2  9.70  8.47  18.9  3.12  72.9 103.   25.9 25.4    7.30 control 
## 5   52.2 11.1   6.26  40.5  3.98  56.7 153.   68.4 26.0  127.   control 
## 6   56.4 11.4   7.44  10.1  2.62  83.7  17.3  54.1  7.08 225.   control

The next step is to perform a basic visual inspection of the data. This is using a box plot to show overall distribution and outliers. Remember this is a data set generated using the random normal variable so there may appear to be “outliers” but this is just part of the model.

# Need to gather the data from column format to a long format for use
# with ggplot.  This is why the gather() is used with the SVM data
# set.

ggplot(gather(svm_dataset,
               key = "Analyte" ,
               value = "Concentration",
              1:10), aes(x = factor(Analyte),
                         y = Concentration,
                         fill = category)) +
    geom_boxplot() +
    labs(title = "Box Plot of Standard 10-plex",
         subtitle = "Data set = Random Generated.",
         fill = "Category") +
    xlab("Cytokine") +
    ylab("Concentration")
Figure 1. Initial data set use for training and test of the SVM model.

Creating the SVM model

First divide into training and testing data. THIS IS ABSOLUTELY REQUIRED. Using the same data to train and test can lead to over-fitting and/or over estimating the models fit. See the caret package for more information on dividing into training and validation sets.

# NOTE: using the caret package to make this easier.
# NOTE:  p value can be changed between 0.6 and 0.8 at user discretion.

training <- caret::createDataPartition(y = svm_dataset$category,
                                       p = 0.70,
                                       list = FALSE)



svm_training <- svm_dataset[ training,]
svm_testing  <- svm_dataset[-training,]

Next step is to do a very basic SVM model with a linear kernel This is using KernLab and allows for easier repeats using training control. See the KernLab package for more information. In this example 10-fold with three repeat cross-validation is performed.

trnctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

svm  <- train(category ~., data = svm_training, method = "svmLinear",
                    trControl = trnctrl,
                    preProcess =  c("center", "scale"),
                    tuneLength = 10)
print(svm)
## Support Vector Machines with Linear Kernel 
## 
## 215 samples
##  10 predictor
##   2 classes: 'AID', 'control' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 193, 193, 194, 193, 193, 193, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1    
## 
## Tuning parameter 'C' was held constant at a value of 1

The above shows how the SVM was trained. The next step is to test it against the testing partition.

svm_pred <- predict(svm , svm_testing)

## Other plus of using kernlab is the easier confusion matrix generation

confusionMatrix(table(svm_pred, svm_testing$category))
## Confusion Matrix and Statistics
## 
##          
## svm_pred  AID control
##   AID      41       0
##   control   0      50
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9603, 1)
##     No Information Rate : 0.5495     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4505     
##          Detection Rate : 0.4505     
##    Detection Prevalence : 0.4505     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : AID        
## 

The confusion matrix and the statistics shows how the model performs. This was a simple model but in more complex models, missed calls are common and will be show in the confusion matrix and SVM statistics.

Using the SVM model to predict the category

In real world scenarios, the SVM model would be saved, and this next section would be a separate r markdown document (or r script) that would be used just for prediction. For ease of teaching the predicting was show as part of the same r markdown file.

Using a new data set created with another random generation seed. This data set has the same underlying distributions of the original data but was created separately. This set kept the category but will be “blinded” and then tested. The data was cleaned prior to saving to make this section easier. The data is blinded (i.e. category removed) prior to adding the SVM prediction column. This column could be kept and a secondary confusion matrix generated using data never seen by the initial model.

#loading the data set
new_data <- readRDS("ml_predict_set")
#blinding the data 
blind_new_data <- select(new_data, -category)
#predicting the category
new_data_predict <- predict(svm,  blind_new_data)

# column binding the new data predictions 

full_new_data <- cbind(blind_new_data, Prediction = new_data_predict)

## Looking at the full table

knitr::kable(full_new_data, digits = 2)
GM_CSFIFN_gIL_1bIL_2IL_4IL_5IL_6IL_8IL_10TNF_aPrediction
45.1510.739.6427.264.6322.77103.305.4132.28150.85control
56.769.607.8620.983.0648.04153.0749.6645.54199.97control
39.0014.6611.6314.933.7968.88185.20108.0267.67175.95control
48.6113.068.5834.994.7920.66164.18103.5133.07231.79control
90.669.54376.32168.623.8749.446.691253.8255.311480.62AID
47.6810.049.7041.412.6083.00144.63102.3021.07226.53control
47.0222.77826.63540.846.0023.1718.70735.8836.291525.37AID
59.066.887.3215.644.1941.25170.619.6648.63212.37control
49.697.865.7412.845.6530.4332.91127.7642.81236.62control
120.9326.57744.32515.473.1555.9128.72665.0210.231315.09AID
77.1117.221060.71193.292.7068.6516.02807.9818.451412.12AID
117.2614.24476.25192.525.0764.4822.671055.6123.721546.70AID
93.122.58149.57133.902.4054.4531.86264.0537.491468.69AID
44.187.648.9230.494.5430.89201.9386.7139.1481.04control
55.818.617.7514.103.5435.54273.52105.3244.53106.76control
107.486.24775.88287.564.2936.0626.60867.5823.801708.22AID
85.4220.051083.64507.514.6553.4419.93245.1367.781678.81AID
123.0124.84658.85287.204.222.649.86613.0358.701571.94AID
44.8238.64515.85373.814.2637.587.40134.6911.721943.04AID
50.6912.308.4923.973.1548.20119.94132.9323.66204.96control
104.381.03567.50259.823.8067.4826.97908.1364.891686.21AID
77.829.49770.94152.512.6648.3118.151032.9454.891496.37AID
42.069.076.4220.123.3216.04140.56125.1047.20225.16control
52.358.966.3617.993.2425.95188.7218.9717.48137.01control
39.7811.658.1827.063.8153.80211.0199.8631.68186.67control
86.7524.531459.89318.424.8859.9329.95339.6822.431663.64AID
45.449.618.1325.723.6043.93107.7195.4311.14233.01control
61.259.595.9337.213.7447.3425.0941.5455.82144.40control
49.3610.417.4912.872.785.14113.2053.7715.59155.97control
80.8818.11712.4025.414.2843.9213.54788.8861.191486.55AID
38.366.896.9316.182.8143.9738.02104.6824.34283.90control
46.787.098.8025.473.9447.31273.5665.2936.0396.89control
93.6219.72539.32126.674.6454.8516.15850.3818.501552.92AID
100.725.361586.33481.884.3649.574.42526.3831.771482.74AID
52.8422.02376.19131.672.6425.6435.09604.7731.961445.43AID
39.4310.925.7416.904.3266.0686.6254.8956.16142.36control
46.3110.298.1113.883.5750.75281.6492.4736.92241.68control
102.0724.93893.10558.562.1349.8034.971617.6930.421393.30AID
83.7840.86825.13130.943.2762.3722.84941.0828.591554.44AID
34.8510.697.4020.333.9734.9240.3790.1714.86129.51control
42.9712.696.4236.562.8360.02224.31155.3112.39171.33control
73.1225.33835.94308.634.0063.7139.731408.6625.671614.76AID
39.5313.157.5933.783.3112.61185.6663.8728.72208.48control
129.8729.521334.98140.732.4730.6829.22573.8631.181604.78AID
45.8511.577.6828.604.5713.6887.8394.269.48154.67control
107.3948.65779.78394.601.7092.8322.861192.1225.881524.06AID
74.7741.76801.34110.194.8055.0233.84414.6480.281633.50AID
65.9315.2010.2124.393.7846.39186.77137.1427.6950.78control
75.2124.53446.47125.662.3442.1014.89327.2350.971579.51AID
45.866.016.7023.674.0720.2850.5464.4625.64161.60control
Table 1. Using the SVM model to predict classification of the blinded data set.

The above table shows the prediction attached to the data. The data was blinded this experiment can be repeated without removing the classifier and a confusion matrix generated to determine model efficacy.

Finally, the full new data set with predictions can be plotted as above.

## Visualizing the predicted new data

ggplot(gather(full_new_data,
               key = "Analyte" ,
               value = "Concentration",
              1:10), aes(x = factor(Analyte),
                         y = Concentration,
                         fill = Prediction)) +
    geom_boxplot() +
    labs(title = "Box Plot of Standard 10-plex After Classification",
         subtitle = "Data set = Random Generated.",
         fill = "Prediction") +
    xlab("Cytokine") +
    ylab("Concentration")
Figure 2. Plotting the second generated data set. None of the data was used in training or validation.

SVM is a robust method for classification and can easily be applied to cytokine multiplex data generated using instruments like the Luminex Bead Based Array system.

Leave a Reply