Problem
We have data of medical insurance of patients. We will use the independent data to create a machine learning model which will estimate the Insurance charges. The medical charge is a numeric value so this problem is a regression problem.
Charge is dependent variable and these are independent variable:
age: Integer indicating the age of the patients
sex: patients gender ,either male or female
bmi: Body Mass Index(BMI) ,BMI is equal to weight( in kilograms) divided by height(in meter) squared.It provides a sense of how over or under weight a person is relative to their height.
children: An integer indicating the number of children/dependent covered by the insurance plan.
smoker : Patients regularly smokes tobacco
region : Patients place of residence in U.S
charges: charge of the health insurance to patient yearly.
Load Library and data
library(dplyr)
library(caret)
insurance <- readRDS("insurance.rds")
Exploratory Data Analysis (EDA)
It is already done in data analysis section.
knitr::kable(head(insurance))
age | sex | bmi | children | smoker | region | charges |
---|---|---|---|---|---|---|
19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
39 | male | 33.770 | 1 | no | southeast | 1725.552 |
28 | male | 33.000 | 3 | no | southeast | 4449.462 |
33 | male | 22.705 | 0 | no | northwest | 21984.471 |
32 | male | 28.880 | 0 | no | northwest | 3866.855 |
31 | female | 25.740 | 0 | no | southeast | 3756.622 |
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 39 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Algorithm
As this problem is regression problem we will use Multiple Linear Regression Algorithm to make Medical insurance Predictive Model.
Simple Linear Regression
simple linear regression is a simple method for predicting the quantitative value and study relationships between two continuous variables suppose X and Y. Mathematically, simple linear regression can be written as:
\[Y=a+b∗X+e\]
Where \(Y\) is dependent variable, \(X\) is independent variable, \(a\) is the intercept , \(b\) is the slope of \(X\) and \(e\) is the error term in equation.
Linear regression method’s main task is to find the best-fitting straight line through the Y and X points
Multiple Linear Regression
Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.
Multiple Linear regression uses multiple predictors. The equation for multiple linear regression looks like:
\[Y = \beta0 + \beta1x1+ \beta2x2+ ...+e\]
where:
\(Y\) is Response or dependent variable \(\beta0\) is intercept \(x1\) and \(x2\) are predictors or independent variable \(\beta1\) and \(\beta2\) are coefficeints for the \(x1\) and \(x2\) respectively and \(e\) is error term in equation.
Spliting the data
Machine Learning we require training data to train the model and testing data to test the model performance. We will use 75% of data as training data and remaining as a testing data.
set.seed(101) # Set Seed so that same sample can be reproduced in future also
sample <- sample.int(n = nrow(insurance), size = floor(.75*nrow(insurance)), replace = F)
train_df <- insurance[sample, ] # 75% data for the training
test_df <- insurance[-sample, ] # remaining data for testing
dim(train_df)
## [1] 1003 7
dim(test_df)
## [1] 335 7
We have 1003 rows in training sets and 335 rows for the testing data.
Train the Model
We will use linear regression model from caret package for our insurance problem.
knitr::kable(head(test_df))
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
2 | 39 | male | 33.770 | 1 | no | southeast | 1725.552 |
4 | 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
6 | 31 | female | 25.740 | 0 | no | southeast | 3756.622 |
8 | 37 | female | 27.740 | 3 | no | northwest | 7281.506 |
15 | 27 | male | 42.130 | 0 | yes | southeast | 39611.758 |
26 | 59 | female | 27.720 | 3 | no | southeast | 14001.134 |
real_charges <- select(test_df,charges)
test_df <- select(test_df,-charges)
ins_model <- train(charges ~ .
,train_df
,method = "lm"
)
ins_model
## Linear Regression
##
## 1003 samples
## 6 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1003, 1003, 1003, 1003, 1003, 1003, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 6364.451 0.7248959 4416.144
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(ins_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28647 -3039 -1072 1814 29678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10886.35 1183.76 -9.196 <2e-16 ***
## age 227.74 14.42 15.798 <2e-16 ***
## sexmale 274.18 396.51 0.691 0.4894
## bmi 347.64 34.23 10.155 <2e-16 ***
## children 425.81 169.09 2.518 0.0119 *
## smokeryes 23553.52 492.65 47.810 <2e-16 ***
## regionnorthwest -930.97 574.91 -1.619 0.1057
## regionsoutheast -1312.03 562.47 -2.333 0.0199 *
## regionsouthwest -1258.10 575.88 -2.185 0.0291 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared: 0.7337, Adjusted R-squared: 0.7315
## F-statistic: 342.3 on 8 and 994 DF, p-value: < 2.2e-16
The output of summary() shows there are eight coefficients while we passed only 6 features. This happened because the caret packages applied a technique called dummy coding to each of the factor - type variables .
residuals section provides the summary statistics for the errors in our Predictions. The residuals is equals to the true value minus the predicted value.
p-value denoted Pr(>|t|), shows the relationship between the dependent variable and independent variable . The level of significance is 0.05(threshold value). The independent variable with p-value less than significance level are considered statistically significant.
Multiple R-squared value provides a measure of how well our model as a whole explains the values of the dependent variable. The closer the value is to 1.0 the better the model perfectly explains the data.
cor(predict(ins_model,test_df), real_charges)
## charges
## [1,] 0.8055643
real_charges$prediction <- predict(ins_model,test_df)
knitr::kable(head(real_charges))
charges | prediction | |
---|---|---|
2 | 1725.552 | 9123.411 |
4 | 21984.471 | 3865.569 |
6 | 3756.622 | 3809.918 |
8 | 7281.506 | 7530.165 |
15 | 39611.758 | 32424.453 |
26 | 14001.134 | 12152.521 |
Improve Model
BMI may have zero impact on medical expenditure for individuals in the normal weight range but , it may be strongly related to higher costs for the obese people i.e BMI >= 30.
We can model this relationship by creating a binary obesity indicator variable i.e 1 if BMI >= 30 and 0 if BMI < 30.
train_df$bmi30 <- ifelse(train_df$bmi >= 30, 1, 0)
test_df$bmi30 <- ifelse(test_df $bmi >= 30 ,1 ,0)
In linear Regression , the relationship between an independent variable and the dependent variable is assumed to be linear, but this may not be always true.
The effect of age on medical expenditure may not be constant through out the age group. The treatment may become disproportionately expensive for oldest population.
to show non linear age to the model we will use square of age.
When two features have a combined effect , this is known as interaction.
In my opinion obesity indicator (bmi30) and smoker indicator has relationship between eachother. We will bmi30*smoker to the model and test our hypothesis.
new_model <- train(charges ~ age+ age^2+ children + bmi+ bmi30*smoker + region
,train_df
,method = "lm"
)
Compare two model
summary(new_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37254 -1952 -1143 -38 30576
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4451.53 1172.52 -3.797 0.000156 ***
## age 239.93 11.17 21.488 < 2e-16 ***
## children 531.72 130.97 4.060 5.3e-05 ***
## bmi 140.07 42.84 3.269 0.001115 **
## bmi30 -1022.20 529.17 -1.932 0.053683 .
## smokeryes 12727.41 573.17 22.205 < 2e-16 ***
## regionnorthwest -658.96 445.12 -1.480 0.139082
## regionsoutheast -1065.91 435.82 -2.446 0.014628 *
## regionsouthwest -1501.03 445.86 -3.367 0.000790 ***
## `bmi30:smokeryes` 19357.92 764.94 25.307 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4837 on 993 degrees of freedom
## Multiple R-squared: 0.8406, Adjusted R-squared: 0.8391
## F-statistic: 581.8 on 9 and 993 DF, p-value: < 2.2e-16
summary(ins_model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28647 -3039 -1072 1814 29678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10886.35 1183.76 -9.196 <2e-16 ***
## age 227.74 14.42 15.798 <2e-16 ***
## sexmale 274.18 396.51 0.691 0.4894
## bmi 347.64 34.23 10.155 <2e-16 ***
## children 425.81 169.09 2.518 0.0119 *
## smokeryes 23553.52 492.65 47.810 <2e-16 ***
## regionnorthwest -930.97 574.91 -1.619 0.1057
## regionsoutheast -1312.03 562.47 -2.333 0.0199 *
## regionsouthwest -1258.10 575.88 -2.185 0.0291 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared: 0.7337, Adjusted R-squared: 0.7315
## F-statistic: 342.3 on 8 and 994 DF, p-value: < 2.2e-16
Our Multiple Linear Regression equation is:
\[charges = -4451.53 + 227.74 age + 274.18 sexmale + 347.64 bmi + 425.81children + 23553.52smokeryes + -930.97regionnorthwest + -1312.03 regionsoutheast + -1258.10 regionsouthwest\]
cor(predict(new_model,test_df), real_charges)
## charges prediction
## [1,] 0.8729587 0.923047
real_charges$prediction_new <- predict(new_model,test_df)
knitr::kable(head(real_charges,10))
charges | prediction | prediction_new | |
---|---|---|---|
2 | 1725.552 | 9123.4114 | 8079.622 |
4 | 21984.471 | 3865.5690 | 5987.571 |
6 | 3756.622 | 3809.9177 | 5525.875 |
8 | 7281.506 | 7530.1648 | 9247.715 |
15 | 39611.758 | 32424.4532 | 37925.052 |
26 | 14001.134 | 12152.5206 | 14116.453 |
28 | 12268.632 | 12954.1099 | 12717.822 |
36 | 1625.434 | -115.4669 | 2309.169 |
41 | 3046.062 | 3826.7110 | 5032.727 |
54 | 37742.576 | 31797.3293 | 39005.883 |
Save Model
saveRDS(new_model,"Final_model.rds")
Model Deployment
We can deploy the model as webservices in R using Shiny framework made by Rstudio .