#list.files() # prints the files presnt in working directory
For Asking help with R
#?function
#help(function or Robject)
This is simulated dataset containing information about the customers.The data is about regression problem and we need to analyse this data find some insights and create a machine learning model.
library(dplyr) #used for data manipulation/wrangling
library(ggplot2) #used for data visualization
# load the data
credit <- read.csv("credit.csv")
head(credit) # shows first 6 data
*.ID - Identification
*.Income - Income in $10,000’s
*.Limit - Credit limit
*.Rating - Credit rating
*.Cards - Number of credit cards
*.Age - Age in years
*.Education - Number of years of education
*.Gender - A factor with levels Male and Female
*.Student - A factor with levels No & Yes if an individual was a student
*.Married - A factor with levels No & Yes if an individual was married
*.Ethnicity - A factor with levels African/American/Asian/Caucasian enthnic group
*.Balance - Average credit card balance in $.
We need to analyse and create machine learning model to predict the Bank balance of Customer.
tail(credit) #last 6 data are shown by default
View() shows whole dataset in new tab.
glimpse(credit)
## Observations: 400
## Variables: 12
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 2...
## $ Limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300...
## $ Rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 58...
## $ Cards <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3...
## $ Age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, ...
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9,...
## $ Gender <fct> Male, Female, Male, Female, Male, Male, Female, ...
## $ Student <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No...
## $ Married <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian...
## $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, ...
#View(credit)
dim(credit) #prints dimension of the dataframe (Row, Column)
## [1] 400 12
#rownames(credit) #prints rows names
names(credit) #prints columns header
## [1] "ID" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "Education" "Gender" "Student" "Married"
## [11] "Ethnicity" "Balance"
#colnames(credit) #prints columns header
str(credit) #internal structure of the data
## 'data.frame': 400 obs. of 12 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Income : num 14.9 106 104.6 148.9 55.9 ...
## $ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
## $ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
## $ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
## $ Age : int 34 82 71 36 68 77 37 87 66 41 ...
## $ Education: int 11 15 11 11 16 10 12 9 13 19 ...
## $ Gender : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
## $ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
## $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
## $ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...
library(knitr)
kable(summary(credit)) #basic statistics
ID | Income | Limit | Rating | Cards | Age | Education | Gender | Student | Married | Ethnicity | Balance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.0 | Min. : 10.35 | Min. : 855 | Min. : 93.0 | Min. :1.000 | Min. :23.00 | Min. : 5.00 | Male :193 | No :360 | No :155 | African American: 99 | Min. : 0.00 | |
1st Qu.:100.8 | 1st Qu.: 21.01 | 1st Qu.: 3088 | 1st Qu.:247.2 | 1st Qu.:2.000 | 1st Qu.:41.75 | 1st Qu.:11.00 | Female:207 | Yes: 40 | Yes:245 | Asian :102 | 1st Qu.: 68.75 | |
Median :200.5 | Median : 33.12 | Median : 4622 | Median :344.0 | Median :3.000 | Median :56.00 | Median :14.00 | NA | NA | NA | Caucasian :199 | Median : 459.50 | |
Mean :200.5 | Mean : 45.22 | Mean : 4736 | Mean :354.9 | Mean :2.958 | Mean :55.67 | Mean :13.45 | NA | NA | NA | NA | Mean : 520.01 | |
3rd Qu.:300.2 | 3rd Qu.: 57.47 | 3rd Qu.: 5873 | 3rd Qu.:437.2 | 3rd Qu.:4.000 | 3rd Qu.:70.00 | 3rd Qu.:16.00 | NA | NA | NA | NA | 3rd Qu.: 863.00 | |
Max. :400.0 | Max. :186.63 | Max. :13913 | Max. :982.0 | Max. :9.000 | Max. :98.00 | Max. :20.00 | NA | NA | NA | NA | Max. :1999.00 |
In this section we will work on one variable or column data. We will do some basic data maniuplation function from dplyr.
income<-select(credit,Income) #selection of the column
length(income)#length of the vector
## [1] 1
dim(income)
## [1] 400 1
head(income)
names(income) <-"Dollar"
head(income)
income<-rename(income,Income = Dollar)
income
min(income) #minimum of the data
## [1] 10.354
max(income) #maximum of the data
## [1] 186.634
###Missing value
head(is.na(income)) #gives logical value
## Income
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
sum(is.na(income))
## [1] 0
arrange(income,desc(Income)) #arrange the value in incre and decr order
ggplot(income)+geom_histogram(aes(x=Income))
ggplot(income)+geom_density(aes(x=Income))
ggplot(income)+geom_boxplot(aes(x="",y=Income))
summary(income)
## Income
## Min. : 10.35
## 1st Qu.: 21.01
## Median : 33.12
## Mean : 45.22
## 3rd Qu.: 57.47
## Max. :186.63
In this we will work on multiple variable and analyse the data create some visuals using ggplot2 and base R.
plot(credit)
####Distribution of data The density plot shows the distribution of data . Skewness and Kurtosis gives the measure of dispersion of the data.
ggplot(credit,aes(x=Balance))+geom_density()
library(moments)
skewness(credit$Balance)
## [1] 0.5824006
kurtosis(credit$Balance)
## [1] 2.463755
.The skewness is measure of a dataset’s symmetry – or lack of symmetry and perfectly symmetrical dataset has 0 skewness. . Kurtosis is called the degree of peakedness of a distribution but it is actually measure of tail.
####Scatterplot The Scatterplot is used to plot data about two variables and it is also used to show the relation/correlation between two variable.
db1 <- select(credit,Income,Limit)
ggplot(db1)+geom_point(aes(x=Income,y=Limit))+ggtitle("Scatterplot Income vs Limit")
We can plot more than two variable data by using colors,shape to show third variable data.
ggplot(data = credit)+geom_point(aes(Age,Balance,color= Gender))+ labs(title="Age vs Bank Balance")+theme_minimal()
ggplot(data = credit)+geom_point(aes(Income,Balance,color= Married))+ labs(title="Age vs Bank Balance")+theme_minimal()
cat("Stats of Age:",summary(credit$Age))
## Stats of Age: 23 41.75 56 55.6675 70 98
cat("\nStats of Balance:",summary(credit$Balance))
##
## Stats of Balance: 0 68.75 459.5 520.015 863 1999
Corrlation is Degree and type of relationship between any two or more quantities .The value of correlation is between -1 to 1.The 0 shows the no relation between two variables. -1 shows strong negative corelation and +1 shows strong positive correlation.
cls <- sapply(credit,class)
cls
## ID Income Limit Rating Cards Age Education
## "integer" "numeric" "integer" "integer" "integer" "integer" "integer"
## Gender Student Married Ethnicity Balance
## "factor" "factor" "factor" "factor" "integer"
credit%>% select(which(cls!="factor"))
num_credit<-credit%>% select(which(cls!="factor"))
num_credit<- as.data.frame(num_credit)
cor(num_credit)
## ID Income Limit Rating Cards
## ID 1.000000000 0.03720258 0.02417249 0.02198547 -0.03630425
## Income 0.037202580 1.00000000 0.79208834 0.79137763 -0.01827261
## Limit 0.024172487 0.79208834 1.00000000 0.99687974 0.01023133
## Rating 0.021985470 0.79137763 0.99687974 1.00000000 0.05323903
## Cards -0.036304251 -0.01827261 0.01023133 0.05323903 1.00000000
## Age 0.058603022 0.17533840 0.10088792 0.10316500 0.04294829
## Education -0.001415034 -0.02769198 -0.02354853 -0.03013563 -0.05108422
## Balance 0.006064108 0.46365646 0.86169727 0.86362516 0.08645635
## Age Education Balance
## ID 0.058603022 -0.001415034 0.006064108
## Income 0.175338403 -0.027691982 0.463656457
## Limit 0.100887922 -0.023548534 0.861697267
## Rating 0.103164996 -0.030135627 0.863625161
## Cards 0.042948288 -0.051084217 0.086456347
## Age 1.000000000 0.003619285 0.001835119
## Education 0.003619285 1.000000000 -0.008061576
## Balance 0.001835119 -0.008061576 1.000000000
plot(num_credit)
In the below plot we use scatterplot to show the relation between two variables. Limit and Rating shows highly postive relation.
g<- ggplot(num_credit)
g+ geom_point(aes(x=Limit, y= Rating))+labs(title="Limit vs Rating Scatterplot",x="Limit",y="Rating")+theme_minimal()
g+ geom_point(aes(x=Income,y=Balance))+labs(title="Income vs Balance Scatterplot")+theme_classic()
g+ geom_point(aes(x=Rating,y=Balance))+labs(title="Rating vs Balance Scatterplot")+ theme_light()
####Boxplot Boxplot is used to plot the quartiles and shows outliers in the data.
ggplot(credit)+geom_boxplot(aes(x= Gender,y=Balance))
####Lineplot Lineplot is mostly used in Timeseries data but here we will use to show the Age of the people in dataset.
age_df<-as.data.frame(table(credit$Age))
names(age_df)<-c("Age","Count")
ggplot(age_df,aes(Age,Count))+geom_line(aes(group=1))+geom_point()+scale_x_discrete(breaks=c(23,33,43,53,63,73,83,98))+theme_minimal()
This plot shows the count of the data.
ggplot(credit)+geom_bar(aes(x=Gender,fill=Gender))+labs(title="Countplot of Gender")+
theme_minimal()+theme(legend.position = "")
credit$Cards<- as.factor(credit$Cards)
ggplot(credit,aes(x=Cards))+geom_bar(aes(y = (..count..)/sum(..count..),fill=Cards))+
geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))), stat = "count", vjust = -0.25)+
labs(title= "Cards count",y="Percentage")+
theme_minimal()+theme(legend.position = "")
aggregate(Income~Ethnicity,credit,mean)
####Multiple plots In this section we will analyse data and use multiple plots in same plots to show our insights.
aggregate(Balance~Student+Gender,credit,mean)
*.Aggregate function Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.
ggplot(data= aggregate(Balance~Student+Gender,credit,mean),aes(x=Gender,y=Balance))+geom_col(aes(fill=Gender))+facet_grid(~Student)
Male has more cash in Bank balance than the female.
ggplot(data=aggregate(Balance~Gender+Married,credit,mean),aes(Gender,Balance))+
geom_col(aes(fill=Gender))+ggtitle("Barplot Gender vs Balance on basis of Marital Status") +facet_grid(~Married)+theme_minimal()+theme(legend.position = "")
This plot shows the relation Between the bank balance,Gender and their marital status.It seems married women has more cash in bank than the Married male in this dataset.
ggplot(data=aggregate(Balance~Ethnicity+Gender,credit,mean),aes(Ethnicity,Balance))+
geom_col(aes(fill=Ethnicity))+ggtitle("Barplot Ethnicity vs Balance on basis of Gender") +facet_grid(~Gender)+theme_bw()+theme(legend.position = "")
This was all about the data analysis.I showed basic data analysis process and you can find lots of more data insights from this dataset. Now , we will move to reporting process and lets see Report making process in R+Rstudio.