Intro

Analysis of the Wage and other data for a group of 3000 male workers in the Mid-Atlantic region of USA.

  • year Year that wage information was recorded

  • age Age of worker

  • maritl A factor with levels Never Married Married Widowed Divorced and Separated indicating marital status

  • race A factor with levels White Black Asian and Other indicating race

  • education A factor with levels < HS Grad HS Grad Some College College Grad and Advanced Degree indicating education level

  • region Region of the country (mid-atlantic only)

  • jobclass A factor with levels Industrial and Information indicating type of job

  • health A factor with levels <=Good and >=Very Good indicating health level of worker

  • health_ins A factor with levels 1. Yes and 2. No indicating whether worker has health insurance

  • logwage Log of workers wage

  • wage Workers raw wage

Loading Library and Data

Library are package made of Rscript which has multiple function which helps in our analysis by providing speed and efficiency.

#Load Library
library(dplyr)##Data Manipulation
library(ggplot2)##Data Visualization
##Load Dataset
wage<- read.csv("Wage.csv",stringsAsFactors = FALSE)
#write.csv(wage,"Wage.csv")

We will use csv dataset in this data analysis which is comma separated file.R has built in function to load csv data.Sometime R will read files and create factors if the value is string to avoid that we use false with stringsAsFactors.

While loading dataset sometime forms index value i dont want it so i will drop it using built in function.

wage<- select(wage,-X)

We can also load other type of file in R like xls,spss file. I am showing some example to load other type of data and package required to load the data.

SPSS

#library(foreign)
#SPSSData <- read.spss("filename.sav")
# to display in dataframe 
#SPSSData <- read.spss("filename.sav",to.data.frame=TRUE,use.value.labels=FALSE)

EXCEL

#library(readxl)
 
#df <- read_excel("Advertising.xlsx", sheet ='sheetname')

JSON data

#library(rjson)
#jsondata <- fromJSON(file = "filename.json")

This function shows the dimension of the R object.

#dimension of the data
dim(wage)
## [1] 3000   11

The nrow() and ncol() gives the rows and column number of the object. cat() concatenate the given values.

##Number of rows and columns
cat("No of rows:",nrow(wage))
## No of rows: 3000
cat("\nNo of Columns:",ncol(wage))
## 
## No of Columns: 11

we use colnames() and names() to find the header name of column.

##Column heading or name of columns
#names(wage)
colnames(wage)
##  [1] "year"       "age"        "maritl"     "race"       "education" 
##  [6] "region"     "jobclass"   "health"     "health_ins" "logwage"   
## [11] "wage"

row.names() gives the row names.

##Row heading or name of Rows
#row.names(wage)

We can display the first data using head() .

#Shows first data in datasets
head(wage)

tail() function shows the last data of the dataset.

#Shows last data in datasets
tail(wage,3)

View() shows whole dataset in new tab.

#Shows whole data 
View(wage)

table() function counts the factor values .

table(wage$education)
## 
##          < HS Grad 5. Advanced Degree       College Grad 
##                268                424                684 
##            HS Grad       Some College 
##                967                646
wage$education[wage$education=="5. Advanced Degree"] <- "Advanced Degree"

Cleaning

In this section we will look for missing and unwanted data. We will remove or clean the dataset using some method. ##any missing value

#is.na(wage)

sum(is.na(wage))
## [1] 55
table(is.na(wage))
## 
## FALSE  TRUE 
## 32945    55

The is.na() check if the object is NA i.e not available or not. If the value is na it return true otherwise false.

colSums(is.na(wage))
##       year        age     maritl       race  education     region 
##          0          0          0          0         11         15 
##   jobclass     health health_ins    logwage       wage 
##          0          0         10          0         19

colSums() sums the value along the column. As the health_ins and education is a qualitative data we will impute NA with mode value.

wage$education[is.na(wage$education)]<- names(table(wage$education)) [table(wage$education)== max(table(wage$education))]
wage$health_ins[is.na(wage$health_ins)]<- names(table(wage$health_ins)) [table(wage$health_ins)== max(table(wage$health_ins))]

wage is quantitaive data we will replace NA with median value of the wage.

wage$wage[is.na(wage$wage)]<- median(wage$wage,na.rm = TRUE)

Is there any missing value in our data lets see with following code.

colSums(is.na(wage))
##       year        age     maritl       race  education     region 
##          0          0          0          0          0         15 
##   jobclass     health health_ins    logwage       wage 
##          0          0          0          0          0

We have one column with missing value thats region but we know that this whole data was taken from Mid-Atlantic region we remove it from the dataframe.

wage<- select(wage,-region)

Lets find out is there any other which need any cleaning.

head(wage,3)

The health_ins column has two factor value yes or no but it has 1 and 2 with the value .So we will replace “1. Yes” with “Yes” and “2. No” with “No” which will look more clear.

wage$health_ins[wage$health_ins=="2. No"]<- "No"
wage$health_ins[wage$health_ins=="1. Yes"]<- "Yes"

Lets look how our data looks like after some of the cleaning process above.

head(wage,3)

Analysis

In this section we will be focusing on the analysis of data by using some statistics using R code.

Descriptive Analyses

In this section we will walkthrough some basic statistics and explore our data. * max() gives the maximum value in the given column of data.

#maximum value
max(wage$wage)
## [1] 318.3424
  • min() gives the minimum value in the given column of data.
#minimum value
min(wage$age)
## [1] 18
  • length() gives the length of the vector.
#length of the vector
length(wage$year)
## [1] 3000
  • mean() calculates the mean value
#mean
mean(wage$wage)
## [1] 111.5287
  • median() calculates the median
#Median
median(wage$age)
## [1] 42
  • This script calculates the mode value
#mode
#mode(wage$race)
names(table(wage$race)) [table(wage$race)== max(table(wage$race))]
## [1] "White"
  • range() gives the range of the given vector or data i.e maximum and minimum value .
#Range
range(wage$wage)
## [1]  20.08554 318.34243
  • quantile() gives the quantile of the data .
##Quantiles
quantile(wage$age,c(0.0,0.25,0.50,0.75,1))
##   0%  25%  50%  75% 100% 
##   18   34   42   50   80
  • var() gives the variance i.e how far a data set is spread out.
##Variance
var(wage$wage)
## [1] 1729.491
  • sd() gives the standard deviation i.e square root of the variance.
#Standard deviation
sd(wage$wage)
## [1] 41.58715
  • summary() is very important in R script.It shows minimum ,maximum,mean,median and different quantiles of the data with single function.
##Summary
summary(wage)
##       year           age           maritl              race          
##  Min.   :2003   Min.   :18.00   Length:3000        Length:3000       
##  1st Qu.:2004   1st Qu.:34.00   Class :character   Class :character  
##  Median :2006   Median :42.00   Mode  :character   Mode  :character  
##  Mean   :2006   Mean   :42.38                                        
##  3rd Qu.:2008   3rd Qu.:50.00                                        
##  Max.   :2009   Max.   :80.00                                        
##   education           jobclass            health         
##  Length:3000        Length:3000        Length:3000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   health_ins           logwage           wage       
##  Length:3000        Min.   :3.000   Min.   : 20.09  
##  Class :character   1st Qu.:4.447   1st Qu.: 85.38  
##  Mode  :character   Median :4.653   Median :104.92  
##                     Mean   :4.655   Mean   :111.53  
##                     3rd Qu.:4.857   3rd Qu.:128.68  
##                     Max.   :5.763   Max.   :318.34
  • str() display the internal structure of an R object.
str(wage)
## 'data.frame':    3000 obs. of  10 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 42 44 30 41 52 ...
##  $ maritl    : chr  "Never Married" "Never Married" "Married" "Married" ...
##  $ race      : chr  "White" "White" "White" "Asian" ...
##  $ education : chr  "< HS Grad" "College Grad" "Some College" "College Grad" ...
##  $ jobclass  : chr  "Industrial" "Information" "Industrial" "Information" ...
##  $ health    : chr  "<=Good" ">=Very Good" "<=Good" ">=Very Good" ...
##  $ health_ins: chr  "No" "No" "Yes" "Yes" ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

Relation between the Variables

  • cor() display the correlation between the quantitative variables of the dataframe.correlation show whether and how strongly pairs of variables are related. Range(-1 to +1)
#Relation and Correlation
cor(wage%>% select(age,logwage,wage))
##               age   logwage      wage
## age     1.0000000 0.2140623 0.1960569
## logwage 0.2140623 1.0000000 0.9448977
## wage    0.1960569 0.9448977 1.0000000

Correlation function cant be used to find the relation between the qualitative variables. So , we use chi square test to test our hypothesis that is the two variables are independent.we will find the relation between the jobclass and the education of the workers.

## Hypothesis testing
chisq.test(table(wage$jobclass,wage$education))
## 
##  Pearson's Chi-squared test
## 
## data:  table(wage$jobclass, wage$education)
## X-squared = 283.14, df = 4, p-value < 2.2e-16

The p-value is less than the cut off value i.e 0.05. We will reject the null hypothesis and accept alternate hypothesis. That means jobclass and education are related.

## Hypothesis testing
chisq.test(table(wage$maritl,wage$health))
## 
##  Pearson's Chi-squared test
## 
## data:  table(wage$maritl, wage$health)
## X-squared = 12.939, df = 4, p-value = 0.01158

This output shows that maritial status and health of the workers are related .

Exploratory Analyses

In this section we will analyse data and create visualization. We will use ggplot2 package which is visualization package of tidyverse.ggplot2 is highly powerful and flexible package used for making high quality visualization.

Distribution of data

The distribution is how the data are distributed or organized. Densityplots and Histogram helps to see these distribution of data.

#distribution
g<-ggplot(data = wage,aes(wage))
g+geom_density()

Skewness and Kurtosis

library(moments)
normal<-as.data.frame(rnorm(500))
names(normal)<- c("Norm")
ggplot(normal)+geom_density(aes(Norm))

skewness(normal)
##       Norm 
## 0.07798921

This code generates normal distribution .The Normal distribution has bell shape and the center of the curve represent mean. Half of the value lies in left side of the mean and half of the value lies on right side of the mean line.

g<-ggplot(data = wage,aes(age))
g+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

g+geom_density()

skewness(wage$age)
## [1] 0.1480543
kurtosis(wage$age)
## [1] 2.571931

The skewness value of age shows that age is Positively skewed distribution or right skewed and the kurtosis values tells that its Leptokurtic distribution.

Barplots

These plots are used to show the counts of variable in data.

ggplot(as.data.frame(table(wage$race)),aes(Var1,Freq))+geom_col(aes(fill=Var1))

Lineplot

Lineplots are mostly used to display the timeseries data .

ggplot(as.data.frame(table(wage$age),stringsAsFactors = FALSE),aes(Var1,Freq))+geom_point()+geom_line(aes(group=1))+scale_x_discrete(breaks=c(18,28,38,48,58,68,80))

Boxplot

Boxplot or whisker diagram shows the quantiles in single plots. It also shows outliers which are extremly high or extremly low values .

#Boxplot
boxplot(wage%>% select(age,wage))

Visualization

After the introduction about the basic graph or visualization we will try to analyse and create insightful visualization in below section.

maritl<-as.data.frame(table(wage$maritl,wage$age),stringsAsFactors = FALSE)%>% group_by(Var1)
g<-ggplot(maritl,aes(Var2,Freq))+geom_line(aes(group=Var1,color=Var1))
#theme(axis.text.x = element_text(angle = 90, hjust = 1))
g

g+theme(axis.text.x = element_text(angle = 90,hjust = 1))

g<-g+scale_x_discrete(breaks=c(18,28,38,48,58,68,80))+ggtitle("Maritial Status and the age of the worker")+xlab("Age")+ylab("Count")
g+scale_color_discrete(name="Maritial Status")

This lineplot shows the maritial status of the workers with colour and count of worker with differnt age .

##Most data collected in which year
ggplot(as.data.frame(table(wage$year)),aes(Var1,Freq))+geom_col()+ggtitle("Count of the people in survey over the year")+xlab("Years")+ylab("Count")

This barplot shows the number of data collected over the year.Maximum data was collected in year 2003.

race<-as_tibble(table(wage$race))

Tibble is efficient and modern version of the old dataframe.

head(race,3)

In this race_edu object we have 3 columns with Var1, Var2 and n as column name .These name doesnt define our value in the data so , we will rename the column header.

#Renaming
names(race)<- c("Race","Count")
names(race)
## [1] "Race"  "Count"

The column header is changed and which define our data content.

g<-ggplot(race,aes(Race,Count))
g+geom_col(aes(fill=Race))+labs(title="Workers count on basis of Race")+theme(legend.position = "")+geom_text(aes(Race,Count,label=Count))

This plots shows the data has more white race workers in worksforce.

edu<-as_tibble(table(wage$education))
names(edu)<- c("Education","Count")
g<- ggplot(edu,aes(Education,Count))
g + geom_col(aes(fill=Education))+ggtitle("Count of Workers on basis of Education")+theme(legend.position = "")+geom_text(aes(Education,Count,label=Count))

This barplot shows that maximum number of workers are with HS grad.

table(wage$education)
## 
##       < HS Grad Advanced Degree    College Grad         HS Grad 
##             268             424             684             978 
##    Some College 
##             646
health_age<-as_tibble(table(wage$age,wage$health_ins))
names(health_age)<- c("Age","Health_insurance","Count")
names(health_age)
## [1] "Age"              "Health_insurance" "Count"

In the plot below we are ploting the count of workers on basis of Age and Health Insurance.

g<- ggplot(health_age,aes(Age,Count))
g+geom_line(aes(group=Health_insurance,color=Health_insurance))+geom_point(aes(color=Health_insurance))+scale_x_discrete(breaks=c(18,28,38,48,58,68,80))

We can see that less number of worker with age 18 to 24 has health insurance.The people from 38 to 48 has maximum number of health insurance .

ggplot(wage,aes(age))+geom_density()

This density plot shows that the maximum number of workers are from 38 to 48 group .

indus_edu_hea<-as_tibble(table(wage$jobclass,wage$education,wage$health))
names(indus_edu_hea)<-c("Job","Education","Health","Count")
g<- ggplot(indus_edu_hea,aes(Education,Count))
g+geom_col(aes(fill=Education))+facet_grid(.~Health)+theme(axis.text.x = element_blank())+labs(title="Health Conditions and Education of the worker")

This barplot shows multiple type of data in single plot.This barplot is divided into two group on basis of health of workers and the color define the colors.

g<- ggplot(indus_edu_hea,aes(Education,Count))
g+geom_col(aes(fill=Education))+facet_grid(.~Job)+theme(axis.text.x = element_blank())+labs(title="Education and Job of the worker")

This plot shows that the lots of worker with HS grad are working in industry.

mari_health <- as.data.frame(table(wage$maritl,wage$health_ins))
names(mari_health)<- c("Maritial","Health_ins","Count")
names(mari_health)
## [1] "Maritial"   "Health_ins" "Count"
ggplot(mari_health)+geom_col(aes(Maritial,Count,fill=Maritial))+facet_grid(~Health_ins)+theme_minimal()+theme(axis.text.x = element_blank())

This plot shows most of the workers are married and has the health_insurance.

table(wage$maritl,wage$health_ins)
##                
##                   No  Yes
##   Divorced        53  151
##   Married        586 1491
##   Never Married  250  395
##   Separated       20   35
##   Widowed          5   14
race_health_ins <- as.data.frame(table(wage$race,wage$health_ins))
names(race_health_ins)<- c("Race","Health_ins","Count")
names(race_health_ins)
## [1] "Race"       "Health_ins" "Count"
ggplot(race_health_ins)+geom_col(aes(Race,Count,fill=Race))+facet_grid(~Health_ins)+theme_minimal()+theme(axis.text.x = element_blank())

Wage

We will work on the wage of the data and try to find average wage for different column.

range(wage$wage)
## [1]  20.08554 318.34243

The maximum wage is 318 and minimum wage is 20.

ggplot(as_data_frame(aggregate(wage~age,wage,mean)))+geom_line(aes(x=age,y=wage,color="red"))+theme_grey()+theme(legend.position="")+labs(title="Average Wage and Age of worker")

This line plot shows the calculated average value for each age group.

ggplot(aggregate(wage~education,wage,mean))+geom_col(aes(education,wage,fill=education))+theme_minimal()+theme(legend.position = "")+labs(title="Wage and Education of the Workers",x="Education",y="Wage")

We can see form this plot that the worker with Advanced Degree is earning more than other. The workers with less than HS grad are least paid in this survey.

ggplot(aggregate(wage~race,wage,mean))+geom_col(aes(race,wage,fill=race))+theme(legend.position = "")+theme_minimal()+labs(x="Race",y="Wage",title="Average wage of the workers from different Race")

The average wage for the Asian is more than others and Other race is the group of worker with least average pay scale.

ggplot(aggregate(wage~maritl,wage,mean))+geom_col(aes(maritl,wage,fill=maritl))+theme(legend.position = "")

ggplot(aggregate(wage~jobclass,wage,mean))+geom_col(aes(jobclass,wage,fill=jobclass))+theme(legend.position = "")

The people working in information sector has more wage than the industrial worker.

ggplot(aggregate(wage~education+jobclass,wage,mean))+geom_col(aes(education,wage,fill=education))+facet_grid(~jobclass)+theme_minimal()+theme(axis.text.x = element_blank())

ggplot(aggregate(wage~race+jobclass,wage,mean))+geom_col(aes(race,wage,fill=race))+facet_grid(~jobclass)+theme(legend.position = "")

ggplot(aggregate(wage~maritl+health,wage,mean))+geom_col(aes(maritl,wage,fill=maritl))+facet_grid(~health)+scale_fill_discrete(name="Marital Status")+theme_minimal()+theme(axis.text.x = element_blank())

Report Creation

After the data analysis we found multiple insightful information. We need to create report of this analysis for that R has Rmd which makes it very easy to create in any form(pdf,html,word).I will create Report the wage analysis using the Rmd and ggplot2 graphs.