Data Sources
- Sensor
- Text
- Image
- Videos
- Survey
Data Types
Basic structured data types 1.Numeric 2. Categorical
continuous windspeed
Discrete Count of the occurence of the event
Categorical Type of Tv Screen
Binary Yes/No 0/1
Ordinal 1,2,3,4,5
Loading Data
mpg <- read.csv("mpg.csv")
Fuel economy data from 1999 and 2008 for 38 popular models of car
Description
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
Data Exploration
str(mpg)
## 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ model : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
## $ drv : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ class : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
str() function shows the structure of the dataframe.
Chop the Head and Tail
head(mpg)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
head() gives the first 6 rows of the dataframe by default.
tail(mpg)
tail() gives the last 6 rows of the dataframe by default.
Central Tendency
Basic Step in exploring data
Mean
mean(mpg$cyl) ## $ select the column of dataframe
## [1] 5.888889
Median
median(mpg$cyl)
## [1] 6
Mode
Estimates of Variability
Variance
var(mpg$displ)
## [1] 1.669158
Standard Deviation(SD)
sd(mpg$cyl)
## [1] 1.611534
Range
range(mpg$displ)
## [1] 1.6 7.0
Percentile
quantile(mpg$displ)
## 0% 25% 50% 75% 100%
## 1.6 2.4 3.3 4.6 7.0
Summary
summary(mpg)
## manufacturer model displ year
## dodge :37 caravan 2wd : 11 Min. :1.600 Min. :1999
## toyota :34 ram 1500 pickup 4wd: 10 1st Qu.:2.400 1st Qu.:1999
## volkswagen:27 civic : 9 Median :3.300 Median :2004
## ford :25 dakota pickup 4wd : 9 Mean :3.472 Mean :2004
## chevrolet :19 jetta : 9 3rd Qu.:4.600 3rd Qu.:2008
## audi :18 mustang : 9 Max. :7.000 Max. :2008
## (Other) :74 (Other) :177
## cyl trans drv cty hwy
## Min. :4.000 auto(l4) :83 4:103 Min. : 9.00 Min. :12.00
## 1st Qu.:4.000 manual(m5):58 f:106 1st Qu.:14.00 1st Qu.:18.00
## Median :6.000 auto(l5) :39 r: 25 Median :17.00 Median :24.00
## Mean :5.889 manual(m6):19 Mean :16.86 Mean :23.44
## 3rd Qu.:8.000 auto(s6) :16 3rd Qu.:19.00 3rd Qu.:27.00
## Max. :8.000 auto(l6) : 6 Max. :35.00 Max. :44.00
## (Other) :13
## fl class
## c: 1 2seater : 5
## d: 5 compact :47
## e: 8 midsize :41
## p: 52 minivan :11
## r:168 pickup :33
## subcompact:35
## suv :62
Summary() function gives the basic statistics for the numeric data and count for the categorical data.
Explore Data Distribution
Loading Packages
library(ggplot2)
ggplot2 is great visualization package from the tidyverse package universe.
Boxplot
boxplot(mpg$displ)
Boxplot shows the percentile of the given value.The top and bottom of the box are the 75th and 25th percentiles, respectively. The median is shown by the horizontal line in the box. The dashed lines, referred to aswhiskers, extend from the top and bottom to indicate the range for the bulk of the data.
Histogram
hist(mpg$displ)
A histogram is a visualization, with bins on the x-axis and data count on the y-axis.
DensityPlot
hist(mpg$displ,freq = FALSE)
lines(density(mpg$displ),col="red",lwd=2)
Exploring Binary and Categorical Data
Mode
Mode <- function(x) {
which.max(table(x))
}
Mode(mpg$manufacturer)
## dodge
## 3
Bar Chart
barplot(table(mpg[["drv"]]))
Correlation
cor(mpg[c(3,5,8,9)])
## displ cyl cty hwy
## displ 1.0000000 0.9302271 -0.7985240 -0.7660200
## cyl 0.9302271 1.0000000 -0.8057714 -0.7619124
## cty -0.7985240 -0.8057714 1.0000000 0.9559159
## hwy -0.7660200 -0.7619124 0.9559159 1.0000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(mpg[c(3,5,8,9)]))
Scatterplot
plot(mpg$displ,mpg$cty,xlab= "displ", ylab="cty")
The standard way to visualize the relationship between two measured data variables is with a scatterplot.
Exploring Two or More Variables(Multivariate)
Categorical and Numeric Data
boxplot(displ ~ drv, data = mpg)
title("Boxplot of Drive Type and Displacement")
Visualizing Multiple Variables
plot(displ~cty,data =mpg)
title("Plot of displacement and Miles Per gallon for Car")
Basic of Data Transformation and Visualization
Prerequisites
In this Section we will learn how to do Data Transformation using dplyr package of tidyverse.
library(dplyr)
Filter rows with filter()
filter() allows you to subset observations based on their values.
tail(filter(mpg,drv=="f", year == 2008))
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
44 | volkswagen | jetta | 2.5 | 2008 | 5 | manual(m5) | f | 21 | 29 | r | compact |
45 | volkswagen | new beetle | 2.5 | 2008 | 5 | manual(m5) | f | 20 | 28 | r | subcompact |
46 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
47 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
48 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
49 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
Arrange rows with arrange()
filter() changes order of selected rows.
head(arrange(mpg,drv,displ))
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
audi | a4 quattro | 2.0 | 2008 | 4 | auto(s6) | 4 | 19 | 27 | p | compact |
subaru | impreza awd | 2.2 | 1999 | 4 | auto(l4) | 4 | 21 | 26 | r | subcompact |
subaru | impreza awd | 2.2 | 1999 | 4 | manual(m5) | 4 | 19 | 26 | r | subcompact |
Select columns with select()
select() allows to look in subset of data using columns name.
head(select(mpg,class,trans,year,displ))
class | trans | year | displ |
---|---|---|---|
compact | auto(l5) | 1999 | 1.8 |
compact | manual(m5) | 1999 | 1.8 |
compact | manual(m6) | 2008 | 2.0 |
compact | auto(av) | 2008 | 2.0 |
compact | auto(l5) | 1999 | 2.8 |
compact | manual(m5) | 1999 | 2.8 |
If you use “-” before the name of column ,that column will be discarded.
Add new variables with mutate()
mutate() helps to add new column at the end of the dataframe using value of existing column.
head(mutate(mpg,dmpg = hwy - cty))
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | dmpg |
---|---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact | 11 |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact | 8 |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact | 11 |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact | 9 |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact | 10 |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact | 8 |
We calculated difference between city miles per gallon and highway miles per gallon.
Grouped summaries with summarise()
summarise() collapses the dataframe to a single Rows.
summarise(mpg,avg_hwy=mean(hwy))
avg_hwy |
---|
23.44017 |
When summarise() is used with group_by(), the unit of analysis is changed from the complete dataset to individual groups.
grouped_mpg<- group_by(mpg,year,cyl)
summarise(grouped_mpg,avg_hwy = mean(hwy))
## # A tibble: 7 x 3
## # Groups: year [?]
## year cyl avg_hwy
## <int> <int> <dbl>
## 1 1999 4 28.4
## 2 1999 6 22.3
## 3 1999 8 17.0
## 4 2008 4 29.3
## 5 2008 5 28.8
## 6 2008 6 23.5
## 7 2008 8 18.0
Data visualisation
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Prerequisites
In this section we will use ggplot2, one of the core members of the tidyverse.
ScatterPlot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Barplot
ggplot(data = mpg) + geom_bar(aes(x=class))
ggplot(data = mpg) + geom_bar(aes(x=class, fill = class))
DensityPlot and Histogram
ggplot(data = mpg) + geom_histogram(aes(x= displ))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = mpg) + geom_density(aes(x=displ))
Boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
Facets
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)