Module 2 Mini Case Studies

Problem 1

Question a

First we use read.csv() function to read in the US retail sales dataset.

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# read in the dataset.
retail <- read.csv("Module 1/US_Retail_sales_2.csv")

There are 181 observations and 5 variables. It is a monthly data.

monthly_sale_ts <- ts(retail$SALES...millions., start = c(2001,1),end = c(2015, 12), frequency = 12)
autoplot(monthly_sale_ts, main = "Monthly Retail Sales", ylab = "Sales in Millions")

Question b

We first use window() function to extract the data for the last four years, then we can use ggseasonal or seasonal to create seasonal plot.

last_four_yrs <- window(monthly_sale_ts, start = c(2012, 1), end = c(2015, 12))
ggseasonplot(last_four_yrs, main = "Seasonal plot of last 4 years", xlab = "Month", ylab = "Sales in Millions")

seasonplot(last_four_yrs, main = "Seasonal plot of last 4 years", xlab = "Month", ylab = "Sales in Millions", year.labels = T)

The sales achieves the peak on December each year and the sales in January is the lowest of each year.

Problem 2

Question a

To make a scatter plot, we can use function plot(Y ~ X). Please remember to add the main title and labels to the axis.

injury <- read.csv("Module 1/Rail_safety.csv")
plot(injury$Injuries ~ injury$Train.miles, main = "Scatter plot of injury and train miles", xlab = "Train miles", ylab = "Number of Injuries")

Question b

To make the time series plots, we need first creat time series objects in R through function ts(). As the it is the annual data, the frequency should be 1 (1 observation per year).

injury_ts <- ts(injury$Injuries, start = 1990, end = 2009, frequency = 1)
miles_ts <- ts(injury$Train.miles, start = 1990, end = 2009, frequency = 1)
injury_per_mile_ts <- ts(injury$Injuries.per.T.M, start = 1990, end = 2009, frequency = 1)

Then we can use autoplot() function to create time series plots. Please remember to add the titles and labels to axis.

autoplot(injury_ts, main = "Time Series Plot of Injuries", xlab = "Year", ylab = "Number of Injuries")

autoplot(miles_ts, main = "Time Series Plot of Miles", xlab = "Year", ylab = "Number of Miles")

autoplot(injury_per_mile_ts, main = "Time Series Plot of Injuries per Mile", xlab = "Year", ylab = "Number of Injuries per Miles")

Question c

Yes, the number of injuries is increasing over the time, and it achieved the peak in the year of 2007.

Question d

Please note that the build-in function mad() is calculate the Median Absolute Deviation. Here we are calculate the Mean Absolute Deviation.

MAD_func <- function(x){
  return(mean(abs(x - mean(x))))
}

# Injury
mean(injury$Injuries)
median(injury$Injuries)
MAD_func(injury$Injuries)
sd(injury$Injuries)

# Miles
mean(injury$Train.miles)
median(injury$Train.miles)
MAD_func(injury$Train.miles)
sd(injury$Train.miles)

# Injury per miles
mean(injury$Injuries.per.T.M)
median(injury$Injuries.per.T.M)
MAD_func(injury$Injuries.per.T.M)
sd(injury$Injuries.per.T.M)

Injuries: mean: 726.55, median: 629.5, MAD: 233.26, S: 303.1238

Miles: mean: 83.95, median: 83, MAD: 7.85, S: 9.10451

Injuries per mile: mean: 843.4324, median: 776.9231, MAD: 197.4803, S: 265.6596

Problem 3

Question a

Again, we can use plot(Y ~ X) function to draw a scatter plot.

electricity <- read.csv("Module 1/Electricity.csv")
plot(electricity$Forecast ~ electricity$Actual, main = "Electricity Consumption", xlab = "Actual Consumption", ylab = "Predicted Consumption")

Question b

We can use cor() function to calculate the linear correlation coefficient. As the correlation coefficient is 0.95, there is a strong positive linear correlation between the predicted and the actual electricity consumption.

cor(electricity$Actual, electricity$Forecast)

## [1] 0.9547757

Problem 4

prediction_1 <- c(13.1, 15.4, 11.6, 20.7, 9.1, 11.9, 19.5, 14.2, 14.7, 7.4, 18.1, 19.3, 22.3, 11.4, 10.5, 21.1, 15.6, 22.0, 14.8, 16.8)

prediction_2 <- c(18.0, 13.8, 7.8, 22.1, 11.5, 13.1, 15.0, 14.7, 12.0, 6.3, 24.0, 16.4, 23.4, 15.2, 5.5, 21.2, 20.8, 20.2, 15.9, 16.7)

actual <- c(12.5, 16.3, 11.6, 21.4, 10.3, 12.9, 18.2, 14.9, 12.2, 8.0, 20.0, 18.1, 21.6, 12.7, 9.8, 20.2, 14.8, 19.7, 12.6, 16.6)

Question a

We can use scale() function to get the Z-scores for all the observations, and then extract the 5th.

scale(actual)[5]

## [1] -1.196853

We can also manually calculate the Z-score by \(\frac{obs - mean}{sd}\).

(actual[5] - mean(actual)) / sd(actual)

## [1] -1.196853

Question b

sqrt(mean( (actual - prediction_1)^2 ))

## [1] 1.269449

sqrt(mean( (actual - prediction_2)^2 ))

## [1] 2.838486

Question c

Model 1 is better, as it has a lower RMSE, which indicates a more accurate prediction.

Question d

As we are asked to construct a 95% confidence prediction itnerval, the corresponding Z score we are going to use is qnorm(1 - 5%/2), 1.96. We can simply apply the following fomula for the upper and lower bound: \[ \text{Lower Bound} = \text{Point Estimation} - Z * \text{RMSE} \\ \text{Upper Bound} = \text{Point Estimation} + Z * \text{RMSE} \]

19.5 - 1.96 * 1.269449

## [1] 17.01188

19.5 + 1.96 * 1.269449

## [1] 21.98812

The 95% confidence prediction interval is [ 17.01, 21.99].

Module 2 Mini Case Studies

Jiantong Wang

Problem 1

Question a

Question b

Problem 2

Question a

Question b

Question c

Question d

Problem 3

Question a

Question b

Problem 4

Question a

Question b

Question c

Question d