Module 1 Mini Cases

Problem 1

The average monthly temperatures for Boulder, Colorado, from January 1991 to December 2015 are given in Boulder_2.xlsx (Source: U.S. Department of Commerce, National Oceanic and Atmospheric Administration). Plot the time series, and create a seasonal plot for the last four years of the series. Comment on your results.

1.Read in the data

Set working directory, and download the dataset in your working directory.

temperature <- read.csv("Boulder_2.csv")
str(temperature)

## 'data.frame':    300 obs. of  3 variables:
##  $ Year       : int  1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
##  $ Month      : chr  "JAN" "FEB" "MAR" "APR" ...
##  $ Temperature: num  29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 ...

tail(temperature)

##     Year Month Temperature
## 295 2015   JUL        70.3
## 296 2015   AUG        70.5
## 297 2015   SEP        69.4
## 298 2015   OCT        56.2
## 299 2015   NOV        40.8
## 300 2015   DEC        33.1

2. Create time series object

Before using the ts() function, remember to library the package you want to use.

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

temp.ts <- ts(temperature$Temperature, start = c(1991,1), end = c(2015,12), frequency = 12)
temp.ts

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1991 29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 36.8 35.3
## 1992 35.9 40.6 43.3 54.3 59.1 62.9 68.3 66.3 64.4 54.1 34.1 29.2
## 1993 28.3 30.6 42.4 47.6 57.5 64.5 69.5 67.3 58.8 48.7 35.6 35.4
## 1994 35.5 31.9 43.9 47.6 60.8 70.0 71.2 70.9 65.0 50.6 36.6 36.1
## 1995 34.5 38.3 42.1 45.1 50.9 62.4 70.5 74.0 60.4 50.5 45.0 36.3
## 1996 29.7 37.7 37.9 50.4 58.9 66.9 71.5 69.5 60.8 53.1 40.6 36.5
## 1997 31.3 32.8 45.5 42.8 57.4 66.5 71.4 68.7 64.0 52.7 37.9 33.9
## 1998 36.5 36.4 38.7 46.5 58.8 62.1 72.8 70.7 67.1 50.4 44.0 32.2
## 1999 36.4 42.1 46.0 44.5 55.6 64.8 73.5 69.3 58.5 51.9 48.0 36.9
## 2000 36.4 41.0 42.9 51.2 61.0 67.4 74.7 73.0 63.1 49.6 31.4 31.2
## 2001 33.0 32.3 40.8 50.7 58.4 68.9 75.0 71.8 65.0 53.8 43.9 35.0
## 2002 33.1 36.0 37.5 52.9 56.2 70.5 76.8 71.3 64.0 45.9 40.3 36.6
## 2003 40.2 31.6 43.7 50.6 57.4 62.8 75.9 72.9 60.5 57.4 39.0 36.4
## 2004 35.4 33.6 48.2 49.2 59.9 62.7 69.2 66.4 62.9 51.9 39.7 36.5
## 2005 35.5 37.9 42.0 48.4 57.6 65.4 75.1 69.7 66.5 52.6 45.0 33.3
## 2006 40.7 33.7 39.4 53.9 61.0 71.6 74.4 71.6 58.4 51.0 43.4 35.3
## 2007 27.2 34.6 47.6 47.9 58.0 67.7 74.8 73.7 64.5 55.2 44.9 30.2
## 2008 31.6 36.1 40.8 47.8 57.1 66.1 75.0 69.6 60.9 46.0 46.0 31.1
## 2009 38.2 39.7 44.3 47.3 59.3 63.0 69.6 69.6 63.1 44.5 43.8 26.7
## 2010 33.0 30.1 42.7 48.8 53.9 66.9 72.5 72.4 66.6 55.0 39.8 37.2
## 2011 33.1 32.0 45.2 48.9 53.7 67.6 73.5 75.1 63.4 52.9 42.3 32.2
## 2012 38.9 32.3 50.8 54.3 60.2 74.2 74.8 73.2 66.0 50.8 46.1 33.7
## 2013 33.0 32.1 40.5 43.8 57.7 69.9 72.2 72.2 65.1 47.8 43.2 31.5
## 2014 34.6 32.0 43.6 49.8 56.6 66.2 72.2 69.2 63.8 55.3 38.3 33.9
## 2015 36.5 36.6 46.1 50.1 52.4 68.3 70.3 70.5 69.4 56.2 40.8 33.1

3. Make time series plot

Remember to add main title to your plot and add labels to X-axis and Y-axis.

autoplot(temp.ts,main = "Everage tempurature", xlab = "Time", ylab = "°F")

4. Create seasonal plot for the last your years

First we need to extract the last four years data. We can use window() function to extract a part of time series object. The first argument is the time series object from which you want to extract. The second and third argument is the start date and end date.

temp.last4.ts <- window(temp.ts,c(2012,1),c(2015,12))

Then we can use seasonplot() or ggseasonplot() to create a season plot. We can set argument year.labels = T to enable labels for each year, and we can use col to specify each year’s color. Here we also set year.label.left to T to plot labels on the left side of the season plot.

seasonplot(temp.last4.ts,year.labels = T, year.labels.left = T,col = 1:4, main = "Season Plot for Average Tempurature", ylab = "°F", xlab = "Month")

ggseasonplot(temp.last4.ts,year.labels = T, year.labels.left = T, main = "Season Plot for Average Tempurature", ylab = "°F", xlab = "Month")

Problem 2

For the temperature data (Boulder_2.xlsx) in Exercise 2.1, compute the summary statistics (mean, median, MAD, and S) overall and for each month. Comment on your results. Does it make sense to compute summary statistics across all values, rather than month by month? Explain why or why not.

1. Read in the dataset

We’ve already read in the data set in Problem 1.

2. Calculate overall summary statistics

mean

mean(temperature$Temperature)

## [1] 51.58367

median

median(temperature$Temperature)

## [1] 50.55

MAD

We need to calculate deviation first, and then calculate the mean absolute deviation.

temp_deviation <- temperature$Temperature - mean(temperature$Temperature)
mean(abs(temp_deviation))

## [1] 12.43013

Standard deviation

sd(temperature$Temperature)

## [1] 14.12608

3. Summary Statistics by Month

We can use apply() function to help us calculate the monthly summary statistics.

First, we need to convert the time-series object to a matrix. We make each row as a year and each column as a month, and rename the rows and columns by function colnames() and rownames().

temp_mat <- matrix(temp.ts,nrow = 25,ncol = 12, byrow = T)
colnames(temp_mat) <- 1:12
rownames(temp_mat) <- 1991:2015
temp_mat

##         1    2    3    4    5    6    7    8    9   10   11   12
## 1991 29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 36.8 35.3
## 1992 35.9 40.6 43.3 54.3 59.1 62.9 68.3 66.3 64.4 54.1 34.1 29.2
## 1993 28.3 30.6 42.4 47.6 57.5 64.5 69.5 67.3 58.8 48.7 35.6 35.4
## 1994 35.5 31.9 43.9 47.6 60.8 70.0 71.2 70.9 65.0 50.6 36.6 36.1
## 1995 34.5 38.3 42.1 45.1 50.9 62.4 70.5 74.0 60.4 50.5 45.0 36.3
## 1996 29.7 37.7 37.9 50.4 58.9 66.9 71.5 69.5 60.8 53.1 40.6 36.5
## 1997 31.3 32.8 45.5 42.8 57.4 66.5 71.4 68.7 64.0 52.7 37.9 33.9
## 1998 36.5 36.4 38.7 46.5 58.8 62.1 72.8 70.7 67.1 50.4 44.0 32.2
## 1999 36.4 42.1 46.0 44.5 55.6 64.8 73.5 69.3 58.5 51.9 48.0 36.9
## 2000 36.4 41.0 42.9 51.2 61.0 67.4 74.7 73.0 63.1 49.6 31.4 31.2
## 2001 33.0 32.3 40.8 50.7 58.4 68.9 75.0 71.8 65.0 53.8 43.9 35.0
## 2002 33.1 36.0 37.5 52.9 56.2 70.5 76.8 71.3 64.0 45.9 40.3 36.6
## 2003 40.2 31.6 43.7 50.6 57.4 62.8 75.9 72.9 60.5 57.4 39.0 36.4
## 2004 35.4 33.6 48.2 49.2 59.9 62.7 69.2 66.4 62.9 51.9 39.7 36.5
## 2005 35.5 37.9 42.0 48.4 57.6 65.4 75.1 69.7 66.5 52.6 45.0 33.3
## 2006 40.7 33.7 39.4 53.9 61.0 71.6 74.4 71.6 58.4 51.0 43.4 35.3
## 2007 27.2 34.6 47.6 47.9 58.0 67.7 74.8 73.7 64.5 55.2 44.9 30.2
## 2008 31.6 36.1 40.8 47.8 57.1 66.1 75.0 69.6 60.9 46.0 46.0 31.1
## 2009 38.2 39.7 44.3 47.3 59.3 63.0 69.6 69.6 63.1 44.5 43.8 26.7
## 2010 33.0 30.1 42.7 48.8 53.9 66.9 72.5 72.4 66.6 55.0 39.8 37.2
## 2011 33.1 32.0 45.2 48.9 53.7 67.6 73.5 75.1 63.4 52.9 42.3 32.2
## 2012 38.9 32.3 50.8 54.3 60.2 74.2 74.8 73.2 66.0 50.8 46.1 33.7
## 2013 33.0 32.1 40.5 43.8 57.7 69.9 72.2 72.2 65.1 47.8 43.2 31.5
## 2014 34.6 32.0 43.6 49.8 56.6 66.2 72.2 69.2 63.8 55.3 38.3 33.9
## 2015 36.5 36.6 46.1 50.1 52.4 68.3 70.3 70.5 69.4 56.2 40.8 33.1

To get the monthly summary statistics, we want to calculate each column’s mean, median, MAD and sd. So we can use apply() function. The first argument of apply() is a matrix or data.frame. The second argument “MARGIN” is to indicate whether we want to calculate each column’s or each row’s data. MARGIN = 1, means calculating by each row. MARGIN = 2, means calculating by each column.

Here, for we want monthly summary statistics, we will set MARGIN = 2. The third argument is the function you want to use, for example, mean(), median(), sd()

apply(temp_mat, 2, mean)

##      1      2      3      4      5      6      7      8      9     10     11 
## 34.336 35.316 43.148 48.888 57.504 66.636 72.608 70.724 63.356 51.600 41.060 
##     12 
## 33.828

apply(temp_mat, 2, median)

##    1    2    3    4    5    6    7    8    9   10   11   12 
## 34.6 34.6 42.9 48.8 57.7 66.6 72.5 70.7 63.8 51.9 40.8 33.9

apply(temp_mat, 2, sd)

##        1        2        3        4        5        6        7        8 
## 3.467453 3.649621 3.211843 3.080357 2.604080 3.131145 2.354060 2.308910 
##        9       10       11       12 
## 2.822983 3.268792 4.159928 2.739605

To get MAD, we need to create a new function by ourselves.

MAD_func <- function(vec){
  d_vec <- vec - mean(vec)
  MAD <- mean(abs(d_vec))
  return(MAD)
}

apply(temp_mat, 2, MAD_func)

##       1       2       3       4       5       6       7       8       9      10 
## 2.75968 3.16064 2.43392 2.37152 1.90752 2.42144 2.00032 1.87296 2.22528 2.54400 
##      11      12 
## 3.43040 2.21664

Problem 3

Suppose you have two forecasting models: Model 1 and Model 2. The prediction of Model 1 is:

prediction_1 <- c(21.6 , 21.4 , 21.9 , 22.8 , 19.4 , 17.8 , 13.8 , 24.5 , 22.5 , 18.1 , 15.5 , 16.5 , 16.9 , 15.7 , 9.7)

The prediction of Model 2 is:

prediction_2 <- c(20.2 , 21.3 , 23.3 , 20.9 , 21.4 , 21.1 , 13.5 , 23.2 , 20.8 , 20.9 , 18.9 , 16 , 18 , 15.6 , 11.8)

The actual values are:

actual <- c(21 , 21 , 22.8 , 21.4 , 18.7 , 18.1 , 14.3 , 24.4 , 22.8 , 19.2 , 17.8 , 16.4 , 17.3 , 15.2 , 10.4)

Please calculate the RMSE of Model 1 and Model 2
Which model is better? Why?

a. Calculate RMSE

prediction_1 <- c(21.6 , 21.4 , 21.9 , 22.8 , 19.4 , 17.8 , 13.8 , 24.5 , 22.5 , 18.1 , 15.5 , 16.5 , 16.9 , 15.7 , 9.7)
prediction_2 <- c(20.2 , 21.3 , 23.3 , 20.9 , 21.4 , 21.1 , 13.5 , 23.2 , 20.8 , 20.9 , 18.9 , 16 , 18 , 15.6 , 11.8)
actual <- c(21 , 21 , 22.8 , 21.4 , 18.7 , 18.1 , 14.3 , 24.4 , 22.8 , 19.2 , 17.8 , 16.4 , 17.3 , 15.2 , 10.4)

prediction_error_1 <- actual - prediction_1
prediction_error_2 <- actual - prediction_2


RMSE_1 <- sqrt(mean(prediction_error_1^2))
RMSE_2 <- sqrt(mean(prediction_error_2^2))

c(RMSE_1, RMSE_2)

## [1] 0.8805301 1.4252485

b. Which one is better?

Model 1 is better. Model 1 has a smaller RMSE, and a smaller RMSE means the prediction is more accurate.

Problem 4

We have a table of the average temperatures (in °F) on the day (January 20) of the president’s inauguration in Washington, DC (Inauguration.csv).

Summarize the data numerically (mean, median, MAD and SD) and graphically.
Create a 95 percent prediction interval for the inaugurations of President Obama in 2009 and in 2013. Please use the mean and standard deviation of all past years, and based on normal distribution to make the prediction interval.
The actual values were 28º and 40°. Does that come as a surprise, given the width of the prediction interval?

a. Summary

DC_temp <- read.csv("Inauguration.csv")
str(DC_temp)

## 'data.frame':    20 obs. of  2 variables:
##  $ Year       : int  1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 ...
##  $ Temperature: int  33 29 35 38 49 44 22 38 35 42 ...

summary(DC_temp$Temperature)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    32.0    35.0    35.9    40.5    55.0

sd(DC_temp$Temperature)

## [1] 10.57753

DC_temp_ts <- ts(DC_temp$Temperature, start = 1937, frequency = 0.25)
autoplot(DC_temp_ts, main = "Average temperatures (in °F)", xlab = "Year", ylab = "°F")

b. Prediction Interval

Exclude the year of 2009 and 2013. Then calculate the mean and SD.

DC_temp_past <- DC_temp$Temperature[-c(19,20)]
mean(DC_temp_past)

## [1] 36.11111

sd(DC_temp_past)

## [1] 10.96995

To make a 95% percent prediction interval, we need the Z-score corresponding to 97.5%. 1 - (1 - 95%)/2 = 97.5%

qnorm(0.975)

## [1] 1.959964

mean(DC_temp_past) - 1.96 * sd(DC_temp_past)

## [1] 14.61

mean(DC_temp_past) + 1.96 * sd(DC_temp_past)

## [1] 57.61222

Module 1 Mini Cases

Jiantong Wang

Problem 1

Problem 2

Problem 3

Problem 4