Predict Electricity Consumption using Time Series analysis

Predict Electricity Consumption using Time Series analysis


Time series models are used to forecast future events based on previous events that have been observed (and data collected) at regular time intervals. We will be taking a small forecasting problem and try to solve it till the end, learning time series forecasting alongside. Time series forecasting is a technique for the prediction of events through a sequence of time. The technique is used across many fields of study, from geology to behavior to economics. The techniques predict future events by analyzing the trends of the past, on the assumption that future trends will hold similarity to the historical trends.

Problem Statement

Predicting electricity consumption using time series analysis by ARIMA model



Stages in Time Series Forecasting:

Solving a time series problem is a little different as compared to a regular modeling task. A simple/basic journey of solving a time series problem can be demonstrated through the following processes. We will understand about tasks which one needs to perform in every stage. We will also look at the python implementation of each stage of our problem-solving journey.


Step 1: Visualizing time series

Step 2. Stationarising time series

Step 3. Finding the best parameters for our model

Step 4. Fitting model

Once we have our optimal model parameters, we can fit an ARIMA model to learn the pattern of the series. Always remember that time series algorithms work on stationary data only. Hence, making a series stationary is an important aspect

Step 5. Predictions

After fitting our model, we will be predicting the future in this stage. Since we are now familiar with a basic flow of solving a time series problem, let us get to the implementation.

Scatter plot of time series data points

We can also visualize the data in our series through a distribution.

Stationarising the time series.

First, we need to check if a series is stationary or not.

ADF (Augmented Dickey-Fuller) Test

The Dickey-Fuller test is one of the most popular statistical tests. It can be used to determine the presence of unit root in the series, and hence helps us to understand if the series is stationary or not. The null and alternate hypothesis of this test is:

Null Hypothesis: The series has a unit root (value of a =1)

Alternate Hypothesis: The series has no unit root.

If we fail to reject the null hypothesis, we can say that the series is non-stationary. This means that the series can be linear or difference stationary (we will understand more about difference stationary in the next section).


Results of Dicky-Fuller test

We see that the p-value is greater than 0.05, so we cannot reject the Null hypothesis. Also, the test statistics is greater than the critical values. So, the data is non-stationary.

To get a stationary series, we need to eliminate the trend and seasonality from the series.

After finding the mean, we take the difference of the series and the mean at every point in the series.

From the above graph, we observed that the data attained stationarity. We also see that the test statistics and the critical value is relatively equal.

There can be cases when there is a high seasonality in the data. In those cases, just removing the trend will not help much. We need to also take care of the seasonality in the series. One such method for this task is differencing.

Differencing is a method of transforming a time series dataset.

It can be used to remove the series dependence on time, so-called temporal dependence. This includes structures like trends and seasonality. Differencing can help stabilize the mean of the time series by removing changes in the level of a time series, thus eliminating (or reducing) trend and seasonality.

Differencing is performed by subtracting the previous observation from the current observation.

Perform the Dickey-Fuller test (ADFT) once again.

Values of p and q come through ACF and PACF plots. So let us understand both ACF and PACF!

Autocorrelation Function(ACF)

Statistical correlation summarizes the strength of the relationship between two variables. Pearson’s correlation coefficient is a number between -1 and 1 that describes a negative or a positive correlation respectively. A value of zero indicates no correlation.

We can calculate the correlation for time series observations with previous time steps, called lags. Because the correlation of the time series observations is calculated with values of the same series at previous times, this is called a serial correlation, or an autocorrelation.

A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym ACF. This plot is sometimes called a correlogram or an autocorrelation plot.

Partial Autocorrelation Function(PACF)

A partial autocorrelation is a summary of the relationship between an observation in a time series and observations at prior time steps with the relationships of intervening observations removed.

The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to the terms at shorter lags.

The autocorrelation for observation and observation at a prior time step is comprised of both the direct correlation and indirect correlations. It is these indirect correlations that the partial autocorrelation function seeks to remove.

Below code plots, both ACF and PACF plots for us:

Fitting model

In order to find the p and q values from the above graphs, we need to check, where the graph cuts off the origin or drops to zero for the first time. In the above graphs, the p and q values are merely close to 3 where the graph cuts off the origin ( draw the line to x-axis). Now, we have p,d,q values which we can substitute in the ARIMA model and see the output.

Lower the RSS value, the more effective the model is. You check with (2,1,0),(3,1,1), etc. to look for the smallest values of RSS.

Training Dataset:

The following code helps us to forecast shampoo sales for the next 6 years.

From the above graph, we calculated the future predictions till 2024. The greyed out area is the confidence interval which means the predictions will not cross that area.

Finally, we were able to build an ARIMA model and actually forecast for a future time period.