We have started a biweekly series of challenges in this Study Circle. After considering the topics you have suggested in the comments, we are kicking off with Time Series.
Why this Data topic?
This morning, I read the Economist Espresso on India's pollution season, and I thought it was a good idea to start the series of challenges with this topic.
Getting the Data
I chose Delhi to be the city we will analyze in this challenge.
Executing the following lines of code will produce the DataFrame we'll work with:
import pandas as pd df = pd.read_csv('anand-vihar, delhi-air-quality.csv', parse_dates=['date'], index_col=0) df
I needed to process the data to deliver a workable dataset in the following way:
#remove whitespaces in columns df.columns = df.columns.str.strip() #get the rows with the numbers (some of them where whitespaces) series = df['pm25'].str.extract('(\w+)') #rolling average to armonize the data monthly series_monthly = series.rolling(30).mean() #remove missing dates series_monthly = series_monthly.dropna() #fill missing dates by linear interpolation series_monthly = series_monthly.interpolate(method='linear') #sorting the index to later make a reasonable plot series_monthly = series_monthly.sort_index() #aggregate the information by month series_monthly = series_monthly.to_period('M').groupby(level='date').mean() #process a timestamp to avoid errors with statsmodels' functions series_monthly = series_monthly.to_timestamp() #setting freq to avoid errors with statsmodels' functions series_monthly = series_monthly.asfreq("MS").interpolate() #change the name of the pandas.Series series_monthly.name = 'air pollution pm25'
As we don't know the coding skills of this Study Circle member, we'll start with simple ARIMA models. From this point, we will iterate the procedure and improve the dynamic.
To take on the challenge and maybe, receive some feedback, you should fork this repository to your GitHub account. Otherwise, you can download this script.
The end goal is to develop an ARIMA model and plot the predictions against the actual data. Resulting in a plot like the this.
Nevertheless, you can develop this challenge in any way you find attractive. The essential point of this Study Circle is the interactivity between the members to generate value and knowledge.
From your feedback, we could later work on different use cases. For example, we could later create a geospatial map in Python with the predictions.
So, let's get on and good luck!
You start with the following object:
Check out the following materials to learn how you could develop the challenge:
- Video Tutorial: How to develop ARIMA models to predict Stock Price
Start the challenge
date 2014-01-01 286.023457 2014-02-01 281.428205 ... 2022-08-01 115.487097 2022-09-01 143.713333 Freq: MS, Name: air pollution pm25, Length: 105, dtype: float64
It's not the same to observe the data in numbers than in a chart:
We aim to compute a mathematical equation that we will later use to calculate predictions, as we can see in the following chart:
There are many types of mathematical equations, the one we will use is
ARIMA. Don't worry about the maths, we need a Python function to make it all for us.
from statsmodels.tsa.arima.model import ARIMA
The parameters of this Class ask for two objects:
endog: the data
p: the first significant lag in the Autocorrelation Plot
d: the diff needed to make our data stationary
q: the first significant lag in the Partial Autocorrelation Plot
d | Diff to get data stationarity
The first thing we need to check about our data is stationarity. We use the Augmented Dickey-Fuller test intending to reject the null hypothesis in which we state that the data is non-stationary. If that's the case, we need to differentiate the time series and adjust the number
d:1 in the parameter
order=(p, d:1, q).
from statsmodels.tsa.stattools import adfuller result = adfuller(series_monthly)
The p-value is given by the second element the function
The p-value is greater than 0.05. Therefore, we can't reject the null hypothesis.
Are we done here?
- No, we can differentiate the Time Series by one lag and test again:
series_monthly_diff_1 = series_monthly.diff().dropna() result = adfuller(series_monthly_diff_1) result
We can reject the null hypothesis and say that our data is stationary with a lag of 1. Therefore, we need to set
d:1 in the
order parameter of the
q | Autocorrelation Plot
Now we need to determine
q based on the first significant lag of the autocorrelation plot:
from statsmodels.graphics.tsaplots import plot_acf plot_acf(series_monthly_diff_1, lags=50) plt.xlabel('Lag');
The first significant lag is the 2, which means that our differentiated data (monthly) is correlated every two months. We set
p | Partial Autocorrelation Plot
We follow the same procedure to choose a number for
p. But this time, we use another type of plot: Partial Autocorrelation.
from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(series_monthly_diff_1, lags=50, method='ywm') plt.xlabel('Lag');
We see the first significant lag at 2. Therefore, we set
We already know which numbers we set on the
order=(p:2, d:1, q:2). So, let's fit the mathematical equation of the model.
model = ARIMA(series_monthly, order=(2,1,2)) result = model.fit() result.summary()
And calculate the predictions:
import matplotlib.pyplot as plt plt.figure(figsize=(6,4)) series_monthly.plot(label='Actual Data') result.predict().plot(label='Predicted Data') plt.legend() plt.xticks(rotation=45);