Table of Contents
What is it correlation analysis?The concept of correlation is the same used in non-time series data: identify and quantify the relationship between two variables. Due to the continuous and chronologically ordered nature of time series data, there is a likelihood that there will be some degree of correlation between the series observations. Show Measuring and analyzing the correlation between two variables, in the context of time series analysis, can be understood by two different aspects:
Looking at these characteristics can be very useful to find new features to use in the modeling step, also to understand patterns of behavior throughout the time. Finding a way to do this in R can be overwhelming, because of the vast quantity of packages, and each package use one different kind of object. The objective of this article is to walk you through three different ways of doing the correlation analysis, which the last one is a general (tidy) way, and also what I prefer. A short disclaimer before start, this article is not meant to explain all the theory behind the correlation analysis and for a more complete explanation, you can access this great reference: Forecasting: Principles and Pratice. The Different ApproachesLet’s set our environment:
DataWe’ll use a dataset from Stack Overflow, that have the numbers of questions for each month from 2009 to 2019, in different topics. You can access the data using this link. The dataset contains 9 different features regarding keywords used in Stack Overflow questions, but here, we’ll use just
I’ll do one basic transformation in this dataset that must help-us during the tutorial and in all methods.
Now we can start!! First ApproachThe first method that I want to show you use the
Now you can see all information
about the time series that was created with the
With this summary, it’s possible to see that we have two variables, that represent two different time series. Each time series represent one feature (r and python). Lets visualize the time series.
With this graph, we see that python has a more pronounced trend than R regarding the number of questions made in the stack overflow platform. The time series doesn’t have a seasonal pattern, but it’s possible to see some cyclicality in both time series. Let’s look at the correlation plots! Lags Analysis IThe Obviously, you can go through the data and try to plot in the For R time series:
For Python time series:
In all graphics we have similar characteristics that need explanation:
Note that looking at ACF plots, both for R and Python time series, we have a greater correlation with more recent lags, which is lost over time. With more distant lags it is possible to see that we have a negative correlation with recent data. The Partial Autocorrelation is a little different, this “partial” correlation between two variables is the amount of correlation between them which is not explained by their mutual correlations with a specified set of other variables. For example, if we are regressing a variable Y on other variables X1, X2, and X3, the partial correlation between Y and X3 is the amount of correlation between Y and X3 that is not explained by their common correlations with X1 and X2. To summarize, when we look at the PACF plot, we want to know each lag that has relevant information to use as a predictor in a future forecast. How much greater the PACF score, the better. And, in our plots, we see that both time series have one lag (closer to lag 1) that may be useful. Just know which correlation score has the higher score is not enough, it’s important to see visually how these points are
distributed. In that,
Causality analysis IJust look at the past pattern within the series is not always a good idea. The main pitfall of this method is that it will fail whenever the changes in the series derive from the exogenous factors. The goal of causality analysis, in the context of time series analysis, is to identify whether a causality relationship exists between the series we wish to forecast other potential exogenous factors. Be careful about the fact that correlation doesn’t imply causation, we just looking for something that could help the forecast model. In our current case, we just can see if exist some lag in R time series correlated to Python time series.
We see that the higher score is in the recent lags, and some scores between 5 and 10 have a negative relationship. The graphic is saying to us that currently (at the date of the dataset) the growth of R questions is highly correlated to Python questions. Second ApproachThis second approach uses what is called tsibble, like a tibble with an implicit date index. So, let’s convert our tibble to a tsibble.
Differently of what we did before, here the data was converted to a long format, better to plot. Looking at the object created, we see the indication of an interval of these observations [1M] (monthly). And also, we see the key variable, that it’s a way to indicate how many time series exist in this object. It’s possible to have different combinations of features to do a new time series in the object. As we have the objects ready, we can now perform the correlation analysis by this second approach. Attempt that the graphic interpretation was already made, and here I’ll just explain the difference between the methods. Lag Analysis IIUnlike A problem with these functions is that they may behave differently than expected. Plugging the data to the acf and pacf functions, we are able to automatically see the faceted plots between all time series within the object. But if we try to simply visualize the series over time, it will not be possible to facet. It’s not a big problem, and you can plot using ggplot normally (see the code).
Casuality Analysis II
Third Approach (Tidy Approach)Until now we have changed our original object several times, but using the Lag Analysis IIIYou basically use 2 functions to perform all the correlation analysis with much less code and more flexibility. The only thing that you need to do is put the dataset in a long format.
Causality Analysis IIIIn the real world, the time series you want to compare will probably not be in the same range as here, and you will probably need to do some transformation to put in the same interval and only then calculate the ccf score.
So, the main takeaways in this
tutorial are that the
I hope you find something useful with this tutorial, if you want to contact me, check the links:
=] |