One of the early steps in the data science development cycle is to understand and explore the data for the problem you’re solving. EDA is a crucial step for a better data science workflow, and Pandas profiling have been my preferred choice to have in done quickly and with a single line of code, while providing me the outputs to better understand the data and uncover meaningful insights.
You probably have been using Pandas Profiling (now called YData Profiling) for the structured tabular data, which is commonly the first type of data that we learn to explore, we all now the Iris dataset right? However, in real-world applications, theres another type of data structure that we can commonly find in our day to day: from traffic, to our daily trajectories or even our electricity and water consumption, all of them have one thing in common — temporal dependency.
Time-series or sequential data have become one of the most valuable commodities in a world that is more and more data-driven, which makes the need to perform EDA and mine time-series data a much needed skill for data science practitioners.
Due to the nature of time series data, and when exploring the dataset, the type of analysis it is different from when the dataset records are considered to be all independent. The complexity of the analysis grow with the addition of more than one entity within the same dataset.
In this blogpost I’ll be exploring some key steps in the analysis of a dataset, and while leveraging the time-series features of pandas-profiling. The dataset explored refers to theAir Quality in the USAand can be download fromEPA website.*
The full code and examples can be found in theGitHub repositoryso you can follow along the tutorial.
Analyzing multiple entities in a time-series dataset
The data description says it’s the air quality data collected at outdoor monitors across the United States, Puerto Rico, and the U. S. Virgin Islands. With that information, we understand this is a multivariate time-series data that has several entities that we will need to take into consideration.
Knowing this, I have some follow-up questions: How many are the locations available in what concerns the pollutants measures? Do all the sensors collect the same amount data throughout the same timespan? How are the collected measures distributed in time and location?
Some of theses questions can be easily answered with an heatmap comparing all the measurements and locations against time, as depicted by the code snippet and image below:
The diagram above showcases the data points for each entity over time. We see not all stations have started collecting data at the same time, and based on the intensity of the heatmap, we can realize some stations have more data points than others for a given time period.
This means when modeling the time series, having dynamic timestamps for the training and test datasets might be better than having pre-determined timestamps. We also will have to further investigate the missing records and the scope for imputing records.
With that basic understanding of what our entities time distribution looks like, we can start deep-dive into the data profiling for more insights. Since there are multiple time series, let’s have a look into each entity behavior.
The support for time series can be enabled by passing the parametertsmode=true, and the library will automatically identify the presence of features with autocorrelation (more on this later). For the analysis work properly, the dataframe needs to be sorted by entity columns and time, otherwise you can always leverage thesortbyparameter.
Here’s how the output report would look like using the time-series mode:
Seasonal and Non-stationary alerts
Specific to time-series analysis, we can spot 2 new warnings —NON_STATIONARYandSEASONAL.The easiest why to have a quick grasps on your time-series is by having a look into the warnings section. For this particular use case, each profile report will depict the particular behavior of each USA location in what concerns pollutants measurements.
Here’s how the warnings look in our report:
A time series is said to be stationary when its statistical properties (such as mean and variance) do not change over the time at which the series is observed. Conversely, a time series is non-stationary when its statistical properties depend on time. For instance, time series with trends and seasonality (more on this later) is not stationary — these phenomena can affect the value of the time series at different times.
Stationary processes are comparatively easier to analyze as there is a static relationship between the time and the variables. In fact, stationary has become a common assumption for most time-series analysis.
While there aremodels for non-stationary time series, Most ML algorithms do expect a static relationship between the input features and the output. When the time-series is not stationary, a model’s accuracy modeled from the data will vary at different points. This means the modeling choices are affected by the stationary/non-stationary nature of the time-series, and different data-preparation steps apply when you want toconvert the time-seriesinto a stationary one.
So this alert will help you identify such columns and pre-process the time series accordingly.
Seasonality in time series is a scenario in which the data experiences regular and predictable changes that recur over a defined cycle. This seasonality may obscure the signal that we wish to model when time-series modeling, and even worse, it may provide a strong signal to the models. This alert can help you identify such columns and alert you to fix the seasonality.
More information on the time-dependent features
The first difference you will notice is that the line plot will replace the histogram for the column that was identified as time-dependent. Using the line plot, we can better understand the trajectory and the nature of the selected column. For this NO2 mean line plot, we see a downward trend in the trajectory, with continuous seasonal variations, with the maximum value recorded in the initial stages of the series.
Next, when we toggle for more details of the column (as shown in the figure above), we’ll see a new tab with autocorrelation and partial-autocorrelation plots.
For time series, autocorrelations show how the relationship of a time series at its present value relates to its previous values. Partial autocorrelation is the autocorrelation of a time series after removing the effect of previous time lags. Which means these plots are crucial to provide information regarding the autocorrelation degrees of the series under analysis, as well as the moving average degree.
The above ACF and PACF plots are a bit ambiguous as expected. Looking throughout our warnings, we can see thatNO2 meanis anon-stationarytime variable, which removes the interpretability of these plots. Nevertheless, the ACF plot is useful to confirm us what we already suspected —NO2 meanis non-stationary — as the ACF plot values decrease very slowly instead of dropping quickly to zero as expected for the case of stationary series.
The information gathered from the data profiling, the nature of time-series, and the alerts such as non-stationary and seasonality give you a head start in understanding the time-series data you have at your hand. This doesn’t mean you’re done with the exploratory data analysis — the goal is to use these insights as a starting point and work on further in-depth data analysis and further data preparation steps.
From profiling the air quality dataset, we see several columns which are constant, which may not add much value when modeled. From the missing values chart, we see SO2 and CO2 air quality indexes have missing data — we should further explore the impact of this and the scope for imputation or dropping these columns altogether. Several columns were found with non-stationary and seasonality alerts, the next steps would be either to make them stationary or ensure the models we’ll be using can handle the non-stationary data points.
You get the idea — as data scientists, it’s important to use profiling tools to quickly grab an overall view of the data (in our case time series), and further inspect and take informed decisions on the data pre-processing and modeling stages.
Conclusions
The motto of Pandas Profiling has always been the same: “Read the data? Pause. Generate the Pandas Profiling report, and inspect the data. Now start cleaning and re-iterate on exploring the data.”
Though structured tabular data remains the most common data when giving the first steps data science, time-series data is widely used and core for the development of many business and advanced data-driven solutions. Due to the nature of the time series and how records depend on time and influence future occurrences — different kinds of insights are sought out by the data scientists during the exploratory data analysis phase.
Thus, it was a matter of time before the Pandas Profiling library incorporated features to enable a time-series analysis mode to uncover these insights. From the changes required from the user to obtain the time-series-specific profiling report — to the output of new alerts that prompt concerns in the data, line plots and correlation graphs that are specific to time-series analysis — we demonstrated everything in this article.
But the metrics and analysis explored today is only the beginning! More questions are to be answered. And for you, what is your usual approach while analysis time-series data? What do you miss the most when working with sequential datasets?