Notebook

Tutorial - Introduction to circular data analysis¶

A machine learning model is trained using a dataset that includes input variables (features) and corresponding output variables (labels). The model learns to map the features to the labels, and the goal of training is to find the best set of parameters for this mapping.

In several applications, the features of the model consist in descriptive information of a user, a transaction, a login, among others. In most of these scenarios, there is information about the timestamp, might be time of the event, day of the week, or day of the month. If the goal is to predict an event based on past events, you can use timestamps as features. For example, you could use the time of day, day of the week, or month of the year as features to a model that predicts traffic volume or energy consumption.

However, the best approach for dealing with time in a machine learning problem will depend on the specific problem you're trying to solve and the structure of your data.

In this tutorial, we show how to analyze timestamps as circular variables, and how to generate an approximate distribution that can be used as part of the feature engineering step in a building a machine learning model.

In [1]:

import pandas as pd
import pycircular

Read and process transactions dataset¶

In [2]:

df = pycircular.datasets.load_transactions()['data']
df['date']= pd.to_datetime(df['date'])
dates = df.loc[df['user'] == 1, 'date']

In [3]:

dates.head()

Out[3]:

1    2020-01-01 03:09:57
6    2020-01-01 16:37:52
9    2020-01-01 19:16:12
12   2020-01-01 19:16:58
15   2020-01-01 19:17:48
Name: date, dtype: datetime64[ns]

Analysis of the transactional dates using traditional data analysis

In [4]:

dates.describe(datetime_is_numeric=True)

Out[4]:

count                              349
mean     2020-04-03 01:32:31.352435712
min                2020-01-01 03:09:57
25%                2020-02-09 08:44:55
50%                2020-03-27 01:00:01
75%                2020-05-25 00:36:32
max                2020-07-29 22:39:35
Name: date, dtype: object

In [5]:

dates.groupby(dates.dt.hour).count().plot(kind="bar")

Out[5]:

<AxesSubplot:xlabel='date'>

When dealing with hour of the day as a scalar variable, there are a few issues that can arise.

One issue is that hour of the day is cyclical in nature, meaning that the value at the end of the day (24:00) is related to the value at the beginning of the day (00:00). However, when hour of the day is treated as a scalar variable, this cyclical relationship is not considered, which can lead to inaccurate or misleading results.

Another issue is that hour of the day is often correlated with other variables, such as day of the week or season. For example, there may be more traffic during rush hour on a weekday than on a weekend. However, when hour of the day is treated as a scalar variable, these correlations are not taken into account and can lead to biased or misleading results.

A third issue is that hour of the day can be affected by different factors such as season, day of the week, or even holidays. These factors can greatly impact the behavior and patterns of hour of the day. So, if this information is not taken into account when using hour of the day as a scalar variable, it can lead to inaccurate conclusions.

To overcome these issues, one solution is to use a cyclical encoding technique, such as sine and cosine encoding, to incorporate the cyclical nature of the data. Another solution is to include other relevant variables, such as day of the week or season, in the model to account for potential correlations. Additionally, it's important to consider the impact of different factors on hour of the day when analyzing data.

Plot histogram as a circular variable¶

We can model the time of the day as a circular variable, for that, we must first convert the time of day into radians and then plot in a 24h clock

In [6]:

time_segment = 'hour'  # 'hour', 'dayweek', 'daymonth
freq_arr, times = pycircular.utils.freq_time(dates , time_segment=time_segment)
fig, ax1 = pycircular.plots.base_periodic_fig(freq_arr[:, 0], freq_arr[:, 1], time_segment=time_segment)
ax1.legend(bbox_to_anchor=(-0.3, 0.05), loc="upper left", borderaxespad=0)

Out[6]:

<matplotlib.legend.Legend at 0x1f407367610>

We could now create variables such as the mean time of the transaction and the standard deviation of the time of the transaction.

In [7]:

dates_mean = times.values.mean()

fig, ax1 = pycircular.plots.base_periodic_fig(freq_arr[:, 0], freq_arr[:, 1], time_segment=time_segment)
ax1.bar([dates_mean], [1], width=0.1, label='Arithmetical Mean Hour')
ax1.legend(bbox_to_anchor=(-0.3, 0.05), loc="upper left", borderaxespad=0)

Out[7]:

<matplotlib.legend.Legend at 0x1f407c21a90>

The issue when dealing with the time of the example, specifically, when analyzing a feature such as the mean of time, is that it is easy to make the mistake of using the arithmetic mean. Indeed, the arithmetic mean is not a correct way to average time because, as shown in the above figure, it does not consider the periodic behavior of the time feature. For example, the arithmetic mean of time of four transactions made at 2:00, 3:00, 22:00 and 23:00 is 12:30, which is counter-intuitive since no were made close to that time.

Circular Analysis¶

We can overcome this limitation by modeling the time of the transaction as a periodic variable, in particular using the von Mises distribution (Fisher, 1996). The von Mises distribution, also known as the periodic normal distribution, is a distribution of a wrapped normal distributed variable across a circle. The von Mises distribution of a set of examples $$ D=\{t_1,t_2,\cdots,t_N\}$$ is defined as \begin{equation} D \sim vonmises\left( \mu_{vM} , \frac{1}{\sigma_{vM}} \right), \end{equation} where $\mu_{vM}$ and $\sigma_{vM}$ are the periodic mean and periodic standard deviation, respectively. In #cite (https://albahnsen.github.io/files/Feature%20Engineering%20Strategies%20for%20Credit%20Card%20Fraud%20Detection_published.pdf) we present the calculation of $\mu_{vM}$ and $\sigma_{vM}$.

In [8]:

radians = pycircular.utils._date2rad(dates, time_segment='hour')
mean, std = pycircular.stats.periodic_mean_std(radians)

fig, ax1 = pycircular.plots.base_periodic_fig(freq_arr[:, 0], freq_arr[:, 1], time_segment=time_segment)
ax1.bar([mean], [1], width=0.1, label='Periodical Mean Hour')
ax1.legend(bbox_to_anchor=(-0.3, 0.05), loc="upper left", borderaxespad=0)

Out[8]:

<matplotlib.legend.Legend at 0x1f407d57fa0>

Using the circular mean and std we can calculate de von mises distribution

In [9]:

# Calculate the von Mises distribution
x, p = pycircular.stats.von_mises_distribution(mean, std)

fig, ax1 = pycircular.plots.base_periodic_fig(freq_arr[:, 0], freq_arr[:, 1], time_segment='hour')

ax1 = pycircular.plots.clock_vonmises_distribution(ax1, mean, x, p)

This method give us a good approximation of the distribution of the time of the events. However, when using a statistical distribution with only one mode, it may be difficult to accurately model the data if the distribution is not a good fit for the data set. Additionally, if the data set is multi-modal (i.e. has multiple peaks), a single mode distribution will not be able to capture all the variations in the data. This could lead to poor predictions or inferences based on the model.

This can be overcome by modeling the data with a von Mises kernel distribution.

Modeling using a von Mises kernel distribution¶

One way to overcome the issues of using a statistical distribution with only one mode is to use a kernel-based method, such as kernel density estimation (KDE).

KDE is a non-parametric method for estimating the probability density function of a random variable. It works by replacing the point-mass at each data point with a smooth and symmetric kernel function, such as a von Mises. The resulting estimate of the PDF is a sum of the kernel functions centered at each data point.

By using a kernel function, KDE is able to smooth out any single-modal distributions, and can capture multiple modes in the data, making it a more flexible method for modeling multi-modal data sets. Additionally, kernel density estimation is non-parametric, which means it does not make any assumptions about the underlying distribution of the data.

However, it's worth noting that choosing the right kernel is important, and there are some challenges when working with KDE such as the choice of bandwidth and the curse of dimensionality.

In [10]:

y = pycircular.circular.kernel(radians.values)

fig, ax1 = pycircular.plots.plot_kernel(freq_arr[:, 0], freq_arr[:, 1], y,time_segment=time_segment)

In summary, using a kernel-based method such as KDE with von Mises can help overcome the issues of using a statistical distribution with only one mode by allowing for a more flexible and robust modeling of multi-modal data sets.

Using the kernel to create a new feature¶

Finally, we can apply the kernel to new observations and create a new feature that can be used as an input for a machine learning model.

In [11]:

y_test = pd.DataFrame(pd.to_datetime([
    '2023-01-01 12:00:00',
    '2023-01-01 03:00:00',
    '2023-01-01 18:00:00',
    ]), columns=['dates'])

radians = pycircular.utils._date2rad(y_test['dates'], time_segment='hour')
y_test['prob'] = pycircular.circular.predict_proba(radians, y)

In [12]:

y_test

Out[12]:

	dates	prob
0	2023-01-01 12:00:00	0.017654
1	2023-01-01 03:00:00	0.838482
2	2023-01-01 18:00:00	0.500127

To understand these probabilities, lets observe the new events in the kernel.

In [13]:

fig, ax1 = pycircular.plots.plot_kernel(freq_arr[:, 0], freq_arr[:, 1], y,time_segment=time_segment)

for i in range(y_test.shape[0]):
    ax1.bar([radians[i]], [1], width=0.05, label=y_test.loc[i, 'dates'])
    
ax1.legend(bbox_to_anchor=(-0.3, 0.05), loc="upper left", borderaxespad=0)

Out[13]:

<matplotlib.legend.Legend at 0x1f4087cee50>

We can see that an observation at noon, have a very low probability (0.017) because when training the kernel there wasn't any observations at that time.

In conclusion, this methodology allows us to effectively deal with timestamps by creating more robust representations of the temporal information in the data. By using the kernel of von Mises during feature engineering, we can generate new features that accurately capture the nuances of temporal patterns in the data. This approach can overcome the limitations of treating dates as a scalar variable and lead to improved performance in machine learning models.

Next Steps¶

The selection of the bandwidth parameter (bw) is crucial for the performance of the model, pycircular library offers a range of optimization methods to select the best bw for a given dataset.
To evaluate the effectiveness of the kernel, it is important to perform accuracy tests and compare the results with other methods.
While the time of the day is an important temporal feature to consider, it's also important to investigate how other temporal variables like day of the week and day of month may impact the model's performance.
The kernel can be used in a machine learning model, it can be integrated as part of the feature engineering process, where it can be applied to the input data to create new features that better capture the temporal patterns in the data.

We will be showing how to deal with these issues in a following tutorial.