Time-series Synthetic Data: A GAN approach

January 28, 2021 Time-series Synthetic Data: A GAN approach

Generate synthetic sequential data with TimeGAN

Time-series or sequential data can be defined as any data that has time dependency. Cool, huh, but where can I find sequential data? Well, a bit everywhere, from credit card transactions, my everyday routine and whereabouts to medical records, such as ECG and EEG’s. Although sequential data is pretty common to be found and highly useful, there are many reasons that lead to not leverage it — from privacy regulations to the scarcity of its existence.

In one of my previous posts, I’ve covered the ability of Generative Adversarial Netoworks (GANs) to learn and generate new synthetic data that preserves the utility and fidelity of a real datasets, nevertheless to generate tabular data is far more simple than generating datasets that should preserve temporal dynamics. To model successfully time-series data means that a model must, not only capture the datasets features distributions within each time-point but also, it should be able to capture the complex dynamics of those features across time. We must not forget also that each time sequence as a variable length associated.

But being a challenging task, does not mean it is impossible! Jinsung Yoon and Daniel Jarret have proposed, in 2019, a novel GAN architecture to model sequential data — TimeGAN — that I’ll be covering with a practical example throughout this blog post.

Time-series Generative Adversarial Networks

TGAN or Time-series Generative Adversarial Networks, was proposed in 2019, as a GAN based framework that is able to generate realistic time-series data in a variety of different domains, meaning, sequential data with different observed behaviors. Different from other GAN architectures (eg. WGAN) where we have implemented an unsupervised adversarial loss on both real and synthetic data, TimeGAN architecture introduces the concept of supervised loss — the model is encouraged to capture time conditional distribution within the data by using the original data as a supervision. Also, we can observe the introduction of an embedding network that is responsible to reduce the adversarial learning space dimensionality.