GAN

GANs for tabular data

Authors

Introduction

In this assignment, we were given two tasks:

(1) create a Generative Adversarial Network (GAN) that can produce tabular samples from two given datasets, and

(2) build a general generative model that receives a black-box as a discriminator and can still generate samples from the tabular data. This is done by attempting to predict the scores given by the black-box model.

We implemented this assignment using mainly Keras and Sklearn.

Dataset Analysis

An example for the ‘Adults’ dataset:

An example for the ‘Bank-full’ dataset:

Code Design:

Our code consists of three scripts:

  1. nt_exp.py - the script that runs the experiments to find the optimal parameters for our GAN network.
  2. nt_gan.py - defines the GAN class, training and testing the models.
  3. nt_gg.py - defines the general generator class, training and testing the models.

Generative Adversarial Networks (Part 1)

Architecture:

We used a fairly standard architecture for both the generator and discriminator.

Generator:

Layer # Layer Type Input Size <p>Activation</p><p>Function</p> Notes
1 Noise int(x/2)<br>Where x is the number of features in the dataset -
2 Dense x*2 <p>LeakyReLU</p><p></p>
3 Dense x*4 <p>LeakyReLU</p><p></p> <p>Has a dropout</p><p>(see details below)</p><p></p>
4 Dense x*2 <p>LeakyReLU</p><p></p> <p>Has a dropout</p><p>(see details below)</p><p></p>
5 Output x Sigmoid

Discriminator:

Layer # Layer Type Input Size <p>Activation</p><p>Function</p> Notes
1 Input x<br>Where x is the number of features in the dataset -
2 Dense x*2 <p>LeakyReLU</p><p></p>
3 Dense x*4 <p>LeakyReLU</p><p></p> <p>Has a dropout </p><p>(see details below)</p>
4 Dense x*2 <p>LeakyReLU</p><p></p> <p>Has a dropout</p><p>(see details below)</p>
5 Output 1 Sigmoid

Results Evaluation:

We devised a few metrics to judge our models:

  1. Mean Minimum Euclidean Distance - This approach finds for each generated record the similar record in the real data and compute euclidean distance between them. Then, the metrics compute the mean of all of these distances. We want low values of this metric for the fooled samples and higher values of this metric for not fooled samples.
  2. Principal Components Analysis (PCA) Distribution -This approach transforms the original data to PCA with two components and then does the same thing to the fooled and not fooled samples. We can understand from the output plot three things:
    1. How similar the fooled samples are to the real data.
    2. How similar the not fooled samples are to the real data.
    3. How similar the fooled and the not fooled samples are.
    4. Understand how the discriminator decides whether or not a sample is real or not.

Network Hyper-Parameters Tuning:

NOTE: Here we explain the reasons behind the choices of the parameters.

After implementing our GAN, we optimized the different parameters used. Some parameters, like the batch size, it is very hard to predict what will work best so this method is the best way to find good values to use.

We tested the parameters based on the first metric we presented - mean minimum euclidean distance.

MMDF = Mean minimum euclidean distance for the fooled samples

MMDNF = Mean minimum euclidean distance for the not fooled samples

MMDG = MMDF - MMDNF

NFL = Number of not fooled samples out of the 100 generated samples.

W1 = 0.33 (weight)

W2 = 0.33 (weight)

W3 = 0.34 (weight)

We use the following metric to determine the parameters:

NTSCORE = w1 * (MMDF + MMDNF) + w2 * (MMDG) + w3 * (NFL/100)

The lower the value of this score, the better the model is.

Each combination takes a long time to train (5-15 minutes) so we tried only a few values for each parameter:

Experimental Results:

The best results are in bold -

For adults dataset, the results of the model were:

For bank-full dataset, the results of the model were:

General Generator (Part 2)

In this section we were tasked with creating a blackbox discriminator (in our case, a random-forest model) and to create only a generator that can create samples based on the confidence scores given by the blackbox discriminator. As before, the input for the generator is a vector of random noise, but in addition to that we also provided it with a sample of the probability given by the blackbox model to class 1 (there are only 2 classes so the probabilities simply sum to 1). The goal is for the generator to learn the distribution of the probabilities in addition to creating good synthetic samples.

Architecture:

We used nearly the same generator architecture coupled with the default values for Random Forest.

Generator:

Layer # Layer Type Input Size <p>Activation</p><p>Function</p> Notes
1 Noise (Input) int(x/2) + 1<br>Where x is the number of features in the dataset - +1 for the desired confidence.
2 Dense x*2 <p>LeakyReLU</p><p></p>
3 Dense x*4 <p>LeakyReLU</p><p></p> <p>Has dropout</p><p></p><p></p>
4 Dense x*2 <p>LeakyReLU</p><p></p> <p>Has dropout</p><p></p>
5 Output x Sigmoid Loss - Categorical Cross Entropy

Training Phase

The generator must be given a desired probability and then generate a sample that indeed results in that probability from the blackbox model. Therefore, it must be punished when the probability is far from what we wanted. This is the process:

  1. The generator creates N samples (based on batch size). The last column is the probability for a single class (class 1 in our case).
  2. These samples are fed to the (trained) discriminator which outputs probabilities for each class.
  3. We aim to mimic these probabilities, so in addition to a noise vector they are fed into the generator.
  4. We then run it through the generator, where the loss function is the binary cross entropy calculated between the probability given and the probability of each sample created.
  5. The weights are adjusted accordingly and a new batch is generated.

Network Hyper-Parameters Tuning:

As before, we did hyper parameter tuning to achieve the best results. We tried various combinations:

Results Evaluation:

In this part the main goal was for the distribution of confidence probabilities to distribute uniformly, since we sampled 1000 confidence scores uniformly as required. A good model should be one that indeed results in such a distribution. We designated the last column as the label. It must be noted, however, that the classes are imbalanced in the ‘adult’ dataset (about 75% of the data is class 1 while 25% is class 0).

  1. Discriminator results:

Adult dataset:

Bank Data set:

Class distribution:

Generator Results:

Here we first uniformly sampled 1000 confidence rates from [0,1]. Then, based on these rates we generated 1,000 samples. The goal being that the discriminators confidence rate also distributed uniformly. This is of course a hard task considering how skewed the confidence rate is, as seen above, since class 1 is much more likely (3 times more for the first data set and 8 times more for the second one).

Adult Dataset:

Bank Dataset:

Discussion

For the adult dataset, the confidence rates for the generated samples are not completely random, but not uniformly distributed. But it is clear that it skews towards the original distribution, which makes sense. However, this is not the case for the bank dataset. Perhaps this can be explained with the high loss rate and the extreme imbalance in the original data.

For both datasets, our generator indeed suffered from mode collapse and only generated samples from the class with more instances in the training set. This obviously hindered the results. Since only class 1 was generated, obviously we only have results for that one so it is difficult to compare results between classes. Perhaps a better approach would be to generate synthetic samples (e.g, SMOTE) to ensure better training data, or choose a feature that distributes more evenly.