This blog post is about the mathematical formulation and solution of the GAN framework.

Check out the introductory blog post of this series here to get a wider picture about GANs.

The content of this blog post is inspired by NIPS tutorial on GANs. ### Maximum Likelihood Estimation Consider a model M defined with its parameters $\theta$. Maximum Likelihood Estimation (MLE) simply says to choose parameters for model that maximizes the likelihood of the training data. In mathematical notation the likelihood given parameters $\theta$ is calculated using $p = \Pi_{i=1}^{m} p_{model}(x^{(i)};\theta)$ where dataset has m training examples, each represented with $x^{i}$. As the probabilities are being multiplied in the above equation, the easiest way to do MLE is in log space as formulated below. $\theta^{*} = arg\ max\ \log{\Pi_{i=1}^{m}p_{model}(x^{(i)}; \theta)}$ which can also be represented as $\theta^{*} = arg\ max\ \Sigma_{i=1}^{m} log{p_{model}(x^{i}; \theta)}$ As log function is strictly increasing, shifting logarithm within argmax does not change the location of maxima. Another way of representing maximum likelihood estimation is by minimization of KL divergence, a metric for calculating distance between two distributions as shown below. $\theta^{*} = arg\ min\ D_{KL}(p_{data}(x) || p_{model}(x;\theta))$

### So why all this math?

What we just did above is formulate the problem. We formulated a probabilistic model from which we could generate data. But we clearly did not discuss whether the problem is tractable or not. When we have low dimensional data, it’s considerably simple for us to develop an explicit density function. These explicit density functions can be maximized very easily by following the gradient uphill. In general, its difficult to design an explicit density model that can capture the complexity of data as well as remain tractable.

There are two major approaches to tackle these challenges. One is two develop model strategically in such a way that the structure of the model guarantees tractability and the other is to use tractable approximations to likelihood and gradients.

So, what else is used when a problem is intractable? Neural Networks.

Neural networks are used extensively where it is either difficult to define the model, or the model, in general, is intractable. This is as an implicit density function category, exactly where GANs fall in. In implicit density functions, the model (neural network) does not explicitly represent a probability distribution over the space where training data lies. Now we have some idea why GANs are important and what’s the motivation to use them.

Let’s get into some more GAN specific math.

In the previous blog post, I already talked about the GAN structure. GANs are a structured probabilistic model. We have two neural networks, a generator and a discriminator. Generator is differentiable function that takes in a random noise vector $z$ while using parameters $\theta^{(G)}$ and discriminator is a differentiable function takes in samples and uses $\theta^{(D)}$ as parameters.

Both generator and discriminator have their own cost functions $J^{(G)} (\theta^{(G)} , \theta^{(D)} )$ and $J^{(D)} (\theta^{(G)} , \theta^{(D)} )$ respectively. The cost function as shown above are defined in terms of parameters of both the network weights but each of the network (generator and discriminator) can only control their own parameters. Hence, this results in a game between the two networks. Both the networks are trying to minimize their cost function while controlling only their own parameters. This is not an optimization problem anymore.

Just like the solution to an optimization problem is called the Local Minima, the solution to a game is called Nash Equilibrium.

### Attaining the Nash Equilibrium

There are several ways to design the two loss functions for GANs. The simplest version of a game is zero-sum game, where all player’s cost is 0. A zero-sum game in GANs can be formulated similarly.
$J^{(G)} = -J^{(D)}$ Notice that we defined the generators cost as negative of discriminator’s cost. This is because we do not have an explicit way to evaluate a generator’s cost. We can easily formulate the loss function for discriminator. As the discriminator classifies whether a sample is true or not, the problem can be structured as a standard binary classification problem and hence the standard cross-entropy loss can be used. Let’s define the discriminator’s cost. $J^{(D)}( \theta^{(G)} , \theta^{(D)} ) = -1/2 [\mathbb{E}_{x\tilde~p_{data}} \log{D(x)} + \mathbb{E}_{z} \log{D(1 - D(G(x))}]$

For all true samples, the label is 1 while the label for samples generated from generator is 0.

One thing to notice is that, the loss metrics does not give a strong insight into when the models have converged to the Nash equilibrium nor does this formulation lead to a stable training (Gradient Descent works well for normal optimization situations mainly). Several other loss functions were later formulated to make the loss function more interpretable and at the same time also make training a stable affair. For example, instead of considering the generator loss as negative of discriminator loss, we use cross entropy loss for generator as well. This is given by $J^{(G)} = -1/2 [\mathbb{E_{z}} \log{D(G(z))}]$

In the subsequent blog posts, I will discuss different types of GANs. While some are focused towards better formulation, some are interesting applications of GANs.