Understanding Gaussian Processes for Machine Learning

Introduction

Gaussian processes are useful tools in machine learning and statistics. They help us understand how random variables distribute within a specific context. Named after the German mathematician Carl Friedrich Gauss, these processes involve continuous random variables that interact with each other.

Gaussian processes have a long history, dating back to the early 20th century. They originated from studies on covariance functions and random processes, particularly focusing on conditional Gaussian distributions. Initially, they were defined as conditional Gaussian distributions, and over time, various statistical and mathematical algorithms were developed to work with them.

Gaussian processes find broad applications, including regression, classification, time series forecasting, and optimization. They have become even more important with recent advancements in machine learning and artificial intelligence. Gaussian processes are especially valued for their ability to estimate uncertainty, analyze variability, and create data-driven models.

Mathematical Foundations

The Gaussian Process (GP) is a stochastic surrogate modeling technique that uses single-fidelity data to approximate an unknown function \(f(x)\) with compromising input-output pairs \(D = x_i, y_i\) (i = 1,..., N). The latent function \(f(x)\) is modeled by a Gaussian process equation with \(m(x)\) mean and specific \(k\) variances, and can be expressed as in Equation (1).

\[f(\boldsymbol{x}) \sim GP(m(\boldsymbol{x}), k(\boldsymbol{x},\boldsymbol{x'}; \boldsymbol{\theta})) \quad(1)\]

where,

\(\boldsymbol{x}\) and \(\boldsymbol{x'}\) are input vectors,
\(m(\boldsymbol{x})\) is the mean function,
\(k(\boldsymbol{x},\boldsymbol{x'};\theta)\) is the covariance function also known as the kernel,
\(\theta\) is the hyper-parameter vector of the covariance function.

Hence, a Gaussian Process (GP) is fully defined by its mean function and kernel, denoted by Equation (2) and (3) respectively. However, it is a common practice to set the mean function as zero and center the data. This simplifies the model and works effectively when focusing on the local behavior of the model.

\[m(\boldsymbol{x})=\mathbb{E}[f(\boldsymbol{x})] \quad(2)\]

\[k\left(\boldsymbol{x}, \boldsymbol{x}^{\prime}\right)=\mathbb{E}\left[(f(\boldsymbol{x})-m(\boldsymbol{x}))\left(f\left(\boldsymbol{x}^{\prime}\right)-m\left(\boldsymbol{x}^{\prime}\right)\right)\right] \quad(3)\]

In general, the covariance function chosen typically contains some free parameters known as hyperparameters within the context of Gaussian Processes (GPs). Hyperparameters exert significant influence on the predictions derived from our GP. Visually, they define the shape of our GP and regulate the degree of fitting to the data. Various covariance functions exist. For the squared-exponential covariance function (Equation (4)), the hyperparameters \((\theta = (l, \sigma_f^2))\) include the signal variance \(\sigma_f^2\), and the length-scale \(l\).

\[k(\boldsymbol{x}, \boldsymbol{x'}; \theta) = \sigma_f^2 \exp\Bigg(- \frac{1}{2} \sum_{i=1}^{D}\frac{(x_i-x'_i)^2}{l^2}\Bigg) \quad(4)\]

The squared-exponential covariance function is among the most frequently used covariance functions. Samples from a Gaussian Process (GP) with a squared-exponential covariance function are both continuous and infinitely differentiable. Continuity yields smooth curves, facilitating result interpretation, while infinite differentiability allows for integrating prior knowledge regarding probable values of the derivatives.

The hyper-parameters \(\theta\) are obtained by minimizing the negative log marginal likelihood (NLML) equation given in Equation (5), where \(K = k(\boldsymbol{x},\boldsymbol{x};\theta)+\sigma^2 I\).

\[NLML(\theta) = - \frac{1}{2} y^T K^{-1} y - \frac{1}{2} \log|K| - \frac{N}{2} \log(2\pi) \quad(5)\]

The hyper-parameters need to be optimized using the given data set. Then, a surrogate model is constructed, and finally the predictive and variance values of the surrogate model are calculated at a new test point (\(x^*\)) using Equations (6) and (7).

\[f(\boldsymbol{x^*})|y \sim N(k(\boldsymbol{x^*},\boldsymbol{x}) K^{-1}y, k(\boldsymbol{x^*},\boldsymbol{x^*})-k(\boldsymbol{x^*},\boldsymbol{x})K^{-1}k(\boldsymbol{x},\boldsymbol{x^*}))\]

\[\mu(\boldsymbol{x^*}) = k(\boldsymbol{x^*},\boldsymbol{x}) K^{-1}y \quad(6)\]

\[cov(\boldsymbol{x^*}) = k(\boldsymbol{x^*},\boldsymbol{x^*})-k(\boldsymbol{x^*},\boldsymbol{x})K^{-1}k(\boldsymbol{x},\boldsymbol{x^*}) \quad(7)\]

Implementing Gaussian Process Regression in Python

In this section, we will first implement the subfunctions, and then demonstrate the results with an example.

* First, we import the necessary Python libraries.

    
        import numpy as np
        from numpy.linalg import inv, det, lstsq, cholesky
        from scipy.optimize import minimize
        import random
        import matplotlib.pyplot as plt

Implemenatation of covariance function (Equation (4)),

The length parameter \(l\) governs the smoothness of the function, while \(\sigma_f\) controls the vertical variation. To maintain simplicity, we employ the same length parameter \(l\) for all input dimensions, a configuration known as an isotropic kernel.

    
        def kernel(X1, X2, l=1.0, sigma_f=1.0):
            '''
            Squared exponential kernel. 
            
            Inputs:
                X1: Array of m points (m x d).
                X2: Array of n points (n x d).
        
            Returns:
                Covariance matrix (m x n).
            '''
            sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * np.dot(X1, X2.T)
            
            return sigma_f**2 * np.exp(-0.5 / l**2 * sqdist)

Implementation of marginal log likelihood function (Equation (5)),

The likelihood function, a parameter estimation method, has been used in Gaussian process to estimate the optimal hyperparameters.

    
        def nlml(X_train, Y_train, noise, naive=True):
            '''
            Computes the negative log marginal likelihood 
            for training data X_train and Y_train and given 
            noise level.
            
            Inputs:
                X_train: training locations (m x d).
                Y_train: training targets (m x 1).
                noise: known noise level of Y_train.
        
            Outputs:
                Marginal log likelihood value
            '''
        
            def nll_naive(theta):
                K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
                    noise**2 * np.eye(len(X_train))
        
                NLML = 0.5 * np.log(det(K)) + \
                    0.5 * Y_train.T.dot(inv(K).dot(Y_train)) + \
                    0.5 * len(X_train) * np.log(2*np.pi)
        
        
                return NLML
            
            def nll_stable(theta):
                # stable version
                K = kernel(X_train, X_train, l=theta[0], sigma_f=theta[1]) + \
                    noise**2 * np.eye(len(X_train))
                L = cholesky(K)
                NLML = np.sum(np.log(np.diagonal(L))) + \
                    0.5 * Y_train.T.dot(lstsq(L.T, lstsq(L, Y_train)[0])[0]) + \
                    0.5 * len(X_train) * np.log(2*np.pi)
                
                return NLML
            
            if naive:
                return nll_naive
            else:
                return nll_stable

* To compute the sufficient statistics i.e. mean and covariance of the posterior predictive distribution, we implement Equations (6) and (7)

    
        def posterior_predictive(X_s, X_train, Y_train, l=1.0, sigma_f=1.0, sigma_y=1e-8):
            '''
            Computes the suffifient statistics of 
            the GP posterior predictive distribution
            
            Args:
                X_s: New input locations (n x d).
                X_train: Training locations (m x d).
                Y_train: Training targets (m x 1).
                l: Kernel length parameter.
                sigma_f: Kernel vertical variation parameter.
                sigma_y: Noise parameter.
            
            Returns:
                Posterior mean vector (n x d) and covariance matrix (n x n).
            '''
            K = kernel(X_train, X_train, l, sigma_f) + sigma_y**2 * np.eye(len(X_train))
            K_s = kernel(X_train, X_s, l, sigma_f)
            K_ss = kernel(X_s, X_s, l, sigma_f) + 1e-8 * np.eye(len(X_s))
            K_inv = inv(K)
            
            # Equation (6)
            mu_s = K_s.T.dot(K_inv).dot(Y_train)
        
            # Equation (7)
            cov_s = K_ss - K_s.T.dot(K_inv).dot(K_s)
            
            return mu_s, cov_s

* An applied example,

For this example, we will generate a set of 10 random numbers between -5 and +5 as our X variable. Then, we will compute the sine of these values to obtain our Y variable.

    
        random.seed(123) # Fix seed for reproducibility
        X_train = np.random.uniform(-5., 5., (10, 1)) # Generate vector of X_train values
        y_train = np.sin(X_train) # Generate vector of y_train values
        
        X_test = np.arange(-5, 5, 0.2).reshape(-1, 1)
        y_test = np.sin(X_test)

* Obtain optimum hyperparameter for training dataset,

    
        res = minimize(nlml(X_train, y_train, 0.1), [1, 1], 
                bounds=((1e-5, None), (1e-5, None)),
                method='L-BFGS-B')
        l_opt, sigma_f_opt = res.x
        print('Optimum length scale is %f' % l_opt)
        print('Optimum signal variance is %f' % sigma_f_opt)

Optimum length scale is 1.844217
Optimum signal variance is 1.060633

* Make prediction using optimum hyper-parameters for test dataset,

    
        mu_s, cov_s = posterior_predictive(X_test, X_train, y_train, l=l_opt, sigma_f=sigma_f_opt, sigma_y=0.1)

* Plot the results,

    
        def plot_gp(mu, cov, X, X_train=None, Y_train=None, Y_test=None):
            X = X.ravel()
            mu = mu.ravel()
            uncertainty = 1.96 * np.sqrt(np.diag(cov))
            
            plt.plot(X, mu, label='GP prediction')
            plt.plot(X, Y_test, label='Exact result')
            plt.fill_between(X, mu + uncertainty, mu - uncertainty, alpha=0.1, label='Confidence Interval')
            
            
            if X_train is not None:
                plt.plot(X_train, Y_train, 'kx', label='Training data')
            plt.legend()
            plt.xlabel('$x$')
            plt.ylabel('$f(x)$')

        plot_gp(mu_s, cov_s, X_test, X_train=X_train, Y_train=y_train, Y_test=y_test)