itsonlyamodel

MLE Gradient Derivation

Numerical Optimization Improvements for Maximum Likelihood Estimation of interpolated Argo Ocean Floats.

Ocean properties such as temperature and salinity are measured globally by the Argo float array, collecting more than 2 million profiles over a decade. It is enough data to create a climatological product. It is an active area of research and many such products exist. But there are concerns about the uncertainty of these methods. When making gridded products, (aka objective mapping, Krieging, etc) The ocean covariance structure is often estimated by ad hoc means. This underlying 'guess' impacts the uncertainty of climatology models and is a problem when deciding which model to use.

A solution proposed by Kuusela Stein's paper here takes data in a certain region and time and models ocean covariance as a Gaussian process. The parameters that describe this model are estimated using the statistician's Swiss army knife: maximum likelihood estimation (MLE). In this article, we try to make improvements to the estimation process, cleaning up the code to perform faster, and with fewer errors.

Problem formulation

An oceanic property residual (measurement - mean) at a certain depth in the ocean indexed by $latitude_i, longitude_i, time_i$ has the estimated value $y_{i,j} = f_i(x_{i,j}, t_{i,j}) + \epsilon_{i,j}$ where $f_i$ is a Gaussian process $f_i \overset{iid}{\sim} GP(0, k(\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j}; \theta)$. Index $j$ represent the Argo measurement data found around the space-time region $W_i$. In this case, $W_i$ is a circle with a 500 km radius centered at point $latitude_i, longitude_i$ and contains data from an entire year. So for example, for a given year, there were $j$ profiles found in $W_i$. $\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j}$ represent the pairwise difference between each of these points. The further a float is to point $i$, the less it influences its error.

An anisotropic exponential space-time covariance function $$ k(\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j}; \theta) = \phi exp(-d(\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j} )) $$ where the GP variance $\phi > 0$ and $$ d= \sqrt{ \left( \frac{\Delta lat_{j,j}}{\theta_{lat}} \right)^2 + \left( \frac{\Delta lon_{j,j}}{\theta_{lon}} \right)^2 + \left( \frac{\Delta t{j,j}}{\theta_{t}} \right)^2} $$

The symbol $\epsilon_{i,j}$ represents the 'Nugget' effect: fine-scale variation not affected by ocean properties. We set this to a Normal distribution with mean zero $\epsilon_{i,j} \overset{iid}{\sim} N(0, \sigma_i)$.

An anisotropic exponential space-time covariance function $$ k(\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j}; \theta) = \phi exp(-d(\Delta lat_{j,j}, \Delta lon_{j,j}, \Delta t_{j,j} )) $$ where the GP variance $\phi > 0$ and $$ d= \sqrt{ \left( \frac{\Delta lat_{j,j}}{\theta_{lat}} \right)^2 + \left( \frac{\Delta lon_{j,j}}{\theta_{lon}} \right)^2 + \left( \frac{\Delta t_{j,j}}{\theta_{t}} \right)^2} $$

The problem is to find the $\theta$ and $\sigma$ for each point $i$ that best fits the covariance structure. Moreover, the method of solving it needs to be fast!

Implementing the Maximum Likelihood Estimation Solver in Python

My group and I cloned the Kuusela-Stein GitHub repostory and tried implementing it on our own, with a few changes. They estimate the covariance function's $\theta$ parameters using MLE. The code provided is in Matlab a language I would rather avoid for numerous reasons such as

  1. Money, both the base and the additional parallel toolbox
  2. Transparency, the algorithms used to solve are hidden
  3. There are better languages. Matlab was my first language, but it's no longer the late 2000's. there are other languages out there that run for free with comparable performance. The language Matlab is owned by MathWorks, who holds sole authority in making changes. This puts the language at a severe disadvantage with open-sourced languages like R, Python, or Julia. The Open source community is walled off from writing a better parallelization toolbox or writing a better IDE.
  4. Licences. I am a CU Boulder employee with access to the Cheyenne and Casper supercomputing facilities. Included in their cluster are a handful of Matlab licenses. Whenever I want to run Matlab, A license needs to be available, otherwise, I need to wait. This minor annoyance gets tricky when scheduling a large computation spanning weeks.

Thus, after running into these issues I find it necessary to translate the MLE portion of code from Matlab to Python.

A part of this effort involves replacing Matlab's unconstrained solver function fminunc with the Scipy's minimize function. Kuusela-Stein used the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm to find the optimal $\theta$. The BFGS method in Python has the option to pass the gradient function of the objective function $l(\theta)$ $$ \nabla f = \begin{bmatrix} \frac{df}{d\theta{1}} \\ \frac{df}{d\theta{2}} \\ \vdots \\ \frac{df}{d\theta{n}} \end{bmatrix} $$

where i are data indexes and j are the parameters $ \theta = [\phi, \theta_{lat}, \theta_{lon}, \theta_t, \sigma]$. $\sigma]$. We lumped $\sigma$ in with the remaining parameters since we estimate that too. Matlab's fminunc estimates the gradient using finite differencing. We will go a step further and calculate it ourselves and should expect much better performance in the MLE solver with the gradient defined.

To do this we need to look at the log-likelihood function

$$ l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} log\ p(y_n | \theta) $$

Each year indexed by $n$ is calculated separately.

$$ l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} l_n(\theta) $$

$$ l_n(\theta) = log (det(K_n(\theta) + \sigma\ I)) + y_n^T(K_n(\theta) + \sigma^2\ I)^{-1}y_n + m_n log(2\pi) $$

$K_n(\theta)$ is the matrix form of $(x_1 , t_1 , x_2 , t_2 ; \theta)$ for every data point pair combination $(x_i,x_j)$ Note that $k(x_1 , t_1 , x_2 , t_2 ; \theta) = k(x_2 , t_2 , x_1 , t_1 ; \theta)$ making this matrix symmetric. We also assume the matrix is positive definite (symmetric and all of its eigenvalues are positive). $K_n(\theta) \in \mathbb{R}^{m_n\ \times \ m_n }$, $m_n$ is the amount of data for the year ($length\ (y_n)$). $I$ is the identity matrix.

$ l_n(\theta) $ Has a gradient

$$ \nabla l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} \left[ \frac{d}{d\theta_j}l_n(\theta)_n\right]_j $$

where $\nabla l(\theta) $ is a vector $\in \ \mathbb{R}^{5}$

$K_n(\theta) + \sigma I$ is symmetric (positive definite). We shall use the following matrix identies (See the Matrix Cookbook.)

  • $\partial (log\ det(A)) = A^{-1}$ See Boyd and Vandenberghe book Appendix A.4 for proof.

  • $A = A^T$ For symmetric matrix $A$

  • $\frac{\partial a^T A^{-1} b}{\partial A} = -A^{-T}ab^TA^{-T}$

  • The chain rule of a function $\frac{\partial f(g(\theta))}{\partial \theta} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial \theta}$

  • Chain rule for functions of matricies has the additional property

$ df(X) = tr(\frac{\partial f}{\partial X} dX) $

  • An $n \times n$ symmetric matrix $A$ has $n$ distinct eigenvalues $\lambda_i$ and its trace is

$tr(A) = \sum\limits_{i}^n a_{ii} = \sum\limits_{i}^n \lambda_i$

Remember that that order matters in matrix multiplication!

Jacobian of $K_n(\theta) + \sigma^2\ I$.

First step is finding each partial derivative of $\kappa_n = K_n(\theta) + \sigma^2 I$, $\nabla \kappa = \frac{d\kappa_{n}(\theta)}{d\theta_j}$. Note that $\kappa_n(\theta)$ is $m \times m$ since we are making a pairwise difference of $x \in \mathbb{R}^{m}$.

$$ \kappa_n(\theta) = \phi exp(-d) + \sigma I $$

$$\frac{\partial \kappa_n}{\partial \phi} = exp(-d)$$ Easy right? Lets do $\sigma$ next.

$$ \frac{\partial \kappa_n}{\partial \sigma} = 2 \sigma I $$ The remaining partial derivatives have the same structure. Starting with $\theta_{lat}$ will need to implement the chain rule.

$$ \frac{\partial \kappa_n}{\partial \theta_{lat}} = \frac{\partial \kappa_n}{\partial d(\theta)} \frac{\partial d(\theta)}{\partial P(\theta) } \frac{\partial P(\theta)}{\partial \theta} $$

Where

$$ d= \sqrt{ P }, P = \left( \frac{x_{lat,1} - x_{lat,2}}{\theta_{lat}} \right)^2 + \left( \frac{x_{lon,1} - x_{lon,2}}{\theta_{lon}} \right)^2 + \left( \frac{x_{t,1} - x_{t,2}}{\theta_{t}} \right)^2 $$

Which have derivatives

$$ \frac{\partial \kappa_n}{\partial d} = -\frac{\partial d}{\partial P} \phi exp(-d), \\ \frac{\partial d}{\partial P} = \frac{1}{2}P^{-\frac{1}{2}}, \\ \frac{\partial P(\theta_{lat})}{\partial \theta_{lat}} = -2\frac{(x_{lat,1} - x_{lat,2})^2}{\theta_{lat}^{3}} $$

Substitution yields

$$ \frac{\partial \kappa_n}{\partial \theta_{lat}} = \phi exp(-d)\frac{(x_{lat,1} - x_{lat,2})^2}{\theta_{lat}^{3}} \left( \left( \frac{x_{lat,1} - x_{lat,2}}{\theta_{lat}} \right)^2 + \left( \frac{x_{lon,1} - x_{lon,2}}{\theta_{lon}} \right)^2 + \left( \frac{x_{t,1} - x_{t,2}}{\theta_{t}} \right)^2 \right)^{-\frac{1}{2}} $$

The remaining partial derivatves are found in an identical fashion as $\frac{\partial \kappa_n}{\partial \theta_{lat}}$. Just replace $\theta_{lat}^3$ with $\theta_{lon}^3$ and $\theta_{t}^3$ for $\frac{\partial \kappa_n}{\partial \theta_{lon}}$ and $\frac{\partial \kappa_n}{\partial \theta_{t}}$ respectively.

Gradient of the negative log-likelihood function

We use the Jacobian of $\kappa_i(\theta)$ to compute the gradient of $l(\theta)$. First we distribute the partial derivative across $l_n(\theta)$ and solve each component. $$ \nabla l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} \left( \frac{\partial}{\partial \theta} log( det(\kappa_n(\theta)) + \frac{\partial}{\partial \theta} y_n^T(\kappa_n(\theta))^{-1}y_n(\theta) + \frac{\partial}{\partial \theta}m_n log(2\pi) \right) $$

The first term borrows directly borrows from the identity $ \partial (log\ det(A)) = A^{-1}$ and the chain rule. $$ \frac{\partial}{\partial \theta} log( det(\kappa_n(\theta)) = tr\left( \frac{\partial}{\partial \kappa_n(\theta)}log( det(\kappa_n(\theta) )\frac{ \partial \kappa_n(\theta) } {\partial \theta} \right) = tr\left( \kappa_n(\theta)^{-1}\frac{ \partial \kappa_n(\theta) }{\partial \theta} \right) $$

The second term uses the identity $\frac{\partial a^T A^{-1} b}{\partial A} = -A^{-T}ab^TA^{-T}$ combined with the chain rule. Again, $\kappa_n(\theta)$ is symmetric.

$$ \frac{\partial}{\partial \theta} y_n^T(\kappa_n(\theta))^{-1}y_n(\theta) = tr\left(\kappa_n(\theta)^{-1} y_ny_n^T \kappa_n(\theta)^{-1} \frac{\partial \kappa_n(\theta) }{\partial \theta} \right) $$

The third term has no $\theta$. Its contribution to the gradient is zero. Gathering everything, the gradient is expressed as

$$ \nabla l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} tr\left( \left(\kappa_n(\theta)^{-1} - \kappa_n(\theta)^{-T} y_ny_n^T \kappa_n(\theta)^{-T}\right)\frac{ \partial \kappa_n(\theta) }{\partial \theta} \right) $$

Implementing MLE solver in Python

We take advantage of the fact that a system of equations $Ax=b$ has the solution $x = A^{-1}b$. Calculating the inverse of $A$ is numerically expensive. We avoid this by taking advantage of the Scipy library's linalg.solve(A,b) function, whos documentation is found here. In our example, we solve for $x$ by inputting $A$ and $b$ into solve, which efficiently returns $x$ via LU decomposition. Cholesky or QR Decomposition or SVD may also work.

If we stare at our Jacobian long enough, we see this pattern again and again. linalg.solve is our hammer; our Jacobian, a box of nails.

$$ \kappa_n(\theta)^{-1}\frac{ \partial \kappa_n(\theta) }{\partial \theta} = solve(\kappa_n(\theta), \frac{\partial \kappa_n(\theta) }{\partial \theta}) $$

$$ \kappa_n(\theta)^{-T} y_ny_n^T \kappa_n(\theta)^{-T}\frac{\partial \kappa_n(\theta) }{\partial \theta} = solve(\kappa_n(\theta)^{T}, y_ny_n^T) \cdot solve(\kappa_n(\theta), \frac{ \partial \kappa_n(\theta) }{\partial \theta}) $$

We first are going to create the classes needed to compute $l(\theta)$.

In [1]:
from scipy.linalg import solve, eig, eigh, inv, det
from scipy.optimize import minimize, least_squares
import numpy as np
import time
import pickle
import pdb

class Agg(object):
    def __init__(self, lat, lon, res, julDay):
        self.lat = lat
        self.lon = lon
        self.res = res
        self.julDay = julDay
        self.dataLabels = ['lat', 'lon', 'res', 'julDay']

    def arr(self):
        return [self.lat, self.lon, self.res, self.julDay]
    
    def sort_by_time(self):
        sortIdx = np.argsort(self.julDay) #ascending
        self.lat = np.take_along_axis(self.lat, sortIdx, axis=0)
        self.lon = np.take_along_axis(self.lon, sortIdx, axis=0)
        self.julDay = np.take_along_axis(self.julDay, sortIdx, axis=0)
        self.res = np.take_along_axis(self.res, sortIdx, axis=0)
        
    def validate_data(self):
        m = len(self.lat)
        lengthsEqual = [ len(x) == m for x in self.arr()]
        if m <= 0:
            logging.info('empty aggr')
            return False 
        if not all(lengthsEqual):
            logging.warning('data arrays must be of equal length.')
            return False
        return True
        
class Data(Agg):
    def __init__(self, lat, lon, res, julDay):
        super(Data,self).__init__( lat, lon, res, julDay )
        self.sort_by_time()
        self.dt = self._pairwise_difference(self.julDay)
        self.m = len(self.lat)
        self.dlat = self._pairwise_difference(self.lat)
        self.dlon = self._pairwise_difference(self.lon)
        self._convert_dlng_dlat_to_Mm()
        self.yyT = self._pairwise_product(self.res)
        
    def _check_lengths(self):
        m = len(self.lat)
        lengthsEqual = [ len(x) == m for x in self.dataLabels]
        assert all(lengthsEqual), "data arrays must be of equal length."

    @staticmethod
    def _pairwise_difference(arr):
        '''calculates difference between each point'''
        cp = np.array(np.meshgrid(arr, arr)).T
        diff = cp[:,:,0].flatten() - cp[:,:,1].flatten()
        return diff.reshape(len(arr), len(arr))
    
    @staticmethod
    def _pairwise_product(arr):
        '''calculates product between each point'''
        cp = np.array(np.meshgrid(arr, arr)).T
        pwp = cp[:,:,0].flatten() * cp[:,:,1].flatten()
        return pwp.reshape(len(arr), len(arr))

    def _convert_dlng_dlat_to_Mm(self):
        self.dlat = np.deg2rad(self.dlat) * RAD_EARTH
        phi = np.deg2rad(np.array(np.meshgrid(self.lat, self.lat)).T)[:, :, 0]
        self.dLon = np.cos(phi) * np.abs(self.dlon) * RAD_EARTH
    
    def reduce_by_time(self, dtLimit=90):
        '''reduce profiles whose are more than dtLmiit days apart'''
        tooFar = np.abs(self.dt) > dtLimit
        self.dt[tooFar] = 0
        self.dlat[tooFar] = 0
        self.dlon[tooFar] = 0
        self.yyT[tooFar] = 0
        self.tooFar = tooFar
    
class Theta(object):
    def __init__(self, phi, lat, lon, t, sig):
        self.theta_names = ['phi', 'lat', 'lon', 't', 'sig']
        self.phi = np.abs(phi)
        self.lat = np.abs(lat)
        self.lon = np.abs(lon)
        self.t = np.abs(t)
        self.sig = np.abs(sig)
        
    def __str__(self):
        return f'phi: {self.phi}, lat:{self.lat}, lon: {self.lon}, t:{self.t}, sig:{self.sig}'
        
    def arr(self):
        return [self.phi, self.lat, self.lon, self.t, self.sig]
    
    def check_values(self):
        values = self.arr()
        tooLarge = [np.abs(x) > LARGEVAL for x in values]
        tooSmall = [np.abs(x) < SMALLVAL for x in values]
        if any(tooLarge):
            raise Exception(f'thetas too large {self.__str__()}')
        if any(tooSmall):
            raise Exception(f'thetas to small {self.__str__()}')
        

sq_div = lambda arr, theta: np.power( np.divide(arr, theta), 2 )
kappa_fun = lambda theta, data: space_time_covariance_exp(theta, data).reshape(data.m, data.m) +  np.multiply(theta.sig ** 2, np.eye(data.m) )
p_fun = lambda theta, data: np.add.reduce( [sq_div(data.dlat, theta.lat), sq_div(data.dlon, theta.lon), sq_div(data.dt, theta.t)] )

def space_time_covariance_exp(theta, data):
    dist = np.sqrt( p_fun(theta, data ) )
    return theta.phi * np.exp(-1 * dist)

def neg_log_liklihood(theta, data):
    kappa = kappa_fun(theta, data)
    eigVal, eigVectors = eigh(kappa)
    firstTerm = np.sum(np.log(eigVal))
    try:
        inversePortion = np.dot(data.res, solve(kappa, data.res, assume_a='pos'))
    except:
        pdb.set_trace()
    return firstTerm + inversePortion + np.log(2*np.pi)*data.m
    

Let's do some testing on some toy data

In [2]:
np.random.seed(seed=2)
def make_data(mLen=40):
    mLen = 20
    lat = np.random.uniform(-10,10, mLen)
    lon = np.random.uniform(-10,10, mLen)
    res = np.random.normal(0, 2.5, mLen)
    julDay = np.random.uniform(0, 365, mLen)
    agg = Agg(lat, lon, res, julDay)
    return agg

def make_agg_years():
    aggYears = {}
    for year in range(2007, 2018+1):
        agg = make_data(100)
        data = Data(*agg.arr())
        aggYears[str(year)] = data
    return aggYears
    
LARGEVAL = 10**8
SMALLVAL = np.finfo(np.float32).eps
RAD_EARTH =  6.371 #mega meters
thetaInitGuess = [np.log(x) for x in [400**2, 5, 5, 5, 100]]
theta = Theta(*thetaInitGuess)
data = Data(*make_data().arr())
aggYears = make_agg_years()

Next, we write the jacobian functions.

note that $ l_n(\theta) $ Has the gradient of $l_i(\theta)$ multiplied by $\theta$.

$$ \nabla l(\theta) = \frac{1}{2}\sum\limits_{n=0}^{total\ years} \left[ \frac{d}{d\theta_j}l_n(\theta)_n\right]_j \theta $$

We also are setting the $\theta$ values to $\zeta = exp(\theta)$ to make this an unconstrained optimization problem (inputting, negative values will cause errors). This will have a slight modification on the negative log liklihood gradient.

$$ \nabla l(\zeta) = \nabla l(exp(\theta)) \odot \theta $$

Where $\odot$ is the pairwise product operator.

In [8]:
def sum_nll(thetaArr, filtAggs):
    theta = Theta(*thetaArr)
    years = filtAggs.keys()
    nllSum = 0
    for iYear in years:
        data = Data(*filtAggs[iYear].arr())
        if data.validate_data():
            nllSum += neg_log_liklihood(theta, data)
    return .5 * nllSum

def grad_sum_nll(thetaArr, filtAggs):
    theta = Theta(*thetaArr)
    years = filtAggs.keys()
    gradSum = np.zeros(5)
    for iYear in years:
        data = Data(*filtAggs[iYear].arr())
        if data.validate_data():
            gradSum = np.add(gradSum, grad_nll(theta, data))
    return .5 * gradSum

def grad_nll(theta, data):
    gradKappa = grad_kappa(theta, data)
    kappa = kappa_fun(theta, data)
    gradNll = []
    for idx in range(gradKappa.shape[0]):
        firstTerm = solve(kappa, gradKappa[idx], assume_a='pos')
        secondTerm = solve(kappa, data.yyT, assume_a='pos')
        gradNll.append(np.trace(firstTerm - np.dot(secondTerm, firstTerm)))
    return np.array(gradNll)

partial_p_fun = lambda arr, theta: np.divide( arr ** 2, theta ** 3)

def grad_kappa(theta, data):
    spc = space_time_covariance_exp(theta, data)
    grad_phi = spc / theta.phi
    grad_sig = np.multiply(2 * theta.sig, np.eye(data.m))
    P = p_fun(theta, data)
    P[P == 0] = np.inf # divide by zero correction
    grad_prefix = np.multiply(spc, np.power(P, -.5))
    grad_lat = np.multiply(grad_prefix, partial_p_fun(data.dlat, theta.lat))
    grad_lon = np.multiply(grad_prefix, partial_p_fun(data.dlon, theta.lon))
    grad_t   = np.multiply(grad_prefix, partial_p_fun(data.dt, theta.t))
    gradList = [grad_phi, grad_lat, grad_lon, grad_t, grad_sig]
    gradKappa = np.array(gradList)
    return gradKappa

minimize on $nll$

We wish to maximize the negative log-likelihood function (or minimize the negative log-likelihood function). We can create a timeit function decorator that will come in handy for comparing different methods.

In [13]:
def timeit(method):
    def timed(*args, **kw):
        ts = time.time()
        result = method(*args, **kw)
        te = time.time()
        print('%r  %2.2f ms' % \
              (method.__name__, (te - ts) * 1000))
        return result
    return timed

obj_fun = lambda th, dat: sum_nll([np.exp(x) for x in th], dat)
def obj_grad(th, dat): 
    eth = [np.exp(x) for x in th]
    return np.multiply(grad_sum_nll(eth, dat), np.array(eth))

options = {'gtol': 10**-4, 'disp': True, 'maxiter': 1000 }

@timeit
def min_obj_no_grad():
    return minimize(obj_fun, thetaInitGuess, args=aggYears, method='BFGS', tol=10**-4, options=options)

@timeit
def min_obj_grad():
    return minimize(obj_fun, thetaInitGuess, args=aggYears, method='BFGS', tol=10**-4, jac=obj_grad, options=options)

xOpt = min_obj_no_grad()
xOptFast = min_obj_grad()
Optimization terminated successfully.
         Current function value: 564.508521
         Iterations: 26
         Function evaluations: 192
         Gradient evaluations: 32
'min_obj_no_grad'  2872.34 ms
Optimization terminated successfully.
         Current function value: 564.508521
         Iterations: 26
         Function evaluations: 32
         Gradient evaluations: 32
'min_obj_grad'  1602.62 ms
In [14]:
xOpt
Out[14]:
      fun: 564.5085208770715
 hess_inv: array([[ 1.82146738, -1.2750836 , -0.848378  , -1.10501658, -0.23291553],
       [-1.2750836 ,  4.78727323,  0.39755456, -0.49214321,  0.15796763],
       [-0.848378  ,  0.39755456,  1.2029338 ,  0.65909107,  0.11245871],
       [-1.10501658, -0.49214321,  0.65909107,  2.52415383,  0.15545682],
       [-0.23291553,  0.15796763,  0.11245871,  0.15545682,  0.03320347]])
      jac: array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.62939453e-06,
       0.00000000e+00])
  message: 'Optimization terminated successfully.'
     nfev: 192
      nit: 26
     njev: 32
   status: 0
  success: True
        x: array([ 0.32005396, -0.18915801,  1.39279416,  3.83529965,  0.81783682])
In [15]:
xOptFast
Out[15]:
      fun: 564.5085208770255
 hess_inv: array([[ 1.8675055 , -1.13645913, -0.91386524, -1.30360793, -0.23922858],
       [-1.13645913,  4.54822494,  0.39494673, -0.43980675,  0.14324829],
       [-0.91386524,  0.39494673,  1.21723963,  0.72231417,  0.11916036],
       [-1.30360793, -0.43980675,  0.72231417,  2.70242171,  0.17678517],
       [-0.23922858,  0.14324829,  0.11916036,  0.17678517,  0.0339842 ]])
      jac: array([-6.92383911e-06, -2.25649866e-06, -3.73310065e-06, -1.33861259e-06,
       -3.55709515e-05])
  message: 'Optimization terminated successfully.'
     nfev: 32
      nit: 26
     njev: 32
   status: 0
  success: True
        x: array([ 0.32006737, -0.18915984,  1.39278024,  3.83527794,  0.81783489])
In [16]:
assert xOpt.fun >= xOptFast.fun, 'check objective function'
assert np.isclose(xOpt.x, xOptFast.x, rtol=10**-3).all(), 'check objective function gradient'

Conclusion

Not only is it faster, but calculating the gradient negative log-likelihood by hand can lead to accurate solutions to estimating $\theta$. Generating MLE terms for the global oceans will require that these function evaluations be performed millions of times for data larger than our toy set, potentially saving days of running these on a supercomputer. The headache of calculating the gradient is worth it. One takeaway idea: Would we get better performance from deriving the Hessian? If possible, then other optimization schemes can be used (Newton-CG, dogleg, trust-ncg, trust-krylov, trust-exac, etc.) this paper shows promise by approximating the Hessian with the Fisher information matrix.

Thank you all for reading. I hope you find a use for this very interesting method. Please email me at tyler.tucker@colorado.edu if you have any questions.

Extra pain: Verify the gradient

We verify the derived gradients by estimating the partial derivatives of our function by taking a forward difference approximation. If the approximate and explicit are off by 4 significant figures, we raise an exception.

In [18]:
from scipy.optimize import check_grad, approx_fprime
from scipy.linalg import inv, det

def obj_fun(thetaArr, data):
    th = Theta(*thetaArr)
    return neg_log_liklihood(th, data)
    
def grad_fun(thetaArr, data):
    th = Theta(*thetaArr)
    return grad_nll(th, data)

data = Data(*make_data().arr())
thetaInitGuess = [np.log(x) for x in [400**2, 5, 5, 5, 100]]
theta=Theta(*thetaInitGuess)
eps = 10**-6
sig = 4
check_grad(obj_fun, grad_fun, *[thetaInitGuess, data], epsilon=eps)
Out[18]:
6.915344985895695e-08
In [19]:
check_grad(sum_nll, grad_sum_nll, *[thetaInitGuess, aggYears], epsilon=eps)
Out[19]:
2.3375569427921718e-07

check_grad returns the square root of the sum of squares (i.e., the 2-norm) of the difference between grad(x0, *args) and the finite difference approximation of grad using func at the points x0.

It should be close to zero.

In [20]:
assert all(grad_nll(theta, data) == grad_fun(thetaInitGuess, data))
assert obj_fun(thetaInitGuess, data) == neg_log_liklihood(theta, data)
In [21]:
grad_fun(thetaInitGuess, data)
Out[21]:
array([ 5.35307425e-01, -1.11927386e-03, -1.29740961e-02,  1.55594289e-02,
        4.93630069e+00])

check $\frac{dl_n}{d\sigma}$

We check the gradient term and the Jacobian of $\kappa_n(\theta)$

In [40]:
kappa = kappa_fun(theta, data)
gradKappa = grad_kappa(theta, data)

spc = space_time_covariance_exp(theta, data).reshape(data.m, data.m)
j_K_sigma = np.multiply(2 * theta.sig, np.eye(data.m))
firstTerm = solve( kappa.reshape(data.m, data.m), gradKappa[4], assume_a='pos')
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m), assume_a='pos')
parSigma = np.trace(firstTerm - np.dot(secondTerm, firstTerm))
assert grad_nll(theta, data)[4] == parSigma, 'check sigma grad'
assert all(j_K_sigma.flatten() == grad_kappa(theta, data)[4].flatten()), 'check jacobian of Kappa for sigma '
assert np.isclose(np.trace(gradKappa[4]), data.m * 2 * theta.sig), 'check jacobian of Kappa for sigma '
In [41]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.sig = t2.sig + eps
estParSigma = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps
print(estParSigma)
print(parSigma)
np.testing.assert_approx_equal(estParSigma,parSigma,significant=sig, err_msg='check partial derivative of phi',verbose=True)
4.936300626923185
4.936300690544839

We can see there is a negligable difference. Next we check the remaining partial derivatives.

check $\frac{dl_n}{d\phi}$

In [42]:
spc = space_time_covariance_exp(theta, data).reshape(data.m, data.m)
j_K_phi = spc / theta.phi
firstTerm = solve( kappa.reshape(data.m, data.m), gradKappa[0], assume_a='pos')
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m), assume_a='pos')
parPhi = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

kappa = kappa_fun(theta, data)
assert grad_nll(theta, data)[0] == parPhi
In [43]:
assert all(j_K_phi.flatten() == grad_kappa(theta, data)[0].flatten())
In [44]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.phi = t2.phi + eps
estParPhi = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps
np.testing.assert_approx_equal(estParPhi,parPhi,significant=sig,err_msg='check partial derivative of phi',verbose=True)

check $\frac{dl_n}{d\theta_{lat}}$

In [45]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.lat = t2.lat + eps
estParLat = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps

firstTerm = solve( kappa.reshape(data.m, data.m), gradKappa[1], assume_a='sym')
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m), assume_a='sym')
parLat = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParLat,parLat,significant=sig,err_msg='check partial derivative of lat',verbose=True)
In [46]:
spc = space_time_covariance_exp(theta, data).reshape(data.m, data.m)
P = p_fun(theta, data).reshape(data.m, data.m)
P[P == 0] = np.inf # divide by zero correction
grad_prefix = np.multiply(spc, np.power(P, -.5))
grad_lat = np.multiply(grad_prefix, partial_p_fun(data.dlat.reshape(data.m, data.m), theta.lat))
grad_lon = np.multiply(grad_prefix, partial_p_fun(data.dlon.reshape(data.m, data.m), theta.lon))
grad_t = np.multiply(grad_prefix, partial_p_fun(data.dt.reshape(data.m, data.m), theta.t))
In [47]:
assert all(grad_lat.flatten() == grad_kappa(theta, data)[1].flatten())
In [48]:
firstTerm = np.dot( inv(kappa.reshape(data.m, data.m)), grad_lat)
secondTerm = np.dot( inv(kappa.reshape(data.m, data.m)), data.yyT.reshape(data.m, data.m) )
parLat = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParLat,parLat,significant=sig,err_msg='check partial derivative of lat',verbose=True)

check $\frac{dl_n}{d\theta_{lon}}$

In [49]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.lon = t2.lon + eps
estParLon = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps
firstTerm = solve( kappa.reshape(data.m, data.m), grad_lon, assume_a='sym', overwrite_a=True, overwrite_b=True)
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m), assume_a='sym', overwrite_a=True, overwrite_b=True)
parLon = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParLon,parLon,significant=sig,err_msg='check partial derivative of lon',verbose=True)
In [50]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.lon = t2.lon + eps
estParLon = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps

firstTerm = solve( kappa.reshape(data.m, data.m), grad_lon, assume_a='pos')
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m), assume_a='pos')
parLon = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParLon,parLon,significant=sig,err_msg='check partial derivative of lon', verbose=True)
In [51]:
spc = space_time_covariance_exp(theta, data).reshape(data.m, data.m)
P = p_fun(theta, data).reshape(data.m, data.m)
P[P == 0] = np.inf # divide by zero correction
grad_prefix = np.multiply(spc, np.power(P, -.5))
grad_lon = np.multiply(grad_prefix, partial_p_fun(data.dlon.reshape(data.m, data.m), theta.lon))
In [52]:
assert all(grad_lon.flatten() == grad_kappa(theta, data)[2].flatten())
In [53]:
spc = space_time_covariance_exp(theta, data).reshape(data.m, data.m)
firstTerm = np.dot( inv(kappa.reshape(data.m, data.m)), gradKappa[2])
secondTerm = np.dot( inv(kappa.reshape(data.m, data.m)), data.yyT.reshape(data.m, data.m) )
parLon2 = np.trace(firstTerm - np.dot(secondTerm, firstTerm))
In [54]:
firstTerm = solve( kappa.reshape(data.m, data.m), grad_lon)
secondTerm = solve( kappa.reshape(data.m, data.m), data.yyT.reshape(data.m, data.m) )
parLon = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParLon,parLon,significant=sig,err_msg='check partial derivative of lat',verbose=True)

check $\frac{dl_n}{d\theta_{t}}$

In [55]:
t1 = Theta(*thetaInitGuess)
t2 = Theta(*thetaInitGuess)
t2.t = t2.t + eps
estParT = ( neg_log_liklihood(t2, data) - neg_log_liklihood(t1, data) ) / eps

firstTerm = np.dot( inv(kappa.reshape(data.m, data.m)), gradKappa[3])
secondTerm = np.dot( inv(kappa.reshape(data.m, data.m)), data.yyT.reshape(data.m, data.m) )
parT = np.trace(firstTerm - np.dot(secondTerm, firstTerm))

np.testing.assert_approx_equal(estParT,parT,significant=sig,err_msg='check partial derivative of t',verbose=True)