# MLE Space-Time Numerical Optimization II: Hessian¶

In my last post, we saw the speedup boosts gained when providing Scipy's minimize solver a gradient. We will take this a step further and see what happens when we provide the Hessian. The Hessian matrix is a second partial derivatives of a function $f(x)$. In our case, the log-likelihood of a Gaussian process with an anisotropic covariance function.

An anisotropic exponential space-time covariance function for a given year $n$ $$k_n(x_1 , t_1 , x_2 , t_2 ; \theta) = \phi exp(-d(x_1 , t_1 , x_2 , t_2 ))$$ where the GP variance $\phi > 0$ and $$d= \sqrt{ P }\\ P = \left( \frac{\Delta lat}{\theta_{lat}} \right)^2 + \left( \frac{\Delta lon}{\theta_{lon}} \right)^2 + \left( \frac{\Delta t}{\theta_{t}} \right)^2$$

has a log-likelihood function

$$l_n(\theta) = log (det(K_n(\theta) + \sigma\ I)) + y_n^T(K_n(\theta) + \sigma^2\ I)^{-1}y_n + m_n log(2\pi)$$

The total log-likelihood function for all years is a summation

$$l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} l_n(\theta)$$

Where $\kappa_n = K_n(\theta) + \sigma^2 I$

In our last entry, we found that the gradient

$$\nabla l(\theta) = -\frac{1}{2}\sum\limits_{n=0}^{total\ years} tr\left( \left(\kappa_n(\theta)^{-1} - \kappa_n(\theta)^{-T} y_ny_n^T \kappa_n(\theta)^{-T}\right)\frac{ \partial \kappa_n(\theta) }{\partial \theta} \right)$$

We are interested in finding $\theta_{opt}$ that will minimize $l(\theta)$, and have found that providing the analytical function $\nabla l(\theta)$ Does a much better job in running the BFGS minimization algorithm. But there are faster algorithms available that can compute results much faster if you provide the Hessian.

$$\nabla^2 l_n(\theta) = tr\left( \kappa_n^{-1} \partial^2\kappa_n \kappa_n^{-1}\kappa_n yy^T\right) + 2tr\left( \kappa_n^{-1} \partial \kappa_n \kappa_n^{-1}\partial\kappa_n \kappa_n^{-1}\left(yy^T-\frac{1}{2}\kappa_n\right)\right)$$

In this notebook, we shall compute the Hessian by hand and use a few different methods that will find the optimal values for $\theta$ in much fewer steps. Also, recall that we are working with a constrained optimization problem, but unconstrained optimization is possible with the transformation $\zeta = exp(\theta)$, which makes

$$\nabla l(\zeta) = \nabla l(exp(\theta)) \odot \theta$$

$$\nabla^2 l(\zeta) = \nabla^2 l(exp(\theta)) \odot \theta^T \theta$$

Where $\odot$ is the pairwise product operator. We shall see that this step is not needed for constrained optimization algorithms.

For

$$\theta = \begin{bmatrix} \phi \\ \theta_{lat} \\ \theta_{lon} \\ \theta_{t} \\ \sigma \end{bmatrix}$$

The hessian $\nabla ^2 l(\theta)$ is a $5 \times 5$ matrix, as is $\nabla^2 l_n(\theta)$. However $\nabla^2 \kappa_n$ is a tensor of rank 4 $\nabla^2 \kappa_n \in \mathbb{R}^{5 \times 5 \times m_n\ \times \ m_n }$ Revealing a daunting Hessian

$$\Large \nabla^2\kappa = \begin{bmatrix} \frac{\partial \kappa}{\partial\phi^2}\ \frac{\partial \kappa}{\partial\phi\partial \theta_{lat}}\ \frac{\partial \kappa}{\partial\phi\partial \theta_{lon}}\ \frac{\partial \kappa}{\partial\phi\partial \theta_{t}}\ \frac{\partial \kappa}{\partial\phi\partial \sigma}\ \\ \frac{\partial \kappa}{\partial\theta_{lat}\partial\phi}\ \frac{\partial \kappa}{\partial\theta_{lat}^2}\ \frac{\partial \kappa}{\partial\theta_{lat}\partial \theta_{lon}}\ \frac{\partial \kappa}{\partial\theta_{lat}\partial \theta_{t}}\ \frac{\partial \kappa}{\partial\theta_{lat}\partial \sigma}\ \\ \frac{\partial \kappa}{\partial\theta_{lon}\partial\phi}\ \frac{\partial \kappa}{\partial\theta_{lon}\partial \theta_{lat}}\ \frac{\partial \kappa}{\partial\theta_{lon}^2}\ \frac{\partial \kappa}{\partial\theta_{lon}\partial \theta_{t}}\ \frac{\partial \kappa}{\partial\theta_{lon}\partial \sigma}\ \\ \frac{\partial \kappa}{\partial \theta_{t}\partial\phi}\ \frac{\partial \kappa}{\partial\theta_{t}\partial \theta_{lat}}\ \frac{\partial \kappa}{\partial\theta_{t}\partial \theta_{lon}}\ \frac{\partial \kappa}{\partial\theta_{t}^2}\ \frac{\partial \kappa}{\partial\theta_{t}\partial \sigma}\ \\ \frac{\partial \kappa}{\partial \sigma \partial \phi}\ \frac{\partial \kappa}{\partial \sigma \partial \theta_{lat}}\ \frac{\partial \kappa}{\partial \sigma \partial \theta_{lon}}\ \frac{\partial \kappa}{\partial \sigma \partial \theta_{t}}\ \frac{\partial \kappa}{\partial \sigma^2}\ \end{bmatrix}$$

Wow. Fortunately, we shall see that this matrix is symmetric and many of these terms are zero. The non zero terms are

$\Large\ \frac{\partial \kappa}{\partial\theta_{lat}} = \frac{\phi\Delta\ lat^2}{\theta_{lat} ^ 4} exp\left(-\sqrt{P}\right)\ P^{-\frac{1}{2}}\left(-3 + \frac{\Delta lat^2}{\theta_{lat}}P^{-\frac{1}{2}}\left( 1 + P^{-\frac{1}{2}}\right)\right)\\\Large\ \frac{\partial \kappa}{\partial\theta_{lon}} = \frac{\phi\Delta\ lon^2}{\theta_{lon} ^ 4} exp\left(-\sqrt{P}\right)\ P^{-\frac{1}{2}}\left(-3 + \frac{\Delta lon^2}{\theta_{lon}}P^{-\frac{1}{2}}\left( 1 + P^{-\frac{1}{2}}\right)\right)\\\Large\ \frac{\partial \kappa}{\partial\theta_{t}} = \frac{\phi\Delta\ t^2}{\theta_{t} ^ 4} exp\left(-\sqrt{P}\right)\ P^{-\frac{1}{2}}\left(-3 + \frac{\Delta t^2}{\theta_{t}}P^{-\frac{1}{2}}\left( 1 + P^{-\frac{1}{2}}\right)\right)\\\Large\ \frac{\partial \kappa}{\partial\sigma^2} = 2I\\\Large\ \frac{\partial \kappa}{\partial \phi \partial\theta_{lat}} = \frac{\partial \kappa}{\partial \theta_{lat}\partial\phi} = \ \frac{\Delta lat^2}{\theta_{lat}^3}exp\left(-\sqrt{P}\right)P^{-\frac{1}{2}} \\\Large\ \frac{\partial \kappa}{\partial \theta_{lon}\partial\theta_{lat}} = \frac{\partial \kappa}{\theta_{lon}\partial \theta_{lat}} = \ \frac{\phi \Delta lon^2 \Delta lat^2}{\theta_{lon}^3\theta_{lat}^3}exp\left(-\sqrt{P}\right)P^{-1}\left( 1 + P^{-\frac{1}{2}} \right) \\\Large\ \frac{\partial \kappa}{\partial \theta_{t} \partial\theta_{lon}} = \frac{\partial \kappa}{\partial \theta_{lon} \partial\theta_{t}} = \ \frac{\phi \Delta lon^2 \Delta t^2 }{\theta_{lon}^3\theta_{t}^3}exp\left(-\sqrt{P}\right)P^{-1}\left( 1 + P^{-\frac{1}{2}} \right) \\\Large\ \frac{\partial \kappa}{\partial \phi \partial\theta_{lon}} = \frac{\partial \kappa}{\partial \theta_{lon}\partial\phi} = \ \frac{\Delta lon^2}{\theta_{lon}^3}exp\left(-\sqrt{P}\right)P^{-\frac{1}{2}} \\\Large\ \frac{\partial \kappa}{\partial \theta_{lon}\partial\theta_{lat}} = \frac{\partial \kappa}{\theta_{lon}\partial \theta_{lat}} = \ \frac{\phi \Delta lon^2 \Delta lat^2}{\theta_{lon}^3\theta_{lat}^3}exp\left(-\sqrt{P}\right)P^{-1}\left( 1 + P^{-\frac{1}{2}} \right) \\\Large\ \frac{\partial \kappa}{\partial \theta_{t} \partial\theta_{lat}} = \frac{\partial \kappa}{\partial \theta_{lat} \partial\theta_{t}} = \ \frac{\phi \Delta lat^2 \Delta t^2 }{\theta_{lat}^3\theta_{t}^3}exp\left(-\sqrt{P}\right)P^{-1}\left( 1 + P^{-\frac{1}{2}} \right)$

## Implementing in Python¶

Here we will check our analytical functions for $\nabla^2 \kappa, \nabla^2 l_n$, and $\nabla^2 l$. With the numerical approximation. We reuse some of the classes and functions from the last post.

In [2]:
from local_mle_space_time import *
import itertools

In [3]:
np.random.seed(seed=2)
def make_data(mLen=40):
lat = np.random.uniform(-10,10, mLen)
lon = np.random.uniform(-10,10, mLen)
res = np.random.normal(0, 2.5, mLen)
julDay = np.random.uniform(0, 365, mLen)
agg = Agg(lat, lon, res, julDay)
return agg

n = 10
LARGEVAL = 10**8
SMALLVAL = np.finfo(np.float64).eps
thetaInitGuess = [np.log(x) for x in [400**2, 5, 5, 5, 100]]
theta = Theta(*thetaInitGuess)
data = Data(*make_data(n).arr())

In [4]:
def hess_kappa(theta, data):
H = np.zeros((5,5, data.m, data.m))
P = p_fun(theta, data)
P[P == 0] = np.inf # divide by zero correction
Pnsq = P ** -.5
Pchunk = 1/P * ( 1 + Pnsq)

stChunk = lambda th1, th2: Pchunk / (th1 ** 3 * th2 ** 3 )
stcov = space_time_covariance_exp(theta, data)
stcovdp = stcov / theta.phi
H[4,4, :, :] = 2 * np.eye(data.m)
H[0,1, :, :] = stcovdp * partial_p_fun(data.dlat, theta.lat) * Pnsq
H[1,0, :, :] = H[0,1, :, :]
H[0,2, :, :] = stcovdp * partial_p_fun(data.dlon, theta.lon) * Pnsq
H[2,0, :, :] = H[0,2, :, :]
H[0,3, :, :] = stcovdp * partial_p_fun(data.dt, theta.t) * Pnsq
H[3,0, :, :] = H[0,3, :, :]

dlatsdlons = data.dlat ** 2 * data.dlon ** 2
dlatsdts = data.dlat ** 2 * data.dt ** 2
dlonsdts = data.dt ** 2 * data.dlon ** 2

latlngChunk = stChunk(theta.lat, theta.lon)
lattChunk = stChunk(theta.lat, theta.t)
lontChunk = stChunk(theta.t, theta.lon)

H[1,2, :, :] = stcov * dlatsdlons * latlngChunk
H[2,1, :, :] = stcov * dlatsdlons * latlngChunk

H[3,1, :, :] = stcov * dlatsdts  * lattChunk
H[1,3, :, :] = stcov * dlatsdts * lattChunk

H[3,2, :, :] = stcov * dlonsdts * lontChunk
H[2,3, :, :] = stcov * dlonsdts * lontChunk

ddChunk = lambda dx, th: -3 + (dx / th) ** 2 * Pnsq * ( 1 + Pnsq)
mChunk = stcov * Pnsq

H[1, 1, :, :] = mChunk * data.dlat ** 2 / theta.lat ** 4 * ddChunk(data.dlat, theta.lat)
H[2, 2, :, :] = mChunk * data.dlon ** 2 / theta.lon ** 4 * ddChunk(data.dlon, theta.lon)
H[3, 3, :, :] = mChunk * data.dt ** 2 / theta.t ** 4 * ddChunk(data.dt, theta.t)
return H


our Hessian estimate functions are defined here.

In [5]:
def hessian ( x, the_func, eps=10**-8):
"""Numerical approximation to the Hessian
Parameters
------------
x: array-like
The evaluation point
the_func: function
The function. We assume that the function returns the function value and
the associated gradient as the second return element
eps: float
The size of the step
"""
N = x.size
H = np.zeros((N,N))
df_0 = the_func (x)[1]
for idx in range(N):
xx0 = 1. * x[idx]
x[idx] = xx0 + eps
df_1 = the_func ( x )[1]
H[idx, :] = ((df_1 - df_0)/eps)
x[idx] = xx0
return H

def hessian_4D ( x, the_func, eps=10**-8):
"""Numerical approximation to the 4D Hessian
Parameters
------------
x: array-like
The evaluation point
the_func: function
The function. We assume that the function returns the function value and
the associated gradient as the second return element
eps: float
The size of the step
"""
N = x.size
H = np.zeros((N,N, data.m, data.m))
df_0 = the_func (x)[1]
for idx in range(N):
xx0 = 1. * x[idx]
x[idx] = xx0 + eps
df_1 = the_func ( x )[1]
H[idx, :, :, :] = ((df_1 - df_0)/eps)
x[idx] = xx0
return H

In [6]:
kappaHess = hess_kappa(theta, data)
my_fun = lambda th: (kappa_fun(Theta(*th), data), grad_kappa(Theta(*th), data))
kappaHessNum = hessian_4D(np.array(thetaInitGuess), my_fun)
for idx, jdx in itertools.product(range(5), range(5)):
pdd = kappaHess[idx, jdx, :, :]
pddx = kappaHessNum[idx, jdx, :, :]
if np.allclose(pdd, pddx, 10**-5):
continue
for kdx, ldx in itertools.product(range(pdd.shape[0]), range(pdd.shape[1])):
pdde = pdd[kdx, ldx]
pddex = pddx[kdx, ldx]
print(f'elem: {idx}, {jdx}, {kdx}, {ldx}, analytic: {pdde}, numeric: {pddex}, ratio: {pdde/pddex}')


The analytic function agrees with the numeric approximation.

# $nll_i(\theta)$ Hessian¶

In [7]:
def hess_nll(theta, data):
kappa = kappa_fun(theta, data)
hessKappa = hess_kappa(theta, data)
nllHess = np.zeros((5,5))
for idx, jdx in itertools.product(range(5), range(5)):
ft1 = solve(kappa, hessKappa[idx, jdx, :, :])
ft2 = solve(kappa, kappa - data.yyT)
firstTerm = np.dot(ft1, ft2)
st1 = solve(kappa, gradKappa[idx, :, :])
st2 = solve(kappa, gradKappa[jdx, :, :])
st3 = solve(kappa, data.yyT - .5 * kappa)
secondTerm = np.dot(np.dot(st1, st2), st3)
nllHess[idx, jdx] =  np.trace(firstTerm) + 2 * np.trace(secondTerm)
return nllHess

In [8]:
nll_fun = lambda th: (neg_log_liklihood(Theta(*th), data), grad_nll(Theta(*th), data))
nllHessNum = hessian(np.array(thetaInitGuess), nll_fun)
nllHess = hess_nll(theta, data)
for idx, jdx in itertools.product(range(5), range(5)):
pdd = nllHess[idx, jdx]
pddx = nllHessNum[idx, jdx]
if np.isclose(pdd, pddx, 10**-4):
continue
print(f'elem: {idx}, {jdx}, analytic: {pdd}, numeric: {pddx}, ratio: {pdd/pddx}')

elem: 1, 4, analytic: 1.0090298510675597e-08, numeric: 4.440892098500626e-08, ratio: 0.22721332306367845


Sometimes elements can disagree, but we can ignore them if they are small.

# Verify $nll(\theta)$¶

In [9]:
def make_agg_years():
aggYears = {}
for year in range(2007, 2019):
data = Data(*make_data(10).arr())
aggYears[str(year)] = data
return aggYears

def pairwise_product(arr):
'''calculates product between each point'''
cp = np.array(np.meshgrid(arr, arr)).T
pwp = cp[:,:,0].flatten() * cp[:,:,1].flatten()
return pwp.reshape(len(arr), len(arr))

def sum_nll(thetaArr, filtAggs):
theta = Theta(*thetaArr)
years = filtAggs.keys()
nllSum = 0
for iYear in years:
data = Data(*filtAggs[iYear].arr())
nllSum += neg_log_liklihood(theta, data)
return .5 * nllSum

theta = Theta(*thetaArr)
years = aggYears.keys()
for iYear in years:
data = Data(*aggYears[iYear].arr())

def hess_sum_nll(thetaArr, aggYears):
theta = Theta(*thetaArr)
years = aggYears.keys()
hessSum = np.zeros((5, 5))
for iYear in years:
data    = aggYears[iYear]
hessNll = hess_nll(theta, data)
return .5 * hessSum

In [10]:
aggYears = make_agg_years()
sum_nll_fun = lambda th: (sum_nll(th, aggYears), grad_sum_nll(th, aggYears))
sumNllHessNum = hessian(np.array(thetaInitGuess), sum_nll_fun, 10**-6)

In [11]:
sumNllHess = hess_sum_nll(thetaInitGuess, aggYears)
for idx, jdx in itertools.product(range(5), range(5)):
pdd = sumNllHess[idx, jdx]
pddx = sumNllHessNum[idx, jdx]
if np.isclose(0, pdd, 10**-4 ) or np.isclose(0, pddx, 10**-4 ):
continue
if np.isclose(pdd, pddx, 10**-4):
continue
print(f'elem: {idx}, {jdx}, analytic: {pdd}, numeric: {pddx}, ratio: {pdd/pddx}')


Excellent! Numerical approximations and analytical results are roughly in agreement.

## Unconstrained optimization¶

We need to be aware of transforming the constrained $\theta \gt 0$ to the unconstrained optimization problem $exp(\theta)$ Remember that we are setting the $\theta$ values to $\zeta = exp(\theta)$ to make this an unconstrained optimization problem (inputting, negative values will cause errors). This will have a slight modification on the negative log-likelihood gradient.

$$\large \nabla l(\zeta) = \nabla l(exp(\theta)) \odot \theta$$ and $$\large \nabla^2 l(\zeta) = \nabla^2 l(exp(\theta)) \odot \theta^T \theta$$

Where $\odot$ is the pairwise product operator.

We assign our objective functions and apply the available algorithms in Scipy's minimize documentation. The methods that did not seem to work are omitted.

In [12]:
def obj_fun(thetaArr, aggYears):
eth = [np.exp(x) for x in thetaArr]
return sum_nll(eth, aggYears)

eth = [np.exp(x) for x in thetaArr]

def hess_fun(thetaArr, aggYears):
eth = [np.exp(x) for x in thetaArr]
return np.multiply(hess_sum_nll(eth, aggYears), pairwise_product(eth))

def timeit(method):
def timed(*args, **kw):
ts = time.time()
result = method(*args, **kw)
te = time.time()
print('%r  %2.2f ms' % \
(method.__name__, (te - ts) * 1000))
return result
return timed

options = {'disp': True, 'maxiter': 1000 }

def vanilla_cb(x):
print(x)

def is_pos_def(x):
return np.all(np.linalg.eigvals(x) > 0)

print('Hessian positive definite?: {}'.format(is_pos_def(hess_fun(thetaInitGuess, aggYears))))
aggYears = make_agg_years()
@timeit
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='BFGS', tol=10**-4, options=options)

@timeit
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='BFGS', tol=10**-4, jac=grad_fun, options=options)

@timeit
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='Nelder-Mead', tol=10**-4, options=options)

@timeit
def min_obj_NCG():
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='Newton-CG', jac=grad_fun, hess=hess_fun, tol=10**-4, options=options)

@timeit
def min_obj_cg():
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='CG', jac=grad_fun, hess=hess_fun, tol=10**-4, options=options)

xOptsCG = min_obj_cg()
xOptsNCG = min_obj_NCG()

Hessian positive definite?: False
Optimization terminated successfully.
Current function value: 278.898117
Iterations: 108
Function evaluations: 786
Optimization terminated successfully.
Current function value: 278.898116
Iterations: 104
Function evaluations: 120
Optimization terminated successfully.
Current function value: 278.945299
Iterations: 116
Function evaluations: 220

/home/tyler/anaconda3/envs/argo/lib/python3.6/site-packages/scipy/optimize/_minimize.py:523: RuntimeWarning: Method CG does not use Hessian information (hess).
RuntimeWarning)

Optimization terminated successfully.
Current function value: 278.945330
Iterations: 6
Function evaluations: 19
'min_obj_cg'  572.18 ms
Optimization terminated successfully.
Current function value: 278.945982
Iterations: 16
Function evaluations: 21
Hessian evaluations: 16
'min_obj_NCG'  3071.64 ms


Each method was terminated successfully. We also see how a few other minimization routines (documented here completed. The Nelder-Mead (Simplex), Conjugate Gradient, and Newton Conjugate Gradient methods were completed both faster and with fewer steps than BFGS.

# Constrained Optimization¶

We know at heart that this is a constrained optimization problem so why are we insisting on using $\zeta = exp(\theta)$? The sequential least squares option (SLSQP) only uses the gradient, and makes a very fast prediction.

In [19]:
def obj_fun(thetaArr, aggYears):
return sum_nll(thetaArr, aggYears)

def hess_fun(thetaArr, aggYears):
hess_sum_nll( thetaArr, aggYears)
maxB = 10**10
minB = 10**-6
bounds = [ (minB, maxB) for x in range(5) ]

@timeit
return minimize(obj_fun, thetaInitGuess, args=aggYears, method='SLSQP', tol=10**-4, jac=grad_fun, bounds=bounds, options=options)


Optimization terminated successfully    (Exit mode 0)
Current function value: 278.73682561966064
Iterations: 28
Function evaluations: 38