itsonlyamodel

Python Time Series Fit

Nonlinear Least squares fitting time series in Python

This post is the third part of a three-part series into argovis.

  1. Argovis api introduces my website www.argovis.com, and how users can access data using a python API.
  2. Linear time series analysis in R shows how time series models can be used to fit ocean temperatures from Argo data.
  3. Linear time series model fitting in python gives further detail on how non-linear least squares fitting is typically done on time series.

As we saw in the previous notebook, R has an arima() function that can perfrom fits on our time series generated from Argo floats around Gilbralter. The seasonal trend was removed, leaving us with a stationary model. I decided to keep the two models AR(1) and AR(4), and compare them with our own parameter finding algorithm in python. I also will compare our method with scipy's own minimization algorithms.

I know of three methods to estimate times series parameters: method of moments, exact liklihood, or conditional, or unconditional liklihood methods.

Method of moments is not very accurate, and is usually used as an initial guess to the more rigerous methods. The closed form of the exact liklihood is complicated. I didn't want to get lost in the details, as the scope of this notebook is to optimize an objective function. I decided to go with the unconditional sums of squares using the backcasting method.

Unconditional liklihood estimation for ARMA(p,q) processes.

Consider a time series modeled by an ARMA(p,q) process

$$ \phi_p(B)Z_t = \theta_q a_t $$

It turns the conditional expectation $a_t$ maximum liklihood extimation is better estimate for $\hat{\phi_p}$, $\hat{\theta_p}$.

Estimated parameters are found maximizing the likelihood function by solving the conditional expectation of a_t, given some time series Z, mean $\mu$ and weights $\phi_p$, $\theta_p$.

$$arg\ max\ E(a_t| Z, \phi_p, \theta_q, \mu) = E\left(\frac{\phi_p(B)}{\theta_q(B)}\right)Z_t$$

Shocks a_t are assumed independently distributed white noise process.

$$E(a_t| Z, \phi_p, \theta_q, \mu) \sim WN(0, \sigma_a^2)$$

This sort of problem is solved using the conditional sum of squares of its log-likelihood function.

If we skip some details, the log-likelihood function is maximized by minimizing the function below.

$$f(\phi, \theta, \mu) = \frac{1}{2}\Sigma^n_{t=1} E(a_t|Z, \phi, \theta, \mu)^2$$.

One challange with time series is estimating the inital conditions. IE, what is $Z_0, Z_{-1}...etc?$ One simple way would be estimate inital conditions. $$a_t = E(a_t) = 0$$ $$Z_t = E(Z_t) = \bar{Z_t}$$

At this point, I should note that we aren't solving variance $\sigma_a^2$. It is found after solving for $\hat{\phi}, \hat{\mu}, \hat{\theta}$.

Variance estimate $$\hat{\sigma}^2_a = \frac{f(\hat{\phi}, \hat{\mu}, \hat{\theta})}{df}$$

The degree of freedom

$$ df = n - 2p + q + 1 $$

For example, we can define an AR(1) model conditional likelihood as

$$ a_t = Z_t - \mu - \phi(Z_{t-1} - \mu) = Z_t - \mu - \phi Z_{t-1} - \phi \mu $$

The model is non-linear, due to the term $\phi\mu$. One way to deal with the missing value $Z_{t-1}$ is to estimate is to replace it with the expected value given a series Z,

$$Z_{t-1} \sim E(Z_{t-1}|Z)$$

We write the method for retrieving shocks with the python method shock_AR1, before importing the libraries used by this notebook.

In [2]:
import numpy as np
import pdb
import scipy.linalg as spla
import scipy.optimize as opt
import pandas as pd
import numdifftools as nd
import cmocean
import matplotlib.pylab as plt
%matplotlib inline
In [3]:
z = pd.read_csv('docs/argo/desGilT.csv')['ts'].values
print(z.shape)
(401,)
In [4]:
def shock_AR1(z, phi=.3, mu=0):
    a = np.zeros(len(z))
    #a[0] = np.mean(z)
    a[0] = (1-phi)*(np.mean(z) - mu)
    for tdx in range(1, len(z)):
        a[tdx] = z[tdx] - mu - phi * (z[tdx-1] - mu )
    return a
AR1Shocks = shock_AR1(z, .5, 15)
AR1Shocks.shape
Out[4]:
(401,)

Backcasting method for unconditional likelihood estimation

A better to estimate to the initial values is to use the backcasting method, previous shock values are estimeated until $z_{t} - z_{t-1} < .005$ for some value $t = 0, -1, ..., M$, where $M \geq -(t+1)$. Then you are able to create a backcasted shock values $a \in Real^{s}, s = n + M$ where n is the original length of the time series $Z_t$.

For example, the following time series of length 401 returns a length of 406 when backcasted as an AR(1) process.

Backcasting is performed once before the start our minimization algorithm. Non-stationary time series may cause the backcasting algorithm to go on indefinitely. Since our optimization framework doesn't account for stationarity constraints, we should avoid backcasting when optimizing.

In [4]:
def shock_backcasting_AR1(z, phi, mu):
    zdot = [elem-mu for elem in z]
    tb = 0
    zBack = zdot[0:2]
    while ((np.abs(zBack[1] - zBack[0])) >= .005):
        zBack = [ phi * zBack[0]] + zBack
        tb -= 1
    M = -(tb+1)
    zdot = zBack[:-2] + zdot[2:]
    shock = np.zeros(len(zdot))
    for tdx in range(1, len(zdot)):
        shock[tdx] = zdot[tdx] - phi * zdot[tdx-1]
    zBC = np.add(zdot, mu)
    return shock[1:], zBC

bcAR1, zBC = shock_backcasting_AR1(z, .5, 15)
print(len(bcAR1))
print(len(zBC))
408
409

From a numerical optimization stand point, the shocks $a_t$ can be written as residuals $r_t(x)$. Where x are the model parameters $[\phi, \mu,...]$ et cetra. For an objective function

$$ f(x) = \Sigma^n_{t=1} E(a|Z, \phi, \mu)^2 = \frac{1}{2}\Sigma^n_{t=1} r_t(x)^2 $$

where $ x = [\phi, \mu]$. If we treat $r_t(x)$ as a vector (r_1(x), r_2(x)...r_n(x)), then its derivitaves are described via the hessian

$$ J_{ti}(x) = \left[ \frac{\partial r_t(x)}{\partial x_i}\right] $$

For an AR(1) model, shocks $a_t$ are considered the residual $r(x)$

$$ a_t = Z_t - \mu - \phi_1(Z_{t-1} - \mu) = r(x) $$

There are two unknown parameters $\phi$ and $\mu$ thus $J_{ti}(x) \in R^{nx2}$.

$$J_{it}(x) = \begin{bmatrix} \frac{\partial r_t}{\partial \phi}, \frac{\partial r_t}{\partial \mu} \end{bmatrix} = \begin{bmatrix} -Z_{t-1} + \mu ,-1 + \phi \end{bmatrix} $$

One important feature of this Jacobian is that the backcasted entries should be set to zero. Thus the first row of the AR(1) Jacobian are zeros.

$$J_{i0}(x) = \begin{bmatrix} \frac{\partial r_0}{\partial \phi} = 0, \frac{\partial r_0}{\partial \mu} = 0 \end{bmatrix} $$

In [5]:
def JacAr1(z, phi=.5, mu=15):
    z1 = z[0:-1]
    dPhi = np.add(-1 * z1, mu)
    dMu = (-1+phi) * np.ones(len(z1))
    
    # Jacobian is zero at t=0
    dPhi[0] = 0
    dMu[0] = 0
    return np.array([dPhi, dMu]).T

Jac0 = JacAr1(zBC, phi=.5, mu=15)
Jac0.shape
Out[5]:
(408, 2)

We can check if our Jacobian is correct by comparing it with the python library numdifftools numerical approximation of the Jacobian.

In [6]:
myFun = lambda x: shock_AR1(zBC[1:], x[0], x[1]).T
JacApprox = nd.Jacobian(myFun)

spla.norm(JacAr1(zBC, .5, 15) - JacApprox([.5, 15]))
Out[6]:
4.4305349724175143

Pretty good!

Minimizing X

Working with real data can make the Jacobian ill conditioned, making the Hessian approximation singular. The pure Newton Step may no longer work for determining p.

Gauss Newton step

Working with real data can make the Jacobian ill conditioned, making the Hessian approximation singular. The pure Newton Step may no longer work for determining p. Anticipating a singular matrix, we introduce Gauss-Newton Step function GNP.

Gauss-Newton is a minimization procedure for finding a step length $p^{GN}$ for a linear system

$$ p^{GN} arg\ min_{p \in R^n} ||J(x_k)p + r(x_k)||^2 $$

Finding the optimal step length can be computationally expensive. Hence the simpler Cholesky decomposition is attempted first, followed by more reliable yet expensive methods.

This function relies on exception handling. If a singular matrix raises an error, the step length is found using QR decomposition. If QR decomposition fails, then the step length is found using an SVD approach.

In [7]:
def GNP(x, shockFun, gradFun, jacFun, hessFun):
    #Cholsesky attempt
    try:
        R = spla.cholesky(hessFun(x), lower=False)
        RTR = spla.inv(np.dot(R.T,R))
        pChol = -1 * np.dot(RTR, gradFun(x))
        return pChol
    except np.linalg.linalg.LinAlgError as err:
        pass
    # QR decomposition attempt
    try:
        Q, R, P = spla.qr(jacFun(x), pivoting=True, mode='economic')
        P = get_perm_matrix(P)
        negPinvR = -1 * np.dot(P, spla.inv(R))
        QTr = np.dot(Q.T, jacFun(x))
        pQR = np.dot(negPinvR, QTr)
        if np.nan in pQR:
            raise
        return pQR
    except np.linalg.linalg.LinAlgError as err:
        pass
    except ValueError:
        pass
    # SVD attempt
    try:
        U, S, VT = spla.svd(jacFun(x))
        UTr = np.dot(U.T, shockFun(x))

        pSVD = 0
        for col in range(VT.shape[1]):
            vTCol = VT[:,col]
            sigCol = np.sum(S[col])
            tau = UTr[col] / sigCol
            pSVD += tau * vTCol
        return pSVD
    except np.linalg.linalg.LinAlgError as err:
        if 'singular matrix' in str(err):
            print('svd did not work.')
        else:
            raise

def get_perm_matrix(P):
    I = np.identity(P.shape[0])
    M = np.identity(P.shape[0])
    for idx, col in enumerate(P):
        M[:, idx] = I[:, col]
    return M

The following helper functions are defined for an inital guess point x0.

In [8]:
def init_functions(x0):
    zBC = shock_backcasting_AR1(z, x0[0], x0[1])[1]
    print('length of z: {}'.format(len(z)))
    print('length of backcasted z: {}'.format(len(zBC)))
    myFun = lambda x: shock_AR1(zBC[1:], x[0], x[1])
    myJac = lambda x: JacAr1(zBC, x[0], x[1])
    myGrad = lambda x: np.dot(myJac(x).T, myFun(x))
    myHess = lambda x: np.dot(myJac(x).T, myJac(x))
    myObj = lambda x: .5 * np.dot(myFun(x), myFun(x).T)
    pGN = lambda x: GNP(x, myFun, myGrad, myJac, myHess)
    return myFun, myJac, myGrad, myHess, myObj, pGN

x0 = np.array([.9, 25])
myFun, myJac, myGrad, myHess, myObj, pGN = init_functions(x0)
p0 = pGN(x0)
J0 = myJac(x0)
F0 = myObj(x0)
g0 = myGrad(x0)
H0 = myHess(x0)
length of z: 401
length of backcasted z: 446

Visualization of the AR(1) objective function sample space

It is helpfull to view the sample space of our objective function

$$ f(x) = \frac{1}{2}\Sigma^n_{t=1} r_t(x)^2 $$

around the point

$$ x^{arima}_{opt}Sol = [\phi = 0.8141, \mu = 19.4998] $$

In [9]:
def get_contour_data(x0, ndim, delta):
    x_2 = np.linspace(x0[1] - delta,x0[1] + delta, ndim)
    x_1 = np.linspace(x0[0] - delta,x0[0] + delta, ndim)
    XX = np.meshgrid(x_1, x_2)
    XX = np.dstack([XX[0], XX[1]]).reshape((ndim**2,2))
    x = x0
    m = []
    for xGrid in XX:
        m.append(myObj(xGrid))
    m = np.array(m).reshape(ndim, ndim)
    return x_1, x_2, m

def contourPlot(x_1, x_2, m, figNum):
    fig = plt.figure(0)
    axes = plt.axes()
    cax = axes.contourf(x_1, x_2, m, cmap=cmocean.cm.ice)
    axes.set_title('Contour of r(t)')
    axes.set_ylabel(r'$\mu$')
    axes.set_xlabel(r'$\phi$')
    cbar = fig.colorbar(cax, cmap = cmocean.cm.ice)
    return fig

xArimaSol = np.array([0.8141, 19.4998])
x_1, x_2, m = get_contour_data(xArimaSol, 200, 1)
fig0 = contourPlot(x_1, x_2, m, 0)

The convexity of this countour plot is encouraging. We should be able to find a solution at the bottom of this blue basin.

In [10]:
print(F0)
print(J0.shape)
print(g0.shape)
print(H0.shape)
print(p0.shape)
83.1650196822
(445, 2)
(2,)
(2, 2)
(2,)

Minimizing AR(1) fit f(x) using a Gauss-Newton method

Numerical Optimzation by Nocedal and Wright suggest using a Gauss-Newton line search method for non-linear least square fitting. For the Gauss-Newton step, I am computing the inverse of the approximate Hessian. As long as the Hessian isn't singular, this step is valid.

In [11]:
def myGaussNewton(x0, objFun, gradFun, hessFun, dirFun, timeIt=False, eps=10**-8, dprint=10**3):
    dim = x0.shape[0]
    I = np.identity(dim)
    gradF = gradFun(x0)
    normGrad = spla.norm(gradF)
    H = hessFun(x0)
    xCurr = x0
    k=0
    FOut = []
    gradOut = []
    phiOut = []
    muOut = []

    while normGrad > eps:
        pDir = dirFun(xCurr)
        xNext = xCurr + pDir
        gradFNext = gradFun(xNext)
        HNext = hessFun(xNext)
        k+=1
        if (k % dprint == 0):
            #pdb.set_trace()
            F = objFun(xCurr)
            gradNorm = spla.norm(gradFNext)
            print('on {}th step'.format(k))
            print('fOut: {0}\ngradNorm: {1}\n'.format(F, gradNorm))

        #update values
        H = HNext
        xCurr = xNext
        gradF = gradFNext
        try:
            normGrad = spla.norm(gradF)
        except ValueError:
            pdb.set_trace()
            print('should not have nan or inf in gradient')

        gradOut.append(normGrad)
        FOut.append(objFun(xCurr))
        phiOut.append(xCurr[0])
        muOut.append(xCurr[1])
        
        if k > 3000:
            print('not converging. exiting while loop')
            break
        
    outDict = {'F': FOut, 'normGrad': gradOut, 'phi': phiOut, 'mu': muOut}
    F = objFun(xCurr)
    gradNorm = spla.norm(gradFNext)
    if not timeIt:
        print('algorithm has has terminated after {} steps'.format(k))
        print('fOut: {0}\ngradNorm: {1}'.format(F, gradNorm))
        print('xMin: {}\n'.format(xCurr))
    return xCurr, outDict
In [12]:
xGN, GNDict = myGaussNewton(x0, myObj, myGrad, myHess, pGN)
algorithm has has terminated after 2 steps
fOut: 21.936583099149388
gradNorm: 2.9754436402122983e-13
xMin: [  0.9645027   19.66902258]

Gauss-Newton model converges after 2 steps with a solution that overestimates arima fit...

Backcasting introduces estimated z values to the procedure, so based on the initial point $x_0$, the objective itself changes. If we vary the initial point, our solution changes.

Clearly we need to come up with an initial guess that best approximates the solution.

In [13]:
x0Array = np.array([[.1, 0], [.5, 10], [.99, 30]])

for x0 in x0Array:
    print('initial coordinates for x0: {}'.format(x0))
    myFun, myJac, myGrad, myHess, myObj, pGN = init_functions(x0)
    myGaussNewton(x0, myObj, myGrad, myHess, pGN)
initial coordinates for x0: [ 0.1  0. ]
length of z: 401
length of backcasted z: 404
algorithm has has terminated after 2 steps
fOut: 134.74582946533877
gradNorm: 3.397907021896483e-11
xMin: [  0.76420615  19.51851071]

initial coordinates for x0: [  0.5  10. ]
length of z: 401
length of backcasted z: 410
algorithm has has terminated after 2 steps
fOut: 33.890500289415414
gradNorm: 6.878599883316829e-12
xMin: [  0.90668832  19.55928539]

initial coordinates for x0: [  0.99  30.  ]
length of z: 401
length of backcasted z: 709
algorithm has has terminated after 2 steps
fOut: 22.051923365729113
gradNorm: 1.9662055882241343e-12
xMin: [  0.99612495  19.395851  ]

Estimating starting point $x_0$ using method of moments.

The time series model parameters depend on inital coordinates. Our initial guess can no longer be guess. Instead, we can estimate inital parameter $x_0$ using the method of moments.

Estimation for mean using methods of moments is $\hat{\mu}=\bar{Z}$

For $\phi_i$ we use the autocorrelation function estimate, $\hat{\rho}_k$.

$$ \hat{\rho}_k = \frac{ \Sigma^{n-k}_{t=1} (Z_t - \bar{Z}) (Z_{t+k} - \bar{Z})}{\Sigma^n_{t=1} (Z_t - \bar{Z})^2}, k = 1, 2, ..., p $$

An AR(1) $\phi$ weight is by

$$ \hat{\phi} = \hat{\rho}_1 $$

The starting point is defined as

$$ x_0 = [\hat{\phi}, \hat{\mu}] $$

In [14]:
def get_inital_estimate(z,k=1):
    N = len(z)
    meanZ = np.mean(z)
    # make vector Z_t - meanZ
    lVect = z[k:]-meanZ
    # make vector Z_{t-1} - meanZ
    rVect = z[0:-k] - meanZ
    auto_cov = np.dot(lVect, rVect.T)/np.dot(z-meanZ,z.T-meanZ)
    return np.array([auto_cov, meanZ])

x0 = get_inital_estimate(z)
print('inital x0 estimate: {}'.format(x0))
inital x0 estimate: [  0.80372595  19.48976062]

A consistant initial guess based on the data eliminates the unwanted change in the objective function due to backcasting.

AR(1) Model performance

In [15]:
x0 = get_inital_estimate(z,1)
myFun, myJac, myGrad, myHess, myObj, pGN = init_functions(x0)
xGN, GNDict = myGaussNewton(x0, myObj, myGrad, myHess, pGN)
length of z: 401
length of backcasted z: 415
algorithm has has terminated after 2 steps
fOut: 19.954174586982298
gradNorm: 1.0959762715111268e-14
xMin: [  0.81585758  19.49976398]

In [16]:
GNDF = pd.DataFrame(GNDict)
In [17]:
GNDF['F'].plot()
Out[17]:
In [18]:
GNDF['normGrad'].plot()
Out[18]:
In [19]:
x_1, x_2, m = get_contour_data(xGN, 100, .002)
fig0 = contourPlot(x_1, x_2, m, 1)
axes = plt.axes()
solScat = axes.scatter(GNDF['phi'].values,GNDF['mu'].values, marker = 'o', color='yellow', label='Next point')
In [21]:
print(myObj(xGN))
print(myObj(xArimaSol))
19.954174587
19.9543522124

The ballpark estimate brings us three decimal points within the arima result, and yet we shouldn't expect them to be the same. ARIMA estimates the exact liklihood estimation with a Kalman filter, where as our method uses the unconditional liklihood estimation with the backcasting method.

$x^{NG}_{opt} = [\phi = 0.8159, \mu = 19.5000]$

$x^{arima}_{opt} = [\phi = 0.8141, \mu = 19.4998]$

Comparison to Scipy's minimization libraries.

The scipy library offers a way to compare our Newton Gauss method with BFGS, CG, and a least squares fit algorithm.

In [55]:
def compare_scipy_minimization(x0, myObj, myGrad, myHess, pGN):
    print('Solutions:')
    print('My Gauss Newton: {}'.format(myGaussNewton(x0, myObj, myGrad, myHess, pGN, timeIt=True)[0]))
    print('CG: {}'.format(opt.minimize(myObj, x0, method='CG').x))
    print('BFGS: {}'.format(opt.minimize(myObj, x0, method='BFGS').x))
    print('BFGS with Gradient: {}'.format(opt.minimize(myObj, x0, method='BFGS', jac=myGrad).x))
    print('least squares fit: {}\n'.format(opt.least_squares(myObj, x0, jac=myGrad).x))

def time_scipy_minimization(x0, myObj, myGrad, myHess, pGN):
    print('Timeit function outputs for convergence:')
    print('My Gauss Newton: ')
    %timeit myGaussNewton(x0, myObj, myGrad, myHess, pGN, timeIt=True)
    print('CG:')
    %timeit opt.minimize(myObj, x0, method='CG').x
    print('BFGS:')
    %timeit opt.minimize(myObj, x0, method='BFGS').x
    print('BFGS with Gradient:')
    %timeit opt.minimize(myObj, x0, method='BFGS', jac=myGrad).x
    print('least squares fit: ')
    %timeit opt.least_squares(myObj, x0, jac=myGrad).x
    
compare_scipy_minimization(x0, myObj, myGrad, myHess, pGN)
time_scipy_minimization(x0, myObj, myGrad, myHess, pGN)
Solutions:
My Gauss Newton: [  0.9645027   19.66902258]
CG: [  0.96450269  19.66963493]
BFGS: [  0.96450271  19.66963035]
BFGS with Gradient: [  0.96449872  19.66958286]
least squares fit: [  0.99891309  24.99816482]

Timeit function outputs for convergence:
My Gauss Newton: 
100 loops, best of 3: 2.62 ms per loop
CG:
10 loops, best of 3: 82.1 ms per loop
BFGS:
10 loops, best of 3: 49.6 ms per loop
BFGS with Gradient:
10 loops, best of 3: 63.4 ms per loop
least squares fit: 
100 loops, best of 3: 13.7 ms per loop

Scipy's BFGS method with the gradient defined is closest to our Gauss Newton function. If for some reason, the Hessian is singular, the BFGS method can help approximate a solution. Otherwise, our function will get us to a suitable solution quickly.

Minimizing without the gradient.

Its also possible to minimize without the gradient. For we can approximate the Jacobian numerically.

In [23]:
myJacApprox = nd.Jacobian(myFun)
myNumGrad = lambda x: np.dot(myJacApprox(x).T, myFun(x))
myNumHess = lambda x: np.dot(myJacApprox(x).T, myJacApprox(x))
pNumN = lambda x: -1 * np.dot(spla.inv(myNumHess(x)), myNumGrad(x))
In [24]:
xGNNum, GNDictNum = myGaussNewton(x0, myObj, myNumGrad, myNumHess, pNumN)
algorithm has has terminated after 2 steps
fOut: 19.954174579543107
gradNorm: 1.333118926049019e-13
xMin: [  0.81585752  19.49973142]

In [25]:
GNNumDF = pd.DataFrame(GNDictNum)
x_1, x_2, m = get_contour_data(xGNNum, 100, .005)
fig1 = contourPlot(x_1, x_2, m, 1)
axes = plt.axes()
solScat = axes.scatter(GNNumDF['phi'].values,GNNumDF['mu'].values, marker = 'o', color='yellow', label='Next point')

The numerical approximations terminate after two steps. yet, calculating numerical gradients is costly. Let's compare them with the builtin timeit function.

In [26]:
print('Analitical Jacobian used:')
%timeit myGaussNewton(x0, myObj, myGrad, myHess, pGN, timeIt=True)
print('Jacobian approximate used:')
%timeit myGaussNewton(x0, myObj, myNumGrad, myNumHess, pNumN, timeIt=True)
Analitical Jacobian used:
100 loops, best of 3: 2.33 ms per loop
Jacobian approximate used:
1 loop, best of 3: 1.32 s per loop

The numerical approximation slows down the computation time to be about 550 times slower. Clearly it is worth the effort to derive the Jacobian.

Minimizing an AR(3).

We saw that an R's arima fit for an AR(3) performed slightly better for Gilbralter climate. Let's compare answers for our Gauss-Newton method.

An AR(3) model has the form:

$$ a_t = Z_t - \mu - \phi_1(Z_{t-1} - \mu) - \phi_2(Z_{t-2} - \mu) - \phi_3(Z_{t-3} - \mu) $$

The conditional expected value for the shocks is

$$ E(a_t | Z, \mu, \phi_1, \phi_2, \phi_3) = (E(Z_t) - \mu)(1 - \phi_1 - \phi_2 - \phi_3) $$

In [27]:
def shock_AR3(z, phi1, phi2, phi3, mu):
    shock = np.zeros(len(z))
    shock[:3] = (np.mean(z) - mu)*(1 - phi1 - phi2 - phi3)
    zdot = np.subtract(z,mu)
    phis = np.array([ -phi3, -phi2, -phi1, 1])
    #pdb.set_trace()
    for tdx in range(4, len(z)+1):
        zVect = zdot[tdx-4:tdx]
        shock[tdx-1] = np.dot(zVect, phis.T)
    #pdb.set_trace()
    return shock

x04 = np.array([0.8,  -0.10,  0.03, 0])
AR4Shocks = shock_AR4(z, x04[0], x04[1], x04[2], x04[3])
AR4Shocks.shape
Out[27]:
(401,)

Getting the initial coordinate requres estimating additional weights $\phi_p$ by solving a linear system of equations. See Yule-Walker estimators for further details. Writing out the autocorrelation functions

AR(p) parameters are estimated by

$$ \begin{bmatrix} 1 & \hat{\rho_{1}} & \hat{\rho_{2}} & \dots & \hat{\rho_{p-2}} \\ \hat{\rho_{1}} & 1 & \hat{\rho_{1}} & \dots & \hat{\rho_{p-3}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \hat{\rho_{p-2}} & \hat{\rho_{p-3}} & \hat{\rho_{p-4}} & \dots & 1 \end{bmatrix}^{-1} \begin{bmatrix} \hat{\rho{1}} \ \hat{\rho{2}} \ \dots \ \hat{\rho_{p}}

\end{bmatrix}

\begin{bmatrix} \hat{\phi_{1}} \\ \hat{\phi_{2}} \\ \dots \\ \hat{\phi_{p}} \end{bmatrix} $$

In [29]:
def get_inital_estimate(z,k=1):
    N = len(z)
    meanZ = np.mean(z)
    # make vector Z_t - meanZ
    lVect = z[k:]-meanZ
    # make vector Z_{t-1} - meanZ
    rVect = z[0:-k] - meanZ
    auto_cov = np.dot(lVect, rVect.T)/np.dot(z-meanZ,z.T-meanZ)
    return np.array([auto_cov, meanZ])

def get_ar3_estimate(z):
    mu = get_inital_estimate(z, 1)[1]
    rhos = np.zeros(3)
    for k in range(1,4):
        rhos[k-1] = get_inital_estimate(z, k)[0]
    P = np.matrix([[1, rhos[0], rhos[1]],
                   [rhos[0], 1, rhos[0]],
                   [rhos[1], rhos[0], 1]])
    R = spla.cholesky(P, lower=False)
    RTR = spla.inv(np.dot(R.T,R))
    phi1, phi2, phi3 = np.dot(RTR, rhos)
    return np.array([phi1, phi2, phi3, mu])

x03 = get_ar3_estimate(z)
print('inital x0 estimate: {}'.format(x03))
inital x0 estimate: [  0.82279846  -0.11763953   0.11818254  19.48976062]

The backcasted AR3 function is written below.

In [30]:
def shock_backcasting_AR3(z, phi1, phi2, phi3, mu):
    zdot = [elem-mu for elem in z]
    tb = 0
    zBack = zdot[0:3]

    while ((np.abs(zBack[1] - zBack[0])) >= .005):
        zBackStep = [ phi1 * zBack[0]
                     + phi2 * zBack[1]
                     + phi3 * zBack[2]]
        zBack = zBackStep + zBack
        tb -= 1
    M = -(tb+1)
    zdot = zBack + zdot[3:]
    shock = np.zeros(len(zdot))
    zdot = np.subtract(z,mu)
    phis = np.array([-phi3, -phi3, -phi1, 1])
    for tdx in range(4, len(z)+1):
        zVect = zdot[tdx-4:tdx]
        shock[tdx-1] = np.dot(zVect, phis.T)
    zBC = np.add(zdot, mu)
    return shock[4:], zBC

bcAR1, zBC = shock_backcasting_AR3(z, x03[0], x03[1], x03[2], x03[3])
print(len(bcAR1))
print(len(zBC))
415
401

The Jacobian for an AR(3) Process has five columns.

There are five unknown parameters $\phi_1$, $\phi_2$, $\phi_3$, and $\mu$ thus $J_{ti}(x) \in R^{mx5}$.

$$J_{it}(x) = \begin{bmatrix} \frac{\partial r_t}{\partial \phi_1}, \frac{\partial r_t}{\partial \phi_2}, \frac{\partial r_t}{\partial \phi_3}, \frac{\partial r_t}{\partial \mu} \end{bmatrix} = \\ \begin{bmatrix} (-Z_{t-1} + \mu), (-Z_{t-2} + \mu), (-Z_{t-3} + \mu), (-1 + \phi_1 + \phi_2 + \phi_3) \end{bmatrix} $$

In [31]:
def Jac_AR3(z, phi1, phi2, phi3, mu):
    z0 = z[4:]
    z1 = z[3:-1]
    z2 = z[2:-2]
    z3 = z[1:-3]
    dPhi1 = np.add(-1*z1, mu)
    dPhi2 = np.add(-1*z2, mu)
    dPhi3 = np.add(-1*z3, mu)
    dMu = (-1 + phi1 + phi2 + phi3) * np.ones(len(z0))
    
    Jac = np.array([dPhi1, dPhi2, dPhi3, dMu]).T
    
    Jac[0:3, :] = np.zeros([3, 4])
    return Jac

Jac0 = Jac_AR3(zBC, x03[0], x03[1], x03[2], x03[3])
Jac0.shape
Out[31]:
(397, 4)

We again initialize our AR(3) functions

In [32]:
def init_functions(x0):
    zBC = shock_backcasting_AR3(z, x03[0], x03[1], x03[2], x03[3])[1]
    print('length of z: {}'.format(len(z)))
    print('length of backcasted z: {}'.format(len(zBC)))
    myFun = lambda x: shock_AR3(zBC[4:], x[0], x[1], x[2], x[3])
    myJac = lambda x: Jac_AR3(zBC, x[0], x[1], x[2], x[3])
    myGrad = lambda x: np.dot(myJac(x).T, myFun(x))
    myHess = lambda x: np.dot(myJac(x).T, myJac(x))
    myObj = lambda x: .5 * np.dot(myFun(x), myFun(x).T)
    pGN = lambda x: GNP(x, myFun, myGrad, myJac, myHess)
    return myFun, myJac, myGrad, myHess, myObj, pGN


myFun3, myJac3, myGrad3, myHess3, myObj3, pGN3 = init_functions(x03)
fun0 = myFun3(x03)
J0 = myJac4(x03)
F0 = myObj4(x03)
g0 = myGrad4(x03)
H0 = myHess4(x03)
p0 = pGN4(x03)
length of z: 401
length of backcasted z: 401
In [34]:
print(F0)
print(fun0.shape)
print(J0.shape)
print(g0.shape)
print(H0.shape)
print(p0.shape)
18.7004435656
(397,)
(397, 4)
(4,)
(4, 4)
(4,)

Approximating the Jacobian

Again, we should check if our analitical jacobian agrees with the numerical.

In [39]:
JacApprox3 = nd.Jacobian(myFun3)
In [40]:
JacApprox3(x03).shape
Out[40]:
(397, 4)
In [41]:
spla.norm(myJac3(x03) - JacApprox3(x03))
Out[41]:
0.30598450635147173

There shouldn't be anything left over...The numerical jacobian is is approximating -0.0004475. We shall soon see that this does not affect the optimal value when we compare scipy's BFGS method without providing the gradient.

In [43]:
JacApprox3(x03)[0:5, :]
Out[43]:
array([[ -4.47543376e-04,  -4.47543376e-04,  -4.47543376e-04,
         -1.76658536e-01],
       [ -4.47543376e-04,  -4.47543376e-04,  -4.47543376e-04,
         -1.76658536e-01],
       [ -4.47543376e-04,  -4.47543376e-04,  -4.47543376e-04,
         -1.76658536e-01],
       [ -6.92526767e-01,  -6.34118990e-01,   2.62100811e-01,
         -1.76658536e-01],
       [ -1.19638393e-01,  -6.92526767e-01,  -6.34118990e-01,
         -1.76658536e-01]])
In [44]:
myJac3(x03)[0:5, :]
Out[44]:
array([[ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [-0.69252677, -0.63411899,  0.26210081, -0.17665854],
       [-0.11963839, -0.69252677, -0.63411899, -0.17665854]])

AR(3) Model performance

In [48]:
xGN3, GN3Dict = myGaussNewton(x03, myObj3, myGrad3, myHess3, pGN3)
algorithm has has terminated after 2 steps
fOut: 18.670361350895927
gradNorm: 1.3439775923736026e-14
xMin: [  0.85991845  -0.15030482   0.12875629  19.49622801]

Here the the solution terminates after 8 steps. There is a slight difference within the arima function and our Gauss Newton method.

$$ x^{arima}_{opt} = [0.8340 \pm 0.0497,\ -0.1145 \pm 0.0646 ,\ 0.1152 \pm 0.0499,\ 19.5094 \pm 0.0949] $$

It is important to consider that R includes a 95% confidence interval to the estimated parameters.

$$ x_{opt}^{GN} = [\phi_1 = 0.8599 ,\ \phi_2 = -0.1503,\ \phi_3 = 0.1288,\ \mu = 19.4962 ] $$

The presence of a confidence interval verifies our Gauss-Newton method because its solution $x^{arima}_{opt}$ falls within this confidence interval.

The objective function as defined in this notebook sides with Gauss-Newton over the arima results.

In [49]:
x3arima = np.array([0.8340, -0.1145, 0.1152, 19.5094])
print(myObj3(xGN3))
print(myObj3(x3arima))
18.6703613509
18.6877692217
In [50]:
GNDF3 = pd.DataFrame(GN3Dict)
In [51]:
GNDF3['F'].plot()
Out[51]:
In [53]:
GNDF3['normGrad'].plot()
Out[53]:

Comparison to Scipy's minimization libraries.

Using scipy's BFGS minimization without including the Jacobian brings the returns a solution that is comparable to arima's solution, yet uneasily off.

$$ x^{arima}_{opt} = [0.8340 \pm 0.0497,\ -0.1145 \pm 0.0646 ,\ 0.1152 \pm 0.0499,\ 19.5094 \pm 0.0949] $$

Scipy's comparison functions side with Gauss-Newton

In [56]:
compare_scipy_minimization(x03, myObj3, myGrad3, myHess3, pGN3)
time_scipy_minimization(x03, myObj3, myGrad3, myHess3, pGN3)
Solutions:
My Gauss Newton: [  0.85991845  -0.15030482   0.12875629  19.49622801]
CG: [  0.85991855  -0.15030493   0.12875613  19.49618251]
BFGS: [  0.85991843  -0.15030475   0.12875606  19.49618247]
BFGS with Gradient: [  0.85991843  -0.15030481   0.1287563   19.49622799]
least squares fit: [  0.85982442  -0.15015244   0.12866277  19.4960783 ]

Timeit function outputs for convergence:
My Gauss Newton: 
100 loops, best of 3: 2.52 ms per loop
CG:
10 loops, best of 3: 81.9 ms per loop
BFGS:
10 loops, best of 3: 50.6 ms per loop
BFGS with Gradient:
10 loops, best of 3: 66.3 ms per loop
least squares fit: 
100 loops, best of 3: 13.9 ms per loop

Note the additional model complexity does not affect the time very much.

Conclusion

I find it remarkable that we don't have to depend on constrained optimization algorithms. The model is motivated naturally to settle down on a solution that is stable, provided the inital starting point is agreeable.

The method of moments, although makes a low quality estimator, drives the more refined estimators on track.

It is disconcerting that all of the models solved in python disagree with R's arima function best fit, but we must rember that arima uses a maximum liklihood method using a Kalman filter.

Its amazing how much the arima() function does. As we near the end of this notebook, we may recall the efforts it took to find a specific solution. Although our model is the fastest in this context, I have a much deeper appreciation for the libraries provided by scipy and R. I am in awe of the mechanics behind these functions, and their, robustness, and their ability for generalization.