Mariano Murasawa JAE 2003

This is a replication file for Mariano and Murasawa(2003), which uses a relatively straightforward common factor state-space model to create a business conditions index, but combines monthly and quarterly observables. (They also do the simpler model with the monthly data only).

The four monthly data series are

1.employees on nonagricultural payrolls (EMP)

2.personal income less transfer payments (INC)

3.index of industrial production (IIP)

4.manufacturing and trade sales (SALES)

while the quarterly series is real GDP (RGDP). All series have been individually de-meaned after taking the log first difference in the raw time series data. This allows for the series to have different trend rates. Note that it's only marginally more complicated to allow the growth rates to be estimated as part of the model, rather than being removed from the data first—the specification of the model in the paper allows for that, but in the end, they opt for the simpler approach.

A basic one-factor model takes the form:

\begin{equation} {\bf{Y}}_t = {\bf{\mu }}_t + {\bf{\beta }}f_t + {\bf{u}}_t \end{equation}

where ${\bf{Y}}_t $ are the observables, ${\bf{\mu }}_t $ are the means, $f_t$ is the (scalar) factor series, ${\bf{\beta }}$ are the loadings and ${\bf{u}}_t $ is the error process.

$f$ is assumed to be a univariate AR process. There's a identification issue between the scales of $f$ and ${\bf{\beta }}$. In the combined model, they resolve this by making the loading onto GDP be pegged at 1. In the model with monthly data only, it's resolved by pegging the variance of the shock to $f$ at 1. (In models of this type, the latter method is simpler to implement.)

There is a more complicated identification issue because there are $N+1$ relatively unstructured time series processes ($f$ and the $N$ components of ${\bf{u}}$) which combine to produce $N$ observables. While, theoretically, if the orders of the various processes were known, that would be enough; in practice, they aren't known, and, in fact, one of the steps in their modeling procedure is to choose lag lengths. To avoid any issues, the ${\bf{u}}_t $ is assumed to be composed of independent AR processes, independent of each other and of $f$. The independence from $f$ is routine since $f$ is, in effect, the "model". The assumption that the ${\bf{u}}$ are univariate AR's with independent shocks is more than is needed—a full VAR with independent shocks will identify the model just as well, as the requirement that the shocks be independent forces the correlations into the factor term. In addition to the replication programs that do the univariate AR processes (as in the paper), we also include similar programs that allow for more general errors.

Analysis of Monthly Data

There are four programs for the monthly only data:

monthlyuv_ic

Runs the a joint search over number of lags of the factor process and of the error process using the univariate AR error processes.

monthlyvar_ic

Runs the a joint search over number of lags of the factor process and of the error process using a full VAR error process.

monthlyuv

Estimates the chosen model with univariate AR error processes.

monthlyvar

Estimates the chosen model using a full VAR error process.

The programs are written so the differences between the two model types are minor.

If $p$ is the number of lags in the error process, and $q$ is the number in the factor process, the state matrix can be written:

\begin{equation} {\bf{X}}_t = \left[ {\begin{array}{*{20}c} {{\bf{u}}_t } \\ \vdots \\ {{\bf{u}}_{t - p + 1} } \\ {f_t } \\ \vdots \\ {f_{t - q + 1} } \\ \end{array}} \right] \end{equation}

Note that there will always be at least one copy of ${\bf{u}}$ and one of $f$ even if $p$ or $q$ is zero.

The system matrices are formed by concatenating the separate system components for ${\bf{u}}$ and $f$. The system components for ${\bf{u}}$ are most easily handled by using the @VARDLM procedure (with a minor tweak to differentiate between the univariate AR and the full VAR processes) while those for $f$ can be done using @ARDLM.

The common code (after reading the data) starts with:

compute n=4

dec frml[vect] yfrml

frml yfrml = ||emp,inc,iip,sales||

which defines the number of observables and creates a FRML[VECT] which will produce the observables for the later DLM instruction.

This will be the VECTOR of loadings from the factors to the variables:

dec vect beta(n)

This is the number of AR lags in the factor, a VECTOR which will make the AR coefficients (which will be dimensioned later) and the variable for the variance for the shock to the factor:

dec integer p

dec vect fphi

dec real fsigsq

This is the number of AR/VAR lags in the error process, a VECT[RECT] which will make up the VAR lag coefficients, a VECT[VECT] which will make up the AR coefficients (if those are used) and the VECTOR of error shock variances. The UPHI(L) will be the $N \times N$ matrix of VAR coefficients for lag L, while UPHIAR(L) will be the $N$ vector of AR coefficients for lag L. Those will be dimensioned later once Q is set.

dec integer q

dec vect[rect] uphi

dec vect[vect] uphiar

dec vect usigsq(n)

These are advance declarations for the system matrices for the error process, and for the factor process:

dec rect au fu cu

dec rect af ff cf

This defines the set of PARMSET's for the model with univariate AR noise processes. For the full VAR, use UPHI rather than UPHIAR in the NOISEPARMS list. Note that this pegs FSIGSQ to 1.

nonlin(parmset=noiseparms) uphiar usigsq

nonlin(parmset=factorparms) fphi fsigsq=1.0

nonlin(parmset=loadings) beta

Most of the set up work for the system matrices is done by the following FUNCTION, which creates the system-wide $\bf{A}$ (ADLM), $\bf{F}$ (FDLM), $\bf{C}$ (CDLM) and SW (SWDLM) matrices by combining the system matrices from the two submodels (in order, the error process and the factor process). This is written to use the univariate processes for the errors—it copies them to the full UPHI lag matrices by making those diagonal matrices from the univariate lag coefficients. Commenting out the three lines of the DO L loop will result in the full VAR lag matrices being used.

function DLMSetup

local integer l

* This is for using univariate AR's for the error process

do l=1,q

compute uphi(l)=%diag(uphiar(l))

end do l

@VARDLM(a=au,poke) uphi

@ARDLM(a=af,poke) fphi

compute adlm=au~\af

compute fdlm=fu~\ff

compute swdlm=%diag(usigsq~~fsigsq)

compute cdlm=cu~~cf*tr(beta)

end DLMSetup

Note that all of these matrices are time-invariant—they depend only upon the free parameters. That considerably reduces the calculation time for the state-space model, as these can be computed just once per function evaluation (using the STARTUP option on DLM).

Given $p$ and $q$, the following dimensions the matrices, sets guess values and estimates the model:

This does the coefficients on the factor model (which are initialized to small random numbers):

dim fphi(p)

compute fphi=.01*%ranmat(p,1)

This does the coefficients for the AR processes on the errors by estimating univariate AR's on each dependent variable. The "VAR" process is guessed to be just the diagonal process from the autoregressions:

dim uphi(q)

dim uphiar(q)

ewise uphi(i)=%zeros(n,n)

ewise uphiar(i)=%zeros(n,1)

* Generate guess values

dofor s = emp inc iip sales

linreg s

# s{1 to q}

compute b=%beta

compute i=%doforpass

compute usigsq(i)=%sigmasq

do l=1,q

compute uphi(l)(i,i)=b(l)

compute uphiar(l)(i)=b(l)

end do l

end dofor s

This does guess values when FSIGSQ is pegged to 1 (as is done in the monthly-only model). This assumes 25% of the variance of a series is from the factor part.

compute beta=.5*%sqrt(usigsq)

compute fsigsq=1.0

This sets up the VAR for the noise:

@VARDLM(a=au,f=fu,c=cu,n=n) uphi

The N=N option is needed because it's possible for UPHI to have dimension zero (when testing $q=0$) which means that there is nothing else to provide the dimension of the process.

This sets up the AR for the factor:

@ARDLM(a=af,f=ff,c=cf) fphi

and this does the estimation. The STARTUP option is executed first on each function evaluation to set the ADLM, FDLM, CDLM and SWDLM matrices. This does the full log likelihood maximization so produces results that are slightly different from the ones in paper:

dlm(startup=DLMSetup(),y=yfrml,a=adlm,f=fdlm,c=cdlm,sw=swdlm,presample=ergodic,$

method=bfgs,iters=500,parmset=factorparms+loadings+noiseparms,type=smooth) / fstates

The code for producing a table (organized using a REPORT) is:

report(use=ictable,action=define,$

title="Table 5: Lag Order Selection for Factor Model Without Real GDP")

report(use=ictable,atrow=1,atcol=1) "p" "q" "LogL" "AIC" "SBC"

* Loop over combinations of p and q

do p=0,3

do q=0,4

above code to estimate the model given p and q

@regcrits(title="With p="+p+" q="+q)

report(use=ictable,row=new,atcol=1) p q %logl %aic %sbc

end do q

end do p

report(use=ictable,action=format,atcol=3,picture="*.##")

report(use=ictable,action=format,atcol=4,tocol=4,tag=minimum,special=one)

report(use=ictable,action=format,atcol=5,tocol=5,tag=minimum,special=one)

report(use=ictable,action=format,atcol=4,tocol=5,picture="*.####",align=decimal)

report(use=ictable,action=show)

This stars the minimum value in the AIC and SBC columns:

The results for the search over $p$ and $q$ for the univariate model are:

p	q	LogL	AIC	SBC
0	0	-1294.46	5.1788	5.2459
0	1	-1262.16	5.0662	5.1669
0	2	-1215.28	4.8958	5.0300
0	3	-1201.95	4.8586	5.0265
0	4	-1198.29	4.8600	5.0614
1	0	-1219.80	4.8859	4.9614
1	1	-1196.49	4.8091	4.9182
1	2	-1158.08	4.6723	4.8149
1	3	-1145.00	4.6362	4.8124
1	4	-1141.85	4.6396	4.8493
2	0	-1214.62	4.8693	4.9532
2	1	-1193.17	4.7999	4.9174
2	2	-1156.90	4.6716	4.8226
2	3	-1144.35	4.6376	4.8222
2	4	-1141.09	4.6405	4.8587
3	0	-1213.97	4.8707	4.9630
3	1	-1192.58	4.8015	4.9274
3	2	-1156.19	4.6727	4.8322
3	3	-1143.92	4.6398	4.8328
3	4	-1140.65	4.6428	4.8693

The log likelihood is slightly different as they use the approximate likelihood (starting with a pre-sample state variance matrix of zero rather than ergodic value). The paper uses an alternative definition for the information criteria which does not include the -2 factor, so is looking for a maximum rather than a minimum. Their Table 5 has 1,2 rather than 1,3 as the best SBC, though both calculations have 1,2 and 1,3 as almost identical. Since the detailed calculations in the paper use 1,2, we will do that as well.

The IC for the full VAR error process is:

p	q	LogL	AIC	SBC
0	0	-1294.46	5.1788	5.2459
0	1	-1132.76	4.5995	4.8008
0	2	-1093.42	4.5067	4.8423
0	3	-1059.90	4.4370	4.9069
0	4	-1041.51	4.4275	5.0316
1	0	-1219.80	4.8859	4.9614
1	1	-1121.68	4.5594	4.7691
1	2	-1083.82	4.4724	4.8165
1	3	-1059.16	4.4380	4.9163
1	4	-1038.51	4.4195	5.0321
2	0	-1214.62	4.8693	4.9532
2	1	-1121.67	4.5633	4.7814
2	2	-1083.79	4.4763	4.8287
2	3	-1059.02	4.4415	4.9281
2	4	-1038.44	4.4232	5.0442
3	0	-1213.97	4.8707	4.9630
3	1	-1121.51	4.5666	4.7932
3	2	-1083.73	4.4801	4.8409
3	3	-1057.92	4.4411	4.9361
3	4	-1038.48	4.4274	5.0567

Despite the often much higher parameter count, the VAR achieves a better minimum SBC and substantially better minimum AIC. However, when the monthlyuv and monthlyvar programs (which run the chosen models) are run, the VAR doesn't really have the desired properties of picking up the business cycles: the first graph is the factor generated from the detrended data with the univariate models while the second is generated with the VAR. The shaded zones are the NBER recessions where the first graph rather clearly has the factor declining (often sharply) during the recessions, while the second shows very little relationship between the movement in the factor and the recessions.

Analysis of Mixed Frequency Data

The DLM instruction is designed to allow for missing values in components of the $\bf{Y}$ vector. And in simpler situations, it would be possible to just input to DLM data for the quarterly series which are missing for the first two months of each quarter. However, that would only be correct if the observed quarterly data represented information for the 3rd month of the quarter only. However, GDP (like most quarterly National Income data) is a flow, representing an annualized measure of production for the entire quarter, not just the 3rd month. So that GDP can be modeled in a form similar to the monthly data, it is assumed that there is an unobservable monthly series $y_t^*$ which represents the (annualized) growth of GDP during the month. The authors demonstrate that, with that definition, the observable (annualized) quarterly growth of GDP can be written as

\begin{equation} \log \,{Y_t} - \log {Y_{t - 3}} = \frac{1}{3}\left( {y_t^* + 2y_{t - 1}^* + 3y_{t - 2}^* + 2y_{t - 3}^* + y_{t - 4}^*} \right) \end{equation}

Note that this requires current and 4 lags of ${y_t^*}$. Note also that, as the state-space model is set up, ${y_t^*}$ is not actually one of the states—it can be derived from the actual states: $f_t$ and ${\bf{u}}_t $. To create the measurement equation as shown in the paper would require that the state vector include at least current and four lags of $f$ and ${\bf{u}}$ which is why they specify the model as using current and four lags for those states, regardless of how many are actually needed. This can be done by setting those components up as AR(5) processes with any unused lags in the AR process zeroed out in the first line of the A matrix for a component. (The @ARDLM and @VARDLM procedures do that automatically if the LAGS option is used in setting up the matrices). However, that still leaves a very messy measurment equation.

In this case, a simpler way of handling this is to add ${y^*}$ itself to the state vector (along with four lags) but to leave all the other processes at the size required by their lag count. This makes it simpler to move that part of the analysis to a different model. Now, ${y_t^*}$ is a function of current $f_t$ and ${\bf{u}}_t $. To make it a part of a standard state-space model, it has to be defined using the lagged states and the current shocks. Since $y_t^* = {\beta _1}{f_t} + {u_{1,t}}$, we need to copy into the A and F matrix rows defining ${y_t^*}$, ${\beta _1}$ times the elements for defining ${f_t}$ and 1 times the elements for defining ${u_{1,t}}$.

There are two programs for analyzing the data with the mixed frequencies:

mixed_ic

Runs the a joint search over number of lags of the factor process and of the error process.

mixed

Estimates the chosen model.

As with the monthly models, quite similar to each other. They are also quite similar to the "monthlyuv" programs. Among the differences:

* Number of observables

compute n=5

* FRML to fetch the observables

dec frml[vect] yfrml

frml yfrml = ||rgdp,emp,inc,iip,sales||

RGDP is the 100 x quarter to quarter log growth rate, mean removed, which is defined only for the 3rd month within each quarter. The others, as before, are the 100 x month to month log growth rates with mean removed.

For the purpose of computing guess values (particularly variances) it's useful to create a simple monthly version of RGDP. That can be done with the ESMOOTH instruction:

esmooth(alpha=.8,smoothed=rgdpm) rgdp

set rgdpm = rgdpm/3.0

The divide by 3 operation is to convert it into rough month to month growth rates. Again, this is solely to create guess values.

This sets the number of lags (including current) needed for y* states to make the measurement equation:

compute ylags=5

This creates the C matrix for loading the y*'s onto the observed GDP data. (It has to include columns for all the observables, but the other columns are zeros). This is a fixed matrix, independent of the number of states in the rest of the model.

dec rect cystar(ylags,n)

ewise cystar(i,j)=%if(j==1,%if(i<=3,i/3.0,(6-i)/3.0),0.0)

This sets up the "transition" matrix for the y*'s. This doesn't do anything but handle the lag shifts. Delete the top row, which will be defined using the other states.

@ARDLM(a=aystar,lags=ylags)

compute aystar=%xsubmat(aystar,2,%rows(aystar),1,%cols(aystar))

And this declares the other system matrices for y*, which will be created later.

dec rect fystar

The set up for the subprocesses for the $f$ and $u$ are the same as for the monthly model.

The PARMSETS for estimation are defined somewhat differently, since for the mixed model, the authors pegged the loading for GDP at 1 and left the FSIGSQ free. Note that it's simpler and safer to use the FSIGSQ peg since it's a single parameter which has to be non-zero anyway. However, we follow the authors' lead.

nonlin(parmset=noiseparms) uphiar usigsq

nonlin(parmset=factorparms) fphi fsigsq

nonlin(parmset=loadings) beta beta(1)=1.0

The DLMSetup FUNCTION starts out the same, but adds extra lines to define y*:

•The F matrix for the y* process is relatively simple: there are $n+1$ fundamental shocks, but only the shocks to $u_1$ and to $f$ enter and only to the current y*: the rest of the matrix is zero. The overall F matrix will then have the diagonal concatenation of FU and FF (as is appropriate for "adding" submodels) vertically concatenated with the FYSTAR since y* adds no new shocks but just uses the existing ones.

•The A matrix requires inserting a row for defining how y* is generated from the lag states of $\bf{u}$ (first row of AU) and the lag states of $f$ (BETA(1) times the first row of AF). ADLM is created from that diagonally concatenated with the AYSTAR matrix which takes care of the lags of y*.

•SWDLM is the same as before since there are no new shocks.

•CDLM needs a minor adjustment since we do not want the standard u(1) + beta(1) factor loading on the first observable, since the CYSTAR loadings onto y* and its lags are taking care of that, so we temporarily zero out beta(1) when creating CDLM and eliminate the unit weight on u(1).

function DLMSetup

local integer l

local real actualBeta

* This is for using univariate AR's for the error process

do l=1,q

compute uphi(l)=%diag(uphiar(l))

end do l

@VARDLM(a=au,f=fu,poke) uphi

@ARDLM(a=af,f=ff,poke) fphi

* Create the F matrix which loads the noises onto the y*

compute fystar=%zeros(ylags,%cols(fu)+%cols(ff))

compute fystar(1,1)=1.0,fystar(1,n+1)=beta(1)

compute adlm=(au~\af)~~(%xrow(au,1)~beta(1)*%xrow(af,1))~\aystar

compute fdlm=(fu~\ff)~~fystar

compute swdlm=%diag(usigsq~~fsigsq)

* Knock out the standard loadings for the factor (temporarily zero out

* beta(1)) and for u(1) since the observable GDP growth uses the weights

* on the y*s (through the CYSTAR matrix).

compute cu(1,1)=0.0

compute actualBeta=beta(1),beta(1)=0.0

compute cdlm=cu~~cf*tr(beta)~~cystar

compute beta(1)=actualBeta

end

The only other change compared to the monthly models is to the calculation of the guess values. First, the guess values for GDP are done using the smoothed monthly version. (Again, these are guess values, so extreme accuracy isn't required). Also, because this model chooses a different scale normalization, the guess values for the beta and fsigsq are done as before, but then backed out to make beta(1) equal to 1 and fsigsq free:

compute beta=.5*%sqrt(usigsq)

compute pegbeta=beta(1)

compute fsigsq=pegbeta^2

compute beta=beta/pegbeta

The search for lag length produces the following:

p	q	LogL	AIC	SBC
0	0	-1436.98	5.7574	5.8497
0	1	-1389.47	5.5884	5.7226
0	2	-1340.85	5.4149	5.5911
0	3	-1324.05	5.3680	5.5861
0	4	-1320.43	5.3735	5.6336
1	0	-1366.48	5.4810	5.5817
1	1	-1327.99	5.3479	5.4905
1	2	-1284.95	5.1966	5.3812
1	3	-1272.66	5.1676	5.3942
1	4	-1265.17	5.1577	5.4262
2	0	-1363.19	5.4719	5.5810
2	1	-1325.48	5.3419	5.4929
2	2	-1283.62	5.1953	5.3883
2	3	-1271.78	5.1681	5.4030
2	4	-1264.29	5.1582	5.4351
3	0	-1362.73	5.4741	5.5915
3	1	-1325.23	5.3449	5.5043
3	2	-1282.76	5.1959	5.3973
3	3	-1271.17	5.1697	5.4130
3	4	-1263.74	5.1600	5.4453

The log likelihoods aren't comparable because the work done in the paper added an extra term to fill out the missing values in GDP, which doesn't affect the inference, but does change values of the log likelihood. (This isn't necessary with RATS as it is designed to handle partially missing data vectors).

Estimating the model with the best SBC values of 1 and 2 yields:

DLM - Estimation by BFGS
Convergence in 50 Iterations. Final criterion was 0.0000093 <= 0.0000100
Monthly Data From 1959:02 To 2000:12
Usable Observations		503
Rank of Observables		2179
Log Likelihood		-1284.9512
	Variable	Coeff	Std Error	T-Stat	Signif
1.	FPHI(1)	0.5636	0.0450	12.5352	0.0000
2.	FSIGSQ	0.0738	0.0093	7.9167	0.0000
3.	BETA(1)	1.0000	0.0000	0.0000	0.0000
4.	BETA(2)	0.5025	0.0408	12.3011	0.0000
5.	BETA(3)	0.8128	0.0630	12.9005	0.0000
6.	BETA(4)	2.1631	0.1319	16.3974	0.0000
7.	BETA(5)	1.7503	0.1174	14.9074	0.0000
8.	UPHIAR(1)(1)	0.7649	0.1603	4.7701	0.0000
9.	UPHIAR(1)(2)	0.0984	0.0455	2.1647	0.0304
10.	UPHIAR(1)(3)	-0.0489	0.0553	-0.8853	0.3760
11.	UPHIAR(1)(4)	-0.0419	0.0622	-0.6741	0.5002
12.	UPHIAR(1)(5)	-0.4114	0.0494	-8.3219	0.0000
13.	UPHIAR(2)(1)	-0.6204	0.0743	-8.3526	0.0000
14.	UPHIAR(2)(2)	0.4503	0.0482	9.3377	0.0000
15.	UPHIAR(2)(3)	0.0367	0.0573	0.6401	0.5221
16.	UPHIAR(2)(4)	-0.0610	0.0597	-1.0208	0.3074
17.	UPHIAR(2)(5)	-0.1946	0.0471	-4.1309	0.0000
18.	USIGSQ(1)	0.0413	0.0157	2.6316	0.0085
19.	USIGSQ(2)	0.0174	0.0016	10.5870	0.0000
20.	USIGSQ(3)	0.0883	0.0061	14.4349	0.0000
21.	USIGSQ(4)	0.2518	0.0249	10.1179	0.0000
22.	USIGSQ(5)	0.6130	0.0436	14.0547	0.0000

This is pretty much reproduces what the authors show except for the ${\phi _{u,1}}(GDP)$ (which is UPHIAR(1)(1) in the RATS output) and ${\phi _{u,2}}(GDP)$ (UPHIAR(2)(1)). ${\phi _{u,1}}(GDP)$ might be a typo, since the noise process for y* would be expected to be positively serially correlated.

The final step is creating the "New Coincident Index" which takes the smoothed factor out of the detrended data, adds back the (monthly) growth rate of GDP, accumulates that and exponentiates.

set f = fstates(t)(%rows(au)+1)

set f = (f+.80/3)/100.0

acc f / newbci

set newbci = exp(newbci)

@nbercycles(down=recess)

graph(footer="Figure 1: New Coincident Index",shading=recess)

# newbci