Diagnostics in Large Samples

If you do quite a bit of work with large data sets (typically financial models), you may find it frustrating to get a model which “passes” a standard set of diagnostic tests for model adequacy. This, however, is a well-known (though rarely discussed) problem when dealing with real-world data. In Berkson(1938), the author points out that tests for Normality will almost always reject in very large data sets with p-values “beyond any usual limit of significance.” Now, the key is that we’re talking about “real-world” data. There have been dozens, if not hundreds, of papers published in the literature which propose a test, demonstrate that it has asymptotically the correct size when simulated under the null, demonstrate it has power against simulated alternatives, and ... that’s the end of the paper. What’s missing in many of these papers is any attempt to demonstrate how it works with real data, not simulated data. If the new test doesn’t produce qualitatively different results from an existing (probably simpler) test, is there any real point to it?

A very readable discussion of what’s happening with tests in large data sets is provided in Leamer(1974), Chapter 4. Effectively no model is “correct” when applied to real world data. Consider, for instance, Berkson’s example of a test for Normality. In practice, actual data is likely to be bounded—and the Normal distribution isn’t. Various Central Limit Theorems show that the sums of many relatively small independent (or weakly correlated) components approach the Normal, but in practice, some of the components probably will be a bit more dominant than the CLT's expect.

While Berkson was talking about using a chi-squared test, it’s simpler to look at what happens with the commonly used Jarque-Bera test. Suppose we generate a sample from a \(t\) with 50 degrees of freedom (thus close to, but not quite, Normal). The following is a not atypical set of results from applying the Jarque-Bera test to samples of varying sizes:

N	Kurtosis	JB	Signif
100	0.243	0.448	0.799469
500	0.287	4.071	0.130635
1000	0.230	2.638	0.267441
2000	0.208	3.676	0.159127
5000	0.201	8.507	0.014212
10000	0.201	16.883	0.000216
100000	0.152	96.544	0.000000

The N=100000 line would be an example of the “beyond any usual level of significance.” The theoretical excess kurtosis for this is roughly .13—not a particularly heavy-tailed distribution, and one which is barely distinguishable in any meaningful way from the Normal. If we stick with the conventional significance level of .05, regardless of sample size, we are choosing to allow the probability of Type I errors to remain at .05 while driving the probability of Type II errors down effectively to zero. As we get more and more data, we should be able to push down the probabilities of both types of errors, and a testing process which doesn't is hard to justify.

The JB test has an asymptotic \(\chi^2_2\) distribution. That has a .05 critical value of 5.99, a .01 critical value of 9.21 and a .001 critical value of 13.81. In effect, the JB estimates two extra parameters (the skewness and excess kurtosis) and sees whether they are different from the 0 values that they would have if the distribution were Normal. If we look at how the SBC (or BIC) would deal with the decision about allowing for those extra 2 parameters, its "critical values" (the point at which we would choose the larger model) is \(2\ log T\), which is 9.2 for T=100, 13.3 for T=1000, 18.4 for T=10000 to 23.0 for T=100000. Eventually (by T=100000), the data evidence in favor of the (correct) non-normal distribution is strong enough to cause us to choose it, at T=10000 the decision somewhat marginal (16.883 vs a critical value of 18.4), but at the smaller sample sizes, we would rather strongly favor the simpler model—the difference between the t(50) and Normal just isn't apparent enough until we get a truly enormous amount of data.

Now JB has degrees of freedom fixed at 2. Let’s look at a typical diagnostic test in time series analysis like the Ljung-Box Q test for serial correlation.

\begin{equation} Q = T(T + 2)\sum\limits_{k = 1}^h {\frac{{\hat \rho _k^2 }}{{(T - k)}}} \end{equation}

where \(\hat \rho _k \) is the lag \(k\) autocorrelation (typically of residuals in some form). This is easier to examine more carefully if we take out some of the small-sample corrections and look at the simpler (asymptotically equivalent) Box-Pierce test

\begin{equation} Q = T\sum\limits_{k = 1}^h {\hat \rho _k^2 } \label{eq:bigdiagnostics_boxpierce} \end{equation}

In practice, \(h\) is often fairly large (10 or more) and, in fact, the recommendation is that it increase (slowly!) with \(T\). However, let's fix on \(h=25\). Suppose we have T=2500 (a not uncommon size for a GARCH model). The .05 critical value for a \(\chi^2_{25}\) is 37.7, the .01 is 44.3. Because of the \(T\) multiplier in \eqref{eq:bigdiagnostics_boxpierce}, the size of correlations that triggers a "significant" result is quite small—if the typical autocorrelation is a mere .03, it would generate a \(Q\) of 56.3, which has a p-value of .0003. That's despite the fact that .03 autocorrelations are probably not correctable by any change you could make to the model: they’re statistically significant, but practically insignificant. This is a common problem with high degrees of freedom tests. Now the \(h\ log T\) "critical value" suggested by the SBC (which would be 195.6 here) is a bit extravagant, since we aren't really estimating a model with 25 extra parameters, but we should not be at all surprised to see a fairly reasonable-looking model on this size data set produce a \(Q\) statistic at least 2 or 3 times \(h\), without the test suggesting any change to improve the model.

This is not to suggest that you simply brush aside results of the diagnostic tests. For instance, if residual autocorrelations are large on the small lags (1 and 2, particularly), that suggests that you need longer lags in the mean model—it’s a "significant result" triggered by (relatively) large correlations at odd locations like 7 and 17 that is unlikely to be improved by a change to the model. One simple way to check whether there is anything to those is to compute the diagnostic on a split sample. What you would typically see is that the pattern of "large" autocorrelations is completely different on the subsamples and thus isn’t a systematic failure of the model. Remember that the point of the diagnostics is to either suggest a better model or warn you of a serious problem with the one you’re using. This type of "significant" test does neither.