Statistics and Algorithms / Simulations and Bootstrapping /

Importance Sampling

Monte Carlo integration is fairly straightforward if the density \(f\) is an easy one from which to generate draws. However, many models do not produce the convenient Normal-gamma or Normal-Wishart posteriors. One method which can be used for more difficult densities is known as importance sampling. For technical details beyond those provided here, see Geweke (1989).

Importance sampling attacks an expectation over an unwieldy density function using

\begin{equation} \begin{split} E_f \left( {h\left( x \right)} \right) &= \int {h\left( x \right)f\left( x \right)dx} \\ &= \int {h\left( x \right)\left( {{{f\left( x \right)} \mathord{\left/ {\vphantom {{f\left( x \right)} {g\left( x \right)}}} \right. } {g\left( x \right)}}} \right)} g\left( x \right)dx \\ &= E_g \left( {{{hf} \mathord{\left/ {\vphantom {{hf} g}} \right. } g}} \right) \\ \end{split} \end{equation}

where \(f\) and \(g\) are density functions, with \(g\) having convenient Monte Carlo properties (that is, a density from which it is easy to make random draws).

The main problem is to choose a proper \(g\). In most applications, it’s crucial to avoid choosing a \(g\) with tails that are too thin relative to \(f\). If \(f\) and \(g\) are true density functions (that is, they integrate to 1), we can use

\begin{equation} \hat h = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {h\left( {x_i } \right){{f\left( {x_i } \right)} \mathord{\left/ {\vphantom {{f\left( {x_i } \right)} {g\left( {x_i } \right)}}} \right. } {g\left( {x_i } \right)}}} \label{eq:boot_importestimate} \end{equation}

to estimate the desired expectation, where the \(x_i\) are drawn independently from \(g\).

\begin{equation} \int {\,\left| {h\left( x \right)} \right|f\left( x \right)dx} < \infty ,{\rm{ then }}\int {\left| {h\left( x \right)\left( {{{f\left( x \right)} \mathord{\left/ {\vphantom {{f\left( x \right)} {g\left( x \right)}}} \right. } {g\left( x \right)}}} \right){\kern 1pt} } \right|{\kern 1pt} {\kern 1pt} } g\left( x \right)dx < \infty \end{equation}

so the strong law of large numbers applies to \eqref{eq:boot_importestimate}, and \(\hat h\) will converge a.s. to \(E_f \left( {h\left( x \right)} \right)\). However, while the sample means may converge, if the variance of \({{hf} \mathord{\left/{\vphantom {{hf} g}} \right.} g}\) doesn’t exist, the convergence may be extremely slow. If, on the other hand, that variance does exist, the Central Limit Theorem will apply, and we can expect convergence at the rate \(n^{1/2} \).

To take a trivial example, where the properties can be determined analytically, suppose \(h\left( x \right) = x\), and \(f\) is a Normal with mean zero and variance 4. Suppose we choose as \(g\) a standard Normal. Then

\begin{equation} {f \mathord{\left/ {\vphantom {f g}} \right. } g} = \frac{1}{2}\exp \left( { + \frac{3}{8}x^2 } \right) \end{equation}

and

\begin{equation} \int {\left( {{{h\left( x \right)f\left( x \right)} \mathord{\left/ {\vphantom {{h\left( x \right)f\left( x \right)} {g\left( x \right)}}} \right. } {g\left( x \right)}}} \right)} ^2 g\left( x \right)dx = \int {\frac{{x^2 }}{{4\sqrt {2\pi } }}\exp \left( { + \frac{1}{4}x^2 } \right)dx} = \infty \end{equation}

Convergence of \(\hat h\) with this choice of \(g\) will be painfully slow.

Suppose now that \(f\) is a Normal with mean zero and variance 1/4 and again we choose as \(g\) a standard Normal. Now

\begin{equation} \int {\left( {{{h\left( x \right)f\left( x \right)} \mathord{\left/ {\vphantom {{h\left( x \right)f\left( x \right)} {g\left( x \right)}}} \right. } {g\left( x \right)}}} \right)} ^2 g\left( x \right)dx = \int {\frac{{4x^2 }}{{\sqrt {2\pi } }}\exp \left( { - \frac{7}{2}x^2 } \right)dx} = \frac{4}{{7\sqrt 7 }} \approx .216 \end{equation}

For this particular \(h\), not only does the importance sampling work, but, in fact, it works better than independent draws from the true \(f\) density. The standard error of the importance sampling estimate is \(\sqrt {{{.216} \mathord{\left/{\vphantom {{.216} n}} \right.} n}} \), while that for draws from \(f\) would be \(\sqrt {{{.25} \mathord{\left/{\vphantom {{.25} n}} \right.} n}} \). This result depends on the shape of the \(h\) function—in this case, the importance function gives a lower variance by oversampling the values where \(h\) is larger. If \(h\) goes to zero in the tails, sampling from \(g\) will still work, but won’t do better than draws from \(f\). (A thin-tailed \(g\) works respectably only for such an \(h\).)

The lesson to be learned from this is that, in practice, it’s probably a good idea to be conservative in the choice of \(g\). When in doubt, scale variances up or switch to fatter-tailed distributions or both. For instance, the most typical choice for an importance function is the asymptotic distribution from maximum likelihood estimates. You’re likely to get better results if you use a fatter-tailed \(t\) rather than the Normal.

All of the above was based upon \(f\) and \(g\) being true density functions. In reality, the integrating constants are either unknown or very complicated. If we don’t know \({f \mathord{\left/{\vphantom {f g}} \right.} g}\), just \(w = {{f^* } \mathord{\left/{\vphantom {{f^* } {g^* }}} \right.} {g^* }} = {{cf} \mathord{\left/{\vphantom {{cf} g}} \right.} g}\), where \(c\) is an unknown constant, then

\begin{equation} \begin{array}{*{20}c} {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {h\left( {x_i } \right){\kern 1pt} {\kern 1pt} w\left( {x_i } \right)} = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {h\left( {x_i } \right){{cf\left( {x_i } \right)} \mathord{\left/ {\vphantom {{cf\left( {x_i } \right)} {g\left( {x_i } \right)}}} \right. } {g\left( {x_i } \right)}}} \to cE_f \left( {h\left( {x_i } \right)} \right)} \\ {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {w\left( {x_i } \right)} = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {{{cf\left( {x_i } \right)} \mathord{\left/ {\vphantom {{cf\left( {x_i } \right)} {g\left( {x_i } \right)}}} \right. } {g\left( {x_i } \right)}}} \to cE_f \left( 1 \right) = c} \\ {\left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {\left( {h\left( {x_i } \right)w\left( {x_i } \right)} \right)} ^2 = \left( {{1 \mathord{\left/ {\vphantom {1 n}} \right. } n}} \right)\sum {\left( {h\left( {x_i } \right){{cf\left( {x_i } \right)} \mathord{\left/ {\vphantom {{cf\left( {x_i } \right)} {g\left( {x_i } \right)}}} \right. } {g\left( {x_i } \right)}}} \right)^2 } \to c^2 E_g \left( {h\left( {x_i } \right){{f\left( {x_i } \right)} \mathord{\left/ {\vphantom {{f\left( {x_i } \right)} {g\left( {x_i } \right)}}} \right. } {g\left( {x_i } \right)}}} \right)^2 } \\ \end{array} \end{equation}

Using the second of these to estimate \(c\) gives the key results

\begin{equation} \begin{split} {\hat h &= {{\sum {h\left( {x_i } \right)w\left( {x_i } \right)} } \mathord{\left/ {\vphantom {{\sum {h\left( {x_i } \right)w\left( {x_i } \right)} } {\sum {w\left( {x_i } \right)} }}} \right. } {\sum {w\left( {x_i } \right)} }}} \hfill \\ {s_{\hat h}^2 &= \left[ {{{\sum {\left( {h\left( {x_i } \right)w\left( {x_i } \right)} \right)^2 } } \mathord{\left/ {\vphantom {{\sum {\left( {h\left( {x_i } \right)w\left( {x_i } \right)} \right)^2 } } {\left( {\sum {w\left( {x_i } \right)} } \right)^2 }}} \right. } {\left( {\sum {w\left( {x_i } \right)} } \right)^2 }}} \right] - \hat h^2 /n} \hfill \\ \end{split} \end{equation}

Numerical Issues

There’s one additional numerical problem that needs to be avoided in implementing this. Particularly for large parameter spaces, the omitted integrating constants in the density functions can be huge. As a result, a direct calculation of \(w\) can produce machine overflows or underflows. (The typical computer can handle real numbers up to around \(10^{300}\)). To avoid this, we would advise computing \(w\) by

\begin{equation} \exp \left( {\log f^* \left( x \right) - \log f_{\max }^* - \log g^* \left( x \right) + \log g_{\max }^* } \right) \end{equation}

where \({\log f_{\max }^* }\) and \({\log g_{\max }^* }\) are the (logs of the) maximum values taken by the two kernel functions.

Weighted Quantiles and Density Estimates

All of the above shows how to compute the expectation of a measurable function of the random variable \(x\). If you want to compute quantiles, you need to use the function %WFRACTILES. Quantiles are estimated by sorting the generated values and locating the smallest value for which the cumulated normalized weights exceeds the requested quantile. The proof of this is in the Geweke article. If you want to estimate a density function, add the WEIGHT option to your DENSITY instruction.

Examples

GARCHIMPORT.RPF is a (relatively) simple example of importance sampling for a univariate GARCH model. MONTESVAR.RPF is an example of importance sampling for analyzing a structural VAR.