Causal Forests

Aims and Assumptions

Support causal inference, including sound formulation of statistical statements, e.g. calculation of confidence intervals. Hence, requires refactoring random forests as a statistical estimator (specifically, aiming for an asymptotic sampling distribution of the ATE). In this regard, the standard assumptions apply, chiefly unconfoundedness - conditional random assignment. There are also two key technical conditions:

  1. Features are continous and bounded.

  2. The conditional mean function E[yX=x]\mathbb{E}[y|X = x] is Lipschitz continuous.

The actual body of research relating to causal forests is essentially defining the statistics of random forests, in a way that leads to the level of statistical foundation that exists for nearest-neighbour or regression.


Honest Trees

Honesty is an underpinning constraint of CFs. Fundamentally, honesty requires splitting the training data, using separate parts for growing the trees versus generating labels (predictions at the leaves). Intuitively, this separates model structure (partitioning of the covariate space) from estimation on that structure (estimating the treatment effect).^[A little bit of a joke, since this is an old principle. What can you expect from business-school stats professors?] This trades sample size for the safety of estimates now based on psuedo-exogenous partitions. Mathematically, predictions are now asymptotically normal (with a large enough sample size - not via the CLT), allowing clean derivation of variance (the author, Susan Athey, conveniently calls it "perfect standard errors").

In the end, this is just a step towards emulating kernel methods as a weighting over observations. I.e. the random forests generates a weighting, allowing all observations to contribute/feed into a treatment effect estimation.

Ergo, attempting to quantify the missing data issue that demarcates prediction (where a ground-truth can be available) vs inference (no observations of the alternative for any individual).

Adaptive Nearest Neighbour Matching

k-NN but matching only on important features. In the case of random forests, this involves adaptively choosing an optimal local metric. The only point to note here is that it's proven that they converge to an asymptotically normal sampling distribution, but the equations only give valid variance estimates and not confidence intervals (see Mentch and Hooker, 2016). The limitation on confidence intervals is that they relate to the expected prediction, and not to the underlying function.

Normally, this is a sample size issue, but approaching infinity is not enough here as it requires consistentcy which due to the nature of random forest, partitioning in tree generation is endogenous and inconsistent. The paper itself abstracts away from this, assuming an "idealised" tree. The only explicit specification is small subsamples for tree generation (versus large random, with replacement, samples).


Improving on k-NN

Causal forests attempt to get assumption free random forests by using sample splitting. By sample splitting, you lose bias (the sample used to build the tree is independent of the samples used in estimation) and need only worry about variance. This is different to standard random forests where you assume each tree is the final estimator, and so you need to optimise in respect of the bias-variance tradeoff. Athey and Imbens (2016) explicitly show this, as well as the fact that the loss in accuracy by splitting your sample is regained by significantly improved coverage in the confidence intervals.

MSE-based Splitting

Unlike standard decision trees, the splits in grf attempt to penalise variance-increasing partitions (based on leaf predictions) and reward partitions with better heterogeneity in treatment effects (this is especially relevant to social science/economic modelling with their grisly naturally clustered data).

This is mostly achieved by the estimated expected MSE of the treatment effect criteria they define, however I'm actually unclear on some of the terms they use. For example, they compute the variance across leaves and then subtract a variance estimator (i.e. a measure of uncertainty), but are not clear on how these are calculated?


Based roughly on the grf implementation:

  1. Separate data into two mutually exclusive samples AA and BB, where N=NA+NBN = N_A + N_B and defaulting to NA=0.5NN_A = 0.5\cdot N.

  2. Generate a tree predicting yy on XX from sample AA (stopping is defined parametrically, typically by capping leaves to some low kk observations). Typically, the tree partitions a subset of the full (observed) covariate space, using a random subsample of AA.

  3. Now use BB to generate predictions. ATE can now be estimated as a difference between the mean of the treatment cases in BB and the mean of the control cases.

Actually Estimating the Variance

They use the infitesimal jackknife. To jackknife is to omit one observation, recompute the estimate, and repea. In the infinitesimal case, omitting one weights that unit to zero, a better approach is to simply reduce it's weight. Too much to discuss here as it's actually a directional gradient corresponding to the hyperplane tangent of an approximation of the statistic being estimated as a multi-dimensional function (nn dimensions where random sample size is nn, input space is the space of all random samples).

I lack the necessary background probably, but the paper's proofs seem a bit hand-wavy? Anyways, the formula of interest is

σ^2=n1n(nns)2i=1nCov[μ^b(x),Nbi]2\hat{\sigma}^2 = \frac{n-1}{n}\left(\frac{n}{n-s}\right)^2\sum_{i=1}^n \mathrm{Cov}_*[\hat{\mu}_b^*(x), N_{bi}^*]^2

where bb is a specific tree in the forest with estimated response μ^b(x)\hat{\mu}_b^*(x) and NbiN_{bi}^* the number of times the observation ii was used in bb.

Asymptotic Normality

The key assumptions mentioned at the start were honesty - building the tree is independent from computing predictions, and Lipschitz continuity on lower order moments. A few further conditions are necessary for the proof:

  1. Every feature has a large enough probability to be used for tree splitting.

  2. That trees are fully grown up to a minimum leaf size.

  3. Every child node does not at a minimum require >20% of the observations of it's parent.

Tau is semiparametrically efficient under unconfoundedness (and CATE = ATE)

Under unconfoudedness τ(x)=E[Y(1)Y(0)X=x]\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x] should be constant across xx, netting

τ^=1ni=1n(Yim^1(Xi))(Wie^1(Xi))1ni=1n(Wie^i(Xi))2\hat{\tau} = \frac{\frac{1}{n}\sum_{i=1}^n\left(Y_i - \hat{m}^{-1}(X_i)\right)\left(W_i - \hat{e}^{-1}(X_i)\right)}{\frac{1}{n}\sum_{i=1}^n\left(W_i - \hat{e}^{-i}(X_i)\right)^2}

e^\hat{e} gives a propensity score and m^\hat{m} the conditional mean of the outcome YY. The former is used to help orthagonalize in respect of propensity to be treated. In GRF it is estimated using a regression forest. The full term

Wie^1(Xi)W_i - \hat{e}^{-1}(X_i)

normalises WiW_i by subtracting from the treatment the variation corresponding to propensity. For example, if the treatment were perfectly binomial, then e^P(WX)=1/2\hat{e} \approx P(W | X) = 1/2 and so the binary treatment WW is shifted (which has a nontrivial effect on leaf means and, thus, the ATE).

Now m^E[YX]\hat{m} \approx \mathcal{E}[Y|X] corresponds to the mean outcome when marginalised by the treatment (i.e. what is the contribution of XX independent of treatment WW, achieved by summing across WW in the marginal distribution). Hence, the term

Yim^1(Xi)Y_i - \hat{m}^{-1}(X_i)

has the effect of subtracting out covariate specific effects. Hence, both terms combine provide a sort of "double debiasing".

A note on dubiousness

It isn't clear, and is possibly reflected in my writing, how causal forests are (under the assumptions posed) unbiased estimators of treatment effect over other estimators on the expected values of yTy|T... The two papers aiming to cover this lacked explicit clarity in this regard.

C.f. alternative methods coupled with Belloni's work.