Support causal inference, including sound formulation of statistical statements, e.g. calculation of confidence intervals. Hence, requires refactoring random forests as a statistical estimator (specifically, aiming for an asymptotic sampling distribution of the ATE). In this regard, the standard assumptions apply, chiefly unconfoundedness - conditional random assignment. There are also two key technical conditions:
Features are continous and bounded.
The conditional mean function is Lipschitz continuous.
The actual body of research relating to causal forests is essentially defining the statistics of random forests, in a way that leads to the level of statistical foundation that exists for nearest-neighbour or regression.
Honesty is an underpinning constraint of CFs. Fundamentally, honesty requires splitting the training data, using separate parts for growing the trees versus generating labels (predictions at the leaves). Intuitively, this separates model structure (partitioning of the covariate space) from estimation on that structure (estimating the treatment effect).^[A little bit of a joke, since this is an old principle. What can you expect from business-school stats professors?] This trades sample size for the safety of estimates now based on psuedo-exogenous partitions. Mathematically, predictions are now asymptotically normal (with a large enough sample size - not via the CLT), allowing clean derivation of variance (the author, Susan Athey, conveniently calls it "perfect standard errors").
In the end, this is just a step towards emulating kernel methods as a weighting over observations. I.e. the random forests generates a weighting, allowing all observations to contribute/feed into a treatment effect estimation.
Ergo, attempting to quantify the missing data issue that demarcates prediction (where a ground-truth can be available) vs inference (no observations of the alternative for any individual).
k-NN but matching only on important features. In the case of random forests, this involves adaptively choosing an optimal local metric. The only point to note here is that it's proven that they converge to an asymptotically normal sampling distribution, but the equations only give valid variance estimates and not confidence intervals (see Mentch and Hooker, 2016). The limitation on confidence intervals is that they relate to the expected prediction, and not to the underlying function.
Normally, this is a sample size issue, but approaching infinity is not enough here as it requires consistentcy which due to the nature of random forest, partitioning in tree generation is endogenous and inconsistent. The paper itself abstracts away from this, assuming an "idealised" tree. The only explicit specification is small subsamples for tree generation (versus large random, with replacement, samples).
Causal forests attempt to get assumption free random forests by using sample splitting. By sample splitting, you lose bias (the sample used to build the tree is independent of the samples used in estimation) and need only worry about variance. This is different to standard random forests where you assume each tree is the final estimator, and so you need to optimise in respect of the bias-variance tradeoff. Athey and Imbens (2016) explicitly show this, as well as the fact that the loss in accuracy by splitting your sample is regained by significantly improved coverage in the confidence intervals.
Unlike standard decision trees, the splits in
grf attempt to penalise variance-increasing partitions (based on leaf predictions) and reward partitions with better heterogeneity in treatment effects (this is especially relevant to social science/economic modelling with their grisly naturally clustered data).
This is mostly achieved by the estimated expected MSE of the treatment effect criteria they define, however I'm actually unclear on some of the terms they use. For example, they compute the variance across leaves and then subtract a variance estimator (i.e. a measure of uncertainty), but are not clear on how these are calculated?
Based roughly on the
Separate data into two mutually exclusive samples and , where and defaulting to .
Generate a tree predicting on from sample (stopping is defined parametrically, typically by capping leaves to some low observations). Typically, the tree partitions a subset of the full (observed) covariate space, using a random subsample of .
Now use to generate predictions. ATE can now be estimated as a difference between the mean of the treatment cases in and the mean of the control cases.
They use the infitesimal jackknife. To jackknife is to omit one observation, recompute the estimate, and repea. In the infinitesimal case, omitting one weights that unit to zero, a better approach is to simply reduce it's weight. Too much to discuss here as it's actually a directional gradient corresponding to the hyperplane tangent of an approximation of the statistic being estimated as a multi-dimensional function ( dimensions where random sample size is , input space is the space of all random samples).
I lack the necessary background probably, but the paper's proofs seem a bit hand-wavy? Anyways, the formula of interest is
where is a specific tree in the forest with estimated response and the number of times the observation was used in .
The key assumptions mentioned at the start were honesty - building the tree is independent from computing predictions, and Lipschitz continuity on lower order moments. A few further conditions are necessary for the proof:
Every feature has a large enough probability to be used for tree splitting.
That trees are fully grown up to a minimum leaf size.
Every child node does not at a minimum require >20% of the observations of it's parent.
Under unconfoudedness should be constant across , netting
gives a propensity score and the conditional mean of the outcome . The former is used to help orthagonalize in respect of propensity to be treated. In GRF it is estimated using a regression forest. The full term
normalises by subtracting from the treatment the variation corresponding to propensity. For example, if the treatment were perfectly binomial, then and so the binary treatment is shifted (which has a nontrivial effect on leaf means and, thus, the ATE).
Now corresponds to the mean outcome when marginalised by the treatment (i.e. what is the contribution of independent of treatment , achieved by summing across in the marginal distribution). Hence, the term
has the effect of subtracting out covariate specific effects. Hence, both terms combine provide a sort of "double debiasing".
It isn't clear, and is possibly reflected in my writing, how causal forests are (under the assumptions posed) unbiased estimators of treatment effect over other estimators on the expected values of ... The two papers aiming to cover this lacked explicit clarity in this regard.
C.f. alternative methods coupled with Belloni's work.