Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of observation to report, how many? #1463

Open
ninetale opened this issue Oct 20, 2024 · 2 comments
Open

Number of observation to report, how many? #1463

ninetale opened this issue Oct 20, 2024 · 2 comments

Comments

@ninetale
Copy link

Hello,

I am currently utilizing the causal forest package and have some questions regarding the observations. Let’s assume there are 1,000 observations.

I am using the following code:

cf <- causal_forest(X, Y, W, Y.hat = Y_hat_re, W.hat = W_hat_re, honesty = TRUE, tune.parameters = "all", sample.fraction = 0.8, num.trees = 20000)

  1. From my understanding, this would allocate 400 observations to both the training and estimation sets. Is my understanding correct?

  2. Furthermore, should I assume that the 'test_calibration(cf)' function operates based on a test set of 200 observations?

  3. Lastly, I would like to know how many observations the best_linear_projection(cf, X, target.sample = "overlap") function targets.
    *I understand that "overlap" implies weighting, not the exclusion of observations.

Thank you for developing and maintaining such a useful package.

@erikcs
Copy link
Member

erikcs commented Oct 25, 2024

Hi @ninetale, the effective number of samples used for estimation in both 1 and 2 is n=1000, but in 1 you set aside honest.fraction*sample.fraction=400 for "honest" splitting. 3: yes, but weights can be zero.

@ninetale
Copy link
Author

Thanks for the reply @erikcs
Then, my understanding is correct.

That is,

in ‘cf <- causal_forest(X, Y, W, Y.hat = Y_hat_re, W.hat = W_hat_re, honesty = TRUE, tune.parameters = “all”, sample.fraction = 0.8, num.trees = 20000)’

It uses 400 observations for forest construction and 400 for estimation.
And since 200 are held-out observations, I guess ‘test_calibration(cf)’ uses those 200.

ㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡㅡ

However, the question I still have is

“Does ‘best_linear_projection(cf, X, target.sample = “overlap”)’ use all the first 1,000 observations?”.
(Of the 1,000 observations, 400 for model construction, 400 for estimation, and 200 for heteroscedasticity validity analysis.
And all 1,000 again for the heterogeneity of effects analysis?)

If not, do I have to divide the original 1,000 for best_linear_projection as following steps?

For example:

  1. divide the observations into 600,400 each.
  2. use 200 of the 600 for forest construction and 200 for estimation (in ‘causal_forest’)
  3. use the remaining 200 out of 600 for ‘test_calibration’.
  4. Observe the heterogeneity of the effects through the 400 initial splits.

Should we follow this procedure?

There is confusion as the procedure differs somewhat from machine learning or deep learning for prediction.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants