Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add data-used to posterior .json files where relevant #260

Open
JasonPekos opened this issue Jun 14, 2024 · 0 comments
Open

Proposal: Add data-used to posterior .json files where relevant #260

JasonPekos opened this issue Jun 14, 2024 · 0 comments

Comments

@JasonPekos
Copy link

Proposal:

modify the posterior .json files to specify what data from the dataframe is actually used as an input to the model.

Rationale:

Some models only use a subset of their data. For example,earn-height uses the earnings data:

   N    =>    1192                   
  earn  =>  [50000, 60000, 30000,...        
  height => [74, 66, 64, 63, 63, 64,...      
  male  =>  [1, 0, 0, 0, 0, 0, 0,...

Of this data, earn-height only uses a subset: N, earn, height. This is fine for Stan, which will automatically discard data that doesn't match variables defined in the data block.

Unfortunately, this is frustrating when trying to port PosteriorDB models to other PPLs. Many PPLs — notably Turing, but I think also PyMC, NumPyro, Gen, and so on — use some sort of overloaded function definition to define a probabilistic program, e.g.:

# generic-ppl-pseudocode:

@make_model function model_name(data_1, data_2, data_3){
     prior ~ dist()
     data_1 ~ dist(prior, smth ...)
}

In this setup, the data arguments need to exactly match the columns of the dataframe, and so the dataframe must be filtered beforehand to extract the relevant columns. To make this easier, it would be helpful to have a column in the dataframe specifying data-used.


Example addition:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

would become:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "data_used": ["N", "earn", "height]            # <--------------- the change is here
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

This change would only need to occur for models where the provided dataframe is a superset of the actual dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant