Proposal: Add `data-used` to posterior `.json` files where relevant #260

JasonPekos · 2024-06-14T16:46:06Z

Proposal:

modify the posterior .json files to specify what data from the dataframe is actually used as an input to the model.

Rationale:

Some models only use a subset of their data. For example,earn-height uses the earnings data:

   N    =>    1192                   
  earn  =>  [50000, 60000, 30000,...        
  height => [74, 66, 64, 63, 63, 64,...      
  male  =>  [1, 0, 0, 0, 0, 0, 0,...

Of this data, earn-height only uses a subset: N, earn, height. This is fine for Stan, which will automatically discard data that doesn't match variables defined in the data block.

Unfortunately, this is frustrating when trying to port PosteriorDB models to other PPLs. Many PPLs — notably Turing, but I think also PyMC, NumPyro, Gen, and so on — use some sort of overloaded function definition to define a probabilistic program, e.g.:

# generic-ppl-pseudocode:

@make_model function model_name(data_1, data_2, data_3){
     prior ~ dist()
     data_1 ~ dist(prior, smth ...)
}

In this setup, the data arguments need to exactly match the columns of the dataframe, and so the dataframe must be filtered beforehand to extract the relevant columns. To make this easier, it would be helpful to have a column in the dataframe specifying data-used.

Example addition:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

would become:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "data_used": ["N", "earn", "height]            # <--------------- the change is here
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

This change would only need to occur for models where the provided dataframe is a superset of the actual dataframe.

The text was updated successfully, but these errors were encountered:

JasonPekos mentioned this issue Jun 20, 2024

Adds a folder for Turing models, and a few models for testing, #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add `data-used` to posterior `.json` files where relevant #260

Proposal: Add `data-used` to posterior `.json` files where relevant #260

JasonPekos commented Jun 14, 2024

Proposal: Add data-used to posterior .json files where relevant #260

Proposal: Add data-used to posterior .json files where relevant #260

Comments

JasonPekos commented Jun 14, 2024

Proposal:

Rationale:

Proposal: Add `data-used` to posterior `.json` files where relevant #260

Proposal: Add `data-used` to posterior `.json` files where relevant #260