Preserving column order for categoricals #27

jseabold · 2013-10-09T09:59:28Z

https://groups.google.com/forum/#!topic/pystatsmodels/ZvsyZag3xaw

import patsy
import pandas as pd
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
patsy.dmatrix("age + educ + married", data=cps)

The text was updated successfully, but these errors were encountered:

njsmith · 2015-04-14T22:34:52Z

@josef-pkt: I think this is the issue that you were trying to think of today that involved setting up column order.

For reference, the issue is that patsy does impose some constraints on column order: specifically, it groups together terms so that those which contain the same combination of continuous factors go together, and then within each group it puts lower-order interactions before higher-order interactions. The user request was that they wanted to do a type-I anova, and statsmodels only supported (don't know if this is still true?) type-I anovas where each column was entered from left to right. So they wanted to in particular have some categorical terms, then some continuous terms, then some categorical terms, which violates that "grouping" constraint above.

My current feleing (also expressed more thoughtfully in that thread) is that the best solution is just for statsmodels type-I anova code to support explicit specification of what order you want to enter the terms in. Doing this in patsy is hard because (a) I actually think the current behaviour is nicer for most use cases, so am reluctant to de-optimize user experience in general just to improve type-I anovas (which are almost never the right thing anyway, and rarely used outside of introductory classes), and (b) it's not clear that patsy can fix this entirely, since in general type-I anovas might want almost any ordering of columns, and patsy can't really support that without extreme contortions. Allowing y ~ a + x + b would not be too hard, but y ~ a:b + a:x + a + b:x would be very difficult and intrusive (and it's not even clear how it would work).

josef-pkt · 2015-04-14T22:53:29Z

Yes, I think that's what I remembered.

I don't really know the details, but there are also other use cases where column order is relevant. One is in handling multicollinearity, where R does pivoting, and statsmodels will also do sequential check for perfect correlation.

For type 1 ANOVA:
I don't think it would be difficult to process the anova sequence in a different order, but I don't know how the user would have to specify terms. AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary rearrangement, but there are no names for the terms, i.e.
term_sequence = ["age", "educ", "married", "educ:married"]
instead of
term_sequence = [0, 2, 1, 3]

njsmith · 2015-04-14T23:08:31Z

I guess column order effects results in multicollinearity cases, but do
users actually need fine-grained control over this? I guess if you find a
case where they do then post a comment on this bug? :-)

Patsy does provide the ability to look up terms by name, so I guess you
should just teach anova_lm to use those? Let me know if there's something
on patsy's side that needs doing here...

On Tue, Apr 14, 2015 at 6:53 PM, Josef Perktold [email protected]
wrote:

Yes, I think that's what I remembered.

I don't really know the details, but there are also other use cases where
column order is relevant. One is in handling multicollinearity, where R
does pivoting, and statsmodels will also do sequential check for perfect
correlation.

For type 1 ANOVA:
I don't think it would be difficult to process the anova sequence in a
different order, but I don't know how the user would have to specify terms.
AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary
rearrangement, but there are no names for the terms, i.e.
term_sequence = ["age", "educ", "married", "educ:married"]
instead of
term_sequence = [0, 2, 1, 3]

—
Reply to this email directly or view it on GitHub
#27 (comment).

Nathaniel J. Smith -- http://vorpus.org

josef-pkt mentioned this issue Apr 14, 2015

ENH: anova_lm type 1: options for term sequence statsmodels/statsmodels#2358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving column order for categoricals #27

Preserving column order for categoricals #27

jseabold commented Oct 9, 2013

njsmith commented Apr 14, 2015

josef-pkt commented Apr 14, 2015

njsmith commented Apr 14, 2015

Preserving column order for categoricals #27

Preserving column order for categoricals #27

Comments

jseabold commented Oct 9, 2013

njsmith commented Apr 14, 2015

josef-pkt commented Apr 14, 2015

njsmith commented Apr 14, 2015