Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving column order for categoricals #27

Open
jseabold opened this issue Oct 9, 2013 · 3 comments
Open

Preserving column order for categoricals #27

jseabold opened this issue Oct 9, 2013 · 3 comments

Comments

@jseabold
Copy link
Member

jseabold commented Oct 9, 2013

https://groups.google.com/forum/#!topic/pystatsmodels/ZvsyZag3xaw

import patsy
import pandas as pd
cps = pd.read_csv("http://www.mosaic-web.org/go/datasets/cps.csv")
patsy.dmatrix("age + educ + married", data=cps)
@njsmith
Copy link
Member

njsmith commented Apr 14, 2015

@josef-pkt: I think this is the issue that you were trying to think of today that involved setting up column order.

For reference, the issue is that patsy does impose some constraints on column order: specifically, it groups together terms so that those which contain the same combination of continuous factors go together, and then within each group it puts lower-order interactions before higher-order interactions. The user request was that they wanted to do a type-I anova, and statsmodels only supported (don't know if this is still true?) type-I anovas where each column was entered from left to right. So they wanted to in particular have some categorical terms, then some continuous terms, then some categorical terms, which violates that "grouping" constraint above.

My current feleing (also expressed more thoughtfully in that thread) is that the best solution is just for statsmodels type-I anova code to support explicit specification of what order you want to enter the terms in. Doing this in patsy is hard because (a) I actually think the current behaviour is nicer for most use cases, so am reluctant to de-optimize user experience in general just to improve type-I anovas (which are almost never the right thing anyway, and rarely used outside of introductory classes), and (b) it's not clear that patsy can fix this entirely, since in general type-I anovas might want almost any ordering of columns, and patsy can't really support that without extreme contortions. Allowing y ~ a + x + b would not be too hard, but y ~ a:b + a:x + a + b:x would be very difficult and intrusive (and it's not even clear how it would work).

@josef-pkt
Copy link

Yes, I think that's what I remembered.

I don't really know the details, but there are also other use cases where column order is relevant. One is in handling multicollinearity, where R does pivoting, and statsmodels will also do sequential check for perfect correlation.

For type 1 ANOVA:
I don't think it would be difficult to process the anova sequence in a different order, but I don't know how the user would have to specify terms. AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary rearrangement, but there are no names for the terms, i.e.
term_sequence = ["age", "educ", "married", "educ:married"]
instead of
term_sequence = [0, 2, 1, 3]

@njsmith
Copy link
Member

njsmith commented Apr 14, 2015

I guess column order effects results in multicollinearity cases, but do
users actually need fine-grained control over this? I guess if you find a
case where they do then post a comment on this bug? :-)

Patsy does provide the ability to look up terms by name, so I guess you
should just teach anova_lm to use those? Let me know if there's something
on patsy's side that needs doing here...

On Tue, Apr 14, 2015 at 6:53 PM, Josef Perktold [email protected]
wrote:

Yes, I think that's what I remembered.

I don't really know the details, but there are also other use cases where
column order is relevant. One is in handling multicollinearity, where R
does pivoting, and statsmodels will also do sequential check for perfect
correlation.

For type 1 ANOVA:
I don't think it would be difficult to process the anova sequence in a
different order, but I don't know how the user would have to specify terms.
AFAIR, anova_lm could loop over the list of terms in a pretty arbitrary
rearrangement, but there are no names for the terms, i.e.
term_sequence = ["age", "educ", "married", "educ:married"]
instead of
term_sequence = [0, 2, 1, 3]


Reply to this email directly or view it on GitHub
#27 (comment).

Nathaniel J. Smith -- http://vorpus.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants