Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical names (again) #40

Open
jseabold opened this issue May 6, 2014 · 4 comments
Open

Categorical names (again) #40

jseabold opened this issue May 6, 2014 · 4 comments

Comments

@jseabold
Copy link
Member

jseabold commented May 6, 2014

Do we really need, say, the reference level in the Treatment contrast? I'm not sure it adds enough information vs. the complexity it adds to the names to warrant inclusion. Thoughts? AFAICT, it only appears if you specify a reference level. If you specify one, then surely you know what you specified.

[~/]
[7]: dmatrix('~C(A, Treatment)', data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[7]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment)[T.some really long name]
        1                                         1
        1                                         0
        1                                         0
Terms:
    'Intercept' (column 0)
    'C(A, Treatment)' (column 1)

[~/]
[8]: dmatrix("~C(A, Treatment('some really long name'))", data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[8]: 
DesignMatrix with shape (3, 2)
Intercept  C(A, Treatment('some really long name'))[T.other name]
        1                                                       0
        1                                                       1
        1                                                       1
Terms:
    'Intercept' (column 0)
    "C(A, Treatment('some really long name'))" (column 1)
@njsmith
Copy link
Member

njsmith commented May 6, 2014

The problem is that as far as patsy is concerned, "C(A, Treatment('some
really long name'))" is an opaque blob of arbitrary Python code (which
happens to return a special object that patsy knows how to interpret as a
categorical column). So I'm pretty hesitant to get into the business of
trying to parse code like this to try and guess which parts can be thrown
away :-/

On Tue, May 6, 2014 at 3:33 PM, Skipper Seabold [email protected]:

Do we really need, say, the reference level in the Treatment contrast? I'm
not sure it adds enough information vs. the complexity it adds to the names
to warrant inclusion. Thoughts? AFAICT, it only appears if you specify a
reference level. If you specify one, then surely you know what you
specified.

[~/]
[7]: dmatrix('~C(A, Treatment)', data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[7]:
DesignMatrix with shape (3, 2)
Intercept C(A, Treatment)[T.some really long name]
1 1
1 0
1 0
Terms:
'Intercept' (column 0)
'C(A, Treatment)' (column 1)

[~/]
[8]: dmatrix("~C(A, Treatment('some really long name'))", data=pd.DataFrame([['some really long name'], ['other name'], ['other name']], columns=['A']))
[8]:
DesignMatrix with shape (3, 2)
Intercept C(A, Treatment('some really long name'))[T.other name]
1 0
1 1
1 1
Terms:
'Intercept' (column 0)
"C(A, Treatment('some really long name'))" (column 1)


Reply to this email directly or view it on GitHubhttps://github.com//issues/40
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jseabold
Copy link
Member Author

jseabold commented May 6, 2014

I understand the hesitancy to fix things that aren't really broken, but IMO this a pretty bad usability issue that has come up before.

To be clear, I'm just talking about what's in design_info.column_names I think. Do things rely on this later? Can't the builder retain the reference information without showing it to us all the time? I'm writing a lot of code just to get back sensible names and be able to manipulate DataFrames that have designs built from patsy. E.g., I either have to regex this back to something sensible or things like (actual use case)

X[["C(dialect_region, Treatment('East Central German'))[T.North German]",
 "C(dialect_region, Treatment('East Central German'))[T.West Central German]", 
 ...]]

Typing this kind of stuff out is brutal. Even just being able to leave out the reference would be an improvement IMO.

@jseabold
Copy link
Member Author

jseabold commented May 6, 2014

Part of the solution on my end is to make easier variable names, but it doesn't get by the having to type the reference category each time I want to use the name.

@njsmith
Copy link
Member

njsmith commented May 10, 2014

I totally agree about the usability issue -- I'm not being hesitant to fix things that aren't broken, I'm being hesitant to start writing code to accomplish something that I'm not sure is even possible in principle :-/. Patsy doesn't know what C is, it's just an arbitrary Python function call. In fact Patsy doesn't even know Python syntax, so it doesn't even know there's a function call there...

Of course the best solution would be to have a proper way to represent categorical data (like R's factors) so that dialect_region could know its own reference level and preferred coding scheme and suchlike. In the mean time...

Some possible approaches:

  • Add some sort of generic string-mangling for long names. R has something like this -- in some cases (can't track it down right now), then it starts throwing away spaces and vowels, I think. So you end up with stuff like C(dlctrgnTrtment(EstCntrlGrmn)). I... guess that doesn't really help much. But we could come up with a better one, that say truncates to the unique prefix or something?
  • Add a name= argument to C, which sets an attr on the returned categorical object telling patsy what display name to use? Sort of awkward to write C(dialect_region, Treatment('East Central German'), name='dialect_region'), but at least it would work.
  • ...?

(BTW as a stupid workaround you can avoid typing the reference category by renaming it so it's alphabetically first.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants