-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Describe : add shortest, longest, avg/max/min length #59897
Comments
Thanks for the request. I'm curious about the use cases of wanting to know the min/max/average length of strings. In the examples you give, I view these as labels for which the length of the strings is not particularly important (e.g. What's in a name?). cc @WillAyd |
@rhshadrach Yeah, the example wasn't exactly a use case example, you're pretty right about that. Now let's have a few use cases : To add some personal context : I'm an old Alteryx user and it's a feature in their data investigation tools, very common, very useful and I was surprised that describe doesn't cover it. Plus, there is this very nice project, Amphi, that aims to be a visual data preparation/etl tool and that relies on Python and I would like it to incorporate a data investigation tool. Having it all in Describe would definitly help a lot. Best regards and thanks for your prompt answer to my issue Simon |
In the first two of your examples, it seems to me you wish to validate data. I think
These seem quite uncommon uses to me. I am negative on expanding the API here. |
Hello @rhshadrach "In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance" Validating data would be another thing, like what happens if the field X doesn"t follow the rule Y. Here, that's more in the spirit : do I have suprises with this dataset or is the data quality good? But it can also help for different purposes like finding the max length of string in order to have the good type when sending it to a database (varchar(10) is not the same than a varchar(32)). Moreover, the goal of the Panda Describe function is
And when I ask chatgpt about it , here the answer :
So, the 4th point is not out of the scope, as you can see. Best regards, Simon |
Hello @rhshadrach for your information, it was added on skimpy today aeturrell/skimpy#840 (comment) |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Hello,
As of now, Describe is mainly oriented for numerical analysis. It's less useful when you have text, string values.
Feature Description
Adding five statistics dedicated to string analysis for each concerned column:
-avg length
-max length
-min length
-shortest : one of the string with the minimum length
-longest : one of the string with the maximum length
Alternative Solutions
writing something like that but that means more work to do (sorry for the formatting)
import pandas as pd
Sample DataFrame for illustration
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago'],
'country': ['USA', 'USA', 'USA', 'USA']
}
df = pd.DataFrame(data)
Function to get string statistics
def string_column_statistics(df):
stats = {}
Call the function
string_stats_df = string_column_statistics(df)
print(string_stats_df)
Additional Context
Best regards,
Simon
The text was updated successfully, but these errors were encountered: