# #03 | Grouping & Pivot Tables

## Learn how to summarise the information using the pandas DataFrame in order to become an efficient Data Analyst ## Possibilities

Look at the following example as an aspiration you can achieve if you fully understand and replicate this whole tutorial with your data.

Let's load a dataset that contains information from transactions in tables (rows) at a restaurant considering socio-demographic and economic variables (columns).

``````import seaborn as sns

df_tips
`````` Grouping data to summarise the information helps you identify conclusions. For example, the summary below shows that Dinners on Sundays come to the best customers because they:

1. Spend more on average (\$21.41)

2. Give more tips on average (\$3.25)

3. Come more people at the same table on average (\$2.84)

``````df_tips.groupby(by=['day', 'time'])\
.mean()\
.fillna(0)\
`````` ``````df_tips.groupby(by=['day', 'time'])\
.mean()\
.fillna(0)\
.style.format('{:.2f}').bar(axis=0, width=50, align='zero')
`````` Let's dig into the details of the `.groupby()` function from the basics in the following sections.

## Grouping by 1 Column

We use the `.groupby()` function to generate an object that contains as many `DataFrames` as categories are in the column.

``````df_tips.groupby('sex')
``````

As we have two groups in sex (Female and Male), the length of the `DataFrameGroupBy` object returned by the `groupby()` function is 2:

``````len(df_tips.groupby('sex'))
``````

How can we work with the object `DataFrameGroupBy`?

### Calculate the Average for All Columns

We use the `.mean()` function to get the average of the numerical columns for the two groups:

``````df_tips.groupby('sex').mean()
`````` A pretty and simple syntax to summarise the information, right?

• But what's going on inside the `DataFrameGroupBy` object?
``````df_tips.groupby('sex')
``````
``````df_grouped = df_tips.groupby('sex')
``````

The `DataFrameGroupBy` object contains 2 `DataFrames`. To see one of them `DataFrame` you need to use the function `.get_group()` and pass the group whose `DataFrame` you'd like to return:

``````df_grouped.get_group('Male')
`````` ``````df_grouped.get_group('Female')
`````` Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

As the `DataFrameGroupBy` distinguish the categories, at the moment we apply an aggregation function (click here to see a list of them), we will get the mathematical operations for those groups separately:

``````df_grouped.mean()
`````` We could apply the function to each `DataFrame` separately. Although it is not the point of the `.groupby()` function.

``````df_grouped.get_group('Male').mean(numeric_only=True)
``````
``````df_grouped.get_group('Female').mean(numeric_only=True)
``````

### Compute Functions to 1 Column

To get the results for just 1 column of interest, we access the column:

``````df_grouped.total_bill
``````

And use the aggregation function we wish, `.sum()` in this case:

``````df_grouped.total_bill.sum()
``````

We get the result for just 1 column (total_bill) because the `DataFrames` generated at `.groupby()` are accessed as if they were simple `DataFrames`:

``````df_grouped.get_group('Female')
`````` ``````df_grouped.get_group('Female').total_bill
``````
``````df_grouped.get_group('Female').total_bill.sum()
``````
``````df_grouped.get_group('Male').total_bill.sum()
``````
``````df_grouped.total_bill.sum()
``````

## Grouping by 2 Columns

So far, we have summarised the data based on the categories of just one column. But, what if we'd like to summarise the data based on the combinations of the categories within different categorical columns?

### Compute 1 Function

``````df_tips.groupby(by=['day', 'smoker']).sum()
`````` ### Pivot Tables

We could have also used another function `.pivot_table()` to get the same numbers:

``````df_tips.pivot_table(index='day', columns='smoker', aggfunc='sum')
`````` Which one is best?

• I leave it up to your choice; I'd prefer to use the `.pivot_table()` because the syntax makes it more accessible.

### Compute More than 1 Function

The thing doesn't stop here; we can even compute different aggregation functions at the same time:

#### Groupby

``````df_tips.groupby(by=['day', 'smoker'])\
.total_bill\
.agg(func=['sum', 'mean'])
`````` #### Pivot Table

``````df_tips.pivot_table(index='day', columns='smoker',
values='total_bill', aggfunc=['sum', 'mean'])
`````` ``````dfres = df_tips.pivot_table(index='day', columns='smoker',
values='total_bill', aggfunc=['sum', 'mean'])
``````

You could even style the output `DataFrame`:

``````dfres.style.background_gradient()
`````` For me, it's nicer than styling the `.groupby()` returned DataFrame.

As we say in Spain:

Pa' gustos los colores!

``````df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])
`````` ``````dfres = df_tips.groupby(by=['day', 'smoker']).total_bill.agg(func=['sum', 'mean'])
``````
``````dfres.style.background_gradient()
`````` ## Pivot Tables in Depth

We can compute more than one mathematical operation:

``````df_tips.pivot_table(index='sex', columns='time',
aggfunc=['sum', 'mean'], values='total_bill')
`````` And use more than one column in each of the parameters:

``````df_tips.pivot_table(index='sex', columns='time',
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
`````` ``````df_tips.pivot_table(index=['day', 'smoker'], columns='time',
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
`````` ``````df_tips.pivot_table(index=['day', 'smoker'], columns=['time', 'sex'],
aggfunc=['sum', 'mean'], values=['total_bill', 'tip'])
`````` ## The `.size()` Function

### `.groupby()`

#### 1 Variable to Group By

The `.size()` is a function used to count the number of rows (observations) in each of the `DataFrames` generated by `.groupby()`.

``````df_grouped.size()
``````

#### 2 Variables to Group By

``````df_tips.groupby(by=['sex', 'time']).size()
``````

### `.pivot_table()`

We can use `.pivot_table()` to represent the data clearer:

``````df_tips.pivot_table(index='sex', columns='time', aggfunc='size')
`````` #### Other Example 1

``````df_tips.pivot_table(index='smoker', columns=['day', 'sex'],aggfunc='size')
`````` ``````dfres = df_tips.pivot_table(index='smoker', columns=['day', 'sex'], aggfunc='size')
``````
``````dfres.style.background_gradient()
`````` #### Other Example 2

``````df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')
`````` ``````dfres = df_tips.pivot_table(index=['day', 'time'], columns=['smoker', 'sex'], aggfunc='size')
``````
``````dfres.style.background_gradient()
`````` We can even choose the way we'd like to gradient colour the cells:

• `axis=1`: the upper value between the columns of the same row

• `axis=2`: the upper value between the rows of the same column

``````dfres.style.background_gradient(axis=1)
``````  