#06 | Locating & Filtering the pandas.DataFrame

#06 | Locating & Filtering the pandas.DataFrame

Learn everything about filtering specific parts of the DataFrame based on conditions.

Possibilities

Sometimes, we want to select specific parts of the DataFrame to highlight some data points.

In this case, we refer to the topic as locating & filtering.

For example, let's load the dataset of cars:

import seaborn as sns

df_mpg = sns.load_dataset('mpg', index_col='name').drop(columns=['cylinders', 'model_year', 'origin'])
df_mpg

df1.jpeg

To filter the best cars in each statistics/column.

First, we calculate the maximum values in each column:

df_mpg.max()

mpg 46.6 displacement 455.0 horsepower 230.0 weight 5140.0 acceleration 24.8 dtype: float64

Then, we create a mask (array with True/False) to capture the rows where we have the cars with maximum values:

mask_max = (df_mpg == df_mpg.max()).sum(axis=1) > 0
mask_max

name chevrolet chevelle malibu False buick skylark 320 False ...
ford ranger False chevy s-10 False Length: 398, dtype: bool

Select the rows where the mask is True:

df_mpg_max = df_mpg[mask_max].copy()
df_mpg_max

df2.jpeg

And add some styling:

df_mpg_max.style.format('{:.0f}').background_gradient()

df3.jpeg

To understand the reasoning behind the previous example, read the rest of the article, where we explain the logic from the most basic example to locating data based on the index.

Any Object

By now, we should know the difference between the brackets [] and the parenthesis ().

We use brackets to select parts of an object. For example, let's create a list of days:

list_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

And select the second element:

list_days[1]

'Tuesday'

Or the last element:

list_days[-1]

'Sunday'

Until the third element (included):

list_days[:3]

['Monday', 'Tuesday', 'Wednesday']

Nevertheless, the list is a simple element of Python. To get more functionalities, we use the Series object from pandas library.

Series

Let's create a Series to store the Apple Stock Return on Investment (ROI) by quarters:

import pandas as pd

sr_apple = pd.Series(
    data=[59.02, 63.57, 66.93, 69.05],
    index=['1Q', '2Q', '3Q', '4Q']
)

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

iloc (integer-location) property

We use .iloc[] to select parts of the object based on the integer position of the element.

For example, let's select the first quarter ROI:

sr_apple.iloc[0]

59.02

Now, let's select the first and third quarters:

To select more than one object, we need to use double brackets [[]]:

sr_apple.iloc[[0,2,3]]

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

Could we have accessed with the name 1Q?

sr_apple.iloc['Q1']

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

Input In [99], in <cell line: 1>() ----> 1 sr_apple.iloc['Q1']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.getitem(self, key) 964 axis = self.axis or 0 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1517, in _iLocIndexer._getitem_axis(self, key, axis) 1515 key = item_from_zerodim(key) 1516 if not is_integer(key): -> 1517 raise TypeError("Cannot index by location index with a non-integer key") 1519 # validate the location 1520 self._validate_integer(key, axis)

TypeError: Cannot index by location index with a non-integer key

The iloc property only works in integers (the position of the subelements we want).

To select the elements by their label/name, we need to use the loc property:

loc (location) property

We select parts of an object with the .loc[] instance based on the label/name of the index:

sr_apple.loc['1Q']

59.02

sr_apple.loc[['1Q', '3Q', '4Q']]

1Q 59.02 3Q 66.93 4Q 69.05 dtype: float64

If we would like to access by the position, we'd get an error:

sr_apple.loc[0]

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniforge3/lib/python3.9/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

Input In [102], in <cell line: 1>() ----> 1 sr_apple.loc[0]

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.getitem(self, key) 964 axis = self.axis or 0 966 maybe_callable = com.apply_if_callable(key, self.obj) --> 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1202, in _LocIndexer._getitem_axis(self, key, axis) 1200 # fall thru to straight lookup 1201 self._validate_key(key, axis) -> 1202 return self._get_label(key, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:1153, in _LocIndexer._get_label(self, label, axis) 1151 def _get_label(self, label, axis: int): 1152 # GH#5667 this will fail if the label is not present in the axis. -> 1153 return self.obj.xs(label, axis=axis)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/generic.py:3864, in NDFrame.xs(self, key, axis, level, drop_level) 3862 new_index = index[loc] 3863 else: -> 3864 loc = index.get_loc(key) 3866 if isinstance(loc, np.ndarray): 3867 if loc.dtype == np.bool_:

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 0

It results in KeyError because we don't have any Key in the index to be 0:

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We have:

sr_apple.keys()

Index(['1Q', '2Q', '3Q', '4Q'], dtype='object')

The loc property only works with the labels, not the position.

Masking with boolean objects

Now we'd like to select parts based on a condition. For example, let's show the quarters we had a Return on Investment (ROI) above 60.

First, we create a boolean object based on the stated condition:

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

sr_apple > 60

1Q False 2Q True 3Q True 4Q True dtype: bool

mask_60 = sr_apple > 60

Now we pass the previous object to the .loc property:

sr_apple.loc[mask_60]

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

And here, we have the data for which the ROI is higher than 60.

Just the brackets []

sr_apple

1Q 59.02 2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

We could also access the data by only using the brackets, without the ~.iloc~ property:

sr_apple['1Q']

59.02

And also, the position:

sr_apple[0]

59.02

And the mask:

sr_apple[mask_60]

2Q 63.57 3Q 66.93 4Q 69.05 dtype: float64

So far, we have played with 1-Dimensional objects. Now it's time to level up and play with 2-Dimensional objects, like the DataFrame.

DataFrame

Let's play with a dataset of cars:

import seaborn as sns

df_mpg = sns.load_dataset(name='mpg', index_col='name')
df_mpg

df4.jpeg

iloc (integer-location) property

We can select the second row:

df_mpg.iloc[2]

mpg 18.0 cylinders 8 displacement 318.0 horsepower 150.0 weight 3436 acceleration 11.0 model_year 70 origin usa Name: plymouth satellite, dtype: object

And keep the DataFrame style if we use double brackets [[]]:

df_mpg.iloc[[2]]

df5.jpeg

We can also slice (a term used for filtering as well) consecutive elements of the DataFrame with the colon :.

For example, let's select the first 4 rows:

df_mpg.iloc[:4]

df6.jpeg

Instead of:

df_mpg.iloc[[0,1,2,3]]

df7.jpeg

We can also select the columns we want.

For example, let's select the first 3 columns:

df_mpg.iloc[:4, :3]

df8.jpeg

Learn how to become an independent Data Analyist programmer who knows how to extract meaningful insights from Data Visualizations.

Or the rest of the columns from the 3rd position (not included):

df_mpg.iloc[:4, 3:]

df9.jpeg

Or the last 3 columns by using the -:

df_mpg.iloc[:4, -3:]

df10.jpeg

loc (location) property

We can also select parts of the DataFrame based on the index and column labels (2-Dimensions):

df_mpg.loc[['ford torino', 'fiat 124 sport coupe'], ['origin', 'model_year', 'cylinders']]

df11.jpeg

df_mpg.loc[:'fiat 124 sport coupe', :'cylinders']

df12.jpeg

Masking with boolean objects

Single Condition

Out of all the cars:

df_mpg.index

Index(['chevrolet chevelle malibu', 'buick skylark 320', 'plymouth satellite', 'amc rebel sst', 'ford torino', 'ford galaxie 500', 'chevrolet impala', 'plymouth fury iii', 'pontiac catalina', 'amc ambassador dpl', ... 'chrysler lebaron medallion', 'ford granada l', 'toyota celica gt', 'dodge charger 2.2', 'chevrolet camaro', 'ford mustang gl', 'vw pickup', 'dodge rampage', 'ford ranger', 'chevy s-10'], dtype='object', name='name', length=398)

We could select all the fiat cars if we had a boolean array based on this condition:

mask_fiat = df_mpg.index.str.contains('fiat')
mask_fiat

array([False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False])

We can observe a few Trues where we find some Fiats.

Let's filter them and show all the columns with the ::

df_mpg.loc[mask_fiat, :]

df13.jpeg

Although we could have omitted the ::

df_mpg.loc[mask_fiat]

df14.jpeg

Multiple Conditions

Both Conditions &

Just the fiats whose horsepower is above 80:

mask_hp = df_mpg.horsepower > 80
mask_hp

name chevrolet chevelle malibu True buick skylark 320 True ...
ford ranger False chevy s-10 True Name: horsepower, Length: 398, dtype: bool

df_mpg.loc[mask_hp & mask_fiat, :]

df15.jpeg

Any Condition |

We could also select all fiats OR cars whose horsepower is above 80:

df_mpg.loc[mask_hp | mask_fiat, :]

df16.jpeg

Just the brackets []

We can select the columns by their labels:

df_mpg['acceleration']

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 ... ford ranger 18.6 chevy s-10 19.4 Name: acceleration, Length: 398, dtype: float64

df_mpg[['acceleration', 'origin', 'model_year']]

df17.jpeg

But we can't select the rows by the index labels:

df_mpg['amc rebel sst']

---------------------------------------------------------------------------

KeyError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance) 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err: -> 3623 raise KeyError(key) from err 3624 except TypeError: 3625 # If we have a listlike key, _check_indexing_error will raise 3626 # InvalidIndexError. Otherwise we fall through and re-raise 3627 # the TypeError. 3628 self._check_indexing_error(key)

KeyError: 'amc rebel sst'

Unless we use the colon ::

df_mpg[:'amc rebel sst']

df18.jpeg

df_mpg['buick skylark 320':'amc rebel sst']

df19.jpeg

We can also select the rows by position:

df_mpg[:4]

df20.jpeg

But we can't select both rows and columns (2-Dimensions):

df_mpg[:4,:3]

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance) 3620 try: -> 3621 return self._engine.get_loc(casted_key) 3622 except KeyError as err:

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/base.py:5637, in Index._check_indexing_error(self, key) 5633 def _check_indexing_error(self, key): 5634 if not is_scalar(key): 5635 # if key is not a scalar, directly raise an error (the code below 5636 # would convert to numpy arrays and raise later any way) - GH29926 -> 5637 raise InvalidIndexError(key)

InvalidIndexError: (slice(None, 4, None), slice(None, 3, None))

Unless we specify the columns we want in extra brackets:

df_mpg[:4]['acceleration']

name chevrolet chevelle malibu 12.0 buick skylark 320 11.5 plymouth satellite 11.0 amc rebel sst 12.0 Name: acceleration, dtype: float64

df_mpg[:4][['acceleration']]

df21.jpeg

df_mpg[:4][['acceleration', 'origin']]

df22.jpeg

We can also select the rows given boolean-arrays (a.k.a. masks):

df_mpg[mask_fiat]

df23.jpeg

df_mpg[mask_fiat | mask_hp]

df24.jpeg

df_mpg[mask_fiat & mask_hp]

df25.jpeg

It doesn't mean that I cannot later select the columns that we want (programming is the art of everything, we just need to find a way):

df_mpg[mask_fiat & mask_hp]['mpg']

name fiat 124 sport coupe 26.0 fiat 131 28.0 Name: mpg, dtype: float64

df_mpg[mask_fiat & mask_hp][['mpg', 'origin', 'model_year']]

df26.jpeg

Everything may be a bit confusing, but we hope you get the main idea behind locating and masking:

  1. Select the parts of an object with brackets []

  2. We can access it through

    1. The label/name loc

    2. The integer position iloc

    3. Masks: boolean arrays based on conditions

    4. Just the brackets []*

  3. If the object has:

    1. 1-Dimension object[:]

    2. 2-Dimension object[:,:]

*Carefully because it has many variations of use case, as we observed above

DataFrame MultiIndex

Let's load a dataset with various categorical columns since we summarise data based on categories, not numbers.

df_tips = sns.load_dataset(name='tips')
df_tips

df27.jpeg

Let's make a pivot table to summarise the information to obtain a Hierarchical* DataFrame.

dfres = df_tips.pivot_table(index=['smoker', 'time'], columns='sex', aggfunc='size')
dfres

df28.jpeg

*A Hierarchical DataFrame (MultiIndex) contains two "columns" as an index. As we may observe below:

dfres.index

MultiIndex([('Yes', 'Lunch'), ('Yes', 'Dinner'), ( 'No', 'Lunch'), ( 'No', 'Dinner')], names=['smoker', 'time'])

First Index

Let's locate some parts of the Hierarchical DataFrame:

dfres

df29.jpeg

By using the .loc property:

dfres.loc['Yes', :]

df30.jpeg

dfres.loc['No', :]

df31.jpeg

Second Index

As we have multiple indexes [index1, index2, columns], we can select a part of the second index:

dfres.loc[:, 'Lunch', :]

df32.jpeg

dfres.loc[:, 'Dinner', :]

df33.jpeg

DataFrame MultiIndex & MultiColumns

Let's now play with a DataFrame that is both MultiIndex and MultiColumns:

dfres = df_tips.pivot_table(index=['smoker', 'time'], columns=['sex', 'day'], aggfunc='size')
dfres

df34.jpeg

We may observe two levels in the columns above.

loc (location) property

First Index

We apply the same reasoning we used in the previous sections, [index1, index2, column1, column2].

dfres.loc['No', :, :, :]

df35.jpeg

Although, we can make it shorter.

dfres.loc['No', :]

df36.jpeg

Second Index

The same applies to the second index:

dfres.loc[:,'Dinner', :, :]

df37.jpeg

dfres.loc[:,'Dinner', :]

df38.jpeg

Second Index & Second Column

Let's try to get Dinners on Sundays:

dfres.loc[:, 'Dinner', :, 'Sun']

---------------------------------------------------------------------------

IndexError Traceback (most recent call last)

Input In [158], in <cell line: 1>() ----> 1 dfres.loc[:, 'Dinner', :, 'Sun']

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexing.py:961, in _LocationIndexer.getitem(self, key) 959 if self._is_scalar_access(key): 960 return self.obj._get_value(*key, takeable=self._takeable) --> 961 return self._getitem_tuple(key) 962 else: 963 # we by definition only have the 0th axis 964 axis = self.axis or 0

...

File ~/miniforge3/lib/python3.9/site-packages/pandas/core/indexes/frozen.py:70, in FrozenList.getitem(self, n) 68 if isinstance(n, slice): 69 return type(self)(super().getitem(n)) ---> 70 return super().getitem(n)

IndexError: list index out of range

To make it work, this time we need to create an intermediate object to separate rows and columns:

idx = pd.IndexSlice

dfres.loc[idx[:, 'Dinner'], idx[:, 'Sun']]

df39.jpeg

Second Index & First Column

dfres.loc[idx[:, 'Dinner'], idx['Male', :]]

df40.jpeg

Using the Slice

We can also use the slice() property:

dfres.loc[('Yes', slice(None)), (slice(None), 'Sun')]

df41.jpeg

dfres.loc['Yes', ('Female', slice(None))]

df42.jpeg

dfres.loc[(slice(None), 'Lunch'), 'Female']

df43.jpeg

dfres.loc[(slice(None), 'Lunch'), ('Female', slice(None))]

df44.jpeg

dfres.loc[idx[:, 'Dinner'], idx['Female', :]]

df45.jpeg

iloc (integer-location) property

dfres

df46.jpeg

As always, we can select by the position of the values with the iloc property:

dfres.iloc[:2, :2]

df47.jpeg

dfres.iloc[:2, 2:]

df48.jpeg

DataFrame with DateTimeIndex

Now, we will use a DataFrame that has a DateTimeIndex:

df_tsla = pd.read_excel('tsla_stock.xlsx', index_col=0)
df_tsla

df49.jpeg

loc (location) property

We can select parts of the DataFrame based on just one part of the DateTimeIndex. For example, we can select everything from the year 2020 and move forward:

df_tsla.loc['2020':]

df50.jpeg

Until the last day of 2020:

df_tsla.loc[:'2020']

df51.jpeg

Between two years:

df_tsla.loc['2019':'2020']

df52.jpeg

One complete year:

df_tsla.loc['2019']

df53.jpeg

We can even select a specific year-month:

df_tsla.loc['2019-06']

df54.jpeg

iloc (integer-location) property

Of course, we can also select parts of the DataFrame based on the position of the values with iloc:

df_tsla.iloc[:4, :3]

df55.jpeg

df_tsla.iloc[-4:, :3]

df56.jpeg

Did you find this article valuable?

Support Resolving Python by becoming a sponsor. Any amount is appreciated!