When it comes to selecting data on a DataFrame, Pandas and are two top favorites. They are fast, fast, easy to read and sometimes interchangeable.
In this article we will look at the differences between and , look at their similarities, and see how to perform data selection with them. We will talk about the following topics:
- Differences between and
- Choose via a single value
- Select via a list of values
- Select a range of data via cutting
- Select via conditions and callable
locand is interchangeable when labels are 0-based integers
Please check notebook for the source code.
1. Differences between and
The main difference between and is:
locis label-based, which means you must specify rows and columns based on their row and column labels.
ilocis integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).
Here are some differences and similarities between and:
For demonstration, we create a DataFrame and load it with the Day column as the index.
df = pd.read_csv('data/data.csv', index_col=['Day'])
2. Choose via a single value
Both and allow inputs to be a single value. We can use the following syntax for data selection:
For example, let’s say we’d like to regain Friday’s temperature value.
With , we can pass the row label and the column label .
# To get Friday's temperature >>> df.loc['Fri', 'Temperature']10.51
The equivalent statement must take the row number and the column number .
# The equivalent `iloc` statement >>> df.iloc[4, 1]10.51
We can also use it to return all data. For example, to get all rows:
# To get all rows >>> df.loc[:, 'Temperature']Day Mon 12.79 Tue 19.67 Wed 17.51 Thu 14.44 Fri 10.51 Sat 11.07 Sun 17.50 Name: Temperature, dtype: float64# The equivalent `iloc` statement >>> df.iloc[:, 1]
And to get all columns:
# To get all columns >>> df.loc['Fri', :]Weather Shower Temperature 10.51 Wind 26 Humidity 79 Name: Fri, dtype: object# The equivalent `iloc` statement >>> df.iloc[4, :]
Note that the above 2 outputs are Series. and will return a series when the result is 1-dimensional data.
3. Select via a list of values
We can pass a list of tags to select multiple rows or columns:
# Multiple rows >>> df.loc[['Thu', 'Fri'], 'Temperature']Day Thu 14.44 Fri 10.51 Name: Temperature, dtype: float64# Multiple columns >>> df.loc['Fri', ['Temperature', 'Wind']]Temperature 10.51 Wind 26 Name: Fri, dtype: object
Similarly, a list of integer values can be transferred to select multiple rows or columns. Here are the equivalent statements that make use of:
>>> df.iloc[[3, 4], 1]Day Thu 14.44 Fri 10.51 Name: Temperature, dtype: float64>>> df.iloc[4, [1, 2]]Temperature 10.51 Wind 26 Name: Fri, dtype: object
All of the above outputs are Series because their results are 1-dimensional data.
The output will be a DataFrame when the result is 2-dimensional data, for example, to access multiple rows and columns
# Multiple rows and columns rows = ['Thu', 'Fri'] cols=['Temperature','Wind']df.loc[rows, cols]
The equivalent explanation is:
rows = [3, 4] cols = [1, 2]df.iloc[rows, cols]
4. Select a range of data via cutting
Cutting (written as ) is a powerful technique that makes it possible to select a variety of data. This is very useful if we want to choose everything between two items.
loc with cutting
With , we can use the syntax to select data from label A to label B (Both A and B are included):
# Slicing column labels rows=['Thu', 'Fri']df.loc[rows, 'Temperature':'Humidity' ]
# Slicing row labels cols = ['Temperature', 'Wind']df.loc['Mon':'Thu', cols]
We can use the syntax to select data from label A to label B with step size S (Both A and B are included):
# Slicing with step df.loc['Mon':'Fri':2 , :]
iloc with cutting
With , we can also use the syntax to select data from position n (included) to position m (excluding). However, the main difference here is that the endpoint (m) is excluded from the result.
For example, select columns from position 0 to 3 (excluding):
df.iloc[[1, 2], 0 : 3]
Similarly, we can use the syntax to select data from position n (included) to position m (excluding) with step size s. Notes that the endpoint m is excluded.
5. Choose via conditions and callable
loc with conditions
Often we want to filter the data based on conditions. For example, we may need to find the rows where humidity is greater than 50.
With , we just have to pass the condition to the declaration.
# One condition df.loc[df.Humidity > 50, :]
Sometimes we may need to use multiple conditions to filter our data. For example, find all the rows where humidity is more than 50 and the weather is Shower:
## multiple conditions df.loc[ (df.Humidity > 50) & (df.Weather == 'Shower'), ['Temperature','Wind'], ]
iloc with conditions
For , we will get a ValueError if pass the condition straight in the declaration:
# Getting ValueError df.iloc[df.Humidity > 50, :]
We get the bug because
iloc can’t accept a boolean series. It only accepts a boolean list. We can use the
list() feature to convert a range into a boolean list.
# Single condition df.iloc[list(df.Humidity > 50)]
Similarly, we can use to convert the output of several conditions into a boolean list:
## multiple conditions df.iloc[ list((df.Humidity > 50) & (df.Weather == 'Shower')), :, ]
loc With call
loc accepts a call as an indexer. The callable must be a function with one argument that returns valid output for indexing.
For example, to select columns
# Selecting columns df.loc[:, lambda df: ['Humidity', 'Wind']]
And to filter data with a callable:
# With condition df.loc[lambda df: df.Humidity > 50, :]
iloc With call
iloc Can also take a call as an indexer.
df.iloc[lambda df: [0,1], :]
Filtering data with callables will be required to convert the output of conditions into a boolean list:
df.iloc[lambda df: list(df.Humidity > 50), :]
6. and is interchangeable when labels are 0-based integers
For demonstration, let’s create a DataFrame with 0-based integers as headers and index labels.
df = pd.read_csv( 'data/data.csv', header=None, skiprows=, )
With , the Pandas will generate 0-based integer values as headings. With , those headings Again, Temperature, etc we used will be skipped.
Now, a label-based data picker, can accept a single integer and a list of integer values. For example:
>>> df.loc[1, 2] 19.67 >>> df.loc[1, [1, 2]] 1 Sunny 2 19.67 Name: 1, dtype: object
The reason they work is that those integer values (and ) are interpreted as labels of the index. This usage is not an integer position along with the index and is a little confusing.
In this case, and is interchangeable when you choose via a single value or a list of values.
>>> df.loc[1, 2] == df.iloc[1, 2] True>>> df.loc[1, [1, 2]] == df.iloc[1, [1, 2]] 1 True 2 True Name: 1, dtype: bool
Note that and will return different results when choosing via cuts and conditions. They are essentially different because:
- cut: endpoint is excluded from result, but included in
- Conditions: accept boolean range, but can only accept a boolean list.
Finally, here’s a summary
loc Is label-based and allowed input:
- A single label or (Note what is interpreted as a label of the index.)
- A list of labels or (Note what is interpreted as labels of the index.)
['A', 'B', 'C']
[1, 2, 3]
1, 2, 3
- A cut with labels (Both are included)
- Voorwaardes, ‘n boolean-reeks of ‘n boolean skikking
- A function with one argument
iloc is integer position based and allowed input is:
- An integer b.
- A list or variety of integers .
[1, 2, 3]
- A cut with integers (the endpoint is excluded)
- Conditions, but accept only a boolean array
- A function with one argument
loc and is interchangeable when labeling Pandas DataFrame 0-based integers
I hope this article will help you save time learning Pandas data selection. I recommend checking the documentation to know about other things you can do.