Label encoding is a technique used in machine learning to convert categorical text data into numerical form, so that algorithms can process them. This is particularly useful when working with categorical data, where each category needs to be represented as a number. In Python, label encoding is often performed using the LabelEncoder
class from the sklearn.preprocessing
module.
Installing scikit-learn
If you haven’t installed scikit-learn yet, you can do so using pip:
pip install scikit-learn
Basic Example of Label Encoding
Let’s start with a simple example where we have a list of categorical values (e.g., different types of fruit) and we want to encode them into numerical labels.
Example: Label Encoding
from sklearn.preprocessing import LabelEncoder
# Sample data
fruits = ['apple', 'orange', 'banana', 'apple', 'orange', 'banana']
# Create a LabelEncoder object
label_encoder = LabelEncoder()
# Fit the label encoder and transform the labels into numerical form
encoded_labels = label_encoder.fit_transform(fruits)
print("Original labels:", fruits)
print("Encoded labels:", encoded_labels)
Output:
Original labels: ['apple', 'orange', 'banana', 'apple', 'orange', 'banana']
Encoded labels: [0 2 1 0 2 1]
In this example:
- The
LabelEncoder
object is used to fit and transform the list of categorical values. - The categories (‘apple’, ‘orange’, ‘banana’) are encoded as integers (0, 1, 2).
- The output shows that ‘apple’ is encoded as 0, ‘banana’ as 1, and ‘orange’ as 2.
Inverse Transformation
After encoding, you can also decode the numerical labels back to their original categorical values using the inverse_transform()
method.
Example: Inverse Transformation
# Decode the numerical labels back to the original labels
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("Decoded labels:", decoded_labels)
Output:
Decoded labels: ['apple' 'orange' 'banana' 'apple' 'orange' 'banana']
Working with DataFrames
Label encoding is commonly used with Pandas DataFrames, where you need to encode categorical columns in a dataset.
Example: Label Encoding with a DataFrame
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample DataFrame
data = {'Fruit': ['apple', 'orange', 'banana', 'apple', 'orange', 'banana'],
'Color': ['red', 'orange', 'yellow', 'green', 'orange', 'yellow']}
df = pd.DataFrame(data)
# Create a LabelEncoder object
label_encoder = LabelEncoder()
# Apply label encoding to the 'Fruit' column
df['Fruit_Encoded'] = label_encoder.fit_transform(df['Fruit'])
# Apply label encoding to the 'Color' column
df['Color_Encoded'] = label_encoder.fit_transform(df['Color'])
print(df)
Output:
Fruit Color Fruit_Encoded Color_Encoded
0 apple red 0 1
1 orange orange 2 2
2 banana yellow 1 3
3 apple green 0 0
4 orange orange 2 2
5 banana yellow 1 3
In this example:
- The
Fruit
andColor
columns in the DataFrame are label encoded into numerical values. - The encoded columns (
Fruit_Encoded
andColor_Encoded
) represent the original categorical data in numerical form.
When to Use Label Encoding
Label encoding is appropriate when:
- You have categorical data with an implicit order (e.g., “low”, “medium”, “high”).
- You want to convert categorical data into a format that machine learning models can use.
- Your categorical data has a relatively small number of distinct values.
When Not to Use Label Encoding
Label encoding is not suitable when:
- The categorical values do not have an ordinal relationship, and using label encoding might introduce unintended ordinal relationships.
- The categorical data has a large number of distinct values, which might cause problems with models that interpret numerical data as having a specific order.
Conclusion
Label encoding is a simple and effective way to convert categorical data into numerical form in Python. By using the LabelEncoder
class from the sklearn.preprocessing
module, you can easily encode categorical variables for use in machine learning models. However, it’s important to consider whether label encoding is appropriate for your specific use case, especially when dealing with non-ordinal categorical data.