October 13, 2024

Label Encoding in Python

Label encoding is a technique used in machine learning to convert categorical text data into numerical form, so that algorithms can process them. This is particularly useful when working with categorical data, where each category needs to be represented as a number. In Python, label encoding is often performed using the LabelEncoder class from the sklearn.preprocessing module.

Installing scikit-learn

If you haven’t installed scikit-learn yet, you can do so using pip:

pip install scikit-learn
    

Basic Example of Label Encoding

Let’s start with a simple example where we have a list of categorical values (e.g., different types of fruit) and we want to encode them into numerical labels.

Example: Label Encoding

from sklearn.preprocessing import LabelEncoder

# Sample data
fruits = ['apple', 'orange', 'banana', 'apple', 'orange', 'banana']

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Fit the label encoder and transform the labels into numerical form
encoded_labels = label_encoder.fit_transform(fruits)

print("Original labels:", fruits)
print("Encoded labels:", encoded_labels)
    

Output:

Original labels: ['apple', 'orange', 'banana', 'apple', 'orange', 'banana']
Encoded labels: [0 2 1 0 2 1]
    

In this example:

  • The LabelEncoder object is used to fit and transform the list of categorical values.
  • The categories (‘apple’, ‘orange’, ‘banana’) are encoded as integers (0, 1, 2).
  • The output shows that ‘apple’ is encoded as 0, ‘banana’ as 1, and ‘orange’ as 2.

Inverse Transformation

After encoding, you can also decode the numerical labels back to their original categorical values using the inverse_transform() method.

Example: Inverse Transformation

# Decode the numerical labels back to the original labels
decoded_labels = label_encoder.inverse_transform(encoded_labels)

print("Decoded labels:", decoded_labels)
    

Output:

Decoded labels: ['apple' 'orange' 'banana' 'apple' 'orange' 'banana']
    

Working with DataFrames

Label encoding is commonly used with Pandas DataFrames, where you need to encode categorical columns in a dataset.

Example: Label Encoding with a DataFrame

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
data = {'Fruit': ['apple', 'orange', 'banana', 'apple', 'orange', 'banana'],
        'Color': ['red', 'orange', 'yellow', 'green', 'orange', 'yellow']}

df = pd.DataFrame(data)

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Apply label encoding to the 'Fruit' column
df['Fruit_Encoded'] = label_encoder.fit_transform(df['Fruit'])

# Apply label encoding to the 'Color' column
df['Color_Encoded'] = label_encoder.fit_transform(df['Color'])

print(df)
    

Output:

    Fruit   Color  Fruit_Encoded  Color_Encoded
0   apple     red              0              1
1  orange  orange              2              2
2  banana  yellow              1              3
3   apple   green              0              0
4  orange  orange              2              2
5  banana  yellow              1              3
    

In this example:

  • The Fruit and Color columns in the DataFrame are label encoded into numerical values.
  • The encoded columns (Fruit_Encoded and Color_Encoded) represent the original categorical data in numerical form.

When to Use Label Encoding

Label encoding is appropriate when:

  • You have categorical data with an implicit order (e.g., “low”, “medium”, “high”).
  • You want to convert categorical data into a format that machine learning models can use.
  • Your categorical data has a relatively small number of distinct values.

When Not to Use Label Encoding

Label encoding is not suitable when:

  • The categorical values do not have an ordinal relationship, and using label encoding might introduce unintended ordinal relationships.
  • The categorical data has a large number of distinct values, which might cause problems with models that interpret numerical data as having a specific order.

Conclusion

Label encoding is a simple and effective way to convert categorical data into numerical form in Python. By using the LabelEncoder class from the sklearn.preprocessing module, you can easily encode categorical variables for use in machine learning models. However, it’s important to consider whether label encoding is appropriate for your specific use case, especially when dealing with non-ordinal categorical data.