Using TensorFlow Feature Columns for Structured Data

When working with structured data in machine learning, TensorFlow's feature columns provide an indispensable tool for preprocessing and transforming your dataset into a format that a neural network can process. Feature columns help encapsulate the transformations and enable ease of replication across training, evaluation, and inference.

What are Feature Columns?
Understanding the Role of Feature Columns
Implementing Feature Columns in TensorFlow
Building the Input Layer
Training the Model using Pandas Data
Conclusion

What are Feature Columns?

Feature columns act as an intermediary between raw input data and the model. They bridge the gap by defining how the data should be transformed or grouped. This concept is essential when dealing with different data types, such as numeric, categorical, or mixed-data inputs.

Understanding the Role of Feature Columns

Here are a few types of feature columns that you might use:

NumericColumn: Works with continuous data types and normalizes them before feeding into the model.
CategoricalColumn: Processes categorical data by transforming string or int labels into a suitable format. You might need to further use vocabulary or hashing techniques to handle large sets of categorical variables.
CategoricalIndicatorColumn: Converts categorical data into a one-hot vector representation.
BucketizedColumn: Takes numeric data and splits it into different bucket ranges.
CrossedColumn: Useful for feature crosses, which are synthetic features representing interactions between different features.

Implementing Feature Columns in TensorFlow

To demonstrate how to use TensorFlow feature columns, let's consider a sample problem involving a dataset with age, income, and occupation as input features. Our task is to predict whether someone earns above 50k USD annually.

import tensorflow as tf

# Suppose we have the following feature dictionary for our dataset
feature_dict = {
    'age': tf.feature_column.numeric_column('age'),
    'income': tf.feature_column.numeric_column('income'),
    'occupation': tf.feature_column.categorical_column_with_vocabulary_list(
        'occupation', ['doctor', 'artist', 'engineer', 'other'])
}

# Convert the categorical 'occupation' feature to indicator
occupation_one_hot = tf.feature_column.indicator_column(feature_dict['occupation'])

# Define the list of features to be used by the neural network
feature_columns = [
    feature_dict['age'],
    feature_dict['income'],
    occupation_one_hot
]

In this snippet, we create the feature columns for 'age' and 'income' as numeric columns. For the 'occupation' category, we transform it into an indicator column. You can now input these features into an input_layer.

Building the Input Layer

The input layer acts as a preparatory layer that symbolizes how data should enter the model:

# Input layer based on the defined feature columns
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# Example model utilizing the feature layer
model = tf.keras.Sequential([
    feature_layer,
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile and set optimizer, loss function, etc.
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Here, we constructed a sequential model starting with a dense feature layer. Subsequent dense layers follow it, culminating in an output layer that uses the sigmoid activation function, ideal for binary classification problems.

Training the Model using Pandas Data

Imagine we have our data in a pandas DataFrame; we can seamlessly integrate it with TensorFlow’s feature columns approach:

import pandas as pd

# Sample data in a pandas DataFrame
data = pd.DataFrame({
    'age': [25, 32, 47],
    'income': [60000, 80000, 120000],
    'occupation': ['doctor', 'artist', 'engineer'],
    'label': [1, 0, 1]
})

# Splinting features and label
target = data.pop('label')

# Compile the dataset input function
dataset = tf.data.Dataset.from_tensor_slices((dict(data), target))

dataset = dataset.batch(2)

model.fit(dataset, epochs=10)  # Train the model with our dataset

In this context, the dataset method from_tensor_slices() is instrumental in readily transforming a DataFrame into a format accepted by TensorFlow.

Conclusion

By leveraging feature columns, TensorFlow streamlines the handling of structured data, ensuring a predictable, replicable path from raw input processing to the eventual feeding of data into a machine learning model. Understanding and utilizing these feature transformations are key to exploiting the full potential of TensorFlow in structured data tasks.

Next Article: TensorFlow Feature Columns: Embedding Categorical Features

Previous Article: TensorFlow Feature Columns: Building Powerful Input Pipelines

Series: Tensorflow Tutorials

Tensorflow