What is a feature store and why do we use it?
What is a feature store and why do we use it?Permalink
Why do we even need a feature store for our model pipeline? Why can’t we just have a pipeline where we keep everything in object storage and call it at runtime? Why even have a feature store?
Example problemPermalink
Let’s say that we’re doing a simple classification task for the classic Titanic dataset from Kaggle. Part of that task involves generating features for your model. This task is pretty simple if you’re doing a one-off training regime, but once you start wanting to productionize your model, an inevitable tool that you’ll start wanting to use is a feature store.
What is a feature store?Permalink
A feature store is a centralized data system that manages, stores, and serves the input features used in machine learning models, ensuring consistency and reusability across both training and production environments.
By separating feature engineering and management from the rest of the ML pipeline, it allows data scientists and engineers to easily access validated, up-to-date features without duplicating work or risking inconsistencies.
Some examples of common feature store solutions include Google Vertex AI Feature Store, Feast, AWS SageMaker Feature Store, and Databricks Feature Store. These tools provide built-in capabilities for feature versioning, sharing, and serving, helping organizations quickly operationalize ML projects and maintain robust, scalable pipelines.
What would our pipeline look like without a feature store?Permalink
Let’s say you didn’t want to use a feature store but you wanted to start doing ML engineering in the cloud anyways. What would that look like? One possible way is something like this:
Files stored in Cloud Storage (GCS) as the primary interface between steps. Logic for processing data is often duplicated between the training phase and the inference phase.
[Naive approach] System ComponentsPermalink
- Storage: Google Cloud Storage (GCS) buckets (stores CSV/Parquet files).
- Compute: Vertex AI Training Jobs (or local Python scripts).
- Model Registry: Vertex AI Model Registry (stores the trained .bst or .pkl file).
- Serving Endpoint: Vertex AI Endpoint (hosts the model).
- Client/UI: Streamlit App.
[Naive approach] The Workflow (Pipeline Design)Permalink
Data Ingestion & ProcessingPermalink
You run a script (preprocess.py) that downloads raw data, cleans it (fills missing ages, encodes titles), and saves a file processed_train.csv to GCS. Artifact: A static file in a bucket.
Model TrainingPermalink
- The training script (train.py) downloads processed_train.csv from GCS.
- It trains the XGBoost model and saves the model artifact to the Model Registry.
Inference (Serving) - The “Danger Zone”Permalink
- A user on the Streamlit app enters raw data (e.g., “Age: 22, Title: Mr”).
- CRITICAL: The Streamlit app (or the serving container) must contain a copy of the logic from preprocess.py to convert “Mr” to the exact integer 1 that the model expects, and fill missing values exactly as the training set did.
- The app sends this processed vector to the Endpoint for a prediction.
Where does this naive approach go wrong?Permalink
Training-Serving Skew (The #1 Silent Killer)Permalink
You calculate Family_Size = SibSp + Parch + 1 in your training script. When you build your Streamlit UI, you have to re-write that same logic in Python inside the app. If you update one and forget the other, your model fails silently because the inputs don’t mean the same thing.
Point-in-Time Correctness (Time Travel)Permalink
Your CSV in the bucket just has the latest data. If you want to retrain a model to see how it would have performed last month, you can’t easily “rewind” the data in a CSV to what it looked like on Nov 1st.
Online Serving LatencyPermalink
To serve predictions, your app would need to load the CSV or query a slow database to find passenger traits.
What benefits would another approach ideally have?Permalink
An alternative approach would ideally resolve the problems of the naive approach:
Solving Problem 1: Training-Serving Skew (The #1 Silent Killer)Permalink
Ideally, you compute the feature once, store it, and then both the training job and the Streamlit app ask the Feature Store for “Passenger 123’s Family_Size”. That way, they’re guaranteed to get the exact same definition.
Solving Problem 2: Point-in-Time Correctness (Time Travel)Permalink
Ideally, it stores a history of values. You can ask: “Give me the training data as it existed on Nov 1st.” This prevents data leakage (training on future information).
Solving Problem 3: Online Serving LatencyPermalink
Ideally, it has an Online Store (low-latency, like Redis/Bigtable) specifically for serving. When your Streamlit app asks for features, it gets them in milliseconds.
How does a feature store work?Permalink
At a high level, a feature store sits between raw data sources and model consumers. Teams register feature definitions in a central registry (names, owners, data types, freshness SLAs, and transformation code).
How do we add data to a feature store?Permalink
Ingestion jobs (batch and/or streaming) compute features from raw events and operational tables. The platform writes immutable, time-stamped feature values to an offline store for training. It also maintains lineage and event times to guarantee point-in-time correctness (“data was correct up to this date”) and so that any training set can be traced back to the source data and transformation code used to generate it.
The feature store also manages adding the latest values from the online store to an online store with very low-latency lookups (e.g., Redis).
How do we serve the data?Permalink
Offline servingPermalink
When models are trained, they can request “as-of” datasets (e.g,. all the data up to 2025-10-05”) and the feature store assembles that dataset by getting the state of the data up to that point from the offline store.
Online servingPermalink
For inference, we need data immediately. For this, the feature store serves the data from the online store.
Benefits of a feature storePermalink
The primary benefit is consistency: compute a feature once, use it everywhere. This eliminates training–serving skew, accelerates iteration by enabling reuse across teams, and makes features discoverable through a catalog with ownership and documentation. Reproducibility improves because every value is versioned and time‑stamped, so you can rebuild yesterday’s training set exactly and audit how a model was produced.
Operationally, a feature store reduces cost and complexity by centralizing pipelines, handling backfills and schema evolution, and separating batch analytics from low‑latency serving through dedicated offline/online stores.
Performance also improves via pre‑materialization and caching, while governance improves through RBAC, data contracts, and PII policies applied consistently across training and serving. The result is faster model delivery with fewer incidents and clearer accountability, across both training and serving.
What exactly is a feature store under the hood? How would you create one yourself?Permalink
A feature store is essentially two specialized databases working together, plus some orchestration logic. It’s not magic, it’s just databases optimized for specific ML use cases.
The Two-Database ArchitecturePermalink
1. Offline Store (Training Database)Permalink
- Technology: Usually a data warehouse like BigQuery, Snowflake, or Hive
- Purpose: Stores historical feature data for training
- Optimized for: High-throughput batch reads, time-travel queries
- Storage format: Columnar (Parquet, ORC) for analytical queries
2. Online Store (Serving Database)Permalink
- Technology: Key-value stores like Redis, DynamoDB, Cassandra, or Bigtable
- Purpose: Serves features for real-time inference
- Optimized for: Sub-millisecond lookups by entity ID
- Storage format: Row-based for fast single-record retrieval
The “Secret Sauce” ComponentsPermalink
Beyond just databases, feature stores add:
- Sync Logic: Automatically copies data from offline to online databases
- Time-Travel Engine: Reconstructs “what the data looked like on date X”
- Schema Registry: Tracks feature definitions and versions
- Transformation Engine: Applies consistent feature engineering logic
Creating a feature store from scratchPermalink
A simple implementation of a feature store would look something like this:
# Simplified Feature Store Architecture
class DIYFeatureStore:
def __init__(self):
# Offline Store: PostgreSQL with time-series tables
self.offline_db = PostgreSQLConnection()
# Online Store: Redis for fast lookups
self.online_cache = RedisConnection()
# Metadata Store: Track feature definitions
self.registry = FeatureRegistry()
For the offline store, it would look something like this:
-- Historical features table (PostgreSQL)
CREATE TABLE passenger_features (
entity_id VARCHAR(50), -- PassengerId
feature_timestamp TIMESTAMP, -- When this data was valid
ingestion_timestamp TIMESTAMP, -- When we learned about it
age DOUBLE PRECISION,
fare DOUBLE PRECISION,
family_size INTEGER,
-- ... other features
PRIMARY KEY (entity_id, feature_timestamp)
);
-- Index for time-travel queries
CREATE INDEX idx_time_travel ON passenger_features (feature_timestamp, entity_id);
For the online Redis store, it would look something like this:
# Redis key structure for fast lookups
# Key: "passenger:892:latest"
# Value: JSON with all features
{
"age": 22.0,
"fare": 7.25,
"family_size": 2,
"last_updated": "2024-11-21T10:30:00Z"
}
For the core feature store logic, it would look something like this, with functionalities for (1) ingesting features, (2) getting training data for offline queries (e.g., ML training), and (3) getting online features, for inference.
class DIYFeatureStore:
def ingest_features(self, df, entity_col, timestamp_col):
"""Ingest features into both stores"""
# 1. Store in offline store (PostgreSQL)
df.to_sql('passenger_features', self.offline_db, if_exists='append')
# 2. Update online store (Redis) with latest values
for _, row in df.iterrows():
entity_id = row[entity_col]
features = row.drop([entity_col, timestamp_col]).to_dict()
redis_key = f"passenger:{entity_id}:latest"
self.online_cache.set(redis_key, json.dumps(features))
def get_training_data(self, entity_ids, as_of_time):
"""Time-travel query for training"""
query = """
SELECT DISTINCT ON (entity_id) *
FROM passenger_features
WHERE entity_id = ANY(%s)
AND feature_timestamp <= %s
ORDER BY entity_id, feature_timestamp DESC
"""
return pd.read_sql(query, self.offline_db,
params=[entity_ids, as_of_time])
def get_online_features(self, entity_ids, feature_names):
"""Fast lookup for serving"""
results = []
for entity_id in entity_ids:
redis_key = f"passenger:{entity_id}:latest"
features = json.loads(self.online_cache.get(redis_key) or '{}')
results.append({k: features.get(k) for k in feature_names})
return results
Revised architecture after including a feature storePermalink
We can redesign our architecture to now use a feature store. This new design introduces a single central “source of truth” for all feature data. The feature store handles serving data to both the model trainer and the inference app, making sure that both are consistent.
[Revised approach] System ComponentsPermalink
- Storage: Google Cloud Storage (GCS) buckets (stores CSV/Parquet files).
- Feature Store: Vertex AI Feature Store.
- Offline Store: High-capacity storage (BigQuery-backed) for historical training data.
- Online Store: Low-latency storage (Bigtable/Redis-backed) for real-time serving.
- Compute: Vertex AI Training Jobs (or local Python scripts).
- Model Registry: Vertex AI Model Registry (stores the trained .bst or .pkl file).
- Serving Endpoint: Vertex AI Endpoint (hosts the model).
- Client/UI: Streamlit App.
[Revised approach] The Workflow (Pipeline Design)Permalink
[Revised approach] Feature Engineering & IngestionPermalink
- You run preprocess.py to calculate features (Age, Family_Size).
- Instead of just saving a CSV, you ingest these values into the Feature Store, tagged with a timestamp and an Entity ID (PassengerId).
- The Feature Store automatically syncs this data to both the Offline and Online storage layers.
[Revised approach] Model Training (Batch Fetch)Permalink
- The training script (train.py) does not read a CSV. It sends a query to the Offline Feature Store: “Give me the features for these 800 passengers as they looked on Jan 1st.”
- The Feature Store reconstructs the dataset for that point in time.
- The model is trained and saved.
[Revised approach] Inference (Online Lookup)Permalink
- Scenario A (Known Passenger): The Streamlit app only needs the PassengerId. It asks the Online Feature Store: “Give me the latest features for Passenger 892.” It gets the pre-computed, trusted values in milliseconds and sends them to the model.
- Scenario B (New/Hypothetical Passenger): You can still manually pass values, but typically in mature systems, you would ingest the new data into the Feature Store first, or use the Feature Store’s transformation layer (if available) to ensure the logic is identical.
What are the benefits of this revised approach?Permalink
Zero SkewPermalink
The exact value used to train the model (e.g., Family_Size=4) is the exact value returned by the Online Store for serving.
LatencyPermalink
The Online Store is optimized for millisecond retrieval, unlike reading from GCS or querying a standard SQL database.
Time TravelPermalink
You can re-train the model on the state of data from 6 months ago without needing to keep old CSV files around; the Offline Store manages the timeline for you.
What would this look like in GCP?Permalink
To set up our feature store in GCP, we would do something like this:
Setting up and creating the feature storePermalink
from google.cloud import aiplatform
# Initialize SDK
aiplatform.init(project="titanic-ml-project", location="us-central1")
# 1. Create the Feature Store
# This is the container for all your data
titanic_fs = aiplatform.Featurestore.create(
featurestore_id="titanic_featurestore",
online_store_fixed_node_count=1, # Low cost for demo
project="titanic-ml-project",
location="us-central1",
sync=True # Wait for creation (can take ~10 mins)
)
# 2. Create an Entity Type
# "Passenger" is the entity we are tracking
passenger_entity = titanic_fs.create_entity_type(
entity_type_id="passenger",
description="Titanic passenger features",
sync=True
)
# 3. Register Features
# Define the schema of what you are storing
passenger_entity.batch_create_features(
feature_configs={
"age": {"value_type": "DOUBLE"},
"fare": {"value_type": "DOUBLE"},
"family_size": {"value_type": "INT64"},
"pclass": {"value_type": "INT64"},
"has_cabin": {"value_type": "BOOL"},
},
sync=True
)
Ingestion (Moving data from GCS to Feature Store)Permalink
You need a .csv in GCS with columns matching your features, plus entity_id (PassengerId) and feature_timestamp (when this data was known).
# 4. Ingest Data from GCS
# Your CSV must have a column for the entity ID and a timestamp column
passenger_entity.ingest_from_gcs(
feature_ids=["age", "fare", "family_size", "pclass", "has_cabin"],
feature_time="timestamp", # Name of the timestamp column in your CSV
entity_id_field="PassengerId", # Name of the ID column in your CSV
gcs_source_uris=["gs://titanic-ml-data-YOUR_PROJECT_ID/data/processed/features.csv"],
sync=True
)
ServingPermalink
For serving, at runtime (e.g., in a Streamlit UI), we’d do something like this:
# 5. Online Serving (Low Latency)
# Retrieve features for a specific passenger (e.g., ID 892)
feature_values = passenger_entity.read(
entity_ids=["892"],
feature_ids=["age", "fare", "family_size", "pclass", "has_cabin"]
)
print(feature_values)
# Output: Pandas DataFrame with the latest values for Passenger 892
SummaryPermalink
A feature store is not a single database. Instead, it’s a set of services that standardize how features are defined, computed, versioned, and served to both training and production. It may not be apparent at first why one needs to use a feature store, but in production, problems like training–serving skew, reproducibility, and latency come up, problems that feature stores are designed to solve.