

# Using the SageMaker Data Agent for data analytics tasks


## Data exploration and discovery


The agent helps you explore available data sources by querying AWS Glue Data Catalog metadata through MCP connections. When you request data exploration, the agent retrieves table information, schemas, and relationships from your catalog to understand what data is available.

**Example prompt for data analyst**  
*"Show me what customer tables are available and their schemas"*  
The agent queries your AWS Glue Data Catalog and generates code to display table metadata.

**Example prompt**  
*"Calculate monthly revenue by product category from the sales\$1transactions table"*  
The agent accesses your AWS Glue Data Catalog to verify the sales\$1transactions table structure and generates:  

```
SELECT 
    DATE_TRUNC('month', transaction_date) as month,
    product_category,
    SUM(transaction_amount) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers,
    AVG(transaction_amount) as avg_transaction_value
FROM sales_db.sales_transactions
WHERE transaction_date >= DATE_ADD(CURRENT_DATE(), -180)
GROUP BY DATE_TRUNC('month', transaction_date), product_category
ORDER BY month DESC, total_revenue DESC
```

## Building multi-step analytical workflows


For complex analysis, the agent proposes step-by-step plans that break down your objective into discrete operations, requesting approval before executing each step.

**Example prompt for data scientist**  
*"Analyze the churn table and visualize retention rate by region"*  
The agent proposes a multi-step plan:  
Now I'll create a plan to analyze the churn table and visualize retention rate by region. Since the table has a 'state' column representing regions and a 'churn' boolean column, I can calculate the retention rate (percentage of customers who didn't churn) for each state.  

1. **Step 1: Load Churn Data from Catalog** - Use PySpark to read the churn table from the AWS Glue catalog (sagemaker\$1sample\$1db.churn) and preview the data structure.

1. **Step 2: Calculate Retention Rate by Region** - Aggregate the data by state to calculate the retention rate (percentage of customers where churn = False) for each state/region.

1. **Step 3: Visualize Retention Rates** - Create a bar chart showing the retention rate by region/state, sorted for better readability.
Each step generates code in a new cell, maintaining context from previous cells and building upon earlier results.

## Cell level inline code modification


The agent provides inline assistance to modify existing code based on your requirements. You can use the agent icon on every cell to trigger a prompt input box and type your request instead of code.

**Example workflow**  
Initial cell content:  

```
# Load sales data
sales_df = spark.table("sales_db.transactions")
sales_df.show(10)
```
Inline prompt: *"Add filtering for electronics category and last quarter only"*  
Updated cell content:  

```
# Load sales data with filters
from datetime import datetime, timedelta

quarter_start = datetime.now() - timedelta(days=90)
sales_df = spark.table("sales_db.transactions") \
    .filter(col("category") == "electronics") \
    .filter(col("transaction_date") >= quarter_start)
sales_df.show(10)
```

## Referencing previous cell results


The agent understands your notebook context and can generate code that references data frames and variables from previous cells.

**Example workflow**  
Cell 1:  

```
# Customer purchase analysis
customer_purchases = spark.sql("""
    SELECT customer_id, product_category, 
           SUM(amount) as total_spent
    FROM sales_db.transactions
    GROUP BY customer_id, product_category
""")
customer_purchases.createOrReplaceTempView("customer_summary")
```
Agent prompt: *"Create a visualization showing the distribution of customer spending from the results above"*  
Cell 2 (generated by agent):  

```
# Visualize customer spending distribution
import matplotlib.pyplot as plt
import pandas as pd

# Convert to pandas for visualization
spending_data = customer_purchases.toPandas()

plt.figure(figsize=(12, 6))
plt.hist(spending_data['total_spent'], bins=50, edgecolor='black')
plt.xlabel('Total Spending')
plt.ylabel('Number of Customers')
plt.title('Customer Spending Distribution')
plt.grid(True, alpha=0.3)
plt.show()

# Show summary statistics
print(spending_data['total_spent'].describe())
```

## Error handling and fix with the SageMaker Data Agent


When code execution fails, the SageMaker Data Agent can analyze error messages and suggest corrections. Use the fix with agent capability to automatically generate corrected code.