What You Will Build
By the end of this tutorial you will have three working integrations for NVIDIA Nemotron 3 Super on Cloudflare Workers AI:
- A Worker using the native
env.AI.run()binding - A REST API call using
fetch()from any environment - An OpenAI-compatible client using the
openaiSDK
Nemotron 3 Super runs at the edge with no GPU setup, no Docker, and no infrastructure to manage. Cloudflare handles the compute.
Prerequisites
- A Cloudflare account (free tier works)
- Node.js 18+ installed
- Wrangler CLI installed:
npm install -g wrangler - Basic familiarity with Cloudflare Workers
Why Nemotron 3 Super on Workers AI
Nemotron 3 Super has 120B total parameters but only 12B are active per token — which means inference is fast and cost-efficient despite the large parameter count. On Cloudflare Workers AI it is available at no additional cost on the free tier (within usage limits), making it one of the most capable free options available for edge inference.
Key specs relevant to this tutorial:
| Spec | Value |
|---|---|
| Total parameters | 120B |
| Active parameters per token | 12B |
| Context window | 1M tokens |
| SWE-Bench Verified | 60.47% |
| Workers AI model ID | @cf/nvidia/nemotron-3-super |
| Pricing on Workers AI | Free tier included |
Option 1 — Native Binding with env.AI.run()
This is the recommended approach if you are already using Cloudflare Workers.
Step 1: Create a new Worker
npm create cloudflare@latest nemotron-worker
cd nemotron-worker
Select Hello World Worker when prompted. TypeScript is recommended.
Step 2: Update wrangler.toml
name = \"nemotron-worker\"
main = \"src/index.ts\"
compatibility_date = \"2026-01-01\"
[ai]
binding = \"AI\"
The [ai] binding gives your Worker access to env.AI — no API keys required.
Step 3: Write the Worker
interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { prompt } = await request.json() as { prompt: string };
const response = await env.AI.run(
\"@cf/nvidia/nemotron-3-super\",
{
messages: [
{
role: \"system\",
content: \"You are a helpful assistant specialized in developer tooling and AI.\"
},
{
role: \"user\",
content: prompt
}
],
max_tokens: 1024,
}
);
return Response.json(response);
},
};
Step 4: Run locally
npx wrangler dev
Test it:
curl -X POST http://localhost:8787 \\
-H \"Content-Type: application/json\" \\
-d '{\"prompt\": \"Explain what a Cloudflare Worker is in two sentences.\"}'
Expected response shape:
{
\"response\": \"A Cloudflare Worker is a serverless function that runs at the edge...\"
}
Step 5: Deploy
npx wrangler deploy
Your Worker is now live at https://nemotron-worker.<your-subdomain>.workers.dev.
Option 2 — REST API with fetch()
Use this if you want to call Nemotron 3 Super from a Next.js API route, a Node.js backend, or any non-Worker environment.
Get your API token
- Go to dash.cloudflare.com
- My Profile → API Tokens → Create Token
- Use the Workers AI template
- Copy the token — you will not see it again
Call the REST API
const ACCOUNT_ID = \"your_account_id\"; // Found in the right sidebar of your Cloudflare dashboard
const API_TOKEN = process.env.CLOUDFLARE_API_TOKEN;
async function runNemotron(prompt: string): Promise<string> {
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-super`,
{
method: \"POST\",
headers: {
\"Authorization\": `Bearer ${API_TOKEN}`,
\"Content-Type\": \"application/json\",
},
body: JSON.stringify({
messages: [
{ role: \"system\", content: \"You are a helpful assistant.\" },
{ role: \"user\", content: prompt }
],
max_tokens: 1024,
}),
}
);
const data = await response.json();
return data.result.response;
}
// Usage
const answer = await runNemotron(\"What is SWE-Bench Verified?\");
console.log(answer);
Using it in a Next.js API route
// app/api/chat/route.ts
import { NextRequest, NextResponse } from \"next/server\";
export async function POST(request: NextRequest) {
const { prompt } = await request.json();
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${process.env.CF_ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-super`,
{
method: \"POST\",
headers: {
\"Authorization\": `Bearer ${process.env.CF_API_TOKEN}`,
\"Content-Type\": \"application/json\",
},
body: JSON.stringify({
messages: [{ role: \"user\", content: prompt }],
max_tokens: 1024,
}),
}
);
const data = await response.json();
return NextResponse.json({ response: data.result.response });
}
Add to .env.local:
CF_ACCOUNT_ID=your_account_id
CF_API_TOKEN=your_api_token
Option 3 — OpenAI-Compatible Endpoint
If you are already using the openai SDK, this is the fastest way to swap in Nemotron 3 Super with minimal code changes.
npm install openai
import OpenAI from \"openai\";
const client = new OpenAI({
apiKey: process.env.CLOUDFLARE_API_TOKEN,
baseURL: `https://api.cloudflare.com/client/v4/accounts/${process.env.CF_ACCOUNT_ID}/ai/v1`,
});
async function chat(userMessage: string): Promise<string> {
const completion = await client.chat.completions.create({
model: \"@cf/nvidia/nemotron-3-super\",
messages: [
{ role: \"system\", content: \"You are a helpful assistant.\" },
{ role: \"user\", content: userMessage }
],
max_tokens: 1024,
});
return completion.choices[0].message.content ?? \"\";
}
// Usage
const response = await chat(\"Write a Python function that parses a CSV file.\");
console.log(response);
This approach is useful when you want to switch between models (GPT-5.4, Claude, Nemotron) by changing only the model string and baseURL.
Streaming Responses
For chat interfaces where you want to stream tokens as they arrive:
const stream = await client.chat.completions.create({
model: \"@cf/nvidia/nemotron-3-super\",
messages: [{ role: \"user\", content: \"Explain transformer attention in detail.\" }],
max_tokens: 2048,
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
Rate Limits and Pricing
| Plan | Neurons/day | Notes |
|---|---|---|
| Free | 10,000 | Shared across all AI models |
| Workers Paid ($5/mo) | 100,000 | Sufficient for most dev projects |
| Beyond included | $0.011 per 1,000 neurons | Check dashboard for current rates |
Nemotron 3 Super consumes more neurons per request than smaller models due to its parameter count. For production workloads with high volume, compare total cost against DeepInfra ($0.10/$0.50 per 1M tokens) where you pay per token rather than per neuron.
FAQ
Does Nemotron 3 Super support tool calling on Workers AI?
Tool calling support on Workers AI depends on the model binding version. As of March 2026, check the Cloudflare Workers AI model catalog for the current feature list. The OpenAI-compatible endpoint is more likely to support function calling syntax.
Can I use the 1M token context window on Workers AI?
Workers AI imposes its own context limits independent of the model's native context window. Check the current limits in the Cloudflare documentation — large context requests may need to go through the direct API rather than the Workers binding.
How does this compare to running Nemotron locally with Ollama?
Running locally via Ollama gives you full context window access and no usage limits, but requires a GPU with sufficient VRAM (Nemotron 3 Super quantized needs ~24GB+). Workers AI is better for production deployments, serverless functions, and teams without GPU infrastructure.
What to Try Next
- Compare output quality against Claude Haiku 4.5 on your specific use case — both are in the same cost tier
- Test the 1M context window via the REST API with a large codebase
- Combine with Cloudflare D1 (SQLite at the edge) to build a full RAG pipeline without leaving Cloudflare's ecosystem
Next read: Nemotron 3 Super vs Claude Sonnet 4.6 vs GPT-5.4: Which for Coding Agents?