What You Will Build

By the end of this tutorial you will have three working integrations for NVIDIA Nemotron 3 Super on Cloudflare Workers AI:

  • A Worker using the native env.AI.run() binding
  • A REST API call using fetch() from any environment
  • An OpenAI-compatible client using the openai SDK

Nemotron 3 Super runs at the edge with no GPU setup, no Docker, and no infrastructure to manage. Cloudflare handles the compute.


Prerequisites

  • A Cloudflare account (free tier works)
  • Node.js 18+ installed
  • Wrangler CLI installed: npm install -g wrangler
  • Basic familiarity with Cloudflare Workers

Why Nemotron 3 Super on Workers AI

Nemotron 3 Super has 120B total parameters but only 12B are active per token — which means inference is fast and cost-efficient despite the large parameter count. On Cloudflare Workers AI it is available at no additional cost on the free tier (within usage limits), making it one of the most capable free options available for edge inference.

Key specs relevant to this tutorial:

SpecValue
Total parameters120B
Active parameters per token12B
Context window1M tokens
SWE-Bench Verified60.47%
Workers AI model ID@cf/nvidia/nemotron-3-super
Pricing on Workers AIFree tier included

Option 1 — Native Binding with env.AI.run()

This is the recommended approach if you are already using Cloudflare Workers.

Step 1: Create a new Worker

npm create cloudflare@latest nemotron-worker
cd nemotron-worker

Select Hello World Worker when prompted. TypeScript is recommended.

Step 2: Update wrangler.toml

name = \"nemotron-worker\"
main = \"src/index.ts\"
compatibility_date = \"2026-01-01\"

[ai]
binding = \"AI\"

The [ai] binding gives your Worker access to env.AI — no API keys required.

Step 3: Write the Worker

interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { prompt } = await request.json() as { prompt: string };

    const response = await env.AI.run(
      \"@cf/nvidia/nemotron-3-super\",
      {
        messages: [
          {
            role: \"system\",
            content: \"You are a helpful assistant specialized in developer tooling and AI.\"
          },
          {
            role: \"user\",
            content: prompt
          }
        ],
        max_tokens: 1024,
      }
    );

    return Response.json(response);
  },
};

Step 4: Run locally

npx wrangler dev

Test it:

curl -X POST http://localhost:8787 \\
  -H \"Content-Type: application/json\" \\
  -d '{\"prompt\": \"Explain what a Cloudflare Worker is in two sentences.\"}'

Expected response shape:

{
  \"response\": \"A Cloudflare Worker is a serverless function that runs at the edge...\"
}

Step 5: Deploy

npx wrangler deploy

Your Worker is now live at https://nemotron-worker.<your-subdomain>.workers.dev.


Option 2 — REST API with fetch()

Use this if you want to call Nemotron 3 Super from a Next.js API route, a Node.js backend, or any non-Worker environment.

Get your API token

  1. Go to dash.cloudflare.com
  2. My Profile → API Tokens → Create Token
  3. Use the Workers AI template
  4. Copy the token — you will not see it again

Call the REST API

const ACCOUNT_ID = \"your_account_id\"; // Found in the right sidebar of your Cloudflare dashboard
const API_TOKEN = process.env.CLOUDFLARE_API_TOKEN;

async function runNemotron(prompt: string): Promise<string> {
  const response = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-super`,
    {
      method: \"POST\",
      headers: {
        \"Authorization\": `Bearer ${API_TOKEN}`,
        \"Content-Type\": \"application/json\",
      },
      body: JSON.stringify({
        messages: [
          { role: \"system\", content: \"You are a helpful assistant.\" },
          { role: \"user\", content: prompt }
        ],
        max_tokens: 1024,
      }),
    }
  );

  const data = await response.json();
  return data.result.response;
}

// Usage
const answer = await runNemotron(\"What is SWE-Bench Verified?\");
console.log(answer);

Using it in a Next.js API route

// app/api/chat/route.ts
import { NextRequest, NextResponse } from \"next/server\";

export async function POST(request: NextRequest) {
  const { prompt } = await request.json();

  const response = await fetch(
    `https://api.cloudflare.com/client/v4/accounts/${process.env.CF_ACCOUNT_ID}/ai/run/@cf/nvidia/nemotron-3-super`,
    {
      method: \"POST\",
      headers: {
        \"Authorization\": `Bearer ${process.env.CF_API_TOKEN}`,
        \"Content-Type\": \"application/json\",
      },
      body: JSON.stringify({
        messages: [{ role: \"user\", content: prompt }],
        max_tokens: 1024,
      }),
    }
  );

  const data = await response.json();
  return NextResponse.json({ response: data.result.response });
}

Add to .env.local:

CF_ACCOUNT_ID=your_account_id
CF_API_TOKEN=your_api_token

Option 3 — OpenAI-Compatible Endpoint

If you are already using the openai SDK, this is the fastest way to swap in Nemotron 3 Super with minimal code changes.

npm install openai
import OpenAI from \"openai\";

const client = new OpenAI({
  apiKey: process.env.CLOUDFLARE_API_TOKEN,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${process.env.CF_ACCOUNT_ID}/ai/v1`,
});

async function chat(userMessage: string): Promise<string> {
  const completion = await client.chat.completions.create({
    model: \"@cf/nvidia/nemotron-3-super\",
    messages: [
      { role: \"system\", content: \"You are a helpful assistant.\" },
      { role: \"user\", content: userMessage }
    ],
    max_tokens: 1024,
  });

  return completion.choices[0].message.content ?? \"\";
}

// Usage
const response = await chat(\"Write a Python function that parses a CSV file.\");
console.log(response);

This approach is useful when you want to switch between models (GPT-5.4, Claude, Nemotron) by changing only the model string and baseURL.


Streaming Responses

For chat interfaces where you want to stream tokens as they arrive:

const stream = await client.chat.completions.create({
  model: \"@cf/nvidia/nemotron-3-super\",
  messages: [{ role: \"user\", content: \"Explain transformer attention in detail.\" }],
  max_tokens: 2048,
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

Rate Limits and Pricing

PlanNeurons/dayNotes
Free10,000Shared across all AI models
Workers Paid ($5/mo)100,000Sufficient for most dev projects
Beyond included$0.011 per 1,000 neuronsCheck dashboard for current rates

Nemotron 3 Super consumes more neurons per request than smaller models due to its parameter count. For production workloads with high volume, compare total cost against DeepInfra ($0.10/$0.50 per 1M tokens) where you pay per token rather than per neuron.


FAQ

Does Nemotron 3 Super support tool calling on Workers AI?

Tool calling support on Workers AI depends on the model binding version. As of March 2026, check the Cloudflare Workers AI model catalog for the current feature list. The OpenAI-compatible endpoint is more likely to support function calling syntax.

Can I use the 1M token context window on Workers AI?

Workers AI imposes its own context limits independent of the model's native context window. Check the current limits in the Cloudflare documentation — large context requests may need to go through the direct API rather than the Workers binding.

How does this compare to running Nemotron locally with Ollama?

Running locally via Ollama gives you full context window access and no usage limits, but requires a GPU with sufficient VRAM (Nemotron 3 Super quantized needs ~24GB+). Workers AI is better for production deployments, serverless functions, and teams without GPU infrastructure.


What to Try Next

  • Compare output quality against Claude Haiku 4.5 on your specific use case — both are in the same cost tier
  • Test the 1M context window via the REST API with a large codebase
  • Combine with Cloudflare D1 (SQLite at the edge) to build a full RAG pipeline without leaving Cloudflare's ecosystem

Next read: Nemotron 3 Super vs Claude Sonnet 4.6 vs GPT-5.4: Which for Coding Agents?