Serving & Deployment

npcpy ships with a built-in Flask server that exposes your NPC teams as REST API endpoints. You can register NPCs and teams at startup, then interact with them over HTTP from any language or framework. The server handles streaming responses, conversation history, jinx execution, model management, and an OpenAI-compatible completions endpoint.

Basic Server Setup

The simplest way to start the server is to call start_flask_server with your NPCs and teams.

from npcpy.serve import start_flask_server
from npcpy.npc_compiler import NPC, Team

# Create NPCs
researcher = NPC(
    name="researcher",
    primary_directive="You are a research assistant. Find and summarize information.",
    model="gemma3:4b",
    provider="ollama",
)

writer = NPC(
    name="writer",
    primary_directive="You are a technical writer. Produce clear documentation.",
    model="gemma3:4b",
    provider="ollama",
)

# Build a team
team = Team(
    npcs=[researcher, writer],
    team_name="docs_team",
)

# Register and start
start_flask_server(
    port=5337,
    cors_origins=["http://localhost:3000"],
    debug=True,
    teams={"docs_team": team},
    npcs={"researcher": researcher, "writer": writer},
)

The server binds to 0.0.0.0 on the given port and accepts requests immediately.

start_flask_server Parameters

Parameter	Default	Description
`port`	`5337`	HTTP port to listen on
`cors_origins`	`None`	List of allowed CORS origins
`debug`	`False`	Enable Flask debug mode
`teams`	`None`	Dict mapping team names to `Team` objects
`npcs`	`None`	Dict mapping NPC names to `NPC` objects
`db_path`	`""`	Path to SQLite database for conversation history
`user_npc_directory`	`None`	Directory containing `.npc` and `.jinx` files

NPC Loading Priority

When a request specifies an NPC name, the server resolves it in this order:

Registered team NPCs -- checks all teams passed via the teams parameter for an NPC with a matching name.
Globally registered NPCs -- checks the npcs dict passed at startup.
Database / file fallback -- loads the NPC from disk using the npc_source field ("global" checks ~/.npcsh/npc_team/, "project" checks the request's currentPath).

This means programmatically registered NPCs always take precedence over file-based definitions.

REST API Endpoints

Core Endpoints

Method	Path	Description
`POST`	`/api/stream`	Stream a response from an NPC (SSE)
`POST`	`/api/jinx/execute`	Execute a jinx with arguments
`GET`	`/api/models`	List available models
`GET`	`/api/npc_team_global`	List globally registered NPCs
`GET`	`/api/jinxes/global`	List globally available jinxes
`GET`	`/api/conversations`	List conversations
`GET`	`/api/conversation/<id>/messages`	Get messages for a conversation
`GET`	`/api/health`	Health check (returns `{"status": "ok"}`)
`POST`	`/v1/chat/completions`	OpenAI-compatible completions endpoint

ML and Fine-Tuning Endpoints

Method	Path	Description
`POST`	`/api/ml/train`	Train a sklearn model
`POST`	`/api/ml/predict`	Run prediction on a trained model
`POST`	`/api/finetune_instruction`	Start an SFT fine-tuning job
`POST`	`/api/finetune_diffusers`	Start a diffusion fine-tuning job
`GET`	`/api/finetuned_models`	List fine-tuned models
`POST`	`/api/genetic/create_population`	Create a genetic evolution population
`POST`	`/api/genetic/evolve`	Run one generation of evolution

Model Management Endpoints

Method	Path	Description
`GET`	`/api/ollama/models`	List Ollama models
`POST`	`/api/ollama/pull`	Pull an Ollama model
`POST`	`/api/ollama/delete`	Delete an Ollama model
`GET`	`/api/models/hf/search`	Search HuggingFace models
`POST`	`/api/models/hf/download`	Download a HuggingFace model

curl Examples

Health Check

curl http://localhost:5337/api/health

{"status": "ok", "error": null}

Stream a Response

curl -X POST http://localhost:5337/api/stream \
  -H "Content-Type: application/json" \
  -d '{
    "commandstr": "Explain quantum computing in two sentences.",
    "conversationId": "conv-001",
    "model": "gemma3:4b",
    "provider": "ollama"
  }'

Stream with a Specific NPC

curl -X POST http://localhost:5337/api/stream \
  -H "Content-Type: application/json" \
  -d '{
    "commandstr": "Write a summary of the latest research on fusion energy.",
    "conversationId": "conv-002",
    "npc": "researcher",
    "model": "gemma3:4b",
    "provider": "ollama"
  }'

Execute a Jinx

curl -X POST http://localhost:5337/api/jinx/execute \
  -H "Content-Type: application/json" \
  -d '{
    "jinxName": "summarize",
    "jinxArgs": [{"name": "topic", "value": "machine learning"}],
    "conversationId": "conv-003",
    "model": "gemma3:4b",
    "provider": "ollama"
  }'

List Available Models

curl http://localhost:5337/api/models?currentPath=/path/to/project

List Global NPCs

curl http://localhost:5337/api/npc_team_global

List Global Jinxes

curl http://localhost:5337/api/jinxes/global

OpenAI-Compatible Completions

The server exposes an OpenAI-compatible endpoint so you can use standard OpenAI clients:

curl -X POST http://localhost:5337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "stream": false
  }'

JavaScript / React Integration

Use the standard fetch API to connect a frontend to the NPC server. The stream endpoint returns server-sent events.

// Stream a response from an NPC
async function streamNPCResponse(prompt, npcName, conversationId) {
  const response = await fetch("http://localhost:5337/api/stream", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      commandstr: prompt,
      conversationId: conversationId,
      npc: npcName,
      model: "gemma3:4b",
      provider: "ollama",
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let fullText = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value, { stream: true });
    fullText += chunk;
    // Update your UI with the streamed text
    console.log("Chunk:", chunk);
  }

  return fullText;
}

// List available models
async function getModels() {
  const response = await fetch("http://localhost:5337/api/models");
  const data = await response.json();
  return data.models;
}

// List NPCs
async function getNPCs() {
  const response = await fetch("http://localhost:5337/api/npc_team_global");
  const data = await response.json();
  return data.npcs;
}

React Hook Example

import { useState, useCallback } from "react";

function useNPCChat(serverUrl = "http://localhost:5337") {
  const [messages, setMessages] = useState([]);
  const [isStreaming, setIsStreaming] = useState(false);

  const sendMessage = useCallback(async (prompt, npcName, conversationId) => {
    setIsStreaming(true);
    setMessages((prev) => [...prev, { role: "user", content: prompt }]);

    try {
      const response = await fetch(`${serverUrl}/api/stream`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          commandstr: prompt,
          conversationId,
          npc: npcName,
        }),
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let assistantMessage = "";

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        assistantMessage += decoder.decode(value, { stream: true });
        setMessages((prev) => {
          const updated = [...prev];
          const lastIdx = updated.length - 1;
          if (updated[lastIdx]?.role === "assistant") {
            updated[lastIdx] = { role: "assistant", content: assistantMessage };
          } else {
            updated.push({ role: "assistant", content: assistantMessage });
          }
          return updated;
        });
      }
    } finally {
      setIsStreaming(false);
    }
  }, [serverUrl]);

  return { messages, sendMessage, isStreaming };
}

Python Client Example

import requests

SERVER = "http://localhost:5337"

# Health check
health = requests.get(f"{SERVER}/api/health").json()
print(health)  # {"status": "ok", "error": null}

# List models
models = requests.get(f"{SERVER}/api/models").json()
for m in models["models"]:
    print(f"  {m['display_name']}")

# Send a message and get a streamed response
response = requests.post(
    f"{SERVER}/api/stream",
    json={
        "commandstr": "Explain gradient descent in one paragraph.",
        "conversationId": "py-conv-001",
        "model": "gemma3:4b",
        "provider": "ollama",
    },
    stream=True,
)

for chunk in response.iter_content(chunk_size=None, decode_unicode=True):
    print(chunk, end="", flush=True)
print()

# Execute a jinx
result = requests.post(
    f"{SERVER}/api/jinx/execute",
    json={
        "jinxName": "analyze_code",
        "jinxArgs": [{"name": "language", "value": "python"}],
        "conversationId": "py-conv-002",
    },
).json()
print(result)

# Use the OpenAI-compatible endpoint
completion = requests.post(
    f"{SERVER}/v1/chat/completions",
    json={
        "model": "gemma3:4b",
        "messages": [
            {"role": "user", "content": "What is 2+2?"}
        ],
    },
).json()
print(completion["choices"][0]["message"]["content"])

Team with Tools Example

A common deployment pattern is a multimedia team where each NPC has specialized tools. The server exposes all team capabilities through the same streaming endpoint.

import os
from npcpy.npc_compiler import NPC, Team
from npcpy.serve import start_flask_server
from npcpy.data.web import search_web
from npcpy.data.load import load_file_contents
from npcpy.data.image import capture_screenshot

def web_search(query: str) -> str:
    """Search the web for information."""
    results = search_web(query)
    return str(results)

def read_file(filepath: str) -> str:
    """Read a local file and return its contents."""
    return load_file_contents(filepath)

def screenshot() -> str:
    """Capture a screenshot of the current desktop."""
    path = capture_screenshot()
    return f"Screenshot saved to {path}"

# Research NPC with web access
researcher = NPC(
    name="researcher",
    primary_directive="Search the web and summarize findings.",
    model="gemma3:4b",
    provider="ollama",
    tools=[web_search],
)

# File analyst with local file access
analyst = NPC(
    name="analyst",
    primary_directive="Analyze files and data.",
    model="gemma3:4b",
    provider="ollama",
    tools=[read_file],
)

# Desktop automation NPC
automator = NPC(
    name="automator",
    primary_directive="Automate desktop tasks and capture screenshots.",
    model="gemma3:4b",
    provider="ollama",
    tools=[screenshot],
)

team = Team(
    npcs=[researcher, analyst, automator],
    team_name="multimedia_team",
)

start_flask_server(
    port=5337,
    cors_origins=["http://localhost:3000"],
    teams={"multimedia_team": team},
    npcs={
        "researcher": researcher,
        "analyst": analyst,
        "automator": automator,
    },
)

When you send a request to /api/stream with "npc": "researcher", the researcher NPC can invoke its web_search tool during the response. Tool calls and results are tracked and stored alongside the conversation.

Complex Workflows with Jinxes

Jinxes are multi-step prompt pipelines defined in YAML. When served through the API, they can chain LLM calls, tool invocations, and template rendering.

First, define a jinx file at ~/.npcsh/npc_team/jinxes/research_report.jinx:

jinx_name: research_report
description: Research a topic and write a structured report.
inputs:
  - name: topic
    description: The topic to research
    default: "artificial intelligence"
  - name: depth
    description: How detailed the report should be
    default: "medium"
steps:
  - name: research
    prompt: |
      Research the following topic thoroughly: {{ topic }}
      Depth level: {{ depth }}
      Provide key findings, statistics, and recent developments.

  - name: outline
    prompt: |
      Based on this research: {{ research }}
      Create a structured outline for a report on {{ topic }}.

  - name: report
    prompt: |
      Using this outline: {{ outline }}
      Write a complete report. Include an introduction, body sections,
      and conclusion.

Then execute it through the API:

curl -X POST http://localhost:5337/api/jinx/execute \
  -H "Content-Type: application/json" \
  -d '{
    "jinxName": "research_report",
    "jinxArgs": [
      {"name": "topic", "value": "quantum computing"},
      {"name": "depth", "value": "detailed"}
    ],
    "conversationId": "jinx-conv-001",
    "model": "gemma3:4b",
    "provider": "ollama"
  }'

Each step in the jinx executes sequentially. Output variables from earlier steps (like {{ research }} and {{ outline }}) are automatically available in later step prompts via Jinja2 templating. The final response includes the output of the last step along with all intermediate results.