Image, Audio & Video Generation
npcpy provides unified functions for generating images, audio, and video across multiple providers. All generation functions live under npcpy.llm_funcs (high-level wrappers) and npcpy.gen.* (provider-specific implementations).
Image Generation
gen_image
The top-level function for image generation and editing.
Signature:
def gen_image(
prompt: str,
model: str = None,
provider: str = None,
npc: Any = None,
height: int = 1024,
width: int = 1024,
n_images: int = 1,
input_images: List[Union[str, bytes, PIL.Image.Image]] = None,
save: bool = False,
filename: str = '',
api_key: str = None,
) -> List[PIL.Image.Image]
Returns: A list of PIL Image objects.
Ollama
Generate images locally with Ollama image generation models.
from npcpy.llm_funcs import gen_image
images = gen_image(
"A serene mountain lake at sunset",
model="x/z-image-turbo",
provider="ollama",
height=512,
width=512,
)
images[0].show()
Supported Ollama image models:
- x/z-image-turbo (default for Ollama)
- x/flux2-klein
Make sure the model is pulled first: ollama pull x/z-image-turbo
Diffusers
Run Stable Diffusion and other HuggingFace diffusion models locally.
images = gen_image(
"A cyberpunk cityscape at night",
model="runwayml/stable-diffusion-v1-5",
provider="diffusers",
height=512,
width=512,
)
images[0].save("cityscape.png")
The diffusers backend enables memory optimizations automatically (attention slicing, CPU offload, xformers when available). For fine-tuned models, pass the local directory path as the model name:
OpenAI
Generate images with DALL-E 2, DALL-E 3, or GPT Image 1.
images = gen_image(
"An oil painting of a cat wearing a top hat",
model="dall-e-2",
provider="openai",
height=1024,
width=1024,
n_images=2,
)
Gemini
Generate images using Google's Gemini models.
images = gen_image(
"A watercolor painting of cherry blossoms",
model="gemini-2.5-flash-image",
provider="gemini",
)
For Imagen models:
images = gen_image(
"A photorealistic landscape",
model="imagen-3.0-generate-002",
provider="gemini",
n_images=2,
)
Image Editing
Edit existing images by passing attachments. The attachments parameter accepts file paths, raw bytes, or PIL Image objects.
# OpenAI image editing with gpt-image-1
images = gen_image(
"Add a rainbow in the sky",
model="gpt-image-1",
provider="openai",
input_images=["photo.png"],
)
# Gemini image editing
images = gen_image(
"Make the sky more dramatic",
model="gemini-2.5-flash-image",
provider="gemini",
input_images=["landscape.jpg"],
)
There is also a convenience function for editing:
from npcpy.gen.image_gen import edit_image
edited = edit_image(
prompt="Add a rainbow in the sky",
image_path="photo.png",
provider="openai",
model="gpt-image-1",
save_path="edited.png",
)
Saving Images
Use the save and filename parameters:
images = gen_image(
"A fractal pattern",
model="x/z-image-turbo",
provider="ollama",
save=True,
filename="fractal",
)
# Saves fractal_0.png, fractal_1.png, etc.
Or save manually:
Video Generation
gen_video
Generate video from text prompts.
Signature:
def gen_video(
prompt: str,
model: str = None,
provider: str = None,
npc: Any = None,
device: str = "cpu",
output_path: str = "",
num_inference_steps: int = 10,
num_frames: int = 25,
height: int = 256,
width: int = 256,
negative_prompt: str = "",
messages: list = None,
) -> dict
Diffusers Backend
Uses the damo-vilab/text-to-video-ms-1.7b model by default.
result = gen_video(
"A timelapse of flowers blooming",
provider="diffusers",
device="cpu",
num_frames=25,
height=256,
width=256,
)
print(result["output"]) # path to the generated .mp4
Gemini / Veo 3
Generate high-fidelity video with synchronized audio using Google's Veo 3 API.
result = gen_video(
"A drone shot over a tropical island with waves crashing",
model="veo-3.0-generate-preview",
provider="gemini",
negative_prompt="blurry, low quality",
)
print(result["output"]) # path to the generated .mp4
Videos are saved to ~/.npcsh/videos/ by default.
Audio Generation (Text-to-Speech)
npcpy provides a unified TTS interface supporting multiple engines.
Unified Interface
Returns raw audio bytes. The format depends on the engine (WAV for Kokoro/Qwen3, MP3 for ElevenLabs/gTTS, PCM16 for OpenAI/Gemini).
Kokoro (Local Neural TTS)
Default engine. Runs locally with no API key needed.
from npcpy.gen.audio_gen import text_to_speech
audio_bytes = text_to_speech(
"Hello, welcome to npcpy!",
engine="kokoro",
voice="af_heart", # female American voice
speed=1.0,
)
# Save to file
with open("greeting.wav", "wb") as f:
f.write(audio_bytes)
Available Kokoro voices: af_heart, af_bella, af_sarah, af_nicole, af_sky (female American), am_adam, am_michael (male American), bf_emma, bf_isabella (female British), bm_george, bm_lewis (male British).
Install: pip install kokoro soundfile
Qwen3-TTS (Local High-Quality Multilingual)
High-quality local TTS with voice cloning and voice design capabilities.
audio_bytes = text_to_speech(
"Welcome to the future of AI.",
engine="qwen3",
voice="ryan",
language="english",
model_size="1.7B",
)
Available voices: aiden, dylan, eric, ryan (male), serena, vivian, sohee, ono_anna (female), uncle_fu (male).
Voice cloning from a reference audio file:
from npcpy.gen.audio_gen import tts_qwen3
audio_bytes = tts_qwen3(
"Clone this voice.",
ref_audio="reference.wav",
ref_text="The transcript of the reference audio.",
)
ElevenLabs (Cloud)
High-quality cloud TTS. Requires ELEVENLABS_API_KEY environment variable.
audio_bytes = text_to_speech(
"This is ElevenLabs quality speech.",
engine="elevenlabs",
)
# Or use the direct function
from npcpy.gen.audio_gen import tts_elevenlabs
audio_bytes = tts_elevenlabs(
"Hello world!",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2",
)
For lowest-latency streaming via WebSocket:
import asyncio
from npcpy.gen.audio_gen import tts_elevenlabs_stream
audio_bytes = asyncio.run(tts_elevenlabs_stream(
"Stream this text.",
model_id="eleven_turbo_v2_5",
))
OpenAI Realtime Voice
Uses the OpenAI Realtime API. Requires OPENAI_API_KEY.
audio_bytes = text_to_speech(
"Hello from OpenAI!",
engine="openai",
voice="alloy", # alloy, echo, shimmer, ash, ballad, coral, sage, verse
)
Gemini Live
Uses the Gemini Live API. Requires GEMINI_API_KEY or GOOGLE_API_KEY.
audio_bytes = text_to_speech(
"Hello from Gemini!",
engine="gemini",
voice="Puck", # Puck, Charon, Kore, Fenrir, Aoede
)
gTTS (Free Fallback)
Google Text-to-Speech. No API key needed, but lower quality.
Discovering Available Engines and Voices
from npcpy.gen.audio_gen import get_available_engines, get_available_voices
# Check which engines are installed and configured
engines = get_available_engines()
for name, info in engines.items():
status = "available" if info["available"] else "not available"
print(f"{info['name']}: {status} ({info['description']})")
# List voices for a specific engine
voices = get_available_voices("kokoro")
for v in voices:
print(f" {v['id']}: {v['name']} ({v['gender']})")
Audio Utilities
Convert between audio formats: