Action Prompting – Cantina Help Center

The Action Prompt is the cinematic engine of every video scene you build in Cantina. It drives motion, camera moves, cuts, transformations. Everything that happens on screen, with or without dialogue. This article walks through how to write one, what to layer in, and how to tweak until the scene lands.

What's in your toolkit

The Action Prompt sits in the video editor, just below the Dialogue field. It's where you describe what your bot does and how the camera moves.

A few things you can layer into an Action Prompt:

Action verbs — struts, glares, transforms, leans, pauses, turns, laughs, walks, looks, reaches, falls
World context — "This is a high fashion show," "It's pouring rain," "The crowd cheers," "Smoke fills the air"
Cuts — "Cut to a close up of her face," "Cut to a wide shot," "Cut to her hands"
Focus shifts — "Focus on her legs," "Focus on the mug," "Focus on the door"
Camera moves — "Pull back," "The camera tracks forward," "The camera orbits her," "Push in," "Tilt up"
Transformations — "transforms into a cloud of bats," "morphs into a shadow," "shifts into a wolf"
Sound cues — "soft rain patters on the window," "vendors calling out in the background," "a distant clock ticks," "thunder rolls overhead"

You don't need all seven in every prompt. Pick the ones that serve the scene.

How to write an Action Prompt

Build your prompt in this order before you generate. The more complete it is up front, the closer the first result will land.

Step 1. Pick your action. Two or three verbs that describe what your bot does — struts, glares, transforms.

Step 2. Ground the world. One sentence of context that anchors the scene — "This is a high fashion show," "It's pouring rain outside."

Step 3. Direct the camera. Add the cuts, focus shifts, and camera moves you want — "Cut to a close up of her face," "Pull back," "The camera tracks forward."

Step 4. Layer in transformations or sound if they serve the scene.

Step 5. Save and generate.

Watch a prompt grow

Same scene, three levels of layering. Each level adds one more layer from the toolkit on top of the last. See how the picture sharpens as the prompt fills out.

The scene: a fashion-show runway moment where the model dissolves into a swarm of bats flying at the camera.

Level 1 — core action

The pale alien woman struts, glares, and transforms into a cloud of bats that fly towards camera.

The verbs do most of the work. Struts sets the motion, glares layers in attitude, transforms signals the change, fly towards camera gives the bats direction. One sentence, one beat.

Level 2 — add world context

The pale alien woman struts, glares, and transforms into a cloud of bats that fly towards camera. This is a high fashion show.

The second sentence grounds the world — a runway, an audience, theatrical lighting. The result inherits the visual language of "fashion show" without you having to spell it all out.

Level 3 — add cuts, focus, and camera moves

The pale alien woman struts, glares, and transforms into a cloud of bats that fly towards camera. This is a high fashion show. Cut to a close up of her face with fangs showing. Cut to a wide shot. Focus on her legs. Pull back. The camera tracks forward, focusing on her.

Now it's a cinematic mini-sequence. Multiple shots inside one scene, a focus shift down to the legs, two camera moves (pull back, track forward). 267 / 500 characters. Still room to layer more.

And the result:

Pair with empty dialogue for cinematic-only moments

Action Prompts can carry an entire scene without dialogue. Clear the Dialogue field, write a rich Action Prompt, and you get a cinematic moment with no voiceover. The visuals and camera tell the story. See No Dialogue Imagine Videos for the full pattern.

Add a narrator

You can write a narrator's voiceover into the Action Prompt itself, alongside what the camera sees and what your bot says. Structure it like a mini screenplay. Three blocks, separated by blank lines, each block labeled for the voice that carries it.

A grumpy blue water bottle sits inside a fridge, staring through the glass shelf with his usual annoyed expression.

NARRATOR: "Meet irritated hydration, the grumpiest bottle in the fridge."

DIALOGUE (irritated hydration lip sync): "Can somebody close the fridge already."

What each block does:

The action block describes what's on screen — no voice tag, no dialogue.
NARRATOR is the voiceover. No character on screen speaks this line. It plays over the visuals.
DIALOGUE (character lip sync) is your bot's spoken line. The lip sync tag tells the model the character's mouth should move to it.

Use a narrator when you want to introduce a character, set up a joke, or land a closing line that the character themselves wouldn't say. It also lets you keep the bot's own dialogue short and reactive. The narrator carries the exposition, the bot carries the attitude and action.

If the result isn't quite right

Before you generate again, look at your prompt and name the missing layer. Most off results trace back to one of these:

Action looks flat? Stack one or two more verbs.
World feels generic? Add a sentence of context.
Camera stays on one angle too long? Add a cut or a focus shift.
Shot lacks character? Add a transformation, an attitude verb, or a sound cue.

Name the missing layer, add it, then generate.

Keep going

Structure Your Prompt — the 8-layer recipe behind a great prompt.
No Dialogue Imagine Videos — how empty dialogue plus a rich Action Prompt makes a cinematic-only scene.
Selfies as Scene Anchors — multi-scene videos built from chat selfies.
Starter Prompt Library — copy-and-remix example prompts across every Cantina surface.
Prompting 101 — the basics.

Got an Action Prompt to share?

Found a pattern that consistently lands? Share it with the community in The Bot Place. We add new patterns over time.