Structure Your Prompt – Cantina Help Center

A great prompt has layers. Once you see the pattern, you can write prompts for almost anything — a selfie, an Imagine video, even your bot's identity — using the same structure.

This isn't a rigid formula. It's a stack of layers you add one at a time until the prompt is as detailed as you want it.

🎥 Want to watch first? The Prompting 101 tutorial walks through these ideas in under a minute.

The layered recipe

Most rich Cantina prompts include these layers, roughly in this order:

Subject — who's in the frame. (You? Your bot? An object?)
Action / pose — what they're doing or how they're standing.
Setting — where this happens. Be specific.
Wardrobe — what they're wearing (for selfies and videos).
Camera — shot type, angle, lens character.
Lighting — natural light, time of day, artificial sources.
Color / atmosphere — color grade, mood, vibe.
Finish — quality and style cues (cinematic, 4K, film grain, editorial).

You don't need all eight. A casual selfie can land with three. A polished editorial shot might use all of them. The more you stack, the closer the result gets to what you're picturing.

Watch a prompt grow

Same idea, four levels of detail. Here's a real progression from a chat with Carl, built up from a one-line prompt to a fully layered one.

Level 1 — just the subject

Selfie.

Level 1 leans on your bot's profile — the defaults you set when you created it. Carl shows up in his usual look (the "C" t-shirt and an indoor setting) because that's how he was built. A rich identity prompt means a richer Level 1.

Level 2 — add a setting

Selfie at a coffee shop.

Now there's a scene to work with. Coffee shop window in the background, latte in hand. Closer to what you wanted — the lighting, weather, and feel are still pulling from your bot's defaults.

Level 3 — add wardrobe, time of day, mood

Selfie at a coffee shop on a rainy afternoon wearing a hoodie, holding a coffee, soft smile, warm interior light.

You can already feel the shot. You've added wardrobe (hoodie), weather (rainy), and warmth. The takeaway cup, the blurred coffee shop lights, and the rain-flecked window all arrived because the prompt asked for them.

Level 4 — add camera, color, finish

Selfie at a coffee shop near a window on a rainy day wearing a hoodie, holding a coffee, soft smile, warm interior light. Eye level medium close-up, soft window light, warm cream and gold color grade, lifestyle film grain, intimate quiet vibe, 4k.

Camera position + lighting + color grade + finish tightens everything. The shot pulls in close, the colors warm, the mood quiets. This is what fully-layered prompts look like — every layer is doing work.

💡 Heads up: the jump from Level 1 to Level 2 is often visually subtle, because your bot's character profile already carries weight. The progression becomes more obvious once you start adding wardrobe and weather (Level 3) and again once you stack camera, color, and finish (Level 4).

How the recipe works for selfies, videos, bots, and voices

All four lean on the same recipe — just in different proportions. The opener changes too: what you put first depends on what you're making.

Selfies and image prompts

Lead with "selfie," then the action. "Selfie eating a taco at the beach" or "Selfie walking through a busy street at golden hour." The word selfie tells your bot what type of image to make.

All eight layers help. Wardrobe, lighting, and color carry the most weight. Casual shots can skip them; for a deliberate look, layer them in.

Structure to start from:

Selfie [doing X] in [setting] at [time of day], wearing [wardrobe], [pose detail]. [Camera type] at [angle], [lighting]. [Color grade], [finish], [vibe], [4K].

Imagine videos

Lead with the action. "Walking through a busy street at golden hour" or "Sitting at a coffee shop on a rainy afternoon."

Add action verbs and camera movement — these are time-based. Imagine videos unfold over a few seconds, so spell out what happens in time. Stack action verbs — "walks in, looks up, smiles, turns away."

Structure to start from:

[Action verb 1] in [setting], [action verb 2], [action verb 3], [camera movement]. [Lighting], [color grade], [finish].

Keep it to one continuous moment per prompt for short clips. The cleanest videos come from a single, focused action.

Bot identity, personality, backstory

Lead with role or essence. "A retired chef who…" / "A 22-year-old skateboarder who…"

Drop camera, lighting, and finish. Lean into role, behavior, and contradiction. These aren't visual prompts — they shape how your bot talks and behaves, not how they look. Specificity wins — a clear sentence beats a vague paragraph.

Structure to start from:

[A role / age / hook]. [What they do now]. [Defining behavior or habit]. [Contradiction or hidden trait]. [Optional formative event].

Bot voice

Lead with the descriptors. Age, gender, accent — "A 47-year-old woman, Standard American accent…"

The most structured of all. Stick close to the recipe. Voice prompts reward density — 300 characters or fewer. End with

High fidelity speech quality

for the cleanest audio (swap it out if you want a different sound).

Structure to start from:

[Age + gender], [accent], [pitch], [texture], [pace]. [Behavior cue]. High fidelity speech quality.

Selfies as scene builders

One workflow worth knowing about: prompt several selfies first, then build a multi-scene Imagine video from them.

The selfie locks in the look — wardrobe, setting, camera, lighting, color — before motion comes in. Each selfie becomes a scene anchor.

The recipe carries straight over. If your selfie prompt is layered (action, setting, wardrobe, camera, lighting, color grade, finish), the scene built from it inherits every layer.

The practical move: if you're planning a multi-scene video, write your selfie prompts like you mean it. Each one is a foundation for a scene. A vague selfie gets a vague scene; a richly layered selfie gets a richly layered scene.

See Selfies as Scene Anchors for the full workflow, an end-to-end example, and tips on keeping multi-scene videos feeling like one piece.

Vocabulary to steal

Layering gets easier when you have words for each piece.

Camera

Wide-angle · fisheye · telephoto · macro · low-angle · eye-level · high-angle · overhead · close-up · medium shot · full-body · wide shot · handheld · dolly · static · orbiting · push-in · pan · shallow depth of field · deep focus.

Lighting

Golden hour · blue hour · harsh midday · overcast · direct flash · soft window light · candlelight · neon glow · backlit · side-lit · top-lit · ambient · volumetric haze · lens flare · atmospheric glow.

Mood / vibe

Warm · moody · dreamy · gritty · playful · intimate · surreal · cinematic · editorial · candid · documentary · saturated · muted · glossy · faded · vivid.

Color grade

Warm amber + dusty blue · cool teal + orange · lavender + rose · emerald + gold · crimson + black · saturated everyday · muted earth tones · soft pastels.

Finish cues

Cinematic film grain · ultra detailed 4K · photorealistic · editorial photography · lifestyle · documentary realism · dreamy · hyperreal · saturated · glossy · ultra HD.

Patterns worth stealing

Lead with the action. For selfies in chat, open with "selfie" then what's happening: "Selfie eating a taco at the beach." For Imagine videos, open directly with the action.
Set the scene before the style. "At a campfire under the stars" before "moody, low light."
One continuous action per Imagine prompt. A single focused moment renders cleanest.
Concrete beats vague. "A red enamel mug on a windowsill" lands better than "a nice mug somewhere."
Mood words are cheap and effective. "Warm," "moody," "playful," "dreamy," "gritty" all do a lot of work in two words.
Stack finish at the end. Camera + lighting + color + finish at the end of the prompt acts like a settings panel for the whole image.

When the result isn't quite right

The recipe doesn't land perfectly on the first try every time. When the output's off:

Find the gap. Compare what you wrote to what you got. What did you not say?
Add the missing layer. If the time of day is wrong, add a time of day. If the mood feels off, add a mood word. If the lighting looks flat, add a lighting line.
Run it again. Don't rewrite from scratch — change one thing.

Most prompts get to "right" in two or three small edits, not a full rewrite.

An example for you to try

Here's a rough selfie prompt:

selfie reading a book

It's got the right opener and an action — but it's missing setting, lighting, camera, mood, and finish. Let's layer.

Add setting and tighten the action:

Selfie reading a leather-bound book at a long wooden table in an old library. Tall shelves disappear into shadow on either side.

Add camera and lighting:

Selfie reading a leather-bound book at a long wooden table in an old library, tall shelves disappearing into shadow on either side. Medium close-up at table level, single warm desk lamp lighting the face, the rest of the library in deep shadow.

Add color, mood, finish:

Selfie reading a leather-bound book at a long wooden table in an old library, one hand resting on the page, tall shelves disappearing into shadow on either side. Medium close-up at table level, single warm desk lamp lighting the face, the rest of the library in deep shadow. Warm amber and deep brown color grade, hushed scholarly mood, painterly film grain, ultra detailed 4K.

Same idea. Way more arrived. Each layer is one small decision — what time of day, what camera, what feel.

Keep going

Prompting 101 — the basics: what a prompt is and where you'll use them.
Fast Videos vs Imagine Videos — which video tool to use for which job.
How to Create a Bot — where you'll use identity, personality, and voice prompts.
Cantina Glossary — quick reference for product terms.

Related to

The layered recipe

Watch a prompt grow

Level 1 — just the subject

Level 2 — add a setting

Level 3 — add wardrobe, time of day, mood

Level 4 — add camera, color, finish

How the recipe works for selfies, videos, bots, and voices

Selfies and image prompts

Imagine videos

Bot identity, personality, backstory

Bot voice

Selfies as scene builders

Vocabulary to steal

Camera

Lighting

Mood / vibe

Color grade

Finish cues

Patterns worth stealing

When the result isn't quite right

An example for you to try

Keep going

Related articles