A great prompt has layers. Once you see the pattern, you can write prompts for almost anything — a selfie, an Imagine video, even your bot's identity — using the same structure.
This isn't a rigid formula. It's a stack of layers you add one at a time until the prompt is as detailed as you want it.
🎥 Want to watch first? The Prompting 101 tutorial walks through these ideas in under a minute.
The layered recipe
Most rich Cantina prompts include these layers, roughly in this order:
- Subject — who's in the frame. (You? Your bot? An object?)
- Action / pose — what they're doing or how they're standing.
- Setting — where this happens. Be specific.
- Wardrobe — what they're wearing (for selfies and videos).
- Camera — shot type, angle, lens character.
- Lighting — natural light, time of day, artificial sources.
- Color / atmosphere — color grade, mood, vibe.
- Finish — quality and style cues (cinematic, 4K, film grain, editorial).
You don't need all eight. A casual selfie can land with three. A polished editorial shot might use all of them. The more you stack, the closer the result gets to what you're picturing.
Watch a prompt grow
Same idea, four levels of detail. Here's a real progression from a chat with Carl, built up from a one-line prompt to a fully layered one.
Level 1 — just the subject
Selfie.
Level 1 leans on your bot's profile — the defaults you set when you created it. Carl shows up in his usual look (the "C" t-shirt and an indoor setting) because that's how he was built. A rich identity prompt means a richer Level 1.
Level 2 — add a setting
Selfie at a coffee shop.
Now there's a scene to work with. Coffee shop window in the background, latte in hand. Closer to what you wanted — the lighting, weather, and feel are still pulling from your bot's defaults.
Level 3 — add wardrobe, time of day, mood
Selfie at a coffee shop on a rainy afternoon wearing a hoodie, holding a coffee, soft smile, warm interior light.
You can already feel the shot. You've added wardrobe (hoodie), weather (rainy), and warmth. The takeaway cup, the blurred coffee shop lights, and the rain-flecked window all arrived because the prompt asked for them.
Level 4 — add camera, color, finish
Selfie at a coffee shop near a window on a rainy day wearing a hoodie, holding a coffee, soft smile, warm interior light. Eye level medium close-up, soft window light, warm cream and gold color grade, lifestyle film grain, intimate quiet vibe, 4k.
Camera position + lighting + color grade + finish tightens everything. The shot pulls in close, the colors warm, the mood quiets. This is what fully-layered prompts look like — every layer is doing work.
💡 Heads up: the jump from Level 1 to Level 2 is often visually subtle, because your bot's character profile already carries weight. The progression becomes more obvious once you start adding wardrobe and weather (Level 3) and again once you stack camera, color, and finish (Level 4).
How the recipe works for selfies, videos, bots, and voices
All four lean on the same recipe — just in different proportions. The opener changes too: what you put first depends on what you're making.
Selfies and image prompts
Lead with "selfie," then the action. "Selfie eating a taco at the beach" or "Selfie walking through a busy street at golden hour." The word selfie tells your bot what type of image to make.
All eight layers help. Wardrobe, lighting, and color carry the most weight. Casual shots can skip them; for a deliberate look, layer them in.
Structure to start from:
Selfie [doing X] in [setting] at [time of day], wearing [wardrobe], [pose detail]. [Camera type] at [angle], [lighting]. [Color grade], [finish], [vibe], [4K].
Imagine videos
Lead with the action. "Walking through a busy street at golden hour" or "Sitting at a coffee shop on a rainy afternoon."
Add action verbs and camera movement — these are time-based. Imagine videos unfold over a few seconds, so spell out what happens in time. Stack action verbs — "walks in, looks up, smiles, turns away."
Structure to start from:
[Action verb 1] in [setting], [action verb 2], [action verb 3], [camera movement]. [Lighting], [color grade], [finish].
Keep it to one continuous moment per prompt for short clips. The cleanest videos come from a single, focused action.
Bot identity, personality, backstory
Lead with role or essence. "A retired chef who…" / "A 22-year-old skateboarder who…"
Drop camera, lighting, and finish. Lean into role, behavior, and contradiction. These aren't visual prompts — they shape how your bot talks and behaves, not how they look. Specificity wins — a clear sentence beats a vague paragraph.
Structure to start from:
[A role / age / hook]. [What they do now]. [Defining behavior or habit]. [Contradiction or hidden trait]. [Optional formative event].
Bot voice
Lead with the descriptors. Age, gender, accent — "A 47-year-old woman, Standard American accent…"
The most structured of all. Stick close to the recipe. Voice prompts reward density — 300 characters or fewer. End with
for the cleanest audio (swap it out if you want a different sound).
[Age + gender], [accent], [pitch], [texture], [pace]. [Behavior cue]. High fidelity speech quality.
Selfies as scene builders
One workflow worth knowing about: prompt several selfies first, then build a multi-scene Imagine video from them.
The selfie locks in the look — wardrobe, setting, camera, lighting, color — before motion comes in. Each selfie becomes a scene anchor.
The recipe carries straight over. If your selfie prompt is layered (action, setting, wardrobe, camera, lighting, color grade, finish), the scene built from it inherits every layer.
The practical move: if you're planning a multi-scene video, write your selfie prompts like you mean it. Each one is a foundation for a scene. A vague selfie gets a vague scene; a richly layered selfie gets a richly layered scene.
See Selfies as Scene Anchors for the full workflow, an end-to-end example, and tips on keeping multi-scene videos feeling like one piece.
Vocabulary to steal
Layering gets easier when you have words for each piece.
Camera
Wide-angle · fisheye · telephoto · macro · low-angle · eye-level · high-angle · overhead · close-up · medium shot · full-body · wide shot · handheld · dolly · static · orbiting · push-in · pan · shallow depth of field · deep focus.
Lighting
Golden hour · blue hour · harsh midday · overcast · direct flash · soft window light · candlelight · neon glow · backlit · side-lit · top-lit · ambient · volumetric haze · lens flare · atmospheric glow.
Mood / vibe
Warm · moody · dreamy · gritty · playful · intimate · surreal · cinematic · editorial · candid · documentary · saturated · muted · glossy · faded · vivid.
Color grade
Warm amber + dusty blue · cool teal + orange · lavender + rose · emerald + gold · crimson + black · saturated everyday · muted earth tones · soft pastels.
Finish cues
Cinematic film grain · ultra detailed 4K · photorealistic · editorial photography · lifestyle · documentary realism · dreamy · hyperreal · saturated · glossy · ultra HD.
Patterns worth stealing
- Lead with the action. For selfies in chat, open with "selfie" then what's happening: "Selfie eating a taco at the beach." For Imagine videos, open directly with the action.
- Set the scene before the style. "At a campfire under the stars" before "moody, low light."
- One continuous action per Imagine prompt. A single focused moment renders cleanest.
- Concrete beats vague. "A red enamel mug on a windowsill" lands better than "a nice mug somewhere."
- Mood words are cheap and effective. "Warm," "moody," "playful," "dreamy," "gritty" all do a lot of work in two words.
- Stack finish at the end. Camera + lighting + color + finish at the end of the prompt acts like a settings panel for the whole image.
When the result isn't quite right
The recipe doesn't land perfectly on the first try every time. When the output's off:
- Find the gap. Compare what you wrote to what you got. What did you not say?
- Add the missing layer. If the time of day is wrong, add a time of day. If the mood feels off, add a mood word. If the lighting looks flat, add a lighting line.
- Run it again. Don't rewrite from scratch — change one thing.
Most prompts get to "right" in two or three small edits, not a full rewrite.
An example for you to try
Here's a rough selfie prompt:
selfie reading a book
It's got the right opener and an action — but it's missing setting, lighting, camera, mood, and finish. Let's layer.
Add setting and tighten the action:
Selfie reading a leather-bound book at a long wooden table in an old library. Tall shelves disappear into shadow on either side.
Add camera and lighting:
Selfie reading a leather-bound book at a long wooden table in an old library, tall shelves disappearing into shadow on either side. Medium close-up at table level, single warm desk lamp lighting the face, the rest of the library in deep shadow.
Add color, mood, finish:
Selfie reading a leather-bound book at a long wooden table in an old library, one hand resting on the page, tall shelves disappearing into shadow on either side. Medium close-up at table level, single warm desk lamp lighting the face, the rest of the library in deep shadow. Warm amber and deep brown color grade, hushed scholarly mood, painterly film grain, ultra detailed 4K.
Same idea. Way more arrived. Each layer is one small decision — what time of day, what camera, what feel.
Keep going
- Prompting 101 — the basics: what a prompt is and where you'll use them.
- Fast Videos vs Imagine Videos — which video tool to use for which job.
- How to Create a Bot — where you'll use identity, personality, and voice prompts.
- Cantina Glossary — quick reference for product terms.
0 comments
Article is closed for comments.