One of the most useful workflows on Cantina isn't obvious from the surface: instead of jumping straight into an Imagine video, prompt several selfies first — then use those selfies as the foundation for each scene in a multi-scene video.
The selfie locks in the look (wardrobe, setting, lighting, vibe, camera feel). Once you're happy with it, the video builds motion on top of a look you've already chosen. Each selfie becomes a scene anchor. The video built from those anchors inherits everything you put into them.
This article walks through the workflow, why it works, and patterns for keeping multi-scene videos feeling like one piece.
Why this works
Think of it as storyboarding your video before you build it. You prompt a selfie, see how it looks, tweak the prompt, see the new selfie, and iterate until each starting image is exactly right. Then you build the video from those starting points.
This shifts where you do the visual decision-making. Instead of describing wardrobe, setting, lighting, color, and motion all at once in a single video prompt, you nail the look one scene at a time in still images. Then you bring those images to life in the video.
- Generic selfie → generic scene.
- Layered selfie → layered scene.
If your selfie prompt is rich (subject + action + setting + wardrobe + camera + lighting + color + finish), the scene built from it inherits every layer. See Structure Your Prompt for the recipe.
The workflow
Step 1. Plan your scenes. Sketch a quick arc — morning → noon → night, before → during → after, scene 1 → scene 2 → scene 3. Two to four scenes is a comfortable range.
Step 2. Prompt rich selfies in chat with your bot. Write a layered selfie prompt for each scene — setting, wardrobe, time of day, camera, lighting, color, vibe. (See Starter Prompt Library for examples and Structure Your Prompt for the recipe.)
Step 3. Tweak each selfie prompt until the result looks right. This is where the visual decision-making happens — fast, in still images, before committing to a video.
Step 4. Open Imagine, tap Video, and pick your bot.
Step 5. Open the Add Image picker and tap a selfie. Your recent chat selfies show up in a grid with timestamps ("1m ago", "3m ago", "4m ago"). Tap one to use as the visual anchor for a scene.
Step 6. Imagine generates a dialogue + action prompt for that scene. A "Generating script…" overlay plays while it writes.
Step 7. Land in the Video editor with the scene loaded. Each scene has a Dialogue field (what your bot says) and an Action Prompt field (what your bot does). Edit either or both. Repeat for each scene in your video.
Step 8. Tap Save / Generate to render the final video.
You don't have to do all scenes in one sitting. Build your selfie library over days in chat, then pull from it when you're ready to make a video.
Want a scene or a whole video without dialogue?
Every scene comes with auto-generated dialogue, but you don't have to keep it. Clear the Dialogue field for any scene to drop the dialogue. The Action Prompt still drives motion.
For the full workflow, mix-and-match patterns, and when no-dialogue scenes work best, see No Dialogue Imagine Videos.
A full example — a bot's day
Here's a three-scene arc and the selfie prompts behind it. The screenshots in this section come from a real walkthrough with Carl.
Scene 1 — morning at the desk
At your desk with one cup of fresh coffee, soft morning light, focused expression.
Sets the day. Wardrobe is implied (Carl's standard look), light is soft and morning-warm, vibe is private and focused.
Scene 2 — midday on the street
Walking through a busy city street, midday sun, more energy, warm smile.
Shifts the energy. Light is harsh and bright, the world is full of people, the mood is social and active.
Scene 3 — golden hour at the bar
Winding down at a bar, window seat. Golden hour glow. One mojito on the table.
Lands the day. Light returns to warm and intimate, the mojito gives the scene a prop, the vibe is celebratory and quiet.
Use these three selfies as scene anchors for the Imagine video. The arc carries the viewer through the day even without an explicit story — the time of day, location, and mood do the work.
The final video
Here's what the three selfies become once stitched together as Imagine scenes:
Notice how the arc carries: morning quiet → midday energy → evening wind-down. The wardrobe, light, and mood across the three selfies do most of the storytelling — Imagine builds the motion and dialogue on top.
Keeping your scenes feeling like one video
Multi-scene videos work best when the scenes feel connected. A few ways to do this:
- Repeat one consistent element. Same jewelry, same hairstyle, same color in the wardrobe, same color grade family. One repeated detail makes three scenes feel like one trip.
- Use a color arc. Move through related grades instead of jumping wildly. Warm cream → saturated noon → amber gold all live in the same warm family. Compare to jumping to a cool blue noir scene mid-arc — it'll feel like a different video.
- Match finish cues. If selfie 1 is cinematic film grain and selfie 3 is digital sharp, the scenes will feel like they came from different cameras. Pick one finish and run it through every scene.
- Anchor the camera language too. Three handheld shots feel like one video. One handheld + one static wide + one fisheye feels like three different videos.
Picking your scenes — a few arc patterns
If you're stuck on what scenes to use, here are arcs that work well:
- Time of day. Morning → midday → golden hour → night. Easy to plan, easy to vary lighting.
- Before / during / after. Getting ready → at the event → the morning after.
- Location-to-location. Apartment → walking → destination.
- Day in the life. Wake up → work → dinner → wind down.
- Emotional arc. Calm → chaos → calm. Or anxious → confident.
- Wardrobe change. Same place, different outfit per scene — lets the wardrobe carry the story.
When to use this workflow vs. a single Imagine prompt
Use scene anchors when:
- You want a multi-scene video with a clear arc.
- You want each scene's look to be deliberate (specific wardrobe, lighting, color per scene).
- You're telling a story or building a vignette.
- You want to share the result externally and care about polish.
Skip scene anchors and prompt Imagine directly when:
- You want one short clip, not a multi-scene video.
- You're experimenting and don't have a specific look in mind.
- You're going for casual / chat-side energy.
- You're in a hurry. Direct Imagine is faster.
See Fast Videos vs Imagine Videos for more on when each video tool is the right fit.
When scenes don't feel connected
If your finished video feels like three different videos stitched together, the fix is usually in the selfie prompts — not the video build.
- Compare the three selfie prompts side by side. Find the layers that disagree (one scene is cinematic film grain, another is digital sharp; one is warm tones, another is cool).
- Rewrite the outlier to match the family. Pick one of the three as the anchor and pull the others toward it.
- Re-generate the selfies and rebuild the video.
Small edits to the selfie prompts ripple through to the scenes — you don't have to rebuild from scratch.
Keep going
- Prompting 101 — the basics: what a prompt is and where you'll use them.
- Structure Your Prompt — the eight-layer recipe behind a great prompt.
- Starter Prompt Library — copy-and-remix example prompts across every Cantina surface.
- Fast Videos vs Imagine Videos — which video tool to use for which job.
- Using Imagine to Create Videos — the step-by-step for generating an Imagine video.
0 comments
Article is closed for comments.