How I automate my video creation
Last week, I was thinking about starting a YouTube channel. I wanted to share my knowledge about Python and Flutter. I’m not an expert, but I think I can have something to share.
The thing is I’m very lazy. I’m willing to make videos, but I really don’t want to spend hours writing a script, recording my voice, editing the video, etc… I don’t really know how to do all that stuff, and neither want to learn 4 new software programs. I made my decision: I’m going to let the machine do most of the job for me.
I started to think how to automate the process. What do I want to do myself? What should I let the machine do? How much money do I want to spend? I don’t want to spend a single penny. Maybe $1 at most for a few videos…
The script
I want to write the script. I think it’s the fun part of the job. I’ll use some LLMs like ChatGPT or Gemini to help me with the writing and to check my spelling because English is not my mother tongue, and a Frenchman being fluent is not very common. They will be useful to ask for information and explain stories or concepts. They often have little details that we can look for later.
I’m talking about writing the script myself, but how do I do it? I want to use local software, so bye-bye Google Docs. Word is too big and complex for my use case. It wouldn’t be easy to parse the script file afterward. My final choice is Obsidian. It’s free software used to write notes. Everything is markdown, it’s lightweight, and the folder structure is easy.
Now, I know how to write my script. What do I do with it? It’s time to use Python and write some smart code!
Video generation
MoviePy
It’s my first solution to generate videos. It’s a Python package that uses ffmpeg under the hood to generate a video. It’s very simple!
You create a bunch of clips. A clip can be an image, text, or a video. You define the duration and concatenate them into one single clip. You add your audio file, and you compile the video! The automation is fast, but the generation can be a bit slow. To generate a basic music video with lyrics, meaning we change the displayed text every few seconds, it takes one minute to generate the video.
When you start to use video clips to generate your own videos, the generation time gets much higher! Another con is it’s not trivial to guess how long a clip should last. You cannot generate a preview. You have to wait X minutes for generation, watch your video, spot the mistakes, adapt your code and generate again…
I wouldn’t recommend MoviePy if you want to create complex videos. It’s slow, has no live editing, and the documentation is outdated. It’s an old package, and the last official release was in 2020. You need to install a lot of third-party software to use all the features of the package. You can meet strange issues, like one of mine was I couldn’t generate texts at all… But a bunch of enthusiasts are trying to update MoviePy and give it a new youth.
I only use MoviePy to generate karaoke videos. You could use it to generate compilations of music, Twitch clips, TikTok shorts, etc… I think that’s the best use case for this package.
So what do I use to generate complex videos?
Shotcut
I decided that the best thing was to pilot a video editing software. I didn’t do a lot of research for the software to use. My choice is Shotcut, a free and open-source editing software available on Linux, and it works well!
I tried to find some mystical and obscure Python packages to pilot the creation, but unfortunately, nothing… I didn’t give up!
I decided to reverse-engineer the file generated by a Shotcut project. After a few hours of adding pictures, audio, and videos in the software, I seized the concept of the Shotcut project format.
It’s an XML file following a special structure. We have a bunch of elements called playlist
. A playlist is a kind of big array containing a bunch of media called producer
by Shotcut.
We have some special playlists:
- main_bin -> that’s the software panel containing all the media that you can drag and drop into your video tracks
- background -> it’s a video track. It’s always a black background. It’s used to define what is shown when media doesn’t fit the project dimensions.
Then we can create playlist0
that will contain our main video track, playlist1
that will contain the background music, and playlist2
that will contain all my audio recordings.
I will not explain the XML schema more now. It would deserve a full post to explain what I have learned, but you can still contact me if you wish!
But now that’s great. I have a way to pilot a video editing software that’s way faster than MoviePy and allows me to preview my videos and add FX if I want to.
Now, I need to link my script and this piloting code.
Piloting my editing software
I’ve got a video script. How do I generate my Shotcut project file?
I didn’t find a better solution than defining some conventions in my script. Each header line will be used to define a new “slide.” What I call a slide is media that will illustrate what my voice is saying.
Let’s say I want to show a cat picture and say something about it. It will be a Picture
slide. I can do the same with videos, texts, or code.
These are my current slide types:
- Picture -> use the picture from the markdown
- Picture local -> download the picture using the link in the markdown
- Video -> use the video from the markdown and only use a specific part of it
- Text -> translate the markdown into a boring PowerPoint-like slide
- Code -> use carbon-now to generate pretty code snippets
- Stable Diffusion -> make a call to an online Stable Diffusion API
My markdown will look like this:
# Picture local
(some image path)
This cat is a symbol of laziness. I would like to be a cat and sleep like it does.
# Video
(some YouTube URL)
- 02:00 -> 02:23 (only use the video between these two timestamps)
We can observe two cats walking freely in the garden. They are very beautiful, both of them.
My Python script parses the markdown, gets each slide, extracts the information, and generates the Shotcut project with all the images and videos in the correct order.
Recording my voice
At the beginning, I was thinking about using Audacity to record my voice and apply some filters. Problem: I haven’t found a way to automate the process. Not such a problem! I can do it with Python! I use a bunch of libraries, and that’s my workflow:
arecord
-> it’s a command-line app that can record in.wav
. My Python script calls it with the correct parameters.- noise reduction -> uses deep-learning or statistical noise reduction, depending on what feels best
- audio effects -> I apply some filters: equalizer, normalization, compression
- remove the silences at the beginning and the end
Now, my Python script is a bit different. The console will show me text to record, I’ll press enter
to record, and when I’m done, ctrl+c
. Moreover, I synchronize the duration of a slide with the total duration of its audio recording.
And that’s it! My video is ready to be generated!
How to Improve the Process?
Image Generation
In my workflow, I already use an API to generate my illustrations. I would like to use a local version of Stable Diffusion and use Python to generate what I want, but my PC is not powerful enough to generate images. I need at least 10 minutes to generate a 512x512 image with good quality.
Zoom in on Faces
I’ve started to work on a feature that automatically detects faces in a picture and applies a zoom-in effect on the main face. It would add more life to my videos.
There’s a neat Python package that detects a face and saves it into a database with the name of that person! Of course, it needs some manual work to fill that database.
Conclusion
Thanks to this automation, I can generate many videos in a very short time. I’m pretty sure I’m saving hours of work. There’s no need to manually synchronize audio with images, improve audio quality, or download pictures and videos with third-party apps.