From the view of someone that occasionally watches videos on YouTube, I am trying to figure out a nice way to say... I hate it. Or more specifically, I hate that it generates the voice, and basically enables video content spam.
What we don't need more of is cheap, easy to automatically generate videos that are basically spam and/or clickbait, trying to get views. The problem with auto-generated voices in videos like this is as a viewer I can't distinguish between work that someone put deliberate production time into, and something churned out by a content farm. The demo video even tricked me at first, I didn't realize it was a generated voice until a couple sentences in, at which point I had a visceral negative reaction, the same as when I accidently click on a content farm-generated video.
It seems a major feature is automatically syncing the narration to the slides. Perhaps a way to enhance this while avoiding spam generation is to use the generated voice only for internal timing, and generated a karaoke-like display for a narrator (human) to read? As a paid service, you could even provide professional voice-offers as an add-on.
If machine voiced vs human voiced is the only discernible difference in the end, this seems like a non-argument.
As someone that is building a tool in the roughly same space (machine voiced video generation), I can just say that the use-cases go far beyond "content-farm". It also enables a lot of useful content like e.g. internal training videos, or paired with browser automation, you can have narrated always up-to-date video manuals of your product. In the education space, it enables a more iterative way to produce material where you previously couldn't afford to tweak parts of a video, as you would have to narrate it again.
And I also don't think that it will amplify the existence of such videos significantly. There are already Youtube channels that already do just that, and people don't seem to mind. E.g. there is a channel that uploads "car news" content, which basically just has a narration on top of a series of pictures of a car, and the amount of views and the rating on those videos is pretty good. In the end its just a few fact bulletins stretched into an overly long video using the same old worn out phrases (just like regular "car news"), and I don't see why a human would need to waste their time to voice that.
> If machine voiced vs human voiced is the only discernible difference in the end, this seems like a non-argument.
The problem is getting to the end -- I don't want to spend several minutes trying to decide if it's spam or useful. It's simply easier and safer for me to use "contains auto-generated voice" as a filter to avoid watching garbage. Specifically I'm talking about videos like the ones discussed in this video .
Though I'd generally agree that good quality content is good quality content, I personally think there's something lost by using a machine-generated voice. Good human narrators add nuance and emphasis and energy, and it's much more interesting when someone is passionate or excited about the topic they're talking about and you can hear that come through.
Some humans are bad narrators, of course, and the machine-generated voice may not be worse by comparison. The problem is I'd just rather not listen to an emotionless voice -- whether it's machine-generated or human -- read a script, I'd rather just read it myself.
Maybe I'm wrong and the generated voices are much better than I've heard (any examples?) but I think part of the problem remains in that unless I'm forced to watch (eg, internal training) or have a recommendation come from someone I trust, it's still safer to filter out videos with machine-generated voice as "probably spam/garbage".
> it enables a more iterative way to produce material where you previously couldn't afford to tweak parts of a video, as you would have to narrate it again
I think this is a very compelling feature, but as a potential consumer of these videos (either accidentally on youtube or forced via internal training) I wish someone would come up with a way to enable this without having to resort to using the emotionless robot voice.
This again could just be my personal preference: I think emotionless robot voice is pretty much going to always mean somewhere between low- and mediocre-quality video, and I also think a low quality video is significantly worse than just having an easily-updatable HTML/PDF/whatever document with pictures/screenshots/diagrams as appropriate.
And... there are some humans tasked with making videos for others and they're just really bad. Again, internal/training videos, etc, done by people without much passion for, or even knowledge of, the task they're training you on. I prefer machine generated voice in those cases, or perhaps even some sort of subtitling that could be piped to the TTS engine of my choice.
> Some humans are bad narrators, of course, and the machine-generated voice may not be worse by comparison. The problem is I'd just rather not listen to an emotionless voice -- whether it's machine-generated or human -- read a script, I'd rather just read it myself.
If it's in the end really just a script that's read off, I'd rather have it auto-generated. You are right, there are some people that are "bad narrators" on a technical level (looking at you, people in the CircleCI Youtube ads, which prompted me to start this project), but even "good narrators" like e.g. the guy from the Kurzgesagt videos, often times don't convey any more emotion, and could be replaced with an auto-generated voice.
> Maybe I'm wrong and the generated voices are much better than I've heard (any examples?)
The best one in my opinion you can get of the shelf are the Google WaveNet ones. They are the least "tinny" ones with good pronunciation. Out of the open source ones, Mozilla TTS has some very good results, but like all other open source ones it's very hard to get running, and even then it has a much more limited featureset (languages, pronunciation, etc.). Happy to hear suggestions here!
I think we've already crossed a point where the quality is good enough for a lot of applications (= it doesn't distract from the script through constant wrong pronunciation), and the future for the field is looking pretty good.
You seem to be making this "either / or" its simply another tool.
Manual editing and manual narration tends to act as a forcing function to review the information and approve its accuracy before publishing.
Auto-generated videos can often be published without a final review or fact check. As we see auto-generated video for things like product demos, and company training, it will open a new problem domain of catching “bugs” in those auto-generated videos.
There's a big difference between good content that is automated into a video, and spam. The key use case for this was helping me focus more on the content, rather than on fiddling with synchronisation and resizing assets. I'm not a native English speaker, and although I speak at quite a few conferences per year, listening to my broken English accent (which sounds like a Bond villain) in a video is quite distracting, even for me. Even with my best efforts to record my own voice professionally, generated voice sounds a lot better than what I can do.
I mean,YT now has restrictions on how much engagement you need before you start monetizing. One could use those videos to bump up the numbers, and then monetize their real videos
Slightly different aim compared to Video Puppet (the source being plain text is not the goal, which means you will likely have to edit and re-record a script multiple times) but still interesting, especially you'd rather avoid an auto-generated voice.
Seems you could do something along these lines to avoid the video generation part.
edit: I'm going down a rabbit hole looking through your site. Digging the twisted early internet aesthetic.
The existing tools for doing this sort of thing seem to either require quite a bit of programming / video skills e.g. Media Lovin' Toolkit, ffmpeg, sox, jimp, ImageMagick etc or they are templated / opinionated tools like https://www.magisto.com/
What I love about Video Puppet is that it provides a simple and easy to use set of tools and an API that through GitHub actions allows you to put version control and early/often feedback loops at the heart of your projects.
I'm using it to document the development story and back story of an Indie Video Game I'm working on. Previously I was doing it as a Google doc which I was sharing with my collaborators.
With Video Puppet, it requires little more overhead - I was writing this stuff already - but when I see and hear the results played back I can immediately see whether the story makes sense or not. I can see if I am jumping into talking about something I haven't set up properly or if I am trying to say too much.
One thing that would help me is to get feedback on fails in the markdown script quicker, before even pushing to GitHub. For code, including things like Terraform, I'd use a linter, or CircleCI has a validator tool you can run locally.
The other place I'm going to start using it is for describing defects in a product I am coaching a team on. Previously I would do a screen cap and then upload that to frame.io. Now I can do the screen cap, describe the problem and stick the whole lot into version control with a bunch of github actions to point the team to the resulting video.
I will be following this product closely and actively using it.
Greak work Gojko!
It's a shame it doesn't also capture the code's output and, ideally, the state of the interpreter. For example: at 4:45 in the demo video, he tries to run his code and it fails with an error. It's important for both coding tutorials and DX analysis to capture the text of the output/error.
What would be even better would be capturing the error _and_ the detailed stack trace, ideally with the state of each stack frame. My employer produces SDKs for different languages, so it'd be invaluable for debugging.
I can imagine a couple of different ways of doing this which might not be horrifically complicated to add to the Paircast recorder, though I suspect you're already going down this road. If you'd like to chat more, yell!
In the meantime, could you write a bit what different pieces of technology/services you're using to build all this?
Under the hood, the conversion system is using Chrome headless to generate slides, render markdown and provide syntax highlighting. Most of the video and audio processing is with FFMpeg and SOX.
- Make the sample script response header "Content-Type: text/plain" so that it renders in the browser instead of downloading a file.
- Make the sample video demonstrate the three features it says it has, like image captions.
Is there any way I can add my own voice and then still write the words that I want my voice to say?
You could create a custom brand voice with Amazon, and we can then integrate it into Video Puppet.
It is an application/shared library for Linux, released as free software. It has a GUI program for live narration and one, "Vox", for creating video from PDF or still images using speech synthesis (Festival).
The Kinetophone shared library could be used as a plug in for presentation software. Kinetophone's file format is XML. I haven't updated it for years, and it does require occasional patches to support the latest FFMPEG. It was originally a commercial application for OS X called Ishmael, back in about '07 which I ported to Linux after my company went out of business.
I'm imagining a daily routine of airplaying the video to your TV with an annotated dashboard of quantified self metrics, weather forecast, plotted local Covid-19 cases, health advisories, etc.
Only naration is useful form of presentation but you don't need this tech to do so.
So, video is only limiting you, nothing else.
I can imagine this to be used alike to PDF in specific contexts - if you need 100% guarantiee that local devices/viewers/etc wont change any output detail.
Building a full video is fairly quick compared to traditional editing tools, so I haven't built any faster preview yet. I usually just build the whole thing and look at it, then tweak the script and build it again.
You can easily upload just the script file into an existing project and re-build the video as many times you like, then download the version you are happy with at the end.
I'll give it a try, perhaps its so fast that previews aren't really necessary.
I also think this could be an amazing tool for personalizing video marketing too
Something like this that would support simple fades, transitions, and maybe animation. The kind of stuff you can do fairly easily in a video editor, but with lots of fiddly clicks and zooming in and out of timelines.
I'd like to have a script that let me specify when different source media start, when to apply effects, etc. All written as a basic text file.
Anything obvious out there I've missed?
you can set transitions globally in the document header, or on individual scenes. for example, just add
(transition: crossfade 0.2)
video segments (different source media start) are also supported. You can do something like:
segment: 00:02 - 00:04)
I can't stand hearing the sound of my own voice, but do a lot of tutorial content production in Markdown for guides for learning material.
This would allow me to re-use all of the existing material I have, which already includes detailed step-by-step screenshots and text instructions, to make voice-over videos with slides and publish to Youtube. Amazing!
I created a single video from text using python (https://www.youtube.com/watch?v=7CIakJ8PMZs // https://github.com/sidpalas/devops-directive-hello-world), but this is next level!
I'm excited to try it out.
It seems it would make a lot more sense to just design the language from scratch, rather than try to bend Markdown to do something it was not at all meant for.
For instance, why would you WANT to have an example like this:
Welcome to London
Welcome to Berlin
The perfect is the enemy of the good.
You could easily borrow some common things from Markdown to make things easier, but this seems to try to force following the Markdown syntax as much as possible, even when that syntax makes no sense in context.
It is much better to invent new things for the cases that are completely new, than try to force a square peg into a round hole.
Also, if you don't like !(), you can just use stage directions with brackets. The equivalent script will be :
Readable but has some intelligence and decoration that is not distracting. See it enough and the syntax will become invisible over time like using punctuation.
> This application is currently in beta version. While in beta, the application is free, and allows anyone to upload assets up to 25 MB. We will announce commercial pricing later, when the full version becomes ready. For now, experiment as much as you like!
Easier bulk upload / upload with curl / python requests is needed IMO.
There's a full example here: https://github.com/videopuppet/examples/tree/master/slides