Breakthrough from Nvidia
Just a few months ago, prolific text-to-video AIs were seen as just a joke, with the example of “Will Smith eating spaghetti.” However, Nvidia’s VideoLDM model is a tool that will make you forget the previous examples. Let’s also mention that Nvidia created this technology by collaborating with Cornell University researchers. In simple terms, this AI model can create videos with a resolution of up to 2048 x 1280 pixels, 24 frames per second, and up to 4.7 seconds based on text.
Nvidia uses 4.1 billion parameters in its developed model, but only 2.7 billion of them were used in video training. While you might think that’s a huge number, it’s a small number by today’s AI standards. Nvidia uses the trained Latent Diffusion (LDM) model to create video. This model perceives time as a monitored dimension and tries to predict what might change in each area of an image over a given period of time. The tool creates a series of keyframes throughout the sequence, then uses another LDM to interpolate the frames between keyframes.
Of course, VideoLDM cannot produce quality videos that will fool anyone in its current form. However, compared to the examples we saw a month or two ago, the scale of development is huge. Currently, text-to-video AI is used to create GIFs, as Nvidia introduced. Therefore, we anticipate that it will not be long before Nvidia introduces more advanced technologies for creating video clips from longer text. The technology prepared by the company will be presented at the Machine Vision and Pattern Recognition Conference, which will be held in Vancouver between June 18-22.