Previous text to video was total shit and only 1-2 seconds. This is really good and up to 1 minute per iteration. Imagine the story telling possibilities or the ADS.
that’s actually completely innacurate. They are giving it way too much credit. It actually cannot consistently portray physics realistically, and in fact it is not running a physics simulation at all. It’s just 2D images with no simulation. There is no consistent world simulation, 3d modeling, or world rendering.
what it is actually doing is creating a visual portrayal of the prompt based on analysis of relative scale of similar depictions in its training set. it has persistence from frame to frame, and persistence for up to one minute.
Being based solely on visual relationships of scale, it can not consistently and realistically show depictions of physical phenomena across multiple instances. sand, breaking glass, someone drinking or eating, or fluid being poured from differently sized containers are all visuals that it would struggle to recreate with any physical acuracy.
So while they are telling the truth that visual representation of physics is “intuitive” and “implicit” that is only because it is baked into their data set. (Their training dataset also used video created in unreal video game engine with its built in physics simulations.)
Physics are innate to real video footage so the models replication of physical phenomena is basically only an artifact of image generation with consistency from frame to frame across a one minute video.
The model has no understanding of these phenomena, and technically, no ability to simulate them.