Powerful technology in making art

Enter “teddy bears working on new AI research on the moon in the 1980s” into any of the newly revealed text-to-picture artificial intelligence image generators, and the smart algorithms will create an eerily relevant image in just a few seconds.

Image 51
This image was generated from the text prompt ‘Teddy bears working on new AI research on the moon in the 1980s.”

Many people are excited, inspired, and scared by this new trend in synthetic media, which seems to be limited only by your imagination. Google, OpenAI, and the AI company Stability AI have all made text-to-picture image generators that are so advanced that some people wonder if people will be able to trust photographs in the future.

As a computer scientist specializing in picture forensics, I’ve been thinking a lot about this technology: what it’s capable of, how each of the tools has been made available to the public, and what lessons might be drawn as this technology continues on its path.

An adversarial strategy

Despite the fact that their digital antecedent goes back to 1997, the first synthetic pictures were on the scene just five years ago. In their initial form, generative adversarial networks (GANs) were the most widely used technology for generating pictures of people, pets, landscapes, and anything else.

A GAN is made up of two major components: the generator and the discriminator. Each is a massive neural network, which is a collection of linked processors that are approximately comparable to neurons.

The generator begins with a random collection of pixels and transmits this picture to the discriminator, which decides if it can distinguish the created image from actual faces. If it is successful, the discriminator sends input to the generator, which alters some pixels and tries again. In an adversarial loop, these two systems are pitted against one other. Eventually, the discriminator is unable to discern between produced and actual pictures.

From text to image

Just as people were grappling with the ramifications of GAN-generated deepfakes — including recordings of individuals doing or saying things they didn’t — a new participant appeared on the scene: Deepfakes based on text.

A model is trained on a vast number of photos, each annotated with a brief written explanation, in this newest version. The model corrupts each picture gradually until only visual noise remains, and then trains a neural network to correct this corruption. The algorithm learns how to turn pure noise into a coherent picture from any caption by repeating this procedure hundreds of millions of times.

Image 52
This photolike image was generated using Stable Diffusion with the prompt “cat wearing VR goggles.”

While GANs can only create images of a single kind, text-to-image synthesis engines are more powerful. They can create practically any picture, including images that feature an interaction between people and things with precise and complicated relationships, such as “the president of the United States burning confidential papers while sitting around a campfire on the beach at sunset.”

DALL-E, OpenAI’s text-to-picture image generator, swept the internet by storm when it debuted on January 5, 2021. On July 20, 2022, a beta version of the tool was made accessible to 1 million individuals. Users all around the globe have discovered apparently limitless methods to provoke DALL-E, resulting in fascinating, weird, and imaginative visuals.

But many professionals, from computer scientists to law professors and government officials, have thought about how the technology could be abused. Deep fakes have been used in the past to make pornography without consent, commit small and large-scale fraud, and spread false information. These more powerful movie producers might add fuel to the fire.

Shopping cart close