The creative spark is elusive in definition and incomprehensible in analysis but is one of the most crucial elements that make humans human. For centuries, philosophers have pondered where creativity may originate—perhaps from some celestial being or maybe creative ideas are living organisms akin to viruses that propagate across minds. Or maybe creativity is just some chaotic amalgamation of electrical signals inside our heads.
The inception of human creativity, and perhaps what separated homo sapiens from their long-lost cousins, is rooted in oral storytelling. It’s not a perfect form of communication. There are real costs as the storyteller has to spend their time and energy repeating their story for every incremental person. It’s also a lossy transmission, as after a few iterations of storytelling, certain bits and pieces get lost or changed like a game of telephone. The advent of written language helped solve these problems.
Technology provides leverage to minimize these costs for the creator. In his piece, The AI Unbundling, Ben Thompson details the propagation of ideas and how technology provides leverage. Writing provided leverage over the consumption of information. The printing press provided leverage over the duplication of information. The internet provided leverage over the distribution of information. Digital tools (e.g. Adobe’s Creative Suite) provided leverage over the implementation of information. And now, machine learning is providing leverage over the creation of information.
Minimizing these costs has always resulted in a net benefit to civilization. It’d be hard to conceive a convincing argument against writing, the printing press, or the internet if you believe that higher accessibility to information is fundamentally a good thing. The total number of creators (good and bad) and the amount of available information both increase. The highly differentiated creators gain more leverage and thus become increasingly valuable while the undifferentiated disappear.
There are many different mediums used to express creativity, but they’re all stories in some sense—whether the form is text, image, voice, video, or games. The relationship between human and machine for the past century has been for humans to use machines as tools to implement and distribute these stories. That hasn’t changed. But what has changed is the work required to create—it’s dramatically decreased. Human creativity will be tremendously augmented through the nascent ability to generate novel content that’s both zero-cost and hyper-personalized.
Synthetic Text (Text-to-Text)
LLMs like OpenAI’s GPT-3 are capable of generating human-quality synthetic text. GPT-3 calculates the conditional probability that a word will appear given past ones and generates that text.
OpenAI published GPT-3 back in Summer 2020. Prior work demonstrated substantial gains on many NLP tasks by pre-training on a large amount of text followed by fine-tuning on specific tasks. This requires task-specific fine-tuning datasets on the scale of thousands or tens of thousands. Humans, however, can perform a new language task based on a few examples.
Inspired by this difference, OpenAI built GPT-3 and showed that scaling up language models significantly improves task-agnostic, few-shot performance, sometimes reaching competitiveness with prior state-of-the-art finetuning. They trained GPT-3, an autoregressive language model with 175 billion parameters, and tested its performance in the few-shot setting. Without fine-tuning, GPT-3 achieved strong performance on many NLP tasks.
Synthetic text is comparatively more advanced than other types of synthetic content due to the massive libraries of text information readily accessible to train on and the lower compute costs of text data. Synthetic text can fuel long-form stories, accelerate copywriting for sales and marketing, and serve as a copilot for a multitude of language-based tasks.
LLMs can create long-form stories, but they’re not yet capable of generating ones that are sufficiently cohesive and powerful. They’ll only be replacing the mediocre writers as technology provides leverage, and leverage makes the extremes more extreme. LLMs today can serve as a valuable tool for brainstorming and reducing writer’s block as well as in automating some short-form sales and marketing content. That type of language is usually more contrived, corporate jargon with some nice adjectives thrown in for good measure. Synthetic text generators can also serve as copilots or vertical assistants for both horizontal and vertical use cases. The horizontal automatable tasks include summarization, translation, and transcription and the verticalized one can better address specific workflows and industry styles.
The next unreleased version, GPT-4, will likely be significantly more effective for business applications. This can be accomplished through more optimized use of compute during training (Chinchilla scaling laws), increasing the total amount of compute, or by using human-in-the-loop reinforcement learning
Synthetic Image (Text-to-Image)
Text-to-image models enable the creation of synthetic images. They encode text through transformer models and generate images through diffusion models. As transformers have displaced LSTM (Long Short-Term Memory) models, diffusion models have done the same for GANS (General Adversarial Networks) for this application.
A diffusion model is a generative model that learns to reverse an image corruption process. This corruption process entails an iterative addition of small, random, Gaussian noise, which deletes pixels until the image becomes entirely noise. The diffusion model is then trained to reverse this process and learns to iteratively regenerate parts of the image. Once this training is completed, the diffusion model can then be used to generate an image from pure noise.
There are 4 prominent text-to-image models: OpenAI’s DALL·E 2, Google Brain’s Imagen, MidJourney, and Stable Diffusion.
OpenAI introduced DALL·E 2 at the start of 2022. DALL·E 2 is a text-to-image model that can create original, realistic images and art from a text description, combining concepts, attributes, and styles. Specifically, DALL·E 2 can expand images beyond the original canvas, can make realistic edits to existing images, and can create variations of an image inspired by the original. All of these capabilities take into account the image’s unique visual elements including shadows, reflections, and textures.
DALL·E 2 is based on CLIP (Contrastive Language–Image Pre-training) and diffusion models. CLIP consists of a text encoder and an image encoder trained on image-text pairs to learn the correct matchings. DALL·E 2 generates images by first generating the CLIP image embedding from the caption and then using a diffusion model to generate the image itself from this embedding. This process is able to capture the salient characteristics of the image that are meaningful to people and allows for language-guided manipulation of the images.
Google Brain published their Imagen model in mid-2022. It is similar to DALL·E 2 in several ways, like its use of transformers to understand text and diffusion models for high-fidelity image generation. Text-to-image models are usually trained on large datasets of text image pairs, scraped from the web. Imagen’s key discovery is that LLMs, pretrained on text alone, are highly effective at encoding text for image generation: increasing the size of the language model increases performance significantly more than increasing the size of the diffusion model.
As friction is decreased to generate art, everyone becomes an artist and can explore various themes and styles without formal education. Artists can gain more control over these text-to-image models by using multimodal inputs that can generate images through a combination of text and sketches. Advances like these will reduce the efforts required in prompt engineering.
Synthetic image applications include building design assets and advertising content. Digital tools are hungry for design assets to fuel the creative enterprise and synthetic content allows for infinite asset libraries. Stock images can be generated for little cost. As these text-to-image models become more prolific and the cost to generate images drops, the total number of images as a portion of total data increases, as there’s no longer a reason for businesses to not include images in their copy. The same will happen for voice, video, and games.
Synthetic Voice (Text-to-Voice)
Voice was the original expression of language and predates text by tens of thousands of years. Writing is better for forging thought and expressing clarity, but the power of voice is unique in its ability to persuade and command.
Synthetic voice has been around for decades and has a storied history from Bell Labs, but many commercial applications require human-parity quality of voice to be successful. Cloning a human’s voice is challenging, as each voice is highly unique and audio is unstructured. There are many different ways to speak a sentence as well as non-word sounds to emulate. Mastering this level of phoneme synthesis is challenging, but GANs and other technologies are proving effective in rapidly increasing the quality of voice.
Quality of voice involves many different components such as tone, tempo, and pitch. It is the most important factor in assessing a commercially-viable synthetic voice, but not the only one. Scalability, or time to train, is necessary for synthetic voice to be used to generate voices dynamically for personalization. Companies can not only build a voice per brand but a voice per customer. The ability for companies to build high-fidelity voices programmatically with a single line of code enables hyper-personalization.
Prominent voice applications include general virtual assistants, conversational AI for customer support, and voice editing for video. Real-time translation of speech (without captions) also requires a computer to generate a cloned synthetic voice.
Synthetic Video (Text-to-Video)
The natural evolution of text-to-image models is adding motion and creating the ability to generate synthetic video. The earliest synthetic video application is deepfakes, which have not yet crossed the uncanny valley. Deepfakes are the earliest application—not due to the greatest market need, but because generating human facial expressions is technically feasible while more general synthetic video generation is substantially more challenging. But there are early inklings of this emerging technology.
Using text as an interface to edit video is being productized today and is an example of The Humanization of Computer Interfaces. There are a few methods being explored to create short synthetic video clips like using animation techniques and frame morphing, which give the appearance of motion or video.
Meta released their text-to-video model, Make-A-Video, in the Fall of 2022 which builds off the advancements in text-to-image technology. Their system uses labeled data sets of images with descriptions to learn what the world looks like and how it’s described. They then use unlabeled videos to train the model to learn how the world moves.
These technologies are in the early stages and can only produce video clips a few seconds long. They resemble GIFS more so than video, but it’s the rate of progress that’s most notable.
Video is more immersive and data-rich than text and image and, as a result, has been more expensive to create. This means the ability to generate synthetic video will have a comparatively greater impact on the world by democratizing a resource-heavy production process. Potential challenges of creating text-to-video include the high computation costs for video training and the lack of text-scene pair datasets to semantically understand scenes.
Across consumer platforms, there’s been a clear directional trend towards more immersive forms of media, from text (Facebook) to image (Instagram) to video (Tik Tok). The continuation of this shift is from video to VR. There’s another directional trend of continually increasing the supply of content available, starting from friends and family through the newsfeed (Facebook) to friends and creators (Instagram) to mostly creators (Tik Tok). The supply of video content from all creators globally is quasi-infinite UGC, and the next shift will be infinite UGC, or AI-generated video. Once there’s an infinite supply of content, highly optimized algorithms trained with all human-usage data will provide unseen levels of personalization and harvest more and more consumer attention.
Synthetic Gaming (Text-to-Metaverse)
Diffusion models have enabled game creators to progress from just using ML for advanced graphic tooling to creating high-resolution assets. This reduces production costs while accelerating the timeline. And more importantly, this gives a newfound ability to create hyper-personalized gaming. Freemium games that sell in-game character cosmetics for revenue can be specific for a particular user. Synthetic text and synthetic voice models can also allow for the creation of more compelling, interactive in-game characters.
User world-building has been, perhaps, the largest innovation in gaming during the past decade. Roblox and Minecraft allow for user-generated worlds to be built, but they’re graphically simple. Advances in ML can create high-fidelity worlds with both real-world physics simulation as well as advanced rendering and lighting. Nvidia’s Omniverse is a prime example of this. They have a powerful synthetic data-generation engine that produces simulated physical data. They support full ray tracing in VR in real-time. And they’re able to cover all the visual effects for a production into a single pipeline that allows for both real-time collaboration and distributed rendering. The final frontier is text-to-metaverse, which is the power to shape artificial worlds by using a natural language interface.
Language Engineering
I wrote in The Humanization of Computer Interfaces how natural language is becoming an increasingly dominant form of computer interface due to rapid advances in NLP. An early illustration of this was text-to-image models where users learned how to engineer language as small changes to the text input can result in vast differences in output.
In Text is the Universal Interface, Roon discusses how a significant amount of work is required to get these models to conjure what you want. He explains that this is a result of how self-supervised language models act like a collection of much smaller models, where every input leads to a different model subpath and a different conditional probability distribution. Text engineering is necessary today, but it would be parochial to believe that this will be an enduring problem and that language interfaces won’t become significantly more aligned with their users.
Decentralization
Text-to-image models have various levels of decentralization, which suggests a similar range may emerge for the larger space. Google’s Imagen is not accessible to the public. OpenAI’s DALL·E 2 is accessible but through a controlled API. Midjourney is both accessible and free to all users through their Discord bot. Stable Diffusion is open source and can be run locally, meaning there are no artificial limitations on what the model can generate.
There’s been an increasingly decentralized media paradigm as a result of the internet providing leverage over distribution and digital tools providing leverage over implementation. The shift in the supply of content from quasi-infinite UGC to infinite AI-generated content creates further decentralization as the total amount of data and creators both increase.
The Labor Revelation
Determining where progress will be made in AI is unintuitive. It seemed reasonable to assume that the first area where AI can substitute for human labor would be in automating unskilled, physical labor. This turns out to be an amazingly difficult thing to do in the unstructured and changing environment of the real world. What happened instead is the automation of unskilled mental labor, where AI can help summarize and transcribe, as well as the augmentation of skilled mental labor, where AI can help with programming and the creation of new content.
Part of this surprise probably stems from the biological perspective that while unskilled physical labor is common to most species, any level of mental intelligence is rare across Earth and skilled mental labor is singular to humans. We also understand how physical labor works well, i.e. the physics of a human body, while remaining mystified by how our brains work. However, the evolutionary perspective could be the more sensible one to take. The genomic jump to being able to navigate unstructured environments with high visual-spatial awareness likely took significantly more time and work to evolve than a cognitive language task.
Knowledge Creation
David Deutsch explained how humans create new knowledge through creative conjecture and criticism. This parallels nature’s evolutionary process of random mutation and selection, which is how nature creates knowledge and embeds it in the genome.
An ML analog to this would be GANs which contain a generator model that creates new examples and a discriminator model that classifies those examples as real or fake. These models aren’t creating new knowledge though. They don’t understand what they’re doing or how the world works the same way humans do, which is why they require so much data. But they can augment humans in creative conjecture.
There does seem to be something strongly universal about this process—conjecture and criticism or random mutation and selection. These binary opposing forces seem to be the source of knowledge creation. And they may continue to inspire future machine learning architectures that are increasingly capable.