Welcome to Wonderland: How synthetic media is going to liberate creativity
AI techniques are enabling a new era of machine-generated content. It will change the cost-structure of content production, giving people and businesses access to production quality that would previously have been out of reach. And it will change the nature of content itself.
This is the third in a series of long-reads about the role of ‘AI’ in various creative fields from art, design and music to journalism and comedy, aimed at the general business audience. How humans and machines can work together to make creativity easier or quicker. And how industry, society and culture play in to that.
For those who don’t know me, I spent 15 years of my career making digital content for brands as head of digital at MTV, and in big creative agencies BBH and Iris. I am now a consultant, advising AI-powered businesses — and businesses who want to use the power of AI.
In 1991, Moby was one of the first people to make a hit record in a bedroom. His track, ‘Go’, used the chords from the soundtrack to Twin Peaks. He made the record in an afternoon in his friend’s apartment, while his friend was at work. It went on to sell more than two million copies and launched Moby’s multi-decade career.
It was an era when a stack of new tech freed musicians from the expense of formal production. Low-priced synthesisers, drum machines, samplers and sequencers meant music could be made outside of the world of big recording studios and major record labels. It was a punk, do-it-yourself ethos.
Suddenly anyone could have a top ten hit from their bedroom.
Similarly, we are now in an era where AI-powered technology promises to put a film or photographic studio in every home (or at least every small business).
Synthesia takes a script and outputs it as video, spoken by an almost completely realistic ‘presenter’. For a small fee, companies can create their own bespoke presenters — your CEO perhaps.
Rosebud.ai puts AI-generated models into scenarios or even specific outfits. Need to change that model for a different one? Make them smile. Maybe even a different race, for a different market? Just push a button and it’s done.
Descript takes spoken word and turns it into text so that users can edit the text to edit the audio. It also learns your voice so the system can synthesise new audio for you. Write some new text and it comes out in your ‘voice’.
Phrasee allows businesses to create short bits of text — typically email headers or social posts — at the touch of a button. No Don Draper-style copywriter required.
They say that the best startups super-power their users. AirBnB allowed anyone with a spare bedroom to compete with the Marriott. Uber allowed anyone with a smartphone to have a chauffeur.
Similarly, all these new businesses will allow even small businesses to compete with the BBC, major fashion chains or big agencies.
Come down the rabbit hole to meet the founders and innovators behind these businesses…
The Innovators I — Synthesia
Synthesia is a service that gives its users the ability to have a computer generated talking head speak any script that it is given, pretty much as fluently as a real human — and in 34 different languages.
I’m on a video call with Victor Riparbelli, co-founder of Sythesia. He’s in his hometown of Copenhagen, visiting family, taking advantage of a window in the Covid pandemic when travel opened up again.
Riparbelli is only 28 and the CEO of a start-up that has already raised $4m. He is confident and knows his business. There are few ‘maybes’ or ‘perhapses’ in his speech.
He founded Synthesia aged just 25, along with Matthias Niessner, who he met while studying at Stanford University back in 2015. Niessner is a professor of computer science, specialising in the fields of Computer Graphics and Computer Vision.
“At the time he was the ‘deepfakes guy’. A lot of the conversation was around was ‘is this the end of truth and democracy’, and so on. But what I saw was the future of content creation”, Riparbelli says.
“Thing is, it’s a video-first world, but there’s a massive disconnect in the cost in time and money that it takes to produce video. It’s such a complex workflow.”
And he’s not wrong.
“It’s a video-first world, but there’s a massive disconnect in the cost in time and money that it takes to produce video. It’s such a complex workflow.” — Victor Riparbelli
Say you wanted to film your chief executive, in their office, announcing some new changes in the company’s approach. Something really simple, but an agency wouldn’t charge less than £4–5,000 to produce it. To be sure, the final output will be nicely edited, lit and recorded. But still, not something you’re going to want to do every day.
And the costs spiral from there.
- A celebrity being interviewed driving in a car, shot with GoPro cameras — £15,000.
- One day UK shoot of some talent at home, minimal crew. Cost, (not including talent) — £30,000.
- Shooting some sponsorship ‘bumpers’ (the 5–15 second bit in and out of a sponsored TV show), say in a two-day UK shoot, full crew, with a cast of five — £135,000
Video production is EXPENSIVE.
Synthesia puts the capability to have video content within the reach of any business.
“We want to turn all text into video content,” says Riparbelli. “Every PowerPoint, every bit of training content, internal comms can now be video content.”
And the ambition is to go way beyond that kind of corporate content. “Today we can make talking heads at 1000th of price. In 10 years’ time creators will be able to make a Hollywood film on laptop.”
Their technology works by combining AI-powered ‘generative networks’ along with more traditional digital visual effects.
“We map between the mouth shapes that people make when they’re talking and the ‘phonemes’ that they say.” (Phonemes are the discrete units of sound in any language. English has 44 phonemes, Mandarin has up to 64.)
By mapping the connection between the shapes of people’s mouths and the sounds they make, Synthesia are able to ‘reverse engineer’ mouth shapes back from word sounds.
“And then we add some visual effects on the top to smooth it all out”, he adds.
So, when you put text into their system, it converts it to speech (using a similar system to, say Siri or Alexa) and from there, they are able to manipulate a pre-captured talking head and make the mouth move in a natural way.
In some ways it’s like 21st century puppeteering.
Yes, there is some effort required in initially capturing the voice and the head in 3D, but you only have to do it once and from there on, videos take 10–15 minutes to output. And now your inputs can just be text.
“Today we can make talking heads at 1000th of price. In 10 years’ time creators will be able to make a Hollywood film on laptop.”–Victor Riparbelli
You can get your synthetic spokesperson to speak almost anything — automated weather reports, computer-generated news reports, sales pitches.
There’s a whole part of YouTube which consists of computer voices speaking posts from Reddit. Why not have a talking head speak them instead?
Or allow YouTubers themselves to go global in multiple languages?
Or what about that minority sport which doesn’t get coverage on the main TV channels but already has radio commentary? Maybe that could now have its own video channel?
“There are 5–10 video startups in a similar space to us. But it’s taken us two years with two professors and 10 PhDs in the building to get where we are.”
“I’m always thinking, what is the one thing that will allow me to eat my competitors,” says Riparbelli.
The Innovators II — Rosebud AI
Rosebud has a system that allows its users to seamlessly manipulate a model in an image without having to reshoot or even open Photoshop.
Need that same model, but smiling or blond? Rosebud. Want to use a stock photo but make it bespoke to your website? Rosebud. Need the same shot for 30 different markets but don’t have money for lots of different models? Rosebud.
The name ‘Rosebud’ doesn’t come from the film Citizen Kane but from a famous cheat code in life simulation video game, The Sims. It’s a symbolic way of signalling that this is something being created from nothing. Their software gives its users access to capabilities they wouldn’t previously have been able to afford.
And, according to this recent article in Vogue, the fashion world, notoriously conservative in its use of tech, is starting to wake up to the possibilities.
As it’s founder, Lisha Li, says, “our vision is to make visuals easy for story telling”
Li is one of those perpetually restless high achievers.
We speak by Zoom — it’s early Sunday morning in San Francisco where she lives.
Li came to Canada aged 6. Her parents emigrated from Chengdu in China (“the part of China with the pandas and spicy food”, as she puts it).
She got her maths degree with “highest distinction” from University of Toronto. Then a PhD in machine learning from Berkeley and then onto a career in the world of venture capital as an AI expert.
“The VC world was such a black box to me, I was fascinated to learn about it. It’s a very different paced world to academia. I love the cadence of it.”
Two years ago, she took the leap to starting an actual business. “Investing is fun, but I could see GANs [the AI networks that can generate content from scratch] getting compelling. There was the opportunity to build an actual product.”
We want to make it so easy to create — this is story telling amplified. That’s what gets me up in the morning.” — Lisha Li
“Our system finds the face in an image, identifies the background, the foreground and so on, and then suggest different faces that work with the underlying photo.”
Even now Li isn’t 100% sure of the direction Rosebud will go.
“We have quite a scrappy user interface at the moment. We wanted to learn fast, get something up to quickly validate our thinking. That’s more important than focusing on the sophistication of the interface.”
Li is busy. Busy making the system more powerful, so it can also adapt the poses of the models. Busy building partnerships with stock photo provider, Shutterstock. Busy trialling the creation of virtual beings with backstories ‘dreamt up’ by AI text-generator GPT-3.
“We’re taking a lot to the fashion world, but there is also potential interest in using the service to create personalised avatars for communication.”
“We can’t anticipate what people might want to do with it. but we want to make it so easy to create — this is story telling amplified. That’s what gets me up in the morning.”
The Innovators III — Phrasee
So far, we’ve been talking about visual content, but Phrasee is all about text.
Their CEO, Parry Malm contends that the internet is still all about written language.
“Every time you buy something you go through a funnel driven by the written word. You read what people actually say, the product information, the reviews.
“Language is the currency of ecommerce and Phrasee will be the Royal Mint,” he says, lightly chuckling at his slightly overblown analogy.
Phrasee’s system automatically generates short pieces of text, ideal for things like email headers or social posts.
“Imagine you want to send an email to your customers telling them that ‘shoes are now at half-off’.
“Every time you buy something, you go through a funnel driven by the written word. You read what people actually say, the product information, the reviews. Language is the currency of ecommerce.” — Parry Malm, Phrasee
“You brief a copywriter, a Don Draper, who thinks about it, drinks a little whisky and eventually sends back some copy. Then it gets tweaked, a co-worker puts in their 2-cents-worth. Maybe the janitor has an opinion, the CEO gets involved, and so on, and so on. The version that gets sent isn’t necessarily the best. It’s the one that the most persuasive person lobbied for.
“If you’re doing that 500 times a year, that’s a huge cost. And at the heart of it is the presumption that the best is what a human thinks is the best.”
Malm is a larger-than-life Canadian from Vancouver who was able to settle in London 15 years ago thanks to having a British grandmother. “I came here with £800 in my pocket. Basically, I’m just another filthy job-stealing immigrant.” Since his arrival, he has built a sizeable business in Phrasee, and even given evidence to Parliament on the future of AI in the UK.
He is smart, dynamic and voluble. A born hustler, with catchy phrases falling out of his mouth with virtually every sentence. It’s ironic that someone with such creative command of the language is behind a system to generate it automatically.
Malm says he met 1000s of marketers in his career in direct marketing. ”They all wanted to know what makes a good subject line. The best anyone could suggest was try out a bunch of stuff and hope that some might work. I started to ask, what if there was a product to solve this problem?”
The answer was Phrasee, which works by pitting two AI-driven systems against each other.
One creates thousands of variants of wordings (“thousands of Don Drapers,” as Malm puts it) and the other, the ‘ranker’ is a deep-learning platform makes sure whatever gets output is on-brand and effective. The ranker is trained from scratch for every client.
And it’s working. eBay use Phrasee in UK, US and Germany. In the US, eBay has seen a 15.8% increase in email open rates and an uplift of 31.2% of users clicking through to a link in the email.
As Gareth Jones, former Chief Marketing Officer of eBay UK puts it, “Phrasee makes you money, so you’re more likely to get your bonus.”
But Malm admits, “working with eBay was not all a bed of roses. They benchmarked us a number of times against a bunch of other suppliers, but we were always on top. They’re not an easy customer, which is why I like them so much.”
As for the future, Malm thinks it’s still early days.
“We’re doing things differently and in a way we created a market from nothing. But it’s still a tiny market and I can see it expanding massively with new tech and new entrants.
“It’s an exciting time.”
The Innovators IV — Descript
Descript is a new podcast creation platform. What differentiates it from its competitors is that editing is done by editing text. And you can not only edit the audio by editing text, but you can also add bits to the recording by writing in new text. The system knows how your voice and will generate fresh audio as if you were speaking it.
Descript’s founder, Andrew Mason, is a serial entrepreneur — he was previously the founder and CEO of Groupon. He has been starting businesses since he was 15, when he started a Saturday morning bagel delivery service called Bagel Express.
“We’re in a world where there’s really no difference anymore between the consumer and the creator and the prosumer. Everyone is a content creator.”–Andrew Mason
“Thing about audio is the easiest kind of content to create — you just open your mouth — but the hardest to edit”, he says.
“The way Descript came along is that we were working making these long-form podcasts, and saw what a tedious workflow it was for the audio producers”, he said, speaking to Software Engineering Daily.
“This was at about the same time that speech-to-text was reaching an inflection point, where it was actually accurate enough to be usable. We thought, wouldn’t it be cool if you could just automatically transcribe audio and then build a podcast editing tool that would allow people to edit audio by editing text, the same way they would in a word processor?”
Like our other founders, Mason believes that his technology will enable creators who wouldn’t previously have had access to this kind of capability.
“We’re in a world where there’s really no difference anymore between the consumer and the creator and the prosumer. Everyone is a content creator.”
“The tools that are out there are difficult to use. If you believe that you can build something that can be a solution for the creators out there, then it’s a process of combining existing markets, creating ones that don’t exist into something new. That’s happening all the time with startups that are creating new spaces.”
Descript is still a relatively small company at 20+ people but at the back-end of 2019 they received a $15m round of investment and acquired Lyrebird AI which Mason describes as “ just incredibly smart PhD researchers, doing the best work I’ve seen of anyone out there in making easy to reproduce voice model of yourself.
“The stuff that we know how to do is build product and product engineering, build a business. We didn’t know anything really about the deep learning stuff that those guys were doing.”
The future for Descript?
“One thing we’re rolling out now is ‘disfluency detection’ in the app. If you say ‘um’ or ‘uh’, we have a one-click button you can use to zap all of those. We’re doing something similar that will catch contextual filler words like, ‘like’ or ‘you know’. It’ll also catch stutters and false starts, so that you can easily clean up your audio.
“We’re looking to offer more tools around that, around mixing and levelling your audio, noise reduction and room tone detection. There’s really an endless list of the types of things that we could we be adding that’s right for tools that are enabled by machine learning.”
The dark side of synthetic media
One concern is that it will allow bad actors to create ‘deepfakes’, potentially flooding the internet with compelling, realistic photos, audio or videos which spread propaganda or disinformation.
Both Synthesia and Descript claim that ‘deepfaking’ content isn’t as easy as it would appear on their systems.
“It’s very hard to fake without being synthetic from the beginning,” says Synthesia’s Victor Riparbelli. Meaning unless you can persuade your chosen celebrity/politician to come into a face and voice capture studio, it will be very difficult to subsequently fake them using the kind of system that Synthesia sell.
Descript say, similarly, that their system will only generate audio from the user’s own voice.
Andrew Mason from Descript, “you give people a script that they have to read and then you validate the transcript of what they read against the original script. Unless, you can fool someone into reading a 10-minute script, you can’t really fake it.”
“You don’t need deepfakes to create disinformation. There are hundreds of conspiracy theories out there already.” — Victor Riparbelli
Riparbelli thinks the answer to the whole issue is more about attribution. “It’s more important to know where a video has come from. Is it a reputable source, has it been tampered with along the way? Maybe there will be some kind of blockchain system put in place to make sure that a piece of video hasn’t been tampered with.”
Rosebud has a particular issue around people misusing their system. For instance, to put people’s faces on porn actors’ bodies. Or deliberately ‘black-facing’ their own social profile. Which is why they have limited its use.
Lisha Li says, “we are starting out as a B2B tool to allow us to control who uses the system. For instance, we only make skin-tone changes available to our enterprise-level clients.”
“When photoshop first came out, people said it would put the cutting room guys out of job, In fact it just democratised access and design departments are bigger than ever.”–Parry Malm
And their system also automatically weeds out anything it believes is pornographic and replaces it with a picture of a teddy bear
The other issue is the impact these technologies will have on people’s jobs.
Will Rosebud put non-white models out of work, for instance?
“In general the system will replace models, but it will allow more diversity in representation too. Part of our proprietary data includes models from multiple ethnicities,” says Li.
Phrasee’s Parry Malm is more gung-ho.
“When photoshop first came out, people said it would put the cutting room guys out of job, In fact it just democratised access and design departments are bigger than ever.”
“At the end of the day, no-one says ‘the favourite part of my job is writing email subject lines’. We’re taking away the parts that don’t require creativity — the boring pattern matching — and allowing them to focus on things that require creativity”
It’s true that all of these companies all take the governance of what they do very seriously, and have all put in place ethics policies. However, the reality is there will inevitably be competitors, maybe located in less regulated territories, who won’t be as scrupulous.
The reality is this is something we will have to deal with as a society. And tampering with media is hardly new either — it was rampant during the Soviet era under Stalin, for example.
As Synthesia’s Riparbelli points out, “you don’t need deepfakes to create disinformation anyway. There are hundreds of conspiracy theories out there already.”
If this change is inevitable, we should all take a leaf out of Rosebud’s ethics policy, “We strive to educate the public about the capabilities of synthetic media generation.”
As Alice says in ‘Alice’s Adventures in Wonderland’, “It’s no use going back to yesterday, because I was a different person then.”
Please clap or share this article if you have enjoyed it as it helps other people find it.
If you want to be told whenever I publish these articles, sign up here (100% no spam guarantee — this list will ONLY be used to let you know when another article is live).