We all use tools as part of the writing process. Other books and internet resources for research, Scrivener for writing the first draft, and a computer for typing or dictating into, as well as editing tools like ProWritingAid. But what if you could use AI tools to help inspire the writing process?
In this episode, science fiction author Yudhanjaya Wijeratne talks about how he used artificial intelligence to co-write his novel, The Salvage Crew.
In the intro, I talk about how I've been playing with Inferkit using my own books to train the Natural Language Generation model. More on AI writing tools here. Google announced a model 6x bigger than GPT-3, and Eleuther.ai wants to create an open-source version.
In publishing news, Amazon and the Big 5 publishers have been accused of colluding to fix ebook prices [The Guardian], and Amazon is being investigated for potentially anti-competitive behavior in its sale of ebooks [Wall St Journal].
I useful stuff, Mark Dawson's Ads for Authors course is open now and very useful if you want to get to grips with paid ads this year, one of the ways in which big tech definitely impacts authors! Plus, you can get 50% off my online courses during lockdown: www.TheCreativePenn.com/learn Use coupon: LOCKDOWN. Valid until the end of this UK lockdown!
Today's show is sponsored by my patrons at Patreon.com/thecreativepenn. They pay for my time so I can think and research the future of creativity and then share it with you. If you find the show useful, please consider supporting for just a few dollars a month and get an extra monthly patron-only Q&A audio. Thank you!
Yudhanjaya Wijeratne is the award-nominated author of science fiction novels including Numbercaste and The Inhuman Race, as well as a Senior Researcher on Data, Algorithms, and Policy for an Asian think tank based in Sri Lanka. His latest novel, The Salvage Crew, features humans working alongside an AI overseer and was written with the help of AI tools.
You can listen above or on your favorite podcast app or read the notes and links below. Here are the highlights and full transcript below.
- How code is less like math and more like art
- Co-writing a book with AI tools
- What does it mean for art if art can be automated and humans can't perceive the difference?
- Why the fear of machines taking over may be unfounded
- How working with AI made the writing process joyful and exciting
- How copyright might impact machine learning and ways around that. More in my solo episode on Copyright Law and Blockchain in an Age of AI
- How a financial model of authors sharing data might work
- Would traditional publishers potentially use the data they have in the copyrighted work they own and control to train specialist genre models?
- Why technology is too important to be left to the technologists and why we need to get involved in the conversation and design of the future
Transcript of Interview with Yudhanjaya Wijeratne
Joanna: Yudhanjaya Wijeratne is the award-nominated author of science fiction novels including Numbercaste and The Inhuman Race, as well as a Senior Researcher on Data, Algorithms and Policy for an Asian think tank based in Sri Lanka. His latest novel, The Salvage Crew, features humans working alongside an AI overseer and was written with the help of AI tools. Welcome to the show, Yudha.
Yudhanjaya: Thank you for having me on.
Joanna: It's so exciting to talk to you. Let's just start by telling us a bit more about how you got into writing sci-fi novels alongside being a tech journalist, a data scientist, doing all these things.
How does your artistic life weave into your technical side?
Yudhanjaya: The writing came first, interestingly. I was always a writer first and foremost. I started out wanting to write video games. And just after school, in between working stints at retail, I was trying to build and write RPGs as a way of teaching myself the skills of both programming and also just sharpening my writing skills in general.
So, the two went hand in hand because I never really saw a difference. If you consider Russell's whole line of reasoning that language is a way of denoting concepts and the relationships between them, there never seemed to be a difference between a language that we would speak or write and a programming language. It was just a matter of having the concept map in your head. So, I just built these two things up mutually.
Joanna: I really like that you say programming is just another language, because I often feel that people who use words in the language they speak don't really understand that coding can be incredibly creative. My husband is a programmer and I've worked with a lot of programmers, so I understand it.
For people who only see languages as writing in sentences, what do you think is the comparison in terms of beautiful code and beautiful language, versus functional things?
Yudhanjaya: Well, firstly, there's as much variation, if not more, within programming languages as you would get in what we would consider to be beautiful and functional languages.
For example, if you want something that's written like a haiku, and that's clean, and then perfect, and also at the same time a bit difficult for a beginner to understand, there's Ruby. If you want the English of programming language, then there's Python, which is general-purpose. It's designed to do everything, but not really optimized towards any particular thing.
If we're talking about extremely precise definitions and tight, formal logic chains, then you have other languages. So, you have these different variations. And within that, you have all these styles, almost these dialects, and languages spin off each other all the time. And a lot of it does correspond to what we would think of as sentences.
You have the line breaks. You have the clauses that tell the compiler that, ‘This is a self-contained unit of instructions. Let's move on to the next one.' When I was growing up, because of our education system instruction, for example, to get into university to do a computing degree, you had to had taken maths. And maths is this bunch of subjects, pure maths, applied maths, physics, chemistry.
There's this whole bucket of subjects that you have to take that's considered necessary to build up the skills required to be in computing. I did maths. And then later, as I got into computing, I realized, well, actually, I would have been better served with an arts degree. I'd have been better served with a creative writing degree, because it's far more akin to writing an essay to the machine than it is to write rigorously-defined equations that perfectly terminate and balance each other across the equation marks.
At least, that's how I view it. I understand that others would totally be different about this.
Joanna: I totally get that. And I think it's almost really important to understand this idea of the art of code, and even if we don't need to code ourselves.
As we get into our discussion, so I want to get into this book you've written, The Salvage Crew, which is a fantastic science fiction novel in its own right, but fascinating also because you co-created it with AI tools.
Can you explain the process of creation in writing The Salvage Crew with AI tools?
In reading your notes you said it was extraordinarily freeing to use these various tools. Take us through that.
Yudhanjaya: What I did for The Salvage Crew was I've been sort of bashing my head against the idea of AI writing fiction. I've explored this in short stories, notably in a couple of anthologies where I've been looking at what happens to humans when you have Shakespeare 2.0 or whatever. And I've been approaching this problem from various angles and trying to look at it from a technical perspective of this is a pattern recognition problem, and how do we make this work.
A lot of computer science seem to be to feed a neural network a collection of books, and then laugh as it managed to perfectly get the rules of, say, pronunciation and spelling and grammar, but managed to hilariously mangle up concepts of time, and thought and so on and so forth. And it's because these are extremely complex relations.
When we tell stories, there are lots of extremely complex layers that we are looking at. And these layers have patterns that are patterns of punctuation, that are patterns of plot elements, that are patterns of character arcs, and so on.
So, I started dialing down my ambition, and instead, looking particularly to the video games industry to have aspects of the world-building handled for me. It starts off as a combination of space opera and a colony survival situation, I would say. So, it takes place on a planet. The planet is generated by a very simple code structure that we call a Markov chain.
The continents, the weather, all of that is generated. There's something else that handles weather sheets that tells me what the weather is like on each chapter and does so in a realistic way, so that it doesn't go from rain to thunderstorm or rain to thunderstorm to sunny to thunderstorm again, but rather rain, slightly more rain the next day, and then thunderstorms, and then slowly clearing up and so on.
The characters are also generated this way. Some of the interactions in the plot, and some of the events of the plot itself, are basically… There's a bunch of programs that pop up and say, ‘Right. This happens. This happens.'
Me, as an author, I look at those data points and go, ‘Right.' The thing that's responsible for the weather is telling me that it's snow. The thing that's responsible for generating random events is telling me that they're about to be attacked.
And the character generator that's been triggered by the events have decided to give the attackers, at least one of the attackers, adaptive camouflage. So, adaptive camouflage in snow, they're going to be absolutely terrified, because they can't see where these things are coming from.
I was able to spin that story, because it was not as much as AI really co-writing with me as someone constantly standing there, holding up a never-ending stream of ideas. And every time it ran out of ideas, it would just bring up another card and say, ‘Have you tried this? Cool. Have you tried this? How about we try this?'
The combinations of these things just keep that process of generating stories and subplots throughout the whole book going. The book is about a machine poet, I used a retrained version of OpenAI's GPT-2, which made headlines, I think as I remember, when they released a couple of articles that they said were written by GPT-2.
It's a very large model that's trained on a lot of data. It looked incredibly realistic. It had written an article about unicorns, in a very sober fashion. It was almost as if the BBC was reporting on the discovery of unicorns, incredibly sober reporting. Fake scientists have been referred in that small quotes and air-quotes are there are there. And I basically took that and modified it to make it generate poetry, because the main character is a machine poet, so I thought it would be nice to have a real machine poet actually powering the fake machine poet.
Joanna: I love that. And I think what's so interesting is having read your process and you talking about it now, you actually took all these different tools, some that you designed, some that other people designed, and used them, as you say, for the character and the plot and the different things that come up, the planet and all of these different things.
It's like you did loads of work beforehand to create these various tools that you then use to do the writing. And when I hear you speak now, it doesn't actually sound like the machine as such did much of the writing, more that you just took all these as inputs and then created something.
But of course, you mentioned GPT-2. We're now, what, eighteen months on, and GPT-3 has been released. And presumably, GPT-4 will be coming, GPT-4 will come next year or the year after, whatever. Because you've got a three book deal on this.
Will you do things differently next time? And how are the tools more powerful now?
Yudhanjaya: You're definitely right. At some point, I was looking at that effort to automation curve and wondering, ‘Are there not easier ways of just writing this?' But I did this because I was curious to see if it could be done.
This particular centaur chess format almost that, this model pushed by Garry Kasparov to see if it could be applied to writing. So, for me, it was just the thrill of finding out whether it could be done. As you said, GPT-3 is out. GPT-4 is on its way. I don't think I'll be pursuing that particular thing because technically, those transform architectures look like a dead end, because they are incredibly sophisticated, but as we understand the papers, as we look at the training costs, GPT-3, for example, is not something that can be easily trained or retrained without having millions of dollars or the hardware lying around.
It seems increasingly an inefficient brute force method to take a transform architecture, throw so much data at it that eventually it starts doing very sophisticated pattern recognition and spitting things out that looks realistic.
Where I'm going to go instead is actually, I've written a galaxy generator. And I have it generating planets, I generate stars. It does a distribution quite similar to what the Milky Way has, and then starts generating planets and assigns these planets to stars. And then it's self-generating civilizational artifacts and assigning these things to the space between them.
It's visualized as a social network. So, I don't necessarily need to care about the actual distances between the stars. That's not strictly relevant for a story. But the links between, and the path that a random node might take from one end of the galaxy to the other, and the things they might see on the way, that is something that I'll be digging deeper into.
Joanna: That sounds fascinating. I love that. It's very interesting you say that the transformer architecture might be more of a dead end, but, of course, more of these different tools are arising all the time.
I don't want to say it's an ethical issue, but there's certainly an issue, and you say in your notes,
What does it mean for art if art can be automated and humans can't perceive the difference?
What do people say about that? Because I know some people are like, ‘Well, you didn't make all that stuff up out of your own mind, so, therefore, it can't be real,' for example. Or, ‘What is the value that you can assign to art created by a machine versus a human?' I know these are huge questions. What are some of the things that you've considered in this area?
Yudhanjaya: That's a big one. I initially created this machine poet that I made out of retraining GPT-2. I initially put it up as an Instagram bot, and I let it generate Instagram poetry. I brought a basic Python script that would have them following hashtags while comment, liking and commenting on people's stuff, following them, and if they didn't follow back, they'd go and unfollow within three days.
So, that bot was, I would say, generating stuff as good as, or superior to, most of what you see on #instapoetry, and it started building up a following of, a very small following on, but a following nonetheless of people who kept commenting, liking and saying, ‘Oh, my God, this is so meaningful.'
Now, something like this could easily replicate your garden variety Insta poet. Could it generate another Tennyson's Ulysses? I don't think so. Could it do a T.S. Eliot? I, again, don't think so.
I have in my head this loose graph, and there's a line there that is constantly moving. And on one side is stuff that can be easily replicated with very little effort, that can be automated. And people can't tell the difference. And those seem to be low energy, low effort activities. They are almost the small talk category of art, if you will.
And then beyond that moving boundary is work that people have put serious thought into. And there is now this sharp divide of can it be automated and will people be able to tell? Most of the poetry for example in The Salvage Crew, people had no idea that it was generated this way, because it makes sense within the format and so on.
Joanna: Poetry is a really good example, because we generally impose meaning on words, especially if they're less prescriptive, but generating plot and the arc of a story is something much more involved. And I guess it's also interesting, like you mentioned, the Garry Kasparov and the centaur chess idea.
To me, these are tools, obviously, to you, these are tools, and thus it is not like… Let's use the word ‘cheating.' Some people would say it's like cheating to use some of these tools. And yet, of course, we use computers to write on and that could be considered cheating as such.
Now, you said there that people didn't know that AI poetry was written by a machine.
But what about your publisher? Because you've got a three-book deal for the novel and the sequels, and Nathan Fillion, of ‘Firefly' fame, who I love, performed the audiobook. So, clearly, this is not a problem with publishers.
Does your publisher know that you co-created with AI? Do you think it's acceptable in other areas outside of sci-fi?
Yudhanjaya: Ooh, good question. In response to the cheating thing, I would say right now it's a lot more effort than it is just writing the book. Right now, I'm doing it more or less because of my own curiosity. I would say that it's not just cheating, but rather inevitable, because in this space of humans and machine learning and AI, to use a term, it seems that we have tilted ourselves headfirst into narratives that pit one against the other.
Whether it's Rossum's Universal Robots, or urban legends of Golems running wild or Frankenstein, and the hideous progeny, to Arnold Schwarzenegger being the Terminator, it's always this ‘machines will come to end the humanity' thing.
However, what it's been like in reality is that we adapt. We tend to merge. Hybridity, in fact, is a far better thesis of working, people and the machine working together. And I think this is going to be essential. I think it's going to be part and parcel of our future in the same sense, as you mentioned, that we use computers now.
That could be considered cheating to a medieval scholar, who has to produce his own parchment, dye it by hand, possibly survive a Viking raid to find the books that they need. And the ability to access all of humanity's knowledge in a few keystrokes would be incredible-feeling, compared to someone who had to go to the Delphic Oracle to get an answer. So, we tend to adapt. We tend to integrate this stuff. And I think this will happen.
As for whether it will be acceptable outside science fiction, that I don't know. So, the line of thinking that this comes from is, I was looking at what happened in areas where AI has actually defeated the human, the best-performing human at the top of the field, by whatever criteria. I came across chess master Garry Kasparov in the latter part of the '90s getting beaten by IBM's Deep Blue.
The headlines of those times, which I obviously don't recall, because I was a kid, but the headlines of those times were, ‘Man Defeated by Machine.' This is the end. Because here is the greatest chess master of all time who's just falling. And what Kasparov did next was rather interesting. He came back years later with a field called Advanced Chess, where he said, ‘Okay.
We've done human versus human. We've done machine versus machine. Let's try human plus machine versus human plus machine.'
And you found that it played to the natural strengths of both of these systems, really. Humans are really good at general-purpose thinking. We are good at wild plays. We are good at connecting cross-domain expertise. We're not necessarily good at memorizing large tables or figures.
Whereas a chess engine is basically designed to have that depth and have all of these historic positions saved, and try to make a reasonable inferences as to how the battle is going to go on. So, centaur chess saw the young players, who would otherwise not be operating anywhere near grandmaster level, very amateur players, suddenly posting scores and plays that were equal if not higher than the best human or chess engine players. Grandmaster-level plays were being done by very young kids whose talent lay more in being able to talk to the chess engine than actually setting up a fantastic end game.
I thought that that was actually quite beautiful, because in doing and performing this kind of hybridity, we let more people into the game. We let more people perform better. We can potentially have, in this, the case of writing this book, I find that it usually takes me about a year, a year and a half, to think about a book and outline and plan it and so on, so forth.
This, once the programmed elements were in place, must have taken me three months. And it was three joyous months. I never had writer's block. It was always I would sit down, I would go, ‘Right? What are we doing today?' There's a bunch of things being poked at me and my mind could easily stitch a story out of it. I think there's potential there.
Joanna: I know everyone's perking up at ‘three joyous months.' That sounds amazing.
Yudhanjaya: Yes. Instead of hitting that wall at 50,000 or 60,000 words, where you don't know if this is good, you're trying to make the ends meet. You're trying to figure out whether the storyline will come together. This was just like me sitting down and going, ‘Right. We're going to have fun today.'
Joanna: And that I think is the attitude. You've mentioned curiosity, you've mentioned fun, joyous. These are words that I want people to think of. I feel that many authors are scared. As you say, it's the media has been sort of the Terminator or whatever, but I want us to reframe this is as joyous and fun and co-creation with potentially these tools that will help us use our minds in really interesting ways.
Were there things that came up that surprised you, and took you into a different realm than you would have done on your own?
Yudhanjaya: Oh, yeah. Quite a lot of the plot elements of how they started essentially falling back and starting to keep a farm growing and how at some point, there are these giant…without spoiling, there are these giant mega beasts on the horizon and they're reverting to, like, medieval wood construction in an effort to keep themselves protected, because there's just not enough wood to go around.
Those things were completely unexpected. Almost all the encounters, almost all the fights that they had were completely unexpected. I had in mind the main character, and I had in mind the character of the alien that they do eventually make first contact with. And I had that theory of mind set in place, and everything else was basically winging it.
Joanna: It sounds really fun. What I'm seeing at the moment is that there are starting to be tools, assuming that most people listening are not programmers. I'm not a programmer. When do you think there are going to be more tools available for authors to use that build on top of many of the existing things that programmers are using and the things that are in beta.
When do you think that the tools will be ready for non-programmers? Or are there any that are even ready now?
Yudhanjaya: I think in terms of world-building, there's quite a few tools being built around niches. There's, for example, a fantasy town generator. There's going to be universe sandbox generators. And all of this stuff already exists, because procedural generation, the art of taking math and turning it into these incredibly large structures that we can then appropriate for world-building, that's been going on for a long time in game development.
There are, all of these dungeon masters who run D&D games will probably be very familiar with a lot of the generators out there. The most sophisticated stuff, such as OpenAI, the reason I have to be cautious around this is that any of these AI tools, anything involving machine learning, requires a lot of data to be trained on, so that it can then start producing things like that.
The problem comes with copyright. So, for example, I would love to have something that has been fed, let's say, a couple of hundred science fiction novels, just to be able to give it a sentence like Foucault's Pendulum, where they have this computer that…
Joanna: Oh, I love that book.
Yudhanjaya: Yeah. Where they have this computer that constantly keeps cranking out conspiracy theories and just tying everything together into a plot. I would love to be able to do that.
But the problem is that data, if it's fiction, is somebody else's copyright.
Joanna: I also see that as a problem. I really think that there needs to be a licensing model for licensing works in copyright to be used as training data for some of these things.
Yudhanjaya: Yes. Absolutely.
Joanna: How open are people to that kind of thing? Because copyright is amazing in many ways, but equally, the 70 years after the death of the author, that is holding up development of the things that you're talking about. And my fear is that governments, or that things will change around copyright in order to facilitate some things, and that we have to have a balance, so, some kind of licensing around data training models would help. What are your thoughts on copyright?
Yudhanjaya: In policy, this is often referred to as the secondary use of data problem. Can you use data of any sort for purposes other than it was given to you for?
If, for example, I go and buy a bunch of Scalzi books from the bookshop, having read and enjoyed them as I'm supposed to do, and that's implied in the social contract of buying books, can I then digitize it and feed his stuff into something that might eventually start to sound like him?
GDPR, for example, requires that for any secondary use, the data subject or the data provider be notified, and explicit permission be attained. So, that actually favors creators more so than the researchers.
I don't necessarily have a much better solution in mind other than to agree with you that the 70 years after death is too restrictive. Even as an author that just feels a bit too much. It's not like I can sit here and claim that every sentence I've written is 100% original.
I am also a product of societies and what I've read, and therefore these words are also going to rely on constructs that I have observed in the world around me and reacted to. So it's not like I'm pulling language purely out of the ether.
That is the state of things right now. On the other hand, I've seen a few AI co-writing tools that are based explicitly on GPT-3. And my question to each of these people is sometimes just reach out for testing. And I keep asking, ‘What's your data set? Is it in the public domain?
If you say you've taken this many screenplays or this many novels, whose novels have you taken, and what permissions do you have to do that? Because there's the flip side of saying, ‘Okay. Fine. In the name of research, let's do this.'
This is eventually what is simple and harmless and found research gets commercialized. Because of the nature of these technologies, I can retrain OpenAI GPT-2 on my home machine. It's a 6-core, 12-thread CPU, with a very powerful GPU and lots of RAM. That's fine.
However, for me to retrain GPT-3, it would take me about $4 million worth of equipment. I read the GPT paper. And towards the end, they basically say, ‘There are these parts, these parts, and these parts where you're not really sure what happened, but the cost of retraining this is too high.'
That's OpenAI saying the cost of retraining this to find out, to make our research a bit more rigorous, is too high. So, this is overwhelmingly going to privilege large corporations with lots of money in the bank, lots of hardware, and lots of highly paid researchers who can then do this kind of work to create a product that can be sold on a SaaS platform.
So, the problem is, I feel like I'm one of those two-handed economists, like, on the other hand, on the other hand, but the problem of flipping the gate the other way is, you'll initially have a wave of early experimenters like me who are having fun with it, and then immediately you have [inaudible 00:32:11]
Joanna: I see this happening right now. As you know, Microsoft has now licensed OpenAI's tools.
Yudhanjaya: Oh, yeah.
Joanna: So, obviously, Microsoft are going to turn this into their products. I think Azure is their AI software as a service. So, these things are going to be commercialized. And of course, as we know, the architecture transformer stuff could change to be something else, and could be cheaper in the future.
I feel like we have to figure out copyright for an AI age before all this starts happening, because the thing is you've done this, you've produced a book that you've been paid for, and because you're ethical and you understand the copyright side of things, you've done it in an ethical way.
Yudhanjaya: Yes. I'd like to note that I use people who've been dead since the fifth century.
Joanna: Exactly. But equally, to me, there's a big problem of bias. So, if you only use works out of copyright, they are generally, you know, white, dead, Christian male…
Yudhanjaya: Oh, yes, yes.
Joanna: And you're Sri Lankan, for a start.
Joanna: How many Sri Lankan published authors are in your dataset? Probably just you?
Yudhanjaya: I've been through the Project Gutenberg corpus of poetry. And there is a very easy downloadable corpus there. What I ended up with, the initial generator that I built, was very heavy on Christian image. Themes of God kept randomly popping up in the middle.
Joanna: You probably had a lot of Bible in there.
Yudhanjaya: And there was William Blake, for example, and a lot of those. The corpus is overwhelmingly, as you say, biased towards the Anglosphere, towards white male authors in the Anglosphere, and even then towards the more religious angle of it, whatever was considered socially acceptable in those times.
So, yes. We have tremendous problems in class balance in these kinds of datasets. And we absolutely do need to figure this out before crap hits the fan.
Joanna: You and I are both authors. I'm an independent author. I own all my rights. I would love to license my corpus of work to whatever models, but I would also like to be recompensed for that.
In my head, I have this idea that using possibly some kind of blockchain technology that would feed in my data would be tagged in some way, and then whatever is output from the other end, let's say someone produces books out of that corpus, that I would receive a micropayment for whatever percentage my work was part of that training data. Is that completely far-fetched?
Yudhanjaya: Oh, no. Actually, that's very interesting, because that should be possible within the GDPR, which, I'm not a huge fan of GDPR. There are certain data colonialism problems that it's kicking off in the way it's being pushed out.
However, in the current structure, there are these intermediaries that other data processes. For example, if you and I could license our books out to these data processes and they acquire the rights of many authors to put together a large corpora of data that then researchers or maybe even other authors can one click and download, and there's a subscription fee, a portion of which, according to our contribution in the corpus, goes to each author who pitches in. That is a perfect model.
Joanna: That's the model I want to happen, because that, I see, as a way to enable this kind of creation, but also to pay the original creators. And then I was reading about synthetic data. So, let's say you and I put together a corpus together, ‘Yudha-Jo corpus,' and we can then actually create synthetic data from that that then could also be licensed on.
Thus it would give a new form of income to creators, but still benefit people who want to train and use the models to create new things.
Yudhanjaya: True. But synthetic data is a bit difficult with unstructured data. So, the think tank that I work at, we actually do a lot of synthetic data work. We use it generally on phone call records across millions of people, to reconstruct patterns of movement, economic activity, so that whatever funds are coming in for development can actually be channeled to where those things are needed, and where there is a need, for example, for better routes and better public transport.
On the language side, because I work with flexible languages and publish large corpora in these languages, synthetic data is incredibly… GPT-2 and 3 are essentially synthetic data generators. They're incredibly difficult with unstructured data, unless you happen to have an insane amount of unstructured data, which has been OpenAI's thing so far.
They just scrape all of Reddit for that first conversation. They basically scraped all of Reddit for the top-performing articles, and we followed those links through and we took that text. And I'm sitting there going, ‘Wow. Okay. That's a lot of copyright violations.' I don't even dare touch that corpus.
Joanna: And that's the thing, but my concern really is that, again, you're an ethical person who knows this stuff. Most people playing around with a lot of these tools don't necessarily even understand copyright, let alone care about it. So, I think we definitely need to be engaged in this.
I love what you're doing. I do have one last question because we're almost out of time.
When we talk about ownership of data and ownership of books and copyright — the publishing industry actually owns the most data.
For example, Penguin Random House, maybe with Simon & Schuster, you've got some really big corpuses there. Do you think as a thought experiment, would one of these mega publishers use that data? Because a lot of the times, the contracts that authors have signed are hand-over data for the life of copyright. Could that be used in the future, or do you think the publishing industry is just not that sophisticated?
Yudhanjaya: I've got a fair bit of traditionally published work out. I've got a fair bit of indie work out. And everybody says, ‘Oh, tratditional publishers respond to this or this or this,' until they start seeing significant money, and then you start seeing eBook adoption. Then you start seeing print runs being reduced, higher royalties appearing on eBook, on eBooks and things going eBook and audio first, large amounts of money being pumped into this.
I think at some point, when they realize that there is enough money, this level of analysis will be done. In fact, I'd be very surprised if there wasn't already. This is speculation from my part, but I can think of several dozen use cases right off the bat. If you wanted to know what the structure of a best seller is, you could potentially do topic modeling on… Say if you take the science fiction.
Say you're paying Penguin Random House and you have science fiction and fantasy. Topic model your bestsellers, and then find books that match those. So, you're not just looking at the blurb, you're not just looking at how closely the title matches, but does the actual structure of words and themes represented and how they're put together inside the document itself, do they match? There's so much other stuff that can be done with this.
Joanna: I think so. And, in fact, I wonder whether this will come out of China first. Whether they're ahead of the U.S. or not can be debated by different countries, but you and I are in the middle of those two countries. So, one could say that this could come out of China first. With AI translation, I don't even know if we're going to know what is created by an AI.
Yudhanjaya: That's the thing. I'm looking at both regulatory environments, because both the U.S. and the China are very much alike, for all the narrative that they have against each other. They're both, to a certain extent, incredibly unregulated data economies.
Now, China is passing through Data Protection Act, which I've read the draft of. It's pretty interesting. Actually, a lot more liberal than I imagined. But in the U.S., for example, you have a Clearview, which, basically, scraped YouTube for faces, and used that to build a facial recognition database for police. And they're claiming that it is within their first and second amendment rights to take data thus from YouTube. YouTube, of course, is kicking up a huge fuss and these people are saying, ‘No, it's in our rights.' And that is a completely unregulated environment.
I have the feeling that this stuff will be coming out of the U.S. first. It's probably not going to come from the science fiction and fantasy authors if you're looking at books that are co-written and so on. It's not going to come out of the classical science fiction and fantasy authors that we know and follow.
It's going to be someone with a programming background, probably in Silicon Valley, going, ‘Hey, I'm going to write a book.' So, a lot of the social contract around the process, the subtle unwritten rules and norms of being a part of a community of writers, are just not going to apply to them because they'll be looking at it completely from the outside and going, ‘How do I hack this process?'
I've just done something like that. I've realized I'm just describing myself as a sociopath!
Joanna: But you haven't. And circling right back to the beginning, you're an artist and a programmer and a technologist. And I think how I want to end this discussion, really, is to say that this is too important a thing to leave to only the technologists, and the artists and writers need to get involved in this, right?
Yudhanjaya: Yes. Absolutely, yes.
Joanna: We have to get involved or else we're going to wake up and we'll be out of the conversation.
Yudhanjaya: Or worse. We're going to wake up and it's going to be crap. A lot of my frustration came from reading computer science papers that were published in these machine learning fora where they would feed increasingly sophisticated neural networks, say, the first seven Harry Potter books. And then you would get some output and they would discuss it and say, ‘Oh, yeah, the model loses attention after two paragraphs.' And the author part of me is screaming, ‘That's not how we work.'
We think of characters, we think of world-building, we think of plot, we think of all of these things, the emotional arcs that the reader has to go through. We think of all of these layers. We don't process this as one giant chunk.
I think it's critical for us to actually be involved in this because, otherwise, we will end up with stuff like GPT-2. Actually, GPT-2 and 3 are very good examples of what happens. We're bordering on the unethical. It's even sometimes unfeasible.
It's clearly a technological breakthrough as well, so it is sitting in this awkward space of who really gets to use this. And what exactly is the greater good scenario here?
Joanna: Absolutely. So, people who want to get involved should definitely check out your books and some of the things you've written.
Tell people where they can find you and everything you do online.
My Twitter is going to be language, futurism, policy, cats. And my Facebook is going to have, lean a lot more towards cats. It's going to be cats and whatever other interesting things I have to say come second to cat photos.
Joanna: Thank you so much. That was so great to talk to you.
Yudhanjaya: Thank you so much for having me. This was so much fun.