Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial โ cancel anytime
Hey everyone, welcome back to this interview series. I'm delighted to have with us Kwok Le, who is a researcher at Google Brain. I had the pleasure of working with him at Stanford University many years ago, and Kwok and I were chatting just now. Fun fact, Kwok was the very first intern I had recruited into Google Brain when I was leading it back in the early days. Zhiqian Yang was the second intern I ever recruited into Google Brain, and I think Geoff Hinton was the third intern I ever recruited, so I think Kwok was in good company. Since then, Kwok has done some of the most influential work in NLP, and I'm delighted to have him here today to tell us about some of his experiences. Thanks for joining us, Kwok. Yeah, thank you, Andrew, for having me in the interview series, and it's a pleasure to be here talking to you. So today, Kwok, you're known widely as one of the most influential NLP and deep learning researchers, and your journey has been a complicated one. I think you had started off going to school in Vietnam, and then went to Australia, and then Germany, and then now the United States. So tell us more about your journey getting into AI. Yeah, so when I was a high school student, I was fascinated by AI, and I ended up reading a lot of books and even programmed some simple AI programs. And then I got a scholarship from Australia for my undergrad in the Australian National University. And in my second year of my undergrad, I got a little bit bored, and I said, you know, maybe I should do something, research in AI, because it seems like the faculty there, they have some very amazing faculty. So I contacted Alex Mola, who actually took me as an intern for a summer, and I worked on kernel methods with him. And I think that even though I did some AI, you know, projects before that, that's the first time that I actually learned about machine learning. And I became super fascinated about machine learning, and I realized the potential of doing machine learning for AI. Wait, back up a bit. I don't think I knew this story. What was the AI project you were coding up in high school that got your scholarship to Australian National University? I didn't get a scholarship to Australia because of the AI program, but I programmed it myself to learn more about AI and so on. I was building like a chatbot, you know, like a rule-based system to actually talk to myself, right? And, you know, try to randomize the answer and so on, try to see how I can fool my friends, because I read about the Turing test and so on, and I was fascinated about it. But I didn't win the scholarship because of that program. It was too simple for it. But, you know, because when I program it, I know how hard it is to write a computer program to do, to actually be intelligent. And I think it's really cool that you started out that way, because I think if there's someone listening to this interview series now that is coding up, you know, some slightly broken chatbot that kind of works, doesn't work, and finds it hard, right? They may be taking their first steps to an illustrious career in AI, much like you were once coding up a social chatbot. We all start from programs that don't quite work as well as we wish, but that can be a wonderful start. Actually, much later, many years later, I actually had a conversation with you, and it turns out you also did something like this as well in the past. But I was fascinated, I was always fascinated about, you know, if you ever build an AI program, can I talk to it, right? Can I talk to the program and hear about what it thinks? But anyway, going back to the journey, I did my summer project with Alex Mola on kernel methods machine learning, and I realized the potential of using machine learning to build AI. And then after that, I graduated from, you know, I kept on doing research with him for a while, and then after I graduated, he has a friend in Germany, and I went to Germany, you know, Bernard Schonkopf hosted me in Germany. And when I was in Germany, I was also doing machine learning research, and near the machine learning research, there's like a neuroscience center that they was trying to think about, trying to understand the brain and so on. So I became very fascinated about that as well. And then I listened to one of your talks at NeurIPS that year that you were talking about using machine learning for AI. And I, by the way, at that point in time, most people think about machine learning as patent recognition, just doing machine learning for machine learning, like trying to classify things. But when I've heard about machine learning to do AI, it really resonated with me. So I applied to the Stanford PhD program to do the PhD. And then, you know, I ended up going to Stanford and then doing the PhD program, the PhD degree with you. So that's around 2007. And then around 2010, 2011, you told me that you were found, you were founding and creating a project at Google to scale up deep learning research and make deep learning, you know, like a hundred or a thousand times bigger. When I heard about it, I said, oh, that must be the future. Because of the newness I've trained, like, no matter what I did, the only thing that make a big difference is giving it more data and more compute. So it really resonated with me. I'm glad you thought it was the future. A lot of my friends at the time were telling me it was a terrible idea to do this crazy Google brain thing. I'm sorry, but you thought it was reasonable. But actually, I think you and I, we had the privilege at Stanford of seeing some of the early results that Adam Coates was generating. And Adam Coates had shown that using small scale resources and scaling up the neural networks really worked. So I think it made it easier. I totally agree. That figure that Adam Coates did, I think, I mean, that's basically changing a lot of the things that I did, but I think probably a lot of the things in the field did too, but really influenced me into joining the project. Of course, then I went to Google Brain and I was an intern. Like you said, I was the first intern of the project. And that's around 2011. I think I remember that around April 2011. And then I stayed since with the project. And, you know, Brain has grown from, you know, to four or five people that you had at the beginning and now to, you know, much larger teams. And now I do research, still continue to do research in deep learning and AI. And yeah. Yeah. Google Brain was incredibly lucky to have you join. And I think one of the first home runs was that, you know, Google Cat project, which you were the lead engineer, the lead author on. Yeah, yeah, yeah, yeah. I remember that. So back when we were at Stanford, I was, you know, playing around with this method called autoencoder, basically given the input image in and trying to reconstruct the image. And then if you apply some kind of sparsity, you start having some, at the low level of the neural network, you start seeing some kind of like edge filters. And edge filter is very useful for computer vision. And then this is this influential result by Honglak Lee, which is our lab mate. And I think in the project, he worked with you. Is that like, if you do this in a deeper neural network, you start having more sophisticated features. And for me at that point in time, you know, learning feature representation is very, very, very interesting. And our thought exercise, this is our, it's not me, but our thought exercise, but you know, you got a lot of credit in this. So I have to say it, but our thought exercise were actually, can you make this, you know, a hundred or thousand times bigger and learn even more sophisticated features? We don't know what they're going to end up doing, like these networks, right? And then we set out doing this project. We scale it up from a single machine to 16,000 machines. I remember that. And we traded on YouTube because I feel like my question was, you know, what is the biggest image data set that we have access in the world at that moment? And it turns out that YouTube images. So we train the neural network for like a week or so. There's fast auto encoder. And, you know, with a lot of talking, we found some hidden, some neuron in the network that is very sensitive to phrases and cats, the typical things that we, that people expect on the internet. And yeah. And one of the most intriguing results was that it's actually surprisingly, the network found cats on the internet. I still remember, I think we were both in the office that day. The team was so much smaller and you grabbed me and said, Hey, Andrew, come to my computer and take a look. And I walked there and said, well, and that was the cat thing that you had to cover from a supervised learning. Yeah. So, and I think, you know, and I think one piece of early deep learning history is I think almost all of us maybe overestimated the short term impact of unsupervised learning and didn't realize that supervised learning will be where a lot of the action, you know, at least in the early days would be. Although I find it interesting that one place where unsupervised learning has really taken off and is showing wonderful commercial results is in NLP. Yeah, yeah, yeah. Yeah. And in fact, I still remember when, you know, there was one day in the office where we were in Google Brain, where you walked over, I tapped on your shoulder and you said, Hey, Andrew, come to the computer and take a look. And I walked over and there, you know, you'd gotten this cat to appear on your monitor from unsupervised learning. And I think that iconic, slightly blurry Google cat wound up being one of the, you know, iconic images of unsupervised learning for that time. So I think Tim couldn't have done that without you. Shortly after, one of the other huge hits that you worked on was the sequence-to-sequence model. And I think that was another piece of work that changed the trajectory of NLP. Tell us a bit more about that. The first version of sequence-to-sequence is actually, a lot of people did not know, but there's this idea called word-to-word, is translating word from one language to another. And the reason why we did this is the following. So back in around 2012, 2013, suddenly there's this wave of new word vector methods. And I'm here by a guy called Thomas Mikulov, and he was developing word2vec, right? And it trains, you know, word vectors very quickly. And it shows this amazing ability of, you know, if you take, you know, pink minus green, then the vector is similar to man minus woman, for example, right? And basically solving analogy. So the next question we ask is that, is there any similar structure between the two languages? So for example, can you train word vectors in Spanish, and then also train word vectors in English? And can you try to align them a little bit? For example, can you learn that, you know, certain words in one language align with another word? And we did this thing, and it turns out it works. So you can only have, let's say, five words, or you know, a hundred words in popular words in English, and you know the corresponding translation in Spanish, and then you learn the rotation matrix. And then you can actually map, now you can start translating words from one language to the other without knowing too much about the language. So that, you know, this word-to-word translation gave me the feeling that, oh, maybe we should not do this word-to-word. Maybe we should do sentence-to-sentence. And then we say, okay, so how do we do sentence-to-sentence? And it turns out that it wasn't easy conceptually at the beginning for us. So we were thinking all the ways, for example, let's try to read in the input sequence, the input sentence, and then try to predict, you know, what's the first word looks like, and then try to predict what's the second word look like without knowing the first word, for example. It's like predicting the first word, second word, third word, et cetera, without knowing their relation. And then I said that, oh, maybe that's a little bit limited because if you predict the second word, you should know the first word. And if you predict the third word, you should know the second word, et cetera, right? So it has to be some sort of dependency. So we thought about it. In an English sentence, it doesn't really make sense to independently predict the first and the second and the third and the fourth words. Predict the first word, and then based on that, figure out what's the second word, and then based on that, only after that, figure out what's the third word. And the sentences you output just make a lot more sense when you do it one word at a time, rather than try to spew out all 10 words at the same time that may not end up relating to each other. Yes, yes. That's a good way to explain it. Now, it turns out that, you know, like from the point that we realized this idea to actually make it work, it took like a year, and it was actually very difficult. And one bottleneck about this whole thing is actually training the sequence model to do this. And back at the time when I was training this, I trained with a conventional recurrent neural network. But at that point in time, together with me, Ilya Suskever and Orio Vinyal, they all also had very similar ideas of doing this. So, and Orio Vinyal and Ilya Suskever, they had, they know how to train this model, how to get this model to work with LSTM. So we met and then we decided that let's try with the LSTM. It turns out it gave us really good improvements. But when the first translation, it still looked awful. And it can output certain words that look like the input, but mostly wrong. So the question is, can we, should we continue? And, you know, I was into, you know, trying all sorts of ideas and so on. But credit to Ilya that he said, maybe we should train this model better. We should really train this model better and train a bigger model, bigger model and train it longer, train it better. And credit to him that actually, actually that approach actually turns out to be successful. And then he, we trained it, we lowered the perplexity slowly, slowly over time. And then we stopped when after, you know, a couple of months, we start seeing better and better translation. You know, now it's just only one or two words, but like five words, certainly more correct. And we realized, we realized that we were onto something. So it sounds like you've got history, right? You tried to have RNNs, then got better results with LSTMs. And then the last mile was just scaling more data, bigger network, bigger LSTM network. And that wound up being the last several steps to get to that, you know, seminal breakthrough that we then saw in the US. Yeah. I would say that like, seems trivial from the perspective of a researcher, but like, I think it makes a huge difference for the project because, you know, had I tried, you know, fancy ideas and, you know, all sorts of like crazy ideas, I think I would not get the thing to work and we would delay the project for much longer. And it's a lot of lessons. Even though the last mile seems to be trivial from the research point of view, it made a huge difference for the project. And had we invest and had we invested in, you know, fancy ideas and not, didn't try scale and didn't try to train the model longer or better, you know, if we invest more on ideas, we never got the model to work and it would not, it would delay the project for much longer. Yeah. Yeah. It doesn't seem trivial to me at all. And I find that when I'm building systems, when I'm, you know, in basic research mode, sometimes I hold the dataset fix and change the algorithm. But when I'm in production, just build a commercial system that works mode. Sometimes I hold the algorithm fix, you know, pick an architecture that I think is good enough. And then spend all my time changing the data. So there's very, very different ways of operating. Yes. Yes. Yes. So the work that you did with earlier and Oreo on sequence sequence models, right, this is a similar piece of work, really huge impact on the whole field of deep learning. And then almost coming full circle, you built chatbots in high school and building on the ideas of sequence sequence, you also built this much more advanced chatbot than your high school version, I imagine, called Miner. And I remember reading that paper and thinking, wow, some of these outputs actually look pretty cool. Talk a bit about the Miner chatbot project. Sure. Sure. One of my dreams in my high school is that the viewer chatbot can tell you a joke. You know, it can tell me like a new original joke. And, you know, one of the things when we're writing the batch every Tuesday, because we sound on Wednesday, is my editor-in-chief and I often end up sitting around brainstorming for five minutes to see what's the joke we want to tell. If we're a chatbot to automate joke telling, it'll make my life easier. Yeah, that's for sure. Same with me, you know, sometimes I want to give a talk and I always begin the talk with some joke and I find it sometimes hard to make the joke as well. But yeah, but like when I was in high school, I was thinking, you know, can we come up with a way to create like an original joke? And it turns out if you think about through that exercise, it's actually more difficult than we can program. So after I worked on the sequence-to-sequence project, I realized that if you can get in sequence as input and then produce sequences output, maybe one of the biggest applications is to train a model that can talk to you. Now, I actually, this idea actually executed better. Actually, who thought about this idea at Google first, it's not me, it's actually Oriol Vina. So he actually trained a chatbot that actually worked really well. And we realized that maybe we should collaborate and then build like a chatbot, cool chatbot. And we did it in around 2016 or something like that. And it's like a paper on archive. And we show that there's potential of this. Yeah. Remember, you do a lot of work, right, within Google to get permissions to use the IT support to support data into the chatbot. Yeah. So back at the time we were like, we were researchers, we asked ourselves, where's the data set? So one thing that we did was like going after the technical support internally at Google and get a data set from them and then train a model. And then we asked a question like, can you debug this? I lost my password. What should I do? And then it starts answering some kind of question. Although, you know, although it still feels a little bit off, but you start seeing it, there's something there, right? There's something there that's really promising. And then I spent... I just had this very clear recollection of you, me, and I think Adam Coates, a bunch of us were having dinner in a restaurant in Palo Alto, and you had just gotten permission to use the data and you were very excited over dinner that day. Exactly, yeah. I remember the recollection of that dinner. Yes, yes. That was when we had a sushi in the Palo Alto restaurant. I remember we would talk about that and how excited you were that, you know, someone worked on chatbots because you also worked on chatbots back in the day. And I think I told you about my high school chatbot attempt, which was actually more of a prank, right? I think I told you, I wrote a, you know, prank chatbot where I will type, and no matter what I type, the chatbot will print out what is your name. So I would type, you know, Q-U-O-C space L-E, and as I type that, the chatbot will print out W-H-A-T space is your name. And the effect was when I hit enter, then it will print out what I actually typed. So to the user, they see, they think I typed what is your name, and then it prints Kwok Ler. And so to my friends, I was pranking. It looked magical because I could ask the question, they would know the answer. But what was actually happening was I was typing the answer. It just was displaying a pre-canned question. So that's impressive. New AI, just a high school prank. Your Amino was much more interesting. Tell us that story. I spent one year trying to improve the bot myself, and I didn't get very far. And turns out the bot is very hard to improve. After you, actually, so if you train multi-turn conversation, after one or two turns, it becomes very difficult. For example, it can answer you the first turn, but the second turn, it starts to forget information. It doesn't give you, you know, a good answer and so on. It's pretty much broken. But then by the end of 2016 and 2017, I met this guy at Google. He's an engineer at Google. His name is Daniel and he came and told me, I also had a dream to be a chatbot. So let's work together to build a chatbot. And when he came in, back at that moment, something magical happened, which is basically people have developed this transformer architecture. And we say, okay, maybe the transformer is more promising because it can understand long range dependency better, right? So it can see multiple turns better than the LSTM. And so we started working on using the transformer to deal with this multi-turn conversation. And you know, we trained bigger model again, right? Longer and using more resources at Google. And slowly we start to see some improvement. And one magical moment that, in my opinion, is that I looked through the log. So we allow some people to chat to it at Google, right? And there was one magical moment is that we looked through the log and then I saw some joke the bot made. And it's actually in the paper about, you know, the joke was about, you know, if cows go to Harvard, then horses should go to Hayward. It's a funny dad joke. And then, but it's actually a multi-turn conversation. It's trying to lead the audience into it. So I saw it and I thought, oh, so I tried to look into the training data to see whether it's actually, does it have anything like Hayward or, you know, cows go to Hayward, sorry, horses go to Hayward or not. And we couldn't find it. So it's truly, it's actually trying to invent, it understands the concept of jokes. It understands the concept of puns. And it actually created a new joke. And I think that's a very magical moment in my opinion. And pretty excited about it, I think. Yeah. I remember reading that joke when I read the paper. Hey, how do you know if it was true understanding the concept of jokes and puns and Harvard versus Hayward versus, you know, if you, enough monkeys in the typewriter or enough randomness means eventually there'll be a funny joke. Yeah. So we look into the training data very carefully and see, look for the word Hayward, right. And see whether that word mentioned anywhere, it mentioned once in the entire training corpus, but it's not next to like a joke. It's mentioned in a different context. Our conclusion is, is actually trying to actually understand this concept of a joke. Because the training data set has a lot of jokes. Training data set has a lot of jokes as well. Puns, there's a lot of puns, but that particular example is actually novel. We could not find anywhere in the training data. So you've been at the forefront of NLP research for many years. And then we continue to see breakthroughs in NLP on a regular basis. So I'm curious, looking to the future, what are you most excited about in terms of things yet to come in NLP? Oh yeah, I think, I think I'm most in NLP, I'm mostly excited about generative models. I think currently a lot of NLP and is basically doing, you know, traditional NLP where you can classify the sentiments of the sentence, or you can do name entity recognition on the sentence. But I'm more excited about the capability of generation. Like, can you generate a new book that, you know, consumed by humans and, you know, teach people a new concept? Can we get to that point? Or can, or maybe, can it help like a director to, or screenplay writer to come up with a better movie plot, right? There's so much potential in using these technology for generation, in my opinion. Since the breakthroughs of transformer model and the several flavors of transformers, one huge vector of progress for the generative models has been scale, right? Scale of data and especially compute. Other than scale, what are the, what do you think are the important vectors of progress for generative models? Okay. I think, I think the, I think true understanding, I think right now we look into like, for example, text generates, there's still things that are a little bit off, right? It made up stuff, right? Like there's some facts that it made up. So the question is, can we get the bot to really have common sense understanding and generating more factually correct output? I think that would be a big thing. I think that would be a big vector of progress. Yeah. Yeah. I want to dig a little bit further into that because I think a lot of researchers, including me, you know, sometimes have talked about computers understanding images or understanding language and common sense is a concept that philosophers have debated for, I think, over 2000 years now. So when you talk about NLP understanding or common sense, are these scientific concepts that are measurable or are these philosophical concepts that you kind of feel like it understands it, but how do you approach these questions of understanding common sense? I think it's measurable. So in some sense, you can ask, for example, you can, for example, you can create, for example, GPT or any language model, you can give a prompt about, let's say, you know, talking about the GDP in the world, right? Let's talk about GDP in the world. And can you look at the GDP and you compare with Wikipedia and can you say they're matching? And that's measurable. Or you can start talking about movies in 2020, right? And can it understand that these are the movies, right? I think it is measurable. Do you have a favorite set of benchmarks? Like there's like a common sense QA is one data set. Do you have a favorite benchmark for measuring understanding or common sense? I thought common sense is a bold name, you know, to say this is our measure of common sense. It's a bold name for data. Maybe we can, instead of like talking about common sense, we can just talk about maybe factuality a little bit. Just factual knowledge, right? Just generate some statements that are actually factually correct, right? For example, you know, how old is Barack Obama and so on. It can generate correctly. And from that perspective, I think it's easy to create another set for that. We haven't had that data set yet, but I think it will be created. I think that people are in other than AI, they created like common sense data sets to measure the performance of genetic models, right? So I'm not expert in this field, but I think my feeling is that when I read a lot of these outputs, I still see things that are actually factually not correct. And I can see things a little bit off. And I think addressing that issue can be quite important, yeah. So like a test set on measuring factual correctness would be a good way to help people. In your time at Google, you've mentored a lot of younger engineers, and even at your time at Stanford, you mentored a lot of younger students. So today, there are a lot of people wanting to break into AI or they want to advance or to show their career in AI. What advice do you have to someone wanting to build their career in AI? Yeah, okay. So first of all, I have to say that I don't have true meta advice. I don't like the concept of meta advice for a lot of people because I think, you know, food for me is poison for other people. We are living in a different world now. But I say, you know, looking through my journey, I say that I started out from something very, very different, right? I came from a different culture, different background. So maybe one piece of advice is that, you know, like career as a whole is taking a long time to make really significant progress. I never can imagine that I make a big progress, I make a good contribution for translation, for example. So my thinking is that maybe you can start quite low, but maybe, you know, with hard work and dedication, you know, over time, you can, you know, do good work and make impact for the field. And so be patient, really, right? So for me, that journey has taken me 15 years also already from the point that I started. And then until that, you know, I become a more senior researcher at Google. So be patient and because it takes time to do impactful work. That's number one. And number two is I think naivety can be good. A lot of people think that they have to read a lot and understand everything. I think naivety sometimes can be in your favor. For example, in around 2014, when I was doing this sequence to sequence at 2013, I tried to re-implement a phrase-based neuromachine translation. And I was just a bad programmer. I didn't know anything about phrase-based neuromachine translation. I was bad. I spent two months implementing it. I couldn't do it. And because of that, because of that frustration, I say, okay, maybe I should do something here. I should do end-to-end learning instead. But imagine if I was a good programmer and I knew a lot about phrase-based machine translation, I would program it successfully. And then I would not develop a sequence to sequence. And I would not participate in a different project like that. So maybe naivety can be good. Yeah. I think what you said about patience is great advice. And sometimes it does take some time, years, to become really good at AI. And it can be a long journey, but extremely rewarding journey. So I'm curious, what advice do you have for someone that is learning about AI, learning about machine learning, growing their career? And in the course of that process, how can someone know if they're making progress? And how can someone know if they're doing the right things to be advancing at the right speed? We need to identify a data set, identify a task where you can see progress more quickly. And maybe sometimes you don't get success on that, then try to understand why, and try to talk to maybe contacts, more senior researchers, and try to figure out all the senior engineers and try to figure out what went wrong. If I were to start AI today, in career in AI as a researcher, let's say, then I would probably, maybe the starting point would be, I would try to replicate some AI, my favorite AI research papers. Then try to look for, read that paper, try to understand it, reproduce the result, compare against their GitHub, and try to get into the habit that I can implement quickly and you know, my code can reproduce that result quickly. And then, you know, if I make progress in that easy phase, then I can start thinking about how could I contribute? How could I get better at developing my own ideas? So that's what I do. And then I would start, in other words, I would start with something easier first. Try to get something that's easier. It's like learning to swim. You don't want to go to the ocean to learn to swim. You want to learn to swim in a pool first, and then, you know, do something quite easy that you get happy doing it, and then move slowly, move yourself towards more and more difficult tasks. And for example, creating your new idea, writing your own paper, that's difficult. But I think reproducing an existing paper and getting a result on MNIST or CIFAR, that seems easy enough to do. So maybe start with that first. Start with something simple. Yeah. Great advice. Kind of like, you know, we use curriculum learning in training neural networks as well. Thanks, Kuo. This is great. Before we wrap up, do you have any final words or any final thoughts for the viewers watching this? Oh, I can say some final words. I say, you know, I accepted the interview because I want to see Andrew. And I want to thank, you know, Andrew was a spectacular educator, I have to say. So I remember in 2017, I came to, you know, I was already a researcher. I wrote papers and so on. I came to Stanford. I started my PhD. I came to a lecture hall by you, the CS229 lecture, and I literally had goosebumps because your lecture was so good. It was like another level. And I learned a lot. I think I learned a lot from your class and, you know, from your original thinking as well. Thank you for your contribution in educating the community about deep learning. I think that's going to be very, very impactful for the field as well, because we need more, you know, more people to move the field forward as well. Well, yeah, thanks. Thanks, Kuo. I wasn't expecting to say that. It was very, very, very, very kind of you. It means a lot. Yeah, thank you. And I think, thanks for being with me on this interview series. It's wonderful to see all the tremendous contributions you're making to deep learning. This is great chatting as always. Yeah. Yeah. Thank you. For more interviews with NLP thought leaders, check out the deeplearning.ai YouTube channel or enroll in the NLP specialization on Coursera.