Welcome to this interview series. To kick it off, I'm delighted to have with us today Tris Manning. Tris is, I believe, the most highly cited NLP researcher in the world. He is a professor of computer science and linguistics at Stanford University. He's also the director of the Stanford AI Lab, which is a role that I had previously held as well. Tris is well-known globally as a leader in applying deep learning to natural language processing, and has done well-known research on tree recursive neural networks, sentiment analysis, neural network dependency parsing, the glove algorithm, and a lot more. She's taught at Carnegie Mellon, the University of Sydney, and at Stanford University. Welcome, Tris. Thank you, Andrew. It's great to have a chance to chat. Even though you and I, Tris, have had the opportunity to work together on many occasions, one thing I've actually never asked you before is how did you get started working in AI? I know that one of the unusual aspects of your background is that you majored in linguistics and then wound up being a computer science professor doing NLP. But tell us about your arc, getting started in AI. Sure. So my background, in some sense, isn't as an AI person. That's not really where I began. So as an undergrad, I have a major in computer science and math, but I also got very interested in linguistics and did a major in honors in linguistics. Coming off of that, to a fair extent, my starting off point was a much more cognitive science viewpoint of human language seemed fascinating. These teeny little human beings somehow managed to learn it at a time where, in general, their cognitive abilities don't seem that great. And how could human language learning take place? So in linguistics, by far the dominant belief in sort of the second half of the 20th century was the thinking of Noam Chomsky, that Noam Chomsky was just the eminent person in linguistics in the same way that, I guess, for the first half of the 20th century, maybe R.A. Fisher was the dominant person in statistics. And Chomsky has had this very strong position that humans cannot possibly just be learning languages from data alone, and that there must be innate machinery in people's brains that allow them to learn languages. That's a big topic. Even back then, this seemed to me kind of just unbelievable given the evolutionary, extremely recent development of human language. So I was interested in the idea of how could you go about learning languages. And that led me to start to look at machine learning, starting off first at the end of my undergrad, which was the late 1980s. And these days, machine learning is such a big and dominant field. It's so big and dominant that the terms artificial intelligence and machine learning are kind of two-thirds the same thing, because the vast majority of what you see in AI is machine learning. But at that time, it just wasn't like that at all. Machine learning was this sort of very scruffy on the side offshoot of AI that almost no one worked in. So, you know, there were sort of these series of two or three books edited by Jaime Carbonell and Tom Mitchell from CMU who had started to sort of put together some papers on machine learning. And the early decision tree algorithms, this was still in Australia. So the early decision tree algorithms in AI, the ID3 algorithm that, you know, still had seen decision tree learning that, you know, machine learning barely existed. But I was sort of interested in these ideas of how you could go about getting computers to learn. And that was sort of the entree that led me down the path that's turned me into an AI researcher. And it was a big deal, right? It was maybe not intuitive at the time that you should use data to learn from language rather than code out by hand a, you know, CFG grammar or something to truly understand the language, which is what people were trying to do back then. Right. Yeah. So the dominant way in which people did natural language processing was by humans. I mean, of course, that wasn't only a fact about NLP. That kind of corresponded to the dominant thinking of AI at that time as well. Right. This was the era of knowledge based systems where what was seen as what we needed was to get subject matter experts for which knowledge engineers would encode their knowledge in knowledge representation systems. And that all of this hand engineering would lead us to intelligence. So even back then, you're an early believer in machine learning for NLP. Absolutely. Right at the moment, transformer based architectures have really become the dominant thing in neural networks. And we are sort of going out of order here. So maybe we should get back to them later. But I mean, the interesting thing about transformer architectures is that they're built around an idea of attention. And you can think of attention as giving you sort of a soft tree structure where you can sort of point it from one word to another. And that allows you to build a tree structure. So we've done some really interesting work, especially my PhD student, John Hewitt, looking at what what do transformer models learn when trained on billions of words of a human language? And you can actually see it show that these models learn all kinds of things about the structure of a language. You know, one of the things they learn is that they learn some of these co-reference facts so that they'll learn that she refers back to Susan and that it is referring to the bottle and sentences. But they also actually do learn this hierarchical context-free grammar structure of languages just from a pile of sequence of words in the text alone, which is actually a really neat result. Because language can be reasonably described as this nested context-free grammar like tree-like structure. You know, a neural network, a sufficiently large neural network, a transformer network discovers aspects of that just from the data. And in fact, in the lead up to the modern views on transformers, your group at Stanford did some of the very influential early work. So I know that for a long time in the pre-deep learning days, you did a lot of work on statistical machine translation. And then as deep learning started to make inroads into NLP, you and your PhD student, Thang Long, actually published one of the earliest papers on neural machine translation and helped lay the foundation with the bilinear tension matrix that helped lay some of the foundations of the modern transformer model. Do you want to tell us a bit about that? Sure. Yeah, so really before starting work in neural networks again, I actually did a teeny bit of neural network work back in the 90s and the days when Dave Rumelhart was at Stanford. But it, you know, really, I barely got into that. And so in the 2000s decades, certainly everything that I was doing was using probabilistic modelling techniques, putting probabilities over symbolic structures to describe human languages, which is by far the dominant approach in that decade. And one part of that work that I worked on for about a decade was building machine translation models. The dominant models were referred to as statistical phrase-based machine translation in those days. And a lot of techniques were worked out pretty well in that time. So there's a fairly well worked out architecture where you have these factorised machine translation models, where part of it was that you had phrase tables, which gave you probabilities of translating a phrase in one language to a phrase in another language. So they did the local parts of translation. And then you were combining it with what's called in natural language processing, a language model. So language model is kind of a term of art that's very dominant in natural language processing. So a language model means something that gives you probability distributions over sequences of words in the language. And that's just been a really dominant, powerful idea in NLP because it lets you just tell whenever you want to sort of have words next to words, what other words are likely or unlikely. So this basic idea of a language model has used context sensitive spelling correction. So when you have something like, you know, Google correcting your spelling and context beautifully, there's a language model. Speech recognition systems use language models and these machine translation systems also use language models. So we had architectures and they worked reasonably well. So when Google first came out with machine learning based learn from data machine translation systems, they were using these statistical phrase based machine translation systems. You know, if I back up just for a teeny little bit of story, you know, when Google first launched machine translation, they licensed a very traditional old rule based machine translation system. And that sort of was the system that was originally developed by this company SysTran, whose roots go back really to the 1950s from the earliest explorations of machine translation. But they had that for a couple of years and then have seen the sort of all the advances that are being made in probabilistic models for machine translation. They fected over to that and then things got much better. And, you know. And I remember Franz Ochs was the, wait, am I saying his name right? Yeah, Franz Ochs, yeah. And I remember Franz Ochs was really a real thought leader in helping Google scale to training the traditional models on tons more data and this dramatically improved the performance of Google Translate. Yeah, so absolutely. So at that point, yeah, Franz Ochs was one of the, one of, maybe even the leading person doing statistical phrase based MT models. And he went to Google and led a team that then Google had the sort of leading large scale implementation of statistical phrase based MT models. And, you know, they actually worked pretty reasonably. They already achieved the goal that you could just sort of feed any web page and dozens of languages in and get something that was two thirds comprehensible. You could work out basically what it was saying about what topic. And sort of that was reasonable. But, you know, that was sort of great for 2007 to 2010. But sort of 2010 to 2014, which was sort of the same period where Andrew and me were doing those tree recursive neural networks we were just talking about. In that period, statistical phrase based machine translation sort of stalled. There weren't really very good ideas for making further progress. And a little bit of progress was being made by just throwing more data in. More data helps. That's still true in modern machine learning. But the models didn't have enough capacity for it to help a lot. An idea that people were working on a huge amount, including me in those years, was saying, well, surely the solution is to make more use of grammatical structure of human languages in our machine translation systems. So really the dominant research area was trying to do syntax based machine translation systems. But that seemed a good idea. But, you know, that barely, barely ever worked. Right. But basically the result was that for some language pairs that it didn't work at all. It just wasn't better than the statistical phrase based machine translation systems. Whereas for some other language pairs that had sort of more different grammatical structures, actually English Chinese machine translation was a good example. It definitely did help a bit. You could show some real gains. But the solution, ironically, turned out to be to pay less attention to the syntax and more attention to the data. Correct. Yeah. So that that that work kind of got blown out of the water when people started exploring using neural methods for machine translation. And this was really. I was going to say it was the first big success of neural methods in NLP. You know, whether that's true or not depends on whether you count speech as part of NLP or not, because really speech recognition was the first huge success of neural network methods applied to human language problems. But for text based work, really, the first thing that was just sort of knocked you out was building neural machine translation systems. And it was a very successful domain because it's a domain where there's sort of lots of data available. Here's a lot of text in a pair of languages that you could start to train big neural network models. And it was first done by Ilya Sutskeva and a couple of colleagues at Google. And for having modeling a sequence of anything like a sequence of words or a sequence of DNA, that the dominant model is for current neural networks, which are just models that work on simply sequences and sort of remember what they've seen before in a limited way. It's kind of like it's the continuous neural version of a hidden Markov model. And so essentially what they showed is if you made no use of the structure of human languages at all. So this is kind of undermining all that work on syntax based models was that if you just build very large recurrent neural networks that were then deep. Right. Until that point, most of the neural network modeling in NLP was that, you know, if we had two layers, we called it deep. And if we had three layers or four layers, we were really pushing it where they were immediately sort of pushing it out to eight layer deep recurrent neural networks. And this is where you start getting into systems issues of needing to sort of run it on a machine with eight GPUs to get, which is a trend that's continued. And we can maybe talk more about. So they showed that a much larger neural network just training big sequence models. So two sequence models, one is an encoder that encodes the source language. Then another run is a generator that generates a sequence of words in the target language could already give you a pretty good machine translation system. Not quite as good as the state of the art at that point. But, you know, it was close enough to the state of the art to seem tantalizing since all they were doing was plugging two neural networks together, the kind of thing that, you know, if you're not counting the library code for having recurrent neural net units, in particular LSTM long short term memory units were sort of very influential in allowing this work to work. You sort of only needed your 500 lines of Python code around the neural network library, and you could have an almost state of the art machine translation system that seemed just super intriguing. But they're also seeing something crude and missing there. And so then very quickly after that, Chiang-Hing Cho, who was working with Yoshua Bengio at Montreal. Well, actually, not only Chiang-Hing Cho, also Dima Bardanow, I should mention, is the first author on the paper, who was then a younger student in Montreal. So it's really actually, I think, Dima who developed the idea of you could build an attention based model. And the idea of an attention based model is at any point in the sequence, you could calculate out a connection to other words in perhaps the same, perhaps a different sequence. And then using that attention, you could kind of calculate a new vector to influence what happens next. So in particular in this machine translation context, you'll have started your translation, and your translation is starting to say the pilot, and then you'd calculate attention back into the source sentence in the other language and work out essentially what words in the source you want to be translating next based on what you've translated so far. And so rather than having to remember the entire of the source sentence in your current neural network state, you could do what the human translator actually does and dynamically look back at the source sentence and work out what to translate next. And so this idea of attention has just been transformative. It's also being used and increasingly being used also in vision systems, systems for knowledge graphs and other areas of work on neural networks. And then shortly after the Cho and Dima paper, you and Sang wrote a paper on bilinear attention. How did that come about? Right. So in earlier work, so really starting with another student, that we looked at this idea of neural tensor networks, where what we wanted to do is sort of be able to combine together vectors and have them influence each other, produce another vector. And we've done that by sort of putting a tensor, which is the multidimensional generalization of matrices in between them. So, you know, that idea was sort of going on in other pieces of work in my group at the time. But, you know, here we just wanted to get an attention score. And so in Dima and Cho's work, what they had done is said, OK, we want an attention score between these two vectors. Let's feed them through a little neural net, a little multilayer perceptron and calculate an attention score. Whereas it seemed to me that, well, wait, no, I could just do this simpler thing of bilinear attention where here are these two vectors. If I put a matrix between them and I multiply vector times matrix by vector, I just get a number out. And so that is bilinear attention or now sometimes referred to as multiplicative attention. And it's a simpler and more directly interpretable idea of attention, because in some sense, the simplest idea of attention is to say, well, you have two vectors, just dot product them together. You get a score for similarity. But that's too rigid because you sort of want to say, well, maybe I only want to pay attention to parts of the vector. And maybe I want to know if that the top part of one vector is similar to the bottom part of the other vector. And so by sticking a matrix into the middle, you can then sort of modulate the similar similarity calculation. That's a natural measure of similarity and very easily learnable by neural networks. And in some sense, it's a general, it's sort of how ideas develop and spread is always sort of complex. There are a lot of ideas in the air at any one time. But in some sense, what's become dominant in the modern work with transformer based models, essentially does build off that notion, but adding in an extra idea on top of it. It's a similar idea where instead of having a giant matrix in the middle, which requires a lot of parameters, if you have a low rank approximation to that matrix in the middle to what you were using, then that gets very close to the to the modern transformer model. Right, exactly. So in our initial work, we just had a full rank matrix in the middle. But the flaw of that is you have a full rank matrix with a lot of parameters. And the obvious way to have less parameters is to say, well, no, I can have two. I can regard that matrix as being the product of two low rank matrices. And then once you have that idea, rather than sort of multiplying them together, you can say, well, I can sort of apply those two low rank matrices to the vectors on each side, which is more efficient computationally as something to do. And that's exactly what modern transformer models do, that you take your two vectors, you multiply each by a low rank matrix, and then you take the dot product between those, which is exactly equivalent to saying I form the matrix by multiplying two low rank matrices. And then I do my vector matrix, vector product, but done more efficiently. In fact, you know, not to jump around too much. I feel like this is not the first time in your career that you really advance the field by making observations about matrix multiplications. If I look at the work that your team did on GloVe, the word embeddings, I feel like the heart of that idea was also simplifying what was a relatively complex, you know, set of neural network like stuff into just a set of matrix multiplications. Do you want to talk about that? I still find GloVe to this day a very elegant paper because it really simplified what was previously a very complicated set of ideas for learning word embeddings into simpler, you know, take the inner product between the word embeddings kind of formulation. Thank you. Yeah, I do think the GloVe paper, in some sense, one of its main contributions was, you know, giving better understanding. I mean, it's not that it worked better than the other methods that have been recently being developed, but it was interesting. Let's try and think about what's going on here and how these methods relate to other things that have been explored. And so. Yeah, so our general topic here is word vectors of coming up with a vector, a real vector representation that gives you the meaning of words and which now in your NLP deep learning courses, essentially often the first idea you really see because it's a very useful notion and a fairly simple one. And this has been done quite successfully by a couple of pieces of work in the few years around sort of 2010 to 2013. But it's been done in sort of very mechanistic neural net ways. Here's the architecture and the algorithm. Run this and run it in those days, often for weeks. We didn't yet have very fast parallel computers and at the end will pop great word vectors. And we were interested. So this was work with postdoc Jeffrey Pennington. We were interested in sort of actually trying to understand better what was happening in terms of the math of these models. And so, you know, once something that we were intrigued by was, I mean, actually there was an older tradition, the LSA or latent semantic analysis tradition of having vector representations of word meaning, which exploited classical linear algebra. So the latent semantic analysis models in linear algebra terms were neither more or less than the singular value decomposition or using the singular value decomposition on word-word count, co-occurrence count matrices, and then reducing rank by getting rid of small singular values. That's fascinating. And one trend that overlays all of this work that you've observed and participated in for the last many, many years is the scaling of NLP models. It seems like the state-of-the-art NLP models are getting larger and larger, you know, like GPT-3, but really many models in the sequence. For how long do you think this will continue? And do you think someday, you know, if ever, we'll go back to building smaller models? In 2018, BERT came out and it showed that you could do fantastically well on a bunch of tasks, including question answering, natural language inference, text classification, named entity recognition, parsing, making use of this pre-trained large language model where you are training a large transformer just on a few billion words of text to predict gapped words and sentences. But just doing that simple task gave you these really good language representations, which then could just be sort of used in a very simple manner with something like a softmax classifier to provide great solutions for downstream NLP tasks. So, you know, that was an amazing success, right? And the idea of representation learning that many of us have been sort of talking about for a decade. We were sort of saying that, you know, neural networks, it's about representation learning, learning these useful intermediate representations. And this was really showing that this was really working for building representations of language that could then be very easily applied to higher natural language understanding downstream tasks. But BERT already used a huge amount of data and compute. So it was trained on billions of words of language, and it was trained on, you know, large numbers of computers for quite a long time. But since BERT, there's been a lot of further progress, but it's essentially just come from scaling up the compute more and more and more. I mean, as always, I'm leaving a few things out. There have also been a couple of new ideas. There's been ideas like doing relative attention, which has improved things. But to a first approximation, what's really been driving the gains is just by throwing more and more compute at the problem, running even bigger models on even more data. Yeah. So, Andrew, you often use the slogan, AI is the new electricity. In some of the talks I've been given recently discussing this trend of bigger and bigger models, I've been turning it around and saying that electricity is the new AI. Because if you look at what's been happening, people are now training models that are not just 10 times or 100 times bigger than the BERT model. They're training models that are 10,000, 100,000 times bigger than the BERT model. And the computational and energy demands to train those models is going up accordingly. So, you know, that's certainly pushed progress. I think that trend can't possibly continue much further. I mean, you know, part of it is just we're running out of text and we're running out of computers to train ever bigger models on. Although going from BERT to GPT-3, if I remember, there was massive scaling of compute and, you know, modest scaling of data. So maybe we have enough text data. We just need, I don't know, another 100 or 1,000 or 10,000 times faster computers. I don't know. So you're right that the scaling of compute was vastly bigger than the scaling of data. But nevertheless, the amount of data that's now being used is actually, you know, quite substantial. I mean, there's certainly more data, but, you know, they actually are using a sort of a substantial quantity of quality data. But, yeah, I buy the point that to the extent that clever systems people can give us three orders of magnitude more power from our GPUs, even without more text data, these models are going to be better and probably will be made better in the coming years. But, you know, I don't ultimately believe that's the path to artificial intelligence or the interesting path at this point for further improving natural language processing. I mean, there's no doubt. So let me ask maybe a slightly controversial question. You know, sometime several weeks ago, one of our mutual friends made a comment, you know, about this direction that said, oh, maybe the scale is a path to what AGI. You know, that conversation where one of our mutual friends made that comment. What are your thoughts on that? So I think there's a place that that idea is coming from. So the really cool thing about the GPT-3 model that was recently released by OpenAI was the discovery that then really motivated their subsequent work, was that with these really humongous language models, that they can actually achieve a generality where they can be used for all kinds of tasks without actually having to train them to do any task. So, you know, conventionally, if we wanted to do different tasks with a neural network, we'd just take data for that task and train a model for that task. With these large pre-trained language models like BERT, we sort of got to a halfway house where we had a baseline, very good language representations of being pre-trained, but still we then just fine-tuned it for every different task. So we fine-tuned it as a question-answering model, and then we'd start again and fine-tune it for something else like summarization or something like that. And what they discovered with GPT-3 is that actually you didn't need to do that anymore, that instead you could just hint to the model what you'd like it to do by giving it a couple of examples. So you'd say, okay, I'm interested in, you know, translation. So if here's a sentence, what I'd like you to produce is this sentence, which is a translation of it into Spanish. And if you gave it a couple of examples of what you wanted it to do, the model would get the idea, and then you could give it more sentences and it would translate it. But it didn't only do translation. You could give it some questions and tell it you wanted answers, and it would give answers to other questions. And the sort of mind-blowing thing about GPT-3 is to a first approximation, you can give it all kinds of tasks, even, you know, sort of weird ones that you wouldn't expect it to really know why it knows that, and it will just do them, right? So sometimes the kind of things that linguists are interested in is sort of weird sentence manipulations. You know, if I give you a sentence, can you turn it into a question sentence? Or can you turn it into a relative clause that modifies the person? And, you know, any of those things you can just give it as a little task to GPT-3 and it will do it. So that's the sense in which it actually offered a vision of general intelligence rather than what's been the mainstay of AI over the last decade, which was these very narrow AIs where you just trained a system for one particular thing. You know, here's a movie recommender, here's an object recognizer in images, here's a natural language question answerer. And so it's intriguing because of its general intelligence, but I don't actually believe it represents something that's a path to the goals of artificial general intelligence, which is to have the kind of flexible cognitive agent like human beings because it is just this humongous pre-trained model trained on an amount of data that is just massively more than a little human being sees before they're competent, that human language. And it's not, you know, to have something that's like a human being, you have to have something that can flexibly learn different tasks as it's exposed to them. The big GPT-3 model, it's no longer learning at all. You know, it's being exposed to so much data at training time that it can do all sorts of different tasks by sort of effectively pattern matching to something it saw somewhere along the line. You know, that's actually an area that there's also starting to be a lot of work on right at the moment in deep learning is the idea of meta-learning, so how do you build systems that are good at learning how to learn new tasks? And I think that's actually closer to the kind of intelligence we have to be seeking if we're aiming at having artificial general intelligence. Chris, in your career, you've mentored tons of very successful, you know, undergrad students, master's students, Ph.D. students, and today there are many people applying to be a student in your lab. So I'm curious, in your view, what makes a good researcher, you know, in your lab at Stanford? What characteristics do you like to see in people applying to be a member of your lab? So there are definitely people that come from different backgrounds. You know, I certainly do actually like some of my students to know something about and really be interested in human languages and their structure. So, you know, I certainly like to have some people that have backgrounds in human languages and linguistics, but that's certainly not all of them. I've had other people who are AI, machine learning, or other quantitative students who have no background at all apart from, like, all human beings. They're native speakers of some language or another. I think, you know, the dominant thing is that you have to have creativity and a scientific thinking, right? I mean, there are lots of people around who can read papers and do whatever's in them. The secret of being successful is that you can think a bit differently and say, wait a minute, these people are doing things like this. In fact, almost everyone's doing things like this. But maybe there's another way it could be approached that is better. And really I think that's sort of core sort of scientific training of trying to sort of aim to break things in the sense of find why these ideas aren't actually a great way to do things or think of different approaches which might work better. And how does a student become creative? That's a good question. Yeah. I believe that you can make progress on this. And I think it's a matter of taking a mindset of being critical when you read things and starting to explore around and do different things. So rather than sort of just reading papers and thinking, oh, this is a good way to do things, I think you really want to be concentrating on being awake as you read and trying to think, well, what are they assuming? Why are they doing it this way rather than some other way? And, well, a lot of times your experiments will fail, but, you know, I think when they fail, you're already learning more than if you're simply implementing the algorithm that's in the paper because you sort of don't really learn very much by doing that. And, you know, occasionally you might see something that half works and then that might give you an idea of, well, maybe there's something here, and if I sort of modified it or pushed it and had sort of a second vector of what people do now, then it might work interestingly better. And I think you kind of the practice of sort of exploring out and thinking in those directions is the way that you sort of build these skills of coming up with something new. Cool. Yeah. And I find a lot of the most creative people read incredibly widely and wind up making strange connections that someone else wouldn't have made between linguistics and computer science or between some other weird thing and deep learning. Yeah, I think that's also a really good strategy. Thanks. This has been great. And before we wrap up, I think a lot of the learners watching this video are looking to build a career in AI, looking to build a career in NLP. Do you have any final words of advice to someone watching this looking to advance their career in AI? Sure. So, I mean, it's a great time for you to do this, right? That there's just huge opportunities all over industry and also in academia for people with skills in AI, machine learning, natural language processing. You'll be greatly in demand. So this is a great thing to do. But I think nevertheless, you want to think thoughtfully about how to develop your career. And this is maybe connected with what, Andrew, you were just saying about reading widely, because the reality is that the world and the skills that are useful develop quickly. Right. So that we've talked about earlier in this hour that, you know, really, when I started off, you know, I was being taught about rule based NLP. And then there started to be machine learning models and probabilistic models. There are other things that went by in the meantime, like support vector machines and large margin models. And then there was the return of neural networks to dominance. But neural networks are actually an idea that's been around since the 1950s. Right. So that at the moment, deep learning and neural networks seems so dominant. It seems like nothing else can possibly be relevant to know. The only thing I should learn is how to be state of the art in building neural network models. But, you know, that's not how the world has been in the past. You know, around 2008, that's what we thought of probabilistic AI models. They were obviously right and dominant and they were the thing to know. But, you know, really, progress was made by sort of taking old ideas and rediscovering them and putting them together with some stuff that's been learned in the meantime. And sort of further raising the bar on where we can get to. So it's just sure to be the case that lots of different ideas are going to be useful in various ways in the future. And so to be successful long term, it both helps to have a richness of background. It definitely helps to have some breadth of areas of computer science, areas of math, statistics, linguistics, different ideas. But you also have to accept that this is an area where you have to adapt and move on to new ideas. I mean, you know, I think one of the ways in which I've been very successful is not by, you know, inventing a whole new field single handedly, but being able to see where promising ideas were emerging and moving while starting to think about those and moving to do work in them fairly quickly. And so that kind of adapting, keeping your antennas up for interesting ideas that are starting to appear and being adaptable and willing to sort of explore and make use of new ideas, that that's the way that you keep your thinking vibrant. Yeah. And I think AI, machine learning and LP evolves so fast. I think all of us just have to keep learning. Absolutely. Thanks, Chris. This has been great. And it's really interesting to hear the story of how you wound up doing all this work over these many years. And I find it inspiring to think that maybe there's someone watching this that will, you know, follow in some of your footsteps and themselves maybe end up a professor somewhere or doing great research, much as you have. So thank you. Thank you, Andrew. It's been fun chatting. A final fact I can tell people listening is, you know, for a whole bunch of years, Andrew and me were actually office mates. We had the offices next door to each other on the corridor. And so once upon a time, I had the opportunity often to see Andrew in the corridor and pass a few thoughts. I don't get those chances as often anymore. So it's fun getting a chance to chat. Thanks, Chris. Those are good days. I remember we shared the walls and I hit my wall in my office. You're on the other side. It was great times. Thanks, Chris. OK, see you. For more interviews with NLP thought leaders, check out the deeplearning.ai YouTube channel or enroll in the NLP specialization on Coursera.

Natural Language Processing Specialization

Intermediate

Topics

Chatbots

Embeddings

NLP

Transformers

Collaborator

DeepLearning.AI

Week 1: Sentiment Analysis with Logistic Regression

Lecture: Logistic Regression

Welcome to the NLP Specialization
Video
・
4 mins

Welcome to Course 1
Video
・
1 min

Acknowledgement - Ken Church
Reading
・
10 mins

Week Introduction
Video
・
1 min

Supervised ML & Sentiment Analysis
Video
・
2 mins

Supervised ML & Sentiment Analysis
Reading
・
2 mins

Vocabulary & Feature Extraction
Video
・
2 mins

Vocabulary & Feature Extraction
Reading
・
2 mins

Negative and Positive Frequencies
Video
・
2 mins

Feature Extraction with Frequencies
Video
・
2 mins

Feature Extraction with Frequencies
Reading
・
10 mins

Preprocessing
Video
・
3 mins

Preprocessing
Reading
・
10 mins

Natural Language preprocessing
Code Example
・
1 hour

Putting it All Together
Video
・
2 mins

Putting it all together
Reading
・
10 mins

Visualizing word frequencies
Code Example
・
1 hour

Logistic Regression Overview
Video
・
3 mins

Logistic Regression Overview
Reading
・
10 mins

Logistic Regression: Training
Video
・
1 min

Logistic Regression: Training
Reading
・
10 mins

Visualizing tweets and Logistic Regression models
Code Example
・
1 hour

Logistic Regression: Testing
Video
・
4 mins

Logistic Regression: Testing
Reading
・
10 mins

Logistic Regression:  Cost Function
Video
・
5 mins

Optional Logistic Regression: Cost Function
Reading
・
10 mins

Week Conclusion
Video
・
1 min

Optional Logistic Regression: Gradient
Reading
・
10 mins

Join the DeepLearning.AI Forum to ask questions, get support, or share amazing ideas!
Reading
・
2 mins

Lecture Notes (Optional)

Lecture Notes W1
Reading
・
1 min

Practice Quiz

Logistic Regression

Graded・Quiz

・

10 mins

Assignment: Sentiment Analysis with Logistic Regression

(Optional) Downloading your Notebook, Downloading your Workspace and Refreshing your Workspace
Reading
・
5 mins

Logistic Regression

Graded・Code Assignment

・

3 hours

Heroes of NLP: Chris Manning (Optional)

Andrew Ng with Chris Manning
Video
・
46 mins

Week 2: Sentiment Analysis with Naïve Bayes