Unlock Plus AI learning and gain exclusive insights from industry leaders
Access exclusive features like graded notebooks and quizzes
Earn unlimited certificates to enhance your resume
Starting at $1 USD/mo after a free trial – cancel anytime
Hi, everyone. I'm delighted to have with us here today Orin Ezioni, who is one of the best known figures in NLP. He is the CEO of the Allen Institute for Artificial Intelligence since its inception in 2014. He's also a professor at the University of Washington's Computer Science Department and a venture partner at Madrona Venture Group. Orin has received multiple awards, including Seattle's Geek of the Year, which I thought was cool, as well as been a founder or co-founder of several companies. Really glad to have you with us, Orin. Thank you, Andrew. It's a pleasure. So I've known you for many years, and even when I was a student at Carnegie Mellon, I remember hearing about some of your work on explanation-based learning from Tom Mitchell. So something I've never really asked you before is, today you're a well-known researcher, but how did you get started in AI? Tell us about your personal story. Well, I really became fascinated with the field in high school when I read the book Gödel, Escher, Bach, which many of us read at some point by Douglas Hofstadter. More than anything, what the book gave me an appreciation for is that asking, what is the nature of intelligence? How do we build an intelligent machine? One that has human-like capabilities is really one of the most fundamental questions in all of science, like what is the origin of the universe or what is the basis of matter? And so I became fascinated with it at the age of 18 or so. And so what happened? You're an 18-year-old reading Gödel, Escher, Bach. I remember my father trying to push that book on me as well when I was a teenager, but I think unlike you, I did not read it as a teenager. But then what? Well, I did two things. One is the summer before college, I started studying Lisp, the Lisp processing ancient programming language, which of course gave some ideas into Java and Python. And I found it just endlessly fun to hack Lisp code. And then when I went to college, I went to Harvard, I was intent on studying computer science, because that seemed like the path towards AI. I think it's inspiring to think that maybe today there's a high school student somewhere, or maybe the parents of a teenager reading this, thinking that if their son or daughter is reading Gödel, Escher, Bach or pick up an interest in AI as a teenager, maybe they could someday have a career a bit like yours. It's a tall act to follow. I think it's something to think about. Well, Andrew, you're very generous. The other thing that's happened, which of course is drawing a lot of people into the field and which I had no clue about, is the fact that our explorations of these algorithms and these methods have led to a very powerful set of technologies, which of course you've been very intimately involved in. So deep learning, right, which is now revolutionizing the field in so many ways, was not something we anticipated back then. And we didn't understand that asking this fundamental intellectual question could lead to so much commercial success as well. Yeah, I found that a lot of the great scientists were driven by fundamental questions. It sounds like yours was, what is the nature of intelligence? And that type of question is enough to drive someone for an entire career. I remember also, back in the early days, you were one of the pioneers in what was called open information extraction from the web. Tell me more about what that was like, working on information extraction back then. Sure, so information extraction basically means the mapping from a sentence to a more structured piece of information. Like we have the sentence, Google acquired YouTube, and we can map it to a database tuple that says acquisition, Google, YouTube. Now that, when I got into the field of information extraction, was very narrowly focused. They would look at M&A events, or they would look at terrorist events, and they would try to extract the particular semantics around specific events. And it occurred to me that maybe we can do it in a much more open-ended way. One of our mottos was no sentence left behind. The idea that maybe we could extract information from any sentence on the web, and thereby map the huge amount of information that's available in the web corpus, literally billions of sentences, maybe we can map it to a very powerful and comprehensive knowledge base. And so we set off on doing that, and the first thing we had to do was vastly generalize the techniques, because the information extraction back then was still a machine learning technique, but it required training examples specific to a particular relation or predicate, like acquisition or seminar location. And there are so many. In fact, that's one of the things we studied. There's potentially hundreds of thousands, if not more, different predicates that are being expressed in a natural language like English or Chinese, and so that would require millions and millions of label training examples. And so we attempted to solve that problem by creating a new technique which had much more of an unsupervised flavor. Cool. So that rather than recognizing that the word acquired is one of the words or predicates that indicates an acquisition, there are lots of ways to say X acquired Y, and to learn that from data was a huge breakthrough. Well, thanks, Andrew. Thanks for remembering, right? The field moves forward so quickly. One of the things that I found particularly inspiring at the time was the fact that we realized that there are certain linguistic invariants, certain regularities, where whatever the sentence was and whatever the topic was, there were certain regular ways in which people would express information, activities, and so on. For example, the simplest is verbs, whether it's acquired or graduated or married. Often the verbs were a very strong indication of the predicate involved in the arguments of the verb. So, you know, Joe married Betty. Okay, so now we know a lot about what's going on here. So basically what we were able to do, and I consider this a pretty fundamental piece of observation about natural language, is we were able to realize that sentences express relationships in certain stereotyped ways. And since we did that work in the 90s, the result has actually been replicated in lots of different languages, Spanish and Arabic and Korean. In many languages, they found that these kind of regular ways of expressing relationships are available. That if something happened, you know, the number of ways to refer to an acquisition or graduation or marriage is quite narrow, so there's a very strong signal for the learning algorithm. Exactly. And I guess both you personally and the field of NLP has come a long ways towards then, and today I regularly read about exciting projects that the Allen Institute for AI is doing. So one of the projects that I've heard you speak about in other contexts that I thought was really cool was the Semantic Scholar project. Can you say a bit about that? I would love to. So our mission as an institute that's a non-profit, fully funded by the late Paul Allen, is AI for the common good. And we asked ourselves, how do we use AI broadly and NLP in particular to make the world a better place? And one of the things that came up on our radar was, can we use it to help scientists, and more generally the informed public, get access to scientific papers? So there's a kind of Moore's law of scientific publication, right? The number of papers seems to be growing every few, seems to be doubling every few years, and of course growing very rapidly. And even diligent folks like yourself can't have read everything that they want to read. In fact, I think there's a kind of limit on the number of papers we're going to read in our lifetime. So I thought to myself, okay, AI to the rescue. I remember when the ICML conference or the, you know, NeurIPS at that time, previous name conference was small enough that I would bring home the paper proceedings and certainly read every single paper's title and abstract and a good fraction of the papers. And clearly, you know, we're well past that phase of AI today. So it's great to have tools like SemanticScholar. Exactly. And SemanticScholar's motto is cut through the clutter. So our idea is to use AI in lots of different ways to find the papers that you want to read. For example, we automatically generate what we call extreme summaries, or simply put, TLDRs. So instead of having to read all those abstracts, maybe we can give you a one-line summary that, and if you like that, you might delve further into it. Or we use computer vision techniques to automatically extract the figures. That may seem straightforward to a person, but remember, these are PDF files, which were set up to display information, not necessarily to give you the location of the semantics of where figures and tables are. So we automatically extract information that, again, will hopefully tell you at a glance, maybe even on a mobile device like this one, is this a paper that I want to read? And so if I can save you the time scouring through the papers to find the ones you want to read, that's a savings. And even when you're reading them, we're continuing to look into ways to make that process more efficient. So someone watching this video is feeling the glut, really the wonderful glut of deep learning or NLP papers. Would you recommend this to them? Very much so. It's a free service. It's at semanticscholar.org, so I don't profit from it. And we love people to use it. We think it has a lot of powerful features that would help both expert research, but also people getting started to get a sense of the field, to find the key authors, the key papers, etc. One interesting story that I remember seeing is that with the rise of COVID-19, societal-wide tragedy, many researchers are starting to publish more and more papers on this horrible virus. And Semantic Scholar was involved in helping sort out this maybe fortunately or unfortunately rapidly growing literature. So what was the story behind that? So it's actually quite a dramatic story. So early in March 2020, when awareness was still growing about the virus, the White House, through a colleague, reached out to us because they knew we had tools for rapidly processing collections of papers. And they asked us to put together the collection of all the relevant papers, both published and preprints at the time, and make it available to AI systems, make it available in a machine-readable form so that NLP systems, information retrieval, search engines, and others could make use of it. And we were quickly, with some help, the White House's help, able to form a coalition that included Chan Zuckerberg and Microsoft and colleagues from Georgetown, publishers and more, and to create a collection that's now more than 200,000 papers, and it's being updated daily, that attempts to capture exactly, as you said, this rapidly growing literature about the virus, and then be able to answer questions about it much more rapidly than ever before. So Kaggle was involved, creating a set of competitions that became their most popular competitions, and a lot of clinical questions and discoveries have been made using, obviously, the literature, but access through our open dataset, which was called CORD-19 for COVID Open Research Datasets. So again, thank you for bringing that up. It's a great example of how Semantic Scholar, and AI in general, is trying to help to make the world a better place. In the bash, we were covering some of this work several times, so I was actually really excited to see your team doing this. And really, thank you for helping fight COVID-19 for all of us. Of course. So Oren, you've been a successful serial entrepreneur, and you are at the cutting edge of NLP technologies. What advice or guidance recommendations would you give to someone that's looking to work on or to launch a startup in NLP? Well, one thing that I think about based on my background, launching AI-based companies, is where is your data going to come from? So what I like to call the dirty little secret of big data isn't just that you need lots of data, but often that you need lots of labels. You need some way of taking a data item, let's say credit card fraud, and saying this credit card transaction is fraudulent, this credit card transaction is valid. And often people have great ideas for companies or for products, but they haven't thought through where's my data going to come from, and how is it going to get labeled? So I always like to ask people, there's a series of questions to ask, but one relatively unusual one is tell me about your data set. Where's your data going to come from? Where are your labels going to come from? And that intuition came from probably my most successful company, Faircast, which was a company that predicted airfare prices, their fluctuations over time, ultimately was sold to Microsoft. And there, what was really cool is at the peak, we had a trillion label data points. So you might ask, and this is, again, a company that we formed in 2003, was acquired in 2008. So back then, a trillion data point was really quite a lot. So how do you get a trillion labels? Mechanical Turk didn't even exist back then, and that's a lot of labels. Well, it turned out because it's temporal sequential data about prices, if you predict, say, on December 1st, 2003, that the price of a particular flight will go up in a week, all you have to do is wait a week and see whether your prediction came true or not. So just the passage of time automatically labels your data, and that observation turned out to be incredibly powerful. It allowed us to label a trillion data points, and with a trillion data points, we were able to generate some very strong predictions. That's very cool. And I guess, because the number of labels you have automatically generated grows quadratically in time, right? Because at this moment, you can predict a lot of future moments. So trillion is a big number, but especially the time series, it makes sense that you could collect this giant data set automatically. That's exactly right. Another thing that I really find fascinating is there's actually a connection between this and some of the success that we're seeing in NLP, because even now that Mechanical Turk exists, the number of words or sentences and so on we can label is nowhere near enough compared to the appetite of models like our ELMo or BERT or Roberta, the succession of most recently GPT-3, the succession of language models. But again, what they use is the inherent sequential nature of language, right? They effectively, to grossly oversimplify, they're saying, I'm going to predict the probability that the next word is forest, or the next word is is. Well, how do you tell whether that prediction was correct or not? All you have to do is look at the next word, right? You mask out, and of course, it doesn't always have to be literally the next word. But basically, you take a sentence, you mask out some words in the sentence, you predict then what these will be, and you check your prediction. So, the wild thing about a natural language corpus is in a certain very similar sense, it's also self-labeling. So, these models that we're now using with great success and have become so enamored with have that property where the data can label itself. Some of the most exciting material in the NLP specialization is Eunice and Lucas talking about techniques like word embeddings, or language models, or transform models, taking advantage of this effect. There's been a huge boon to the whole field of NLP. And on the theme of building these giant data sets, you have a trillion examples, which is really huge back in the day. Even today, a trillion is really large. And we do see this sequence of success stories in NLP, where researchers are building bigger and bigger models, transformer models these days, or flavors of it, and feeding to it bigger and bigger data sets. What's your prediction on the future of this trend? Ever bigger and bigger and bigger, and that's the story of NLP, or a trend towards smaller models at some point as well, or plateauing or something else? Well, let me first start by acknowledging, which I think is important, how wrong I've been. So, I predicted that the growth in model size and a number of parameters and the commensurate amount of data would already have plateaued. And I've definitely been wrong about that. We see that with the continued increase in performance. So, take predictions from anybody, especially be with a grain of salt. That said, I do think that they will continue to grow, because there is that hunger for performance. And we haven't, by any means, exhausted the power of the machine or the size of the corpora. So, I think they will continue to grow until we see a very significant plateau. I think at the same time, it's also very natural for any computer science field to first build the largest possible model, to kind of brute force it, and then go back and optimize it in various ways, both in terms of data efficiency, data selection strategies, and in terms computationally. So, I think we'll see both. And, you know, a great example of that, this is an analogy, but it's still, I think it gives the right intuition. We started with chess, and it was only like specialized chips and supercomputers could do it. Now we have stronger chess playing programs on a laptop. So, it's gotten cheaper and simpler, not just because of Moore's law, also because of better algorithms. And at the same time, we've also scaled up to larger games, let's say, like Go. I feel like today there are definitely researchers, aspiring researchers, wondering, boy, if I don't have a million dollars to buy GPUs, how could I possibly do research in this field? And I feel like, on one hand, hopefully, I think there are lots of opportunities to do exciting work on smaller data sets, lots of groundbreaking research to be done there. And also, the history of computing has shown us that yesterday's supercomputer mainframe is today's smartphone or today's smartwatch. And so, we'll see if this trend continues to hold up, but I would love if some of these most amazing, giant, millions of dollars types of models that you read about in the news, they'll know someday we'll all be able to run it on our smartwatches, but that would be an exciting future if we can get there. Yes, and we've actually done some work at the Allen Institute for AI, or AI2, as we call ourselves, on a topic we call green AI, where we say, exactly because of the point you made, Andrew, that these massive models result in a large number of people being shut out of the creation of these models. If we ask people to also publish results, taking cost into account, taking efficiency into account. So, if I can spend $4 million, $12 million, and build the largest model, that's one kind of research. But what's the best model I can build for $1,000? What's the best model I can build if I only have, I don't know, 1,000 training examples? It seems like there are a lot of questions like that, where if you factor in efficiency, I might say, look, my model isn't as good as Big Brother's trillion parameter model, but it only costs $1,000 to train. It only costs $100 to train. It's trainable on a laptop. In fact, we've been talking about something we call NLP in a box. What are the best NLP capabilities you can derive from simply a laptop or simply on a phone? And there are many situations where that's really the question. Let's just take a phone as an example. I keep flashing this as a prop. For privacy reasons, I may want to keep the data resident on the phone, or I may have intermittent internet connectivity. And so, I can't just always upload things to the cloud. So now, I suddenly need to think about models, whether they're natural language models or vision models that are optimized to run on a limited device without exhausting the battery. That's a really key point. And so, that leads to an array of research that many more people can participate in. Yeah. And today, many NLP teams train a model and then have to re-engineer it or compress it or something to make it run on a mobile phone. And people often don't think of even just the download size needed. If you have a very large downloadable for mobile app, users are actually less likely to be willing to, depending on the country and cost of bandwidth in the country, to download and install that. And of course, the rise of edge computing, a lot of exciting work on getting these things to work at edge devices as well. You have, for many years, had a foot in both the academic world and in the industrial world as a professor, also a CEO of a nonprofit and a venture partner. Today, a lot of people are asking, how do you choose between the academic and the industry pathway in AI? What advice do you have for someone trying to build a career and looking at academia and an industry? Well, let me answer it in geeky terminology that will hopefully be very familiar to the folks taking this specialization. To me, it's a question of what you're trying to optimize. So, if you're trying to optimize compensation, how much money you make, or even adrenaline, right? What's kind of the most exciting, but exciting in the way that a car race is exciting or a poker game is exciting, then the world of startups, the private sector naturally beckons. If on the other hand, you're trying to maximize freedom, the ability to ask your own questions, the ability to sit back and contemplate and really think really deeply uninterrupted about fundamental intellectual questions and the questions that you want to ask, not somebody else, well, then there's no substitute for academia. So, to me, I'm old enough that at different points in my career, which has spent some decades, I've focused on different things. One of the biggest academic highlights for me was graduate school at CMU where I worked with Tom Mitchell and I could spend months just delving into one particular question, right? I finished my coursework, I could just go deep, deep, as deep as I could in answering a question. And then other times when, you know, did a startup, right? Just that feeling of putting a team together and working so hard to succeed. And that kind of rollercoaster ride of this thing that we built with sweat, blood, and tears, and we own, and we're going to make it a success. That was also an incredible feeling, but very different. So, sometimes liken academia to playing bridge, and startups, the commercial sector, to playing poker. Both are fun, but they sort of tickle different neurons, at least in my brain. From that description, I gather you are both a bridge player and a poker player. At various points in my career. Now I'm actually, yeah, I play a lot of Bug House, the chess partner chess. When I was in high school, I played chess. I was actually, you know, captain of my high school's chess club. And then after Garry Kasparov lost to a computer, I gave up playing chess, but this sounds like fun. I should check it out. Well, we should play sometime. One of the things I've seen you do is engage with regulators and help think through what are appropriate AI regulations. What are your thoughts about regulating AI? Well, to take an example, closer to NLP, we've seen that really unfortunately, language models and other NLP systems trained on natural language, trained on text on the internet, learns to exhibit some of the very undesirable biases that is exhibited by text on the internet. So what is the role of, I think as AI technologists, we'll do our best to diminish and squash that in our systems. But what do you think is the role of regulators in this, frankly, really naughty NLP problem where we have wonderfully performing systems, but very problematic aspects of bias because of the data they learn from? How do you think regulators should think about that? Wow. I do think that's the toughest question you've asked me, Andrew. I would be very careful. So what should the role of regulation be for natural language processing? I would be very careful to avoid legislating our values into the technology. And I would really allow a thousand flowers to bloom. And what I would look at a lot more closely is specific applications. So NLP is a broad technology, and that technology, as you pointed out, is prone to bias. But it's how this bias manifests in particular applications. That's what needs to be regulated. Let me give a concrete example. If we build a resume scanning app application, and that exhibits bias in favor of men, right, over women, obviously that's highly problematic. We don't want to have that bias. It's illegal. So in that context, we should block that. We should audit those sorts of applications and disallow bias to appear there. But what I would really not want to do is regulate basic research into NLP based on these ideas. So I think that's really the most important point. Regulate the applications, not the research. Yeah, and I would love for regulators and technologists to together try to figure out what is a fair standard to hold these systems to, and then rigorously audit the systems, such as a regular resume screener, to a well-articulated standard. And then hold the AI teams accountable for reaching that standard. And if we do that, hopefully we can also avoid gotchas where an AI team goes along, and then many years later, you know, there's a surprising, maybe even fair, but surprising criteria that if only we had all realized we should judge the AI system on this criteria, we could have avoided the problem in the first place. So it's tricky. I'm glad that the community is working on this. Yeah, I want to just highlight at least one more point there. And again, there's so much to say on this topic. But I think the word audit that you mentioned there is really a key one. Because for example, in the European Union, they've started thinking about the right to an explanation. So they say, hey, if I have a model, then the model has to be able to tell me why it came up with its conclusion. And the problem with that is, you and I know very well that deep learning models that are based on a very large number of variables, a lot of parameters, a lot of data, may really struggle to provide an explanation that anyone can understand. So if that regulation is created, what we will end up with is explanations that are either incomprehensible, and thus useless, or inaccurate, right? They're clear, but they're actually not correct. They're not high fidelity explanations. And then they're not useless, but they're misleading, which is perhaps even worse. Another option, which is encapsulated in what you said, is to say, no, I'm not going to insist on explanations, but I'm going to insist on the right to audit. So if you create a model, a regulatory agency or a third party like the ACLU, or like an academic, should have access to it to audit its behavior and to check whether it's exhibiting bias. And now we can rely on the marketplace of ideas and the interaction between different bodies with different incentives, like journalists, non-profits, and so on, to check on each other. And I think that sort of situation is a much more robust and interesting one. Great. The vision of transparency and auditing, I think, hopefully will shift society toward more fairness. One final thing I want to ask you, Oren, which is, you've mentored a lot of students, a lot of engineers earlier in the career. You've helped a lot of people become really good at NLP over your career. What advice do you have to someone watching this video today that wants to break into NLP or grow their career in NLP? Well, I would say at the early stages, make sure you've got the fundamentals right. So we're talking about statistics, computer science, understanding of machine learning. I think that's essential because, again, the flavor of the month, which the flavor of the year is transformers, as you said, that changes very rapidly. So you want to make sure you've got the fundamentals right. Then the second step is I do think that we've seen online courses to be extremely successful, extremely cost efficient, and widely accessible. So that's the next place I would go to. And of course, our conversation is part of an NLP specialization. I haven't studied it in depth, but I'm sure, knowing you, Andrew, I'm sure it's very high quality. Thank you. And after that, after someone's done, you know, studied online, what's after that? There is no substitute for doing it yourself. So you only understand something at a certain level if you've done it at the course level. You've got to take a real problem, take a data set, and do it yourself. So find out how it works or doesn't work, and you could be surprised. You could find that the problem that you're excited about is easier than you expected, and you can do really well. Or you'll find, ah, maybe I didn't understand that concept so well, or maybe this problem is harder than it seems. And that might lead you to a new invention or a new idea. So there's also no substitute for practice. That's great, Arun. Thank you. And I hope that maybe many of our learners, I hope, will follow your advice and end up someday becoming great NLP researchers or scientists or engineers and build amazing systems. So that was inspiring. So thank you very much, Warren. It was really great having you, and thank you again for joining this interview series. Well, it's a real pleasure, Andrew, and thank you for all that you're doing to be a champion of the field in general and to bring this information and these ideas to so many more people. We need that if we're going to make the kind of progress that we should be making to use AI to make the world a better place. Thank you, Warren.