After you have deployed your system, you'll move into the evaluation phase of your project. In this phase, you'll attempt to measure how successful your project was or continues to be and communicate results to all the relevant stakeholders. Measuring the impact of your project can be tricky, but if you've been following this framework, then all the way back in the explore phase, you should have defined what success looks like, at least qualitatively, or in a detailed and specific problem statement. In the case of this project that I've been walking through, our problem statement was healthcare providers need to communicate directly with mothers in the community via surveys to monitor their health and the health of their children. To do this, we need to be able to quickly process a large volume of incoming text messages in multiple languages, including survey responses and other unrelated messages from the community. Like I've said before, for any AI project, the technical component of the AI is just part of the solution. It's one component of a much larger and potentially complicated technology or product. And so the performance of the AI model in itself may not be the most important aspect of a successful outcome, and certainly not the only one. In our case, we found that our model actually did perform well and was improving with additional newly labeled annotations provided by the clinic staff, such that the staff were able to be more efficient in their work in the sense that they were processing more patient communications with a lower typical response time. However, as the project proceeded, the staff at the clinic ultimately found it to be a bad user experience for them, where they felt like they were now spending more of their time annotating data rather than caring for their patients. Even as the system improved in automatic categorization of the messages, the clinic staff still continued to spend the same amount of time reviewing edge cases and model errors or low confidence predictions. And they weren't able to directly experience some of the automated tasks, like the automatic rerouting of messages based on language. And as a result, they did not have much visibility into how the model was improving or how their overall individual productivity was improving on average, even though it didn't feel like it for them in their manual day-to-day task. And so this sort of system where humans are involved in manual labeling and annotating data is not uncommon in AI applications. I would say maybe even the majority of AI systems that I have worked on are systems where we have been mindful about continually getting human labels in order to update the models. And most of those have been with experts, and in this case, experts like the people in the clinic. So this particular architecture was not altogether uncommon, and yet it was a type of human-computer interaction which was not successful in this particular case. And so this is why we have emphasized that ensuring a good user experience for annotators, whether they are outsourced annotators or expert annotators, is as important and as complicated as designing the pure machine learning parts of any AI solution. So to give some specific examples of how it's complicated to provide a good user experience. So some of the ways that people have tried to address a better user experience for human and AI collaboration, where annotators might become frustrated or fatigued by their work, involve things like gamification of a system. So for example, having an annotator watch a tally of how their improvements might be increasing the model performance and therefore increasing the overall accuracy. However, a lot of the approaches like this, which sound good initially, tend not to actually work so well. So there's a lot of evidence that if you try to gamify a system, that becomes a sort of fatigue in itself, and the only people who have a better experience in their work going forward are the people at the very top of whatever kind of leaderboard and gamification scoring system. So it's very hard to think about what is the right way that you need to create an interface where you can combine human and machine intelligence. And so in our case, while it appeared that some of the issues outlined in our problem statement had been addressed successfully through the system we implemented, it ultimately turned out to be a poor user experience for some of the key stakeholders, the healthcare providers themselves. Without manual annotation in place, this system could not have been particularly useful, and so ultimately we just continued the project because the alternative would have been to ask for even more time and effort from staff at the clinic when it was a bad user experience for them. Or we could have looked to external parties to annotate data, but we weren't prepared to make a sacrifice on privacy in order to have more people annotate the data in this particular case. And obviously we were disappointed that our project was ultimately not successful, not for us, but for the community that we were trying to support. However, we did learn valuable lessons along the way. So for example, we learned how these kinds of systems could improve efficiency in certain applications in healthcare, but we highlighted the fact that there are important user experience challenges that need to be addressed before this kind of system could be rolled out more broadly. So in the eight years since we've done this, there have actually been a few other attempts to integrate AI to semi-automate people managing healthcare through the UReport system. There was another attempt by people I respect greatly in Translators Without Borders, which is an open source solution you can look at, and they ran into some of the same problems where they couldn't adequately communicate the improvement they were making to the healthcare workers. And just recently, eight years after our failed attempt, I'm finally seeing some early research papers from some natural language processing researchers who are partnering with UNICEF in UReport today, where they seem to have solved the problem of there being a good user experience. But I think that not even these researchers would yet declare that this has finally been a problem solved. So ultimately, I think that this is a really good use case for anyone thinking about AI for good projects. You can do everything right, and still at the end of the day, you ultimately don't have a successful impact. Like I've said before, the majority of social good projects fail. The majority of AI projects fail, and there is no magic in combining them. It makes things harder. And hopefully, this is also a really good example for you of why the do no harm principle is so important. If we had compromised on that principle, say for privacy, then at the end of the day, all we did was compromise people's privacy for no long-term benefit. If we had a solution where at the start, maybe the annotators were actually less accurate, then we'll compromise in both the annotators' work, the clinicians' work, and the patient's health for a more positive later outcome that never happened. So you need to keep in mind that even with good planning and very strong signals that you have a solution that is working, right until the end, you can't compromise on the health and safety of anyone who is relying on the system that you're building. I'd like to show some other examples of areas where it seems obvious that AI can help in areas like healthcare and public health, but when we try to build systems, it's not clear that we're able to roll these systems out without causing harm, or that will have ultimately a positive impact on the stakeholders. So I think there's a really good example in COVID-19. Preceding COVID-19, there was a fair amount of research showing that AI could be useful in identifying certain diseases in CT scans and in X-rays and other kinds of medical imaging. So during the COVID-19 pandemic, it was natural that some research groups started looking at diagnosing biomedical imaging, check tests, CT scans, et cetera, to diagnose COVID-19. A number of these groups published papers and declared success with their projects based on results that suggested it was possible to provide accurate diagnoses of COVID based on the images of the chest. However, relatively cheap and effective nasal swab tests for COVID have been, for the most part, widely available since early in the pandemic. And so there was little, if any, practical need for diagnosing COVID with much more expensive and time-consuming chest imaging. Also, it was never clear that the kinds of testing done offline on a small number of X-rays and CTs would scale and remain accurate when you took this out into the world and had a much bigger diversity of X-ray machines and populations and patient positions in those images. And so while the people building these systems for COVID-19 image detection were well-meaning, none of these ultimately made a difference in COVID diagnoses. So I realized that I first introduced a project that failed, which you've probably never seen before in any AI course that you've taken, but I think it's an important lesson here. Any AI project that you embark on is more likely to fail than succeed. And so there is a positive flip side to this. If you are focused on AI for good, you might as well aim for something where the benefits are incredibly high because in the times where you are successful, the greatest number of people can be positively impacted. So like I said, while our case with maternal health didn't work, it has later informed now what looked like some more successful systems for being able to help with maternal health via messaging. This is similar for some other cases of mine in the past. So I worked in large-scale detection of disease outbreaks in around 2011. We iterated on a lot of these technologies for a long time. At that time, we weren't able to get ahead of the outbreak, but the same company I worked with was one of the first people who then identified COVID-19. So this was, again, eight years later from when we first tried something and it didn't work out. We didn't harm people at that time that we're aware of. We highlighted a lot of ways that people could be harmed with surveillance-like pandemic monitoring. And this informed what later became a very important early detection of COVID, which helped inform policy at a global scale much later. So please stay positive and realistic when it comes to thinking about AI for good projects. The payoff is likely to take longer than you might expect to begin with, but that's true of most AI projects. And so when you're aiming for something bigger, then the payoff of that final delivery is much bigger, too. When it comes to determining the next steps for your own projects, it may be that in evaluating your results, you realize that you want to go back to the implement phase and tweak your model or improve the user experience. Or you might discover that there are problems in your design and you'll head back to the design phase to work on a new version of your system. Or it's possible that you'll find that you can't provide a meaningful impact, and so you'll return all the way to the explore phase and investigate other details or perspectives of the problem that you initially set out to work on, or perhaps even just work on a new problem altogether. In the case of the project that we're highlighting here, working on providing better support for maternal health in Nigeria, the next steps we looked at were things like how we might improve the user experience so that the clinic staff could see the benefits of the system more directly. And so we evaluated potential trade-offs between asking more of the clinic staff in terms of time and commitment to the project and determined that while there might be a path forward to that approach, the risk of creating even worse user experience or worse responses or response times for patients was a distinct possibility, and so we didn't continue with this. However, a decade later, the UReport system has been deployed in dozens of countries worldwide and has made real improvements through building on the learnings from projects like ours and others since then, and now it does start to include various AI tools that it looks like might be starting to support both the people working in the clinic and in turn the communities that they support. And that wraps up this lesson on the AI for a good project framework, where we walked through the four steps of exploring, designing, implementing, and then evaluating your project. You'll be applying this framework to the case studies presented in these courses, and with each new case study, you'll gain new perspectives on the nuances and the challenges from a wide range of AI for good projects. I'd like to wrap up this week of materials with another project spotlight, this time from Ivo Gomliska, who is the founder and CEO of Humans in the Loop, a company focused on empowering crisis-impacted communities to do important work in AI projects around the world. Ivo will highlight the things you need to keep in mind whenever you're involving collaborators from crisis-impacted communities in your work.