Now it's time to put your governance foundation into action. In this lesson, you'll work with an HR data set containing information about employees. You'll create views from the tables, create your service principal identity, and with those permissions you can access your views and finally create your agent's tools and register them in Unity Catalog. All right, let's start coding. Before we get started on the Lab 1 notebook, you're going to see this Welcome to Databricks. We are in the Databricks free edition, and you can sign up for your own free Databricks account. The link for that is in the reading note along with the link to the GitHub repo. You can fork the GitHub repo and then put that code directly into your Databricks free account. I'm going to copy over my link to the data for this agent's governance lab. You should see a file data, lab one, lab three, agent.py and a readme file. Now if I go to Databricks, I'm going to go to the upper right hand side. Here I'll see my email that I signed up with, so
[email protected] And then I'm going to go into my settings. So you are an admin for this account, so you will see the workspace admin settings. and then you as a user have your user settings as well. I'm going to go into the developer tab. This is where you can also switch things like dark mode. And if I want to link my GitHub, I can link accounts, so I'll add my Git credential. So I have my GitHub, GitHub name. I can name it whatever I like. I could just say this is amber_agents. link the GitHub. authorize Databricks. All right, now we have successfully linked our GitHub. Now if I go into my Workspace, this is where all your notebook capabilities are. and I can create a Git folder. So if you've downloaded the notebooks directly, you can create new notebooks or you can create the same folder. But we're just going to have that directly from GitHub. So the link I copied, just going to paste that there. And it already pulls the Git folder information and the provider. Perfect. So now we see all the data that we just had in our GitHub repo. Now, before we get started walking through Lab1, which this is our Lab1 governance. We're going to create the hr_data_analyst service principal and create the group Devs and then add our hr_data_analyst as a member to that. So again, I'm going to go back into my admin view. I'm going to look at Identity and access, which is already the default when I click onto my workspace account. I'm going to go to the Service principals, Add service principal, add a new service principal, and type HR data analyst. Add that service principal. And within the service principal you should see you have entitlements, so Databricks SQL access, Workspace access, and you can add in additional permissions. Here you're seeing a warning about permission control as who can manage or use the service principal. Service principal manager does not automatically grant the ability to use the service principal for users or groups that need to use that service principal. You want to make sure that they have explicit permissions. So we're going to grant access. We are going to type in hr_data_analyst. and we are going to give it the ability to manage and use. And then you can also, since this is me I can be able to do whatever I want and use the service principle. I'm going to add the ability to use as well. So these are all the permissions set for you to be able to build an agent and eventually deploy it on behalf of a service principal. All right, back to identity and access and now we manage and created the service principal. We're going to create the group. So let's say we had a group for all developers. We'll just name that devs. All right, and you can add as many members as you'd want to Devs and they would inherit permissions. If you just create a group and you don't give it permissions, you don't have a parent group, it will just exist and have no default permissions. So I'm going to grant access to the Service Principal. So it is now a part of this group. If I don't want to give it manage access, all I have to do is edit and delete those capabilities. and it will revoke those permissions. If I just want to give it member access, which all we really need to do to inherit the permissions of a group is give it member access. All right, and so now hr_data_analyst is added to that group without the ability to change controls within that group. Perfect. And this will start making more sense as well once we implement governance in our notebook. So now that we have the group Devs, which again, this is all the developers that are working together to build agents. And we don't want to have to add those individually to Unity Catalog and grant permissions one by one. So luckily, we're able to grant permissions to the group, so everyone within that group, including our service principal, can inherit those permissions. So going back to our workspace, we see our agent governance folder. Click into that folder, click into lab one. So starting back at the notebook lab one building the governance foundation. We've now created the HR data analyst service principal. We've created the dev groups and have added HR data analyst just as a member. Now I have to make sure we're connected to serverless compute. All right, we're connected. So the next step is creating the HR tables in Unity Catalog, and then applying data classifications. We're going to create an analyst view. We're going to give permissions to our Devs Group so they can access that view. We're going to create masks and build functions which acts as tool for our agent. So let's break that down step by step. So let's start with the initial setup and data verification. So I'm going to run this first cell. So this is going to set up our working environment. So we are a company called clientcare. and we have an HR team's database, which is our schema HR data. So there's a variety of data stored from our HR team in this database. And so our first step is creating a catalog. And then we're going to create the schema. So after we make the data frames for our data, I'm able to run the next file. So as long as you have your data in the local directory under data, we're able to use all the CSV files. So this is your compensation data, employee records, HR cases, internal procedures. We have our performance reviews, public policies, and then we're going to load each CSV file one at a time. So for the table and file name in the CSV files, we're going to take each one of these files. We're going to read the CSVs and convert it to Spark DataFrame in this line. So we have a pandas DataFrame here with the file path that is our data. And then we are going to use Spark to create the data frame. Spark just requires this to be the absolute file path and not the local file path, so that's why we're using pandas. And then we're going to write all those tables into our catalog, into HR data, and the tables will have these names. So, let's see if we've successfully created these tables. Tables successfully created. All right, the next step is we're going to just review that these tables have actually been created. All right, they actually have been created. And you're able to just go into your Catalog and see them directly. So we see our catalog clientcare. we see our HR data. And then we see all the files within. So this is your file structure, but I can also just view in the UI all the tables we've just created. Going back to our Lab1 into our workspace, into our Lab1_Governance. All right, our next step is we're just wanting to display the data. So just getting a good look on what the data is that we see cause we know we have things like social security number, phone numbers, emails, we have employee names, we have salaries, start date. So we can combine even a few of the tables from employee_records and compensation_data, just to see all the data we've currently logged into Unity Catalog, and the sensitive level of that data. All right, so we can see our employee ID, first name, last name, their department, the full social security numbers, phone numbers, email, hiring date, salaries, Now, if we're creating an HR data analyst, they're going to need key information like base salary, bonus, stock options, but we don't want them to be able to directly identify employees based on those numbers. And we definitely don't want them to have full access to their emails, phone numbers and social security numbers. Next up is applying data classification tags. So we're going to define the classification tags for each table and whether or not they contain PII. So for example, our employee records are confidential and they do contain PII. And before getting into more of this, I just want to talk about how when we apply tags to our tables, it essentially gives a sensitivity level. So table properties are metadata tags that help classify data sensitivity without actually changing the data structure. So you can have metadata key value pairs attached to these tables. They're used by governance tools to understand that data sensitivity, but again, it won't affect the table structure or the actual data. So they can be queried programmatically for compliance and our data classification scheme, as you can see, we have public, restricted, confidential. I'm saying that Public, anyone can access. Internal, our employees only. Confidential is limited access, and then Restricted is highly sensitive information. So by adding classification tags like confidential, restricted, or public to Unity Catalog, serves as adding essentially metadata because again, in Databricks, tags are metadata that can make your data access more manageable, secure, and compliant with both internal policies and external regulations. So here is the actual step in adding the tags. So classification and PII. These are our table properties, so this is in set in Spark SQL. add the classification, we can see all these were added. And if you want to see the tags themselves, again, going into your catalog, into your HR data, and for example, we can look at employee records. Just scroll down and you'll see Confidential and contains PII. If I go to recents, I also can just pull up the lab instead of going to my workspace and then the governance notebook and then the lab. All right, so just scrolling down here, the next thing we have is about Agent Permissions and Agent Access Requirements. So now that we've classified our data, let's design the right level of access for our HR analytics agent based on purpose-built views. For example, what would I want my HR admin to see, what would I want a manager to see, what would I want a data analyst to see. Kind of mapping this out as a way to show, you know, we don't want names or social security numbers, but we do want salary, we need to know the department. We want to know performance because we can look at analysis related to overall performance and operations. So your agent should be able to answer questions like what is the salary distribution, is there pay equity, what's this correlation, without answering what's John Smith's salary, who are the highest paid employees, show me social security numbers. All right, so our next step then is creating a classification aware view. So we're going to create this data_analyst_view for statistical analysis. Now, why do we use views for access control? Well, they're very straightforward, so instead of complex masking rules for every possible user, these are purpose-built views. Each view has a clear business purpose and user type. They're easy to maintain. The views are optimized in SQL, so no runtime overhead, and they're an added security layer. So in Databricks, views act as our security layer since we grant permissions directly to our service principals. We have our data analyst view for analysis. And so key features here, we're going to anonymize the employee ID. We have the department. We're going to take the hiring date and just make it a year, like it's easy to connect a specific hiring date to a user if you know when that user started. We're going to have the full base salary, so making sure that's not taken away. We're going to have our comp year, our review quarter. and our review_year. Still going to have base salary, bonus, rating, stock options. And then we're going to combine the information for employee IDs, so we're going to add compensation data and performance reviews together. And we're going to exclude the legal department. We don't need the legal department in our analysis and we're told not to use it for compliance reasons. All right, so that view has been created. Next, let's confirm the view in our schema. So we are just making sure that this view is in client care in our HR data. And so we do see that a view is living in that schema. All right, next up we're going to test the view. So our analyst view, we're going to check row counts, we're going to verify the departments. All right, we have salaries, we have anonymized IDs like we wanted. We don't see any legal here. All right, and we don't have any legal in the department. All right. The next step is configuring Group Permissions. Now, so we've already created the group Devs. We've created the service principal and added it to the group Devs. So that service principal will inherit certain permissions. Now we just need to make sure that Devs has the right permissions to access the data and be able to create agents. So first, verifying that we did create that group and we didn't skip that step. All right, so we see our admins, our Devs, and our users, so we did create that group. All right, next up is granting Devs the specific permissions that they need again to view the data and to register models in Unity Catalog with MLflow. Now, if you're granting broad permissions, you can just use one query on grant all permissions on client care in HR data. but we want to make sure that we're granting minimum permissions and we're just giving specific access to the schema. So first we're going to grant our catalog and schema. We're going to grant access to devs being able to use the catalog and schema. These are container permissions, so these are prerequisites so they have to be granted. And then we have object permissions which are inherited. So these once you grant access on the schema HR data, being able to use any table, any model, any function automatically gets inherited. If you want more information on that, I recommend going back and re-listening to lesson two. All right, so just going through these real quick. I'm going to run this. So grant use on catalog, use on schema, the ability to create a table on the schema. So if you do have a logged model, maybe you want to log all the inferences in a Unity Catalog table. You're going to be able to execute so running functions for example, when you're using your agent. creating a model, so we're going to want to CREATE MODEL and CREATE MODEL VERSIONS when we register our agent to Unity Catalog. And then grant SELECT ON VIEW so that they are able to use that view. Next up, we're just going to make sure that this works. So we're going to check schema level permissions, check view level permissions. So we're going to see all the grants on that schema. And then we're going to see the table grants as well. And so this should just show us for the view who has access to that view. All right. Schema level permissions. Devs can use the schema, they can create models, they can create tables, they create versions of the model, they can execute. and then the data_analyst_view, Devs have the ability to select on it. Next up we have Column Masking. So you might be wondering why do we need column masking when we've already implemented views. Well, views can be bypassed if users gain direct table access. Column masking at the table level cannot be bypassed because it's enforced by Unity Catalog on every query regardless of how the data is accessed. So by using column masking or row masking, nothing can get through. So we're going to create the masking for social security numbers. We're going to say, you know what, no one actually needs the full security number unless you're in payroll. So we're going to just make it a concatenated five digits, which are just going to be stars. So completely anonymized, you can't see the first five digits of the social security number, and then have the true value as the last four digits. But Devs don't need to see anything. So they're just going to get ANALYTICS_MASKED. Whereas maybe elevated users like managers and admins would be able to see the last four digits. All right, so we're applying this function to the social security number column. Again, you can apply these to rows as well. These masks are functions, the same way we're going to create functions that query the data and use them as tools for our agent. So we're going to alter this table, so our client care HR data employee records that has the column social security numbers, we're going to set that mask. Now looking at validating the column masking, we're going to run the validation of the column masking, testing it out. So we see we're currently not using that view, so it's still the initial employee ID. But we see the social security number has the first four digits masked, and even though I'm an admin, this is still what I see unless I go and remove the mask. So next up, we have building the secure tools for our agents. Now you can call them tools, you can call them functions, you can call them skills. Essentially, you're building a function for the agent to query the data. And that's the only way the agent's going to have access to the data. So the functions we make now are going to act as the interface between the agent and our governed data. So why do we want to use UC functions for agent tools? Well, they have governed access, so the same way everything inherits permissions and access, our functions also have permissions set. They're able to query templates, they perform an audit trail, so all functions that are logged are logged for compliance, and they're optimized for performance. So proper indexing and caching, you can have a tool that essentially uses is a vector database as well. And our tool strategy is we're going to create two general purpose functions that work exclusively with the anonymized data analyst view. So when we do deploy the agent, when we run it as a service principal, it's going to use that view to see the data. So it's still going to get anonymized employee IDs. It's not going to get any information that's key for social security numbers. It is not going to see the legal department. So we're going to have all that tied up and when the agent queries anything, it's not going to get any sensitive information out. So our tool one is performance and retention analytics. So here we're going to create or replace the function. So you can run this notebook as much as you'd like. It will just rewrite what has already been built. So this is our function analyze_performance. we're going to get information about our departments, our ratings, our employee numbers, so the average tenure. And we can add in comments. These go in directly into the function which the agent uses. So HR analytics, this is basic performance metrics. by department. So we're looking here at average, min and max ratings, employee count, average tenure by department. And again, it only works with anonymized data. So here's our data_analyst_view that we're making sure it uses. Next up, we're using our tool number two, which is our department and compensation analytics. So this is going to again create or replace the function analyze_operations. which it's going to take department, employee count, average salary, average bonus, average total comp, and stock options. And the comment here is just that it is department compensation and operational metrics. So going to select department, count, average salary, bonuses, base salary, and again using that data_analyst_view to make sure it only works with anonymized data. Now most data here is going to be available because we want to make sure things like salary, bonus and stock. We made sure all that was viewable by the data_analyst_view. All right, so we've created the data analysis operations function, and this will return compensation metrics and headcount by department. but no sensitive data will be exposed. So let's actually test both of these queries. And now we can see the performance analysis by department. So our department, our average ratings. Again, you're not going to see legal. You're not going to see any identifiable information. Test 2, compensation analytics by department. Again, seeing department, employee count, salary, bonuses. And now we've accomplished the governance foundation. So you've done data classification with tagging. You've built a Classification-Aware View, the data_analyst_view. You configured group permissions, so you set up a group Devs with permissions on your views and models. You've implemented social security masking and then you've built two secure tools for analyzing performance and operations. All right, now this concludes the building of the governance foundation. Let's get started on the next lesson.