Welcome to the project page of GuardAgent! In this project, we aim to provide guardrails to LLM-powered agents (dubbed "target agents" below) by checking whether their inputs/outputs satisfy a set of guard requests (e.g., safety rules or privacy policies) defined by the users. This is fundamentally different from guardrails for LLMs (e.g. Llama Guard) since the output of an LLM agent can be actions, codes, control signals, etc. GuardAgent is designed with two major steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines. In both steps, an LLM is utilized as the core reasoning component, supplemented by in-context demonstrations retrieved from a memory module. Such knowledge-enabled reasoning allows GuardAgent to understand various textual guard requests and accurately "translate" them into executable code that provides reliable guardrails. In addition to GuardAgent, we contribute two novel benchmarks: an EICU-AC benchmark for assessing privacy-related access control for healthcare agents and a Mind2Web-SC benchmark for safety evaluation for web agents. We show the effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0% guarding accuracy in moderating invalid inputs and outputs for the two types of agents, respectively. Finally, the GuardAgent API that provides real-time guardrails based on user guard requests will be coming soon.
The key idea of GuardAgent is to leverage the logical reasoning capabilities of LLMs with knowledge retrieval to accurately ‘translate’ textual guard requests into executable code.
Inputs to GuardAgent: 1) a set of user-defined guard requests (e.g. for privacy control), 2) a specification of the target agent (needed to inform the user requests), 3) inputs to the target agent, and 4) output (logs) of the target agent.
Outputs of GuardAgent: 1) whether or not the outputs of the target agent (actions, responses, etc.) are denied, 2) the reasons if the outputs are denied.
Pipeline of GuardAgent:
We propose two novel benchmarks for different safety requests: 1) EICU-AC, which assesses access control for healthcare agents like EHRAgent, and 2) Mind2Web-SC, which evaluates safety control for web agents like SeeAct.
An example from EICU-AC (left) and an example from Mind2Web-SC (right)
EICU-AC originates from an adapted version of the EICU dataset, which contains questions regarding the clinical care of ICU patients and 10 relevant databases with patient information needed for answering the questions.
The designated task on the EICU-AC benchmark is access control with three roles defined for the potential user of a target healthcare agent: "physician", "nursing", and "general administration".
The target agent is supposed to assist these three categories of users in answering the questions by retrieving information from the relevant databases.
However, each user role has access to only a subset of the databases and a subset of information categories in each accessible database (marked in green below).
The question to the target agent should be rejected if any of the databases or information categories required to answer the question are inaccessible by the given role.
Each example in EICU-AC contains the following key information: 1) a healthcare-related question and the correct answer (from EICU), 2) the databases and the information categories required to answer the question (correctly inferred by EHRAgent), 3) a user role (assigned by us), 4) a binary label '0' if all required databases and information categories are accessible to the given role or '1' otherwise, and 5) the required databases and information categories inaccessible to the identity if the label is '1'.
In summary, EICU-AC contains 52, 57, and 45 examples labeled to '0' for "physician", "nursing", and "general administration", respectively, and 46, 55, and 61 examples labeled to '1' for the three roles, respectively.
Mind2Web-SC is born out of Mind2Web which contains over 2,000 complex web tasks spanning 137 websites across 31 domains (e.g., car rental, shopping, entertainment, etc.)
The target web agent here is designed to solve each task by conducting a sequence of actions grounded on a provided webpage (e.g. clicking on a certain button).
Mind2Web-SC additionally considers a safety control request with a set of rules that prohibit certain users from engaging in specific web activities (see the figure below).
Each example in Mind2Web-SC includes 1) a task to be conducted, 2) an action step towards the completion of the task (correctly inferred by SeeAct), 3) a user profile containing 'age' in integer, and 'domestic', 'dr_license', 'vaccine', and 'membership', all boolean (created by us), 4) a binary label '1' if the action should be denied due to rule violations and '0' otherwise, and 5) the violated rule if the label is '1'.
In summary, Mind2Web-SC includes 100 examples per label.
Safety rules of Mind2Web-SC and the number of examples (with label '1') for each rule violation.
Setup: We test GuardAgent on EICU-AC and Mind2Web-SC, with EHRAgent and SeeAct being the target agent, respectively.
We use GPT-4 version 2024-02-01 with temperature zero as the core LLM of GuardAgent.
For EICU-AC and Mind2Web-SC, we use 1 and 3 demonstrations, respectively.
The guard requests for the two benchmarks are shown below.
Guard requests for EICU-AC and Mind2Web-SC in our experiments. GuardAgent is designed to serve diverse guard requests for different target agents.
Performance of GuardAgent on EICU-AC and Mind2Web-SC compared with the model-guard-agent baselines.
Breakdown of GuardAgent results over the three roles in EICU-AC and the six rules in Mind2Web-SC.
Performance of GuardAgent with different numbers of demonstrations on EICU-AC and Mind2Web-SC.
@misc{xiang2024guardagent,
title={GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning},
author={Zhen Xiang and Linzhi Zheng and Yanjie Li and Junyuan Hong and Qinbin Li and Han Xie and Jiawei Zhang and Zidi Xiong and Chulin Xie and Carl Yang and Dawn Song and Bo Li},
year={2024},
eprint={2406.09187},
archivePrefix={arXiv}}