GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang¹, Linzhi Zheng², Yanjie Li³, Junyuan Hong⁴, Qinbin Li⁵, Han Xie⁶, Jiawei Zhang¹, Zidi Xiong¹, Chulin Xie¹, Carl Yang⁶, Dawn Song⁵, Bo Li¹⁷

¹UIUC, ²Tsinghua University, ³Hong Kong Polytechnic University, ⁴UT Austin, ⁵UC Berkeley, ⁶Emory University, ⁷University of Chicago

Paper Code Dataset API

Method

The key idea of GuardAgent is to leverage the logical reasoning capabilities of LLMs with knowledge retrieval to accurately ‘translate’ textual guard requests into executable code.

Inputs to GuardAgent: 1) a set of user-defined guard requests (e.g. for privacy control), 2) a specification of the target agent (needed to inform the user requests), 3) inputs to the target agent, and 4) output (logs) of the target agent.

Outputs of GuardAgent: 1) whether or not the outputs of the target agent (actions, responses, etc.) are denied, 2) the reasons if the outputs are denied.

Pipeline of GuardAgent:

Task Planning: Generate a step-by-step action plan from the inputs. The prompt to the core LLM contains: 1) planning instructions (fixed for all use cases), 2) demonstrations for task planning retrieved from memory, and 3) inputs to GuardAgent.

Guardrail Code Generation and Execution: Generate guardrail code based on the generated task plan and execute it. The prompt to the core LLM contains: 1) code generation instructions including all callable functions and APIs, 2) demonstrations for code generation retrieved from memory, and 3) generated action plan.

Key features of GuardAgent: 1) generalizable -- the memory and tools of \name can be easily extended to address new target agents with new guard requests, 2) reliable -- outputs of GuardAgent are obtained by successful code execution, and 3) training-free -- GuardAgent is in-context-learning-based and does not need any LLM training.

Benchmark

We propose two novel benchmarks for different safety requests: 1) EICU-AC, which assesses access control for healthcare agents like EHRAgent, and 2) Mind2Web-SC, which evaluates safety control for web agents like SeeAct.

access
An example from EICU-AC (left) and an example from Mind2Web-SC (right)

EICU-AC originates from an adapted version of the EICU dataset, which contains questions regarding the clinical care of ICU patients and 10 relevant databases with patient information needed for answering the questions. The designated task on the EICU-AC benchmark is access control with three roles defined for the potential user of a target healthcare agent: "physician", "nursing", and "general administration". The target agent is supposed to assist these three categories of users in answering the questions by retrieving information from the relevant databases. However, each user role has access to only a subset of the databases and a subset of information categories in each accessible database (marked in green below). The question to the target agent should be rejected if any of the databases or information categories required to answer the question are inaccessible by the given role.

access

Each example in EICU-AC contains the following key information: 1) a healthcare-related question and the correct answer (from EICU), 2) the databases and the information categories required to answer the question (correctly inferred by EHRAgent), 3) a user role (assigned by us), 4) a binary label '0' if all required databases and information categories are accessible to the given role or '1' otherwise, and 5) the required databases and information categories inaccessible to the identity if the label is '1'. In summary, EICU-AC contains 52, 57, and 45 examples labeled to '0' for "physician", "nursing", and "general administration", respectively, and 46, 55, and 61 examples labeled to '1' for the three roles, respectively.

Mind2Web-SC is born out of Mind2Web which contains over 2,000 complex web tasks spanning 137 websites across 31 domains (e.g., car rental, shopping, entertainment, etc.) The target web agent here is designed to solve each task by conducting a sequence of actions grounded on a provided webpage (e.g. clicking on a certain button). Mind2Web-SC additionally considers a safety control request with a set of rules that prohibit certain users from engaging in specific web activities (see the figure below). Each example in Mind2Web-SC includes 1) a task to be conducted, 2) an action step towards the completion of the task (correctly inferred by SeeAct), 3) a user profile containing 'age' in integer, and 'domestic', 'dr_license', 'vaccine', and 'membership', all boolean (created by us), 4) a binary label '1' if the action should be denied due to rule violations and '0' otherwise, and 5) the violated rule if the label is '1'. In summary, Mind2Web-SC includes 100 examples per label.

access
Safety rules of Mind2Web-SC and the number of examples (with label '1') for each rule violation.

Experiment

Setup: We test GuardAgent on EICU-AC and Mind2Web-SC, with EHRAgent and SeeAct being the target agent, respectively. We use GPT-4 version 2024-02-01 with temperature zero as the core LLM of GuardAgent. For EICU-AC and Mind2Web-SC, we use 1 and 3 demonstrations, respectively. The guard requests for the two benchmarks are shown below.

access
Guard requests for EICU-AC and Mind2Web-SC in our experiments. GuardAgent is designed to serve diverse guard requests for different target agents.

Evaluation metrics: 1) label prediction accuracy (LPA) -- the percentage of correct label prediction (i.e., reject inputs for examples labeled to '1' or permit output for examples labeled to '0') over all examples in each dataset, 2) label prediction precision (LPP), 3) label prediction recall (LPR), and (4) comprehensive control accuracy (CCA) -- the percentage of all examples with ground truth labeled '1' that are correctly predicted AND with correct reasoning (i.e., with all inaccessible databases and information categories (for EICU-AC) or all violated rules (for Mind2Web-SC) successfully detected).

Baselines: Since GuardAgent is the first LLM agent designed to safeguard other agents, we compare it with baselines using models to safeguard agents. Approaches for model-guard-model, such as LlamaGuard designed to detect predefined unsafe categories, are not considered here due to their completely different objectives. Here, we consider GPT-4 version 2024-02-01 and Llama3-70B as the guardrail models. We create comprehensive prompts containing high-level guardrail task instructions and the same number of demonstrations as for GuardAgent, but without guardrail code generation or the utilization of long-term memory.

access
Performance of GuardAgent on EICU-AC and Mind2Web-SC compared with the model-guard-agent baselines.

access
Breakdown of GuardAgent results over the three roles in EICU-AC and the six rules in Mind2Web-SC.

access
Performance of GuardAgent with different numbers of demonstrations on EICU-AC and Mind2Web-SC.

Results:

GuardAgent outperforms the model-guard-agent baselines. This is likely due to the code generation of GuardAgent that avoids ambiguities in the database names.
GuardAgent performs uniformly well for the three roles in EICU-AC and the six rules in Mind2Web-SC, except for rule 5 related to movies, music, and videos, which demonstrates relatively strong capabilities in handling complex guard requests with high diversity.
GuardAgent can achieve descent guardrail performance with very few shots of demonstrations.

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Overview

Method

Benchmark

Experiment

BibTeX