PersonalWAB

Leaderboard

The leaderboard shows the performance of different methods on the PersonalWAB benchmark.

Single-turn track: Each method has only one chance to execute the user's instruction without additional actions.

Multi-turn track: Allows multiple attempts with a user simulator for real-time multi-turn interactions.

F. Acc: Function accuracy, measures how accurately correct functions are used.

R. Acc: Result accuracy, measures the correctness of the function results produced.

Avg. Steps: Represents the average number of steps taken by the agent to complete each instruction.

In each track, methods are ranked based on their overall F. Acc and R. Acc scores.

Notice: To submit your results for inclusion on the leaderboard, please fill out this form.

Rank	Method	Search		Recommendation		Review		Overall
Rank	Method	F. Acc	R. Acc	F. Acc	R. Acc	F. Acc	R. Acc	F. Acc	R. Acc
1	PUMA (LLaMA-7B)	0.996	0.652	0.987	0.054	1.000	0.538	0.994	0.406
2	PUMA (gpt-4o)	1.000	0.649	0.939	0.048	1.000	0.449	0.979	0.373
3	ReAct	0.903	0.605	0.560	0.027	0.996	0.444	0.815	0.350
4	Relevant Memory	0.928	0.622	0.492	0.030	1.000	0.443	0.800	0.356
5	Last Memory	0.937	0.626	0.432	0.028	1.000	0.442	0.782	0.357
6	Random Memory	0.974	0.640	0.296	0.018	0.996	0.442	0.745	0.357
7	RecMind	0.981	0.645	0.226	0.017	0.990	0.442	0.721	0.359
8	No Memory	1.000	0.647	0.092	0.000	1.000	0.444	0.684	0.355

Rank	Method	Search			Recommendation			Review			Overall
Rank	Method	F. Acc	R. Acc	Avg. Steps	F. Acc	R. Acc	Avg. Steps	F. Acc	R. Acc	Avg. Steps	F. Acc	R. Acc	Avg. Steps
1	PUMA (gpt-4o)	0.999	0.720	5.082	0.984	0.052	3.791	1.000	0.453	2.002	0.994	0.399	3.608
2	Relevant Memory	0.996	0.686	4.233	0.715	0.042	4.564	0.999	0.448	2.008	0.899	0.383	3.609
3	Last Memory	0.996	0.676	4.229	0.708	0.045	4.252	1.000	0.449	2.007	0.897	0.381	3.498
4	Random Memory	0.999	0.680	4.193	0.703	0.042	4.474	1.000	0.448	2.007	0.896	0.380	3.564
5	InteRecAgent	0.999	0.642	3.110	0.618	0.022	3.008	1.000	0.447	2.001	0.867	0.362	2.706
6	RecMind	0.997	0.642	6.728	0.347	0.026	6.003	0.997	0.451	2.107	0.771	0.364	4.938
7	Reflection	1.000	0.686	5.406	0.281	0.014	6.145	0.976	0.449	2.145	0.741	0.373	4.579
8	ReAct	0.996	0.674	4.657	0.218	0.013	5.468	0.974	0.448	2.129	0.718	0.369	4.098
9	No Memory	0.996	0.656	2.398	0.096	0.000	2.420	1.000	0.446	2.019	0.685	0.358	2.280

Method	Backbone	Author	Created	Paper	Code	Note
PUMA	LLaMA2-7B & gpt-4o-mini	PersonalWAB	2024-10	paper	code	See implementation details in our paper
ReAct	gpt-4o-mini	Baseline	2024-10	paper	code	See implementation details in our paper
Reflection	gpt-4o-mini	Baseline	2024-10	paper	code	See implementation details in our paper
RecMind	gpt-4o-mini	Baseline	2024-10	paper	code	See implementation details in our paper
InteRecAgent	gpt-4o-mini	Baseline	2024-10	paper	code	See implementation details in our paper
No Memory	gpt-4o-mini	Baseline	2024-10	N.A.	N.A.	See implementation details in our paper
Random Memory	gpt-4o-mini	Baseline	2024-10	N.A.	N.A.	See implementation details in our paper
Last Memory	gpt-4o-mini	Baseline	2024-10	N.A.	N.A.	See implementation details in our paper
Relevant Memory	gpt-4o-mini	Baseline	2024-10	N.A.	N.A.	See implementation details in our paper