Leaderboard
The leaderboard shows the performance of different methods on the PersonalWAB benchmark.
In each track, methods are ranked based on their overall F. Acc and R. Acc scores.
Notice: To submit your results for inclusion on the leaderboard, please fill out this form.
Single-turn Track
Rank | Method | Search | Recommendation | Review | Overall | ||||
---|---|---|---|---|---|---|---|---|---|
F. Acc | R. Acc | F. Acc | R. Acc | F. Acc | R. Acc | F. Acc | R. Acc | ||
1 | PUMA (LLaMA-7B) | 0.996 | 0.652 | 0.987 | 0.054 | 1.000 | 0.538 | 0.994 | 0.406 |
2 | PUMA (gpt-4o) | 1.000 | 0.649 | 0.939 | 0.048 | 1.000 | 0.449 | 0.979 | 0.373 |
3 | ReAct | 0.903 | 0.605 | 0.560 | 0.027 | 0.996 | 0.444 | 0.815 | 0.350 |
4 | Relevant Memory | 0.928 | 0.622 | 0.492 | 0.030 | 1.000 | 0.443 | 0.800 | 0.356 |
5 | Last Memory | 0.937 | 0.626 | 0.432 | 0.028 | 1.000 | 0.442 | 0.782 | 0.357 |
6 | Random Memory | 0.974 | 0.640 | 0.296 | 0.018 | 0.996 | 0.442 | 0.745 | 0.357 |
7 | RecMind | 0.981 | 0.645 | 0.226 | 0.017 | 0.990 | 0.442 | 0.721 | 0.359 |
8 | No Memory | 1.000 | 0.647 | 0.092 | 0.000 | 1.000 | 0.444 | 0.684 | 0.355 |
Multi-turn Track
Rank | Method | Search | Recommendation | Review | Overall | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F. Acc | R. Acc | Avg. Steps | F. Acc | R. Acc | Avg. Steps | F. Acc | R. Acc | Avg. Steps | F. Acc | R. Acc | Avg. Steps | ||
1 | PUMA (gpt-4o) | 0.999 | 0.720 | 5.082 | 0.984 | 0.052 | 3.791 | 1.000 | 0.453 | 2.002 | 0.994 | 0.399 | 3.608 |
2 | Relevant Memory | 0.996 | 0.686 | 4.233 | 0.715 | 0.042 | 4.564 | 0.999 | 0.448 | 2.008 | 0.899 | 0.383 | 3.609 |
3 | Last Memory | 0.996 | 0.676 | 4.229 | 0.708 | 0.045 | 4.252 | 1.000 | 0.449 | 2.007 | 0.897 | 0.381 | 3.498 |
4 | Random Memory | 0.999 | 0.680 | 4.193 | 0.703 | 0.042 | 4.474 | 1.000 | 0.448 | 2.007 | 0.896 | 0.380 | 3.564 |
5 | InteRecAgent | 0.999 | 0.642 | 3.110 | 0.618 | 0.022 | 3.008 | 1.000 | 0.447 | 2.001 | 0.867 | 0.362 | 2.706 |
6 | RecMind | 0.997 | 0.642 | 6.728 | 0.347 | 0.026 | 6.003 | 0.997 | 0.451 | 2.107 | 0.771 | 0.364 | 4.938 |
7 | Reflection | 1.000 | 0.686 | 5.406 | 0.281 | 0.014 | 6.145 | 0.976 | 0.449 | 2.145 | 0.741 | 0.373 | 4.579 |
8 | ReAct | 0.996 | 0.674 | 4.657 | 0.218 | 0.013 | 5.468 | 0.974 | 0.448 | 2.129 | 0.718 | 0.369 | 4.098 |
9 | No Memory | 0.996 | 0.656 | 2.398 | 0.096 | 0.000 | 2.420 | 1.000 | 0.446 | 2.019 | 0.685 | 0.358 | 2.280 |
Method Details
Method | Backbone | Author | Created | Paper | Code | Note |
---|---|---|---|---|---|---|
PUMA | LLaMA2-7B & gpt-4o-mini | PersonalWAB | 2024-10 | paper | code | See implementation details in our paper |
ReAct | gpt-4o-mini | Baseline | 2024-10 | paper | code | See implementation details in our paper |
Reflection | gpt-4o-mini | Baseline | 2024-10 | paper | code | See implementation details in our paper |
RecMind | gpt-4o-mini | Baseline | 2024-10 | paper | code | See implementation details in our paper |
InteRecAgent | gpt-4o-mini | Baseline | 2024-10 | paper | code | See implementation details in our paper |
No Memory | gpt-4o-mini | Baseline | 2024-10 | N.A. | N.A. | See implementation details in our paper |
Random Memory | gpt-4o-mini | Baseline | 2024-10 | N.A. | N.A. | See implementation details in our paper |
Last Memory | gpt-4o-mini | Baseline | 2024-10 | N.A. | N.A. | See implementation details in our paper |
Relevant Memory | gpt-4o-mini | Baseline | 2024-10 | N.A. | N.A. | See implementation details in our paper |