Leaderboard

The leaderboard shows the performance of different methods on the PersonalWAB benchmark.

  • Single-turn track: Each method has only one chance to execute the user's instruction without additional actions.
  • Multi-turn track: Allows multiple attempts with a user simulator for real-time multi-turn interactions.
  • F. Acc: Function accuracy, measures how accurately correct functions are used.
  • R. Acc: Result accuracy, measures the correctness of the function results produced.
  • Avg. Steps: Represents the average number of steps taken by the agent to complete each instruction.
  • In each track, methods are ranked based on their overall F. Acc and R. Acc scores.

    Notice: To submit your results for inclusion on the leaderboard, please fill out this form.

    Single-turn Track

    Rank Method Search Recommendation Review Overall
    F. Acc R. Acc F. Acc R. Acc F. Acc R. Acc F. Acc R. Acc
    1 PUMA (LLaMA-7B) 0.996 0.652 0.987 0.054 1.000 0.538 0.994 0.406
    2 PUMA (gpt-4o) 1.000 0.649 0.939 0.048 1.000 0.449 0.979 0.373
    3 ReAct 0.903 0.605 0.560 0.027 0.996 0.444 0.815 0.350
    4 Relevant Memory 0.928 0.622 0.492 0.030 1.000 0.443 0.800 0.356
    5 Last Memory 0.937 0.626 0.432 0.028 1.000 0.442 0.782 0.357
    6 Random Memory 0.974 0.640 0.296 0.018 0.996 0.442 0.745 0.357
    7 RecMind 0.981 0.645 0.226 0.017 0.990 0.442 0.721 0.359
    8 No Memory 1.000 0.647 0.092 0.000 1.000 0.444 0.684 0.355

    Multi-turn Track

    Rank Method Search Recommendation Review Overall
    F. Acc R. Acc Avg. Steps F. Acc R. Acc Avg. Steps F. Acc R. Acc Avg. Steps F. Acc R. Acc Avg. Steps
    1 PUMA (gpt-4o) 0.999 0.720 5.082 0.984 0.052 3.791 1.000 0.453 2.002 0.994 0.399 3.608
    2 Relevant Memory 0.996 0.686 4.233 0.715 0.042 4.564 0.999 0.448 2.008 0.899 0.383 3.609
    3 Last Memory 0.996 0.676 4.229 0.708 0.045 4.252 1.000 0.449 2.007 0.897 0.381 3.498
    4 Random Memory 0.999 0.680 4.193 0.703 0.042 4.474 1.000 0.448 2.007 0.896 0.380 3.564
    5 InteRecAgent 0.999 0.642 3.110 0.618 0.022 3.008 1.000 0.447 2.001 0.867 0.362 2.706
    6 RecMind 0.997 0.642 6.728 0.347 0.026 6.003 0.997 0.451 2.107 0.771 0.364 4.938
    7 Reflection 1.000 0.686 5.406 0.281 0.014 6.145 0.976 0.449 2.145 0.741 0.373 4.579
    8 ReAct 0.996 0.674 4.657 0.218 0.013 5.468 0.974 0.448 2.129 0.718 0.369 4.098
    9 No Memory 0.996 0.656 2.398 0.096 0.000 2.420 1.000 0.446 2.019 0.685 0.358 2.280

    Method Details

    Method Backbone Author Created Paper Code Note
    PUMA LLaMA2-7B & gpt-4o-mini PersonalWAB 2024-10 paper code See implementation details in our paper
    ReAct gpt-4o-mini Baseline 2024-10 paper code See implementation details in our paper
    Reflection gpt-4o-mini Baseline 2024-10 paper code See implementation details in our paper
    RecMind gpt-4o-mini Baseline 2024-10 paper code See implementation details in our paper
    InteRecAgent gpt-4o-mini Baseline 2024-10 paper code See implementation details in our paper
    No Memory gpt-4o-mini Baseline 2024-10 N.A. N.A. See implementation details in our paper
    Random Memory gpt-4o-mini Baseline 2024-10 N.A. N.A. See implementation details in our paper
    Last Memory gpt-4o-mini Baseline 2024-10 N.A. N.A. See implementation details in our paper
    Relevant Memory gpt-4o-mini Baseline 2024-10 N.A. N.A. See implementation details in our paper