Résumé : Understanding the impact of mutations on protein-protein binding affinity is a key objective for a wide range ofbiotechnological applications and for shedding light on disease-causing mutations, which are often located at protein-protein interfaces. Over the past decade, many computational methods using physics-based and/or machine learningapproaches have been developed to predict how protein binding affinity changes upon mutations. They all claim toachieve astonishing accuracy on both training and test sets, with performances on standard benchmarks such as SKEMPI2.0 that seem overly optimistic. Here we benchmarked six well-known and well-used predictors and identified their biasesand dataset dependencies, using not only SKEMPI 2.0 as a test set but also deep mutagenesis data on the SARS-CoV-2spike protein in complex with the human angiotensin-converting enzyme 2. We showed that, even though most testedmethods reach a significant degree of robustness and accuracy, they suffer from limited generalizability properties andstruggle to predict unseen mutations. Undesirable prediction biases towards specific mutation properties, the most markedbeing towards destabilizing mutations, are also observed and should be carefully considered by method developers. Weconclude from our analyses that there is room for im