Performance evaluation is key to building, training, validating, testing, comparing, and publishing classifier models for machine-learning-based classification problems. Two categories of performance instruments are confusion-matrix-derived metrics such as accuracy, true positive rate, and F1 and graphical-based metrics such as area-under-receiver-operating-characteristic-curve. Probabilistic-based performance instruments that are originally used for regression and time series forecasting are also applied in some binary-class or multi-class classifiers, such as artificial neural networks. Besides widely-known probabilistic instruments such as Mean Squared Error (MSE), Root Mean Square Error (RMSE), and LogLoss, there are many instruments. However, it is not identified that any of those is proper to use specifically in binary-classification performance evaluation. This study proposes BenchMetrics Prob, a qualitative and quantitative benchmarking method, to systematically evaluate probabilistic instruments via five criteria and fourteen simulation cases based on hypothetical classifiers on synthetic datasets. These criteria and cases give more insights to select a proper instrument in a binary-classification performance evaluation. The method was tested on over 31 instruments/instrument variants and the results have distinguished that three instruments are the most robust for binary-classification performance evaluation, namely Sum Squared Error (SSE), MSE with RMSE variant, and Mean Absolute Error (MAE). The results also showed that instrument variants with summarization functions other than mean (e.g., median and geometric mean) and the instrument subtypes proposed later to improve performance evaluation in regression such as relative/percentage/symmetric-percentage error instruments are not robust. Researchers should be aware of using those instruments in selecting or reporting performance in binary classification problems.