How to select between models when AUC scores are similar?Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?

On a tidally locked planet, would time be quantized?

What are the purposes of autoencoders?

Has any country ever had 2 former presidents in jail simultaneously?

Multiplicative persistence

Is there a name for this algorithm to calculate the concentration of a mixture of two solutions containing the same solute?

Is a bound state a stationary state?

Why did the EU agree to delay the Brexit deadline?

Biological Blimps: Propulsion

How do you make your own symbol when Detexify fails?

Loading commands from file

Approximating irrational number to rational number

Melting point of aspirin, contradicting sources

How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?

When were female captains banned from Starfleet?

Where does the bonus feat in the cleric starting package come from?

Can I sign legal documents with a smiley face?

Should I outline or discovery write my stories?

Does the expansion of the universe explain why the universe doesn't collapse?

When a Cleric spontaneously casts a Cure Light Wounds spell, will a Pearl of Power recover the original spell or Cure Light Wounds?

How to implement a feedback to keep the DC gain at zero for this conceptual passive filter?

What prevents the use of a multi-segment ILS for non-straight approaches?

Why do we read the Megillah by night and by day?

Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?

If a character has darkvision, can they see through an area of nonmagical darkness filled with lightly obscuring gas?



How to select between models when AUC scores are similar?


Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?













2












$begingroup$


I use two machine learning algorithms for binary classification and I get this result :



Algo 1 :



 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


Algo 2 :



 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


Which one is better?










share|improve this question











$endgroup$











  • $begingroup$
    I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
    $endgroup$
    – ldmtwo
    7 hours ago















2












$begingroup$


I use two machine learning algorithms for binary classification and I get this result :



Algo 1 :



 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


Algo 2 :



 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


Which one is better?










share|improve this question











$endgroup$











  • $begingroup$
    I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
    $endgroup$
    – ldmtwo
    7 hours ago













2












2








2


1



$begingroup$


I use two machine learning algorithms for binary classification and I get this result :



Algo 1 :



 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


Algo 2 :



 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


Which one is better?










share|improve this question











$endgroup$




I use two machine learning algorithms for binary classification and I get this result :



Algo 1 :



 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


Algo 2 :



 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


Which one is better?







machine-learning data-mining metric






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 15 at 15:03









Esmailian

1,686115




1,686115










asked Mar 15 at 12:18









NirmineNirmine

276




276











  • $begingroup$
    I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
    $endgroup$
    – ldmtwo
    7 hours ago
















  • $begingroup$
    I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
    $endgroup$
    – ldmtwo
    7 hours ago















$begingroup$
I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
$endgroup$
– ldmtwo
7 hours ago




$begingroup$
I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models"
$endgroup$
– ldmtwo
7 hours ago










3 Answers
3






active

oldest

votes


















1












$begingroup$

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
    $endgroup$
    – Nirmine
    Mar 15 at 13:17










  • $begingroup$
    Yes, that is correct.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:23










  • $begingroup$
    Thanks for your help
    $endgroup$
    – Nirmine
    Mar 15 at 13:25










  • $begingroup$
    No problem! Don't forget to mark my answer as correct if you got what you asked for.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:27


















1












$begingroup$

Algo 2



Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






share|improve this answer











$endgroup$












  • $begingroup$
    How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:48










  • $begingroup$
    Genuinely curious btw, incase you know something I have missed. :)
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:50










  • $begingroup$
    @SimonLarsson cool! I made some updates.
    $endgroup$
    – Esmailian
    Mar 15 at 14:04










  • $begingroup$
    Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
    $endgroup$
    – Simon Larsson
    Mar 15 at 14:15






  • 2




    $begingroup$
    @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
    $endgroup$
    – Ben Reiniger
    Mar 15 at 14:39


















1












$begingroup$

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



Disclaimer:



If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


and then see the value of the metrics focusing on the recommended metrics by the following code:



>>> print(cm)
Predict 0 1 2
Actual
0 1 0 0
1 0 1 2
2 0 1 0




Overall Statistics :

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2
ACC(Accuracy) 1.0 0.4 0.4
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
DOR(Diagnostic odds ratio) None 0.5 0.0
ERR(Error rate) 0.0 0.6 0.6
F0.5(F0.5 score) 1.0 0.45455 0.0
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
F2(F2 score) 1.0 0.35714 0.0
FDR(False discovery rate) 0.0 0.5 1.0
FN(False negative/miss/type 2 error) 0 2 1
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
FOR(False omission rate) 0.0 0.66667 0.33333
FP(False positive/type 1 error/false alarm) 0 1 2
FPR(Fall-out or false positive rate) 0.0 0.5 0.5
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
LR+(Positive likelihood ratio) None 0.66667 0.0
LR-(Negative likelihood ratio) 0.0 1.33333 2.0
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
MK(Markedness) 1.0 -0.16667 -0.33333
N(Condition negative) 4 2 4
NPV(Negative predictive value) 1.0 0.33333 0.66667
P(Condition positive) 1 3 1
POP(Population) 5 5 5
PPV(Precision or positive predictive value) 1.0 0.5 0.0
PRE(Prevalence) 0.2 0.6 0.2
RACC(Random accuracy) 0.04 0.24 0.08
RACCU(Random accuracy unbiased) 0.04 0.25 0.09
TN(True negative/correct rejection) 4 1 2
TNR(Specificity or true negative rate) 1.0 0.5 0.5
TON(Test outcome negative) 4 3 3
TOP(Test outcome positive) 1 2 2
TP(True positive/hit) 1 1 0
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





share|improve this answer











$endgroup$








  • 1




    $begingroup$
    You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
    $endgroup$
    – Ben Reiniger
    Mar 15 at 16:23










  • $begingroup$
    thanks for your reminder.I just edited my answer
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 16:25










  • $begingroup$
    @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
    $endgroup$
    – Esmailian
    Mar 15 at 17:53










  • $begingroup$
    @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 18:53










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47339%2fhow-to-select-between-models-when-auc-scores-are-similar%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
    $endgroup$
    – Nirmine
    Mar 15 at 13:17










  • $begingroup$
    Yes, that is correct.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:23










  • $begingroup$
    Thanks for your help
    $endgroup$
    – Nirmine
    Mar 15 at 13:25










  • $begingroup$
    No problem! Don't forget to mark my answer as correct if you got what you asked for.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:27















1












$begingroup$

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






share|improve this answer











$endgroup$












  • $begingroup$
    Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
    $endgroup$
    – Nirmine
    Mar 15 at 13:17










  • $begingroup$
    Yes, that is correct.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:23










  • $begingroup$
    Thanks for your help
    $endgroup$
    – Nirmine
    Mar 15 at 13:25










  • $begingroup$
    No problem! Don't forget to mark my answer as correct if you got what you asked for.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:27













1












1








1





$begingroup$

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






share|improve this answer











$endgroup$



Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 15 at 12:42

























answered Mar 15 at 12:36









Simon LarssonSimon Larsson

51910




51910











  • $begingroup$
    Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
    $endgroup$
    – Nirmine
    Mar 15 at 13:17










  • $begingroup$
    Yes, that is correct.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:23










  • $begingroup$
    Thanks for your help
    $endgroup$
    – Nirmine
    Mar 15 at 13:25










  • $begingroup$
    No problem! Don't forget to mark my answer as correct if you got what you asked for.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:27
















  • $begingroup$
    Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
    $endgroup$
    – Nirmine
    Mar 15 at 13:17










  • $begingroup$
    Yes, that is correct.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:23










  • $begingroup$
    Thanks for your help
    $endgroup$
    – Nirmine
    Mar 15 at 13:25










  • $begingroup$
    No problem! Don't forget to mark my answer as correct if you got what you asked for.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:27















$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– Nirmine
Mar 15 at 13:17




$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– Nirmine
Mar 15 at 13:17












$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
Mar 15 at 13:23




$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
Mar 15 at 13:23












$begingroup$
Thanks for your help
$endgroup$
– Nirmine
Mar 15 at 13:25




$begingroup$
Thanks for your help
$endgroup$
– Nirmine
Mar 15 at 13:25












$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
Mar 15 at 13:27




$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
Mar 15 at 13:27











1












$begingroup$

Algo 2



Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






share|improve this answer











$endgroup$












  • $begingroup$
    How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:48










  • $begingroup$
    Genuinely curious btw, incase you know something I have missed. :)
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:50










  • $begingroup$
    @SimonLarsson cool! I made some updates.
    $endgroup$
    – Esmailian
    Mar 15 at 14:04










  • $begingroup$
    Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
    $endgroup$
    – Simon Larsson
    Mar 15 at 14:15






  • 2




    $begingroup$
    @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
    $endgroup$
    – Ben Reiniger
    Mar 15 at 14:39















1












$begingroup$

Algo 2



Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






share|improve this answer











$endgroup$












  • $begingroup$
    How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:48










  • $begingroup$
    Genuinely curious btw, incase you know something I have missed. :)
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:50










  • $begingroup$
    @SimonLarsson cool! I made some updates.
    $endgroup$
    – Esmailian
    Mar 15 at 14:04










  • $begingroup$
    Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
    $endgroup$
    – Simon Larsson
    Mar 15 at 14:15






  • 2




    $begingroup$
    @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
    $endgroup$
    – Ben Reiniger
    Mar 15 at 14:39













1












1








1





$begingroup$

Algo 2



Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






share|improve this answer











$endgroup$



Algo 2



Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 15 at 15:52

























answered Mar 15 at 13:31









EsmailianEsmailian

1,686115




1,686115











  • $begingroup$
    How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:48










  • $begingroup$
    Genuinely curious btw, incase you know something I have missed. :)
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:50










  • $begingroup$
    @SimonLarsson cool! I made some updates.
    $endgroup$
    – Esmailian
    Mar 15 at 14:04










  • $begingroup$
    Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
    $endgroup$
    – Simon Larsson
    Mar 15 at 14:15






  • 2




    $begingroup$
    @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
    $endgroup$
    – Ben Reiniger
    Mar 15 at 14:39
















  • $begingroup$
    How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:48










  • $begingroup$
    Genuinely curious btw, incase you know something I have missed. :)
    $endgroup$
    – Simon Larsson
    Mar 15 at 13:50










  • $begingroup$
    @SimonLarsson cool! I made some updates.
    $endgroup$
    – Esmailian
    Mar 15 at 14:04










  • $begingroup$
    Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
    $endgroup$
    – Simon Larsson
    Mar 15 at 14:15






  • 2




    $begingroup$
    @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
    $endgroup$
    – Ben Reiniger
    Mar 15 at 14:39















$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
Mar 15 at 13:48




$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
Mar 15 at 13:48












$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
Mar 15 at 13:50




$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
Mar 15 at 13:50












$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
Mar 15 at 14:04




$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
Mar 15 at 14:04












$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
Mar 15 at 14:15




$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
Mar 15 at 14:15




2




2




$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
Mar 15 at 14:39




$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
Mar 15 at 14:39











1












$begingroup$

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



Disclaimer:



If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


and then see the value of the metrics focusing on the recommended metrics by the following code:



>>> print(cm)
Predict 0 1 2
Actual
0 1 0 0
1 0 1 2
2 0 1 0




Overall Statistics :

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2
ACC(Accuracy) 1.0 0.4 0.4
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
DOR(Diagnostic odds ratio) None 0.5 0.0
ERR(Error rate) 0.0 0.6 0.6
F0.5(F0.5 score) 1.0 0.45455 0.0
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
F2(F2 score) 1.0 0.35714 0.0
FDR(False discovery rate) 0.0 0.5 1.0
FN(False negative/miss/type 2 error) 0 2 1
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
FOR(False omission rate) 0.0 0.66667 0.33333
FP(False positive/type 1 error/false alarm) 0 1 2
FPR(Fall-out or false positive rate) 0.0 0.5 0.5
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
LR+(Positive likelihood ratio) None 0.66667 0.0
LR-(Negative likelihood ratio) 0.0 1.33333 2.0
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
MK(Markedness) 1.0 -0.16667 -0.33333
N(Condition negative) 4 2 4
NPV(Negative predictive value) 1.0 0.33333 0.66667
P(Condition positive) 1 3 1
POP(Population) 5 5 5
PPV(Precision or positive predictive value) 1.0 0.5 0.0
PRE(Prevalence) 0.2 0.6 0.2
RACC(Random accuracy) 0.04 0.24 0.08
RACCU(Random accuracy unbiased) 0.04 0.25 0.09
TN(True negative/correct rejection) 4 1 2
TNR(Specificity or true negative rate) 1.0 0.5 0.5
TON(Test outcome negative) 4 3 3
TOP(Test outcome positive) 1 2 2
TP(True positive/hit) 1 1 0
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





share|improve this answer











$endgroup$








  • 1




    $begingroup$
    You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
    $endgroup$
    – Ben Reiniger
    Mar 15 at 16:23










  • $begingroup$
    thanks for your reminder.I just edited my answer
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 16:25










  • $begingroup$
    @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
    $endgroup$
    – Esmailian
    Mar 15 at 17:53










  • $begingroup$
    @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 18:53















1












$begingroup$

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



Disclaimer:



If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


and then see the value of the metrics focusing on the recommended metrics by the following code:



>>> print(cm)
Predict 0 1 2
Actual
0 1 0 0
1 0 1 2
2 0 1 0




Overall Statistics :

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2
ACC(Accuracy) 1.0 0.4 0.4
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
DOR(Diagnostic odds ratio) None 0.5 0.0
ERR(Error rate) 0.0 0.6 0.6
F0.5(F0.5 score) 1.0 0.45455 0.0
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
F2(F2 score) 1.0 0.35714 0.0
FDR(False discovery rate) 0.0 0.5 1.0
FN(False negative/miss/type 2 error) 0 2 1
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
FOR(False omission rate) 0.0 0.66667 0.33333
FP(False positive/type 1 error/false alarm) 0 1 2
FPR(Fall-out or false positive rate) 0.0 0.5 0.5
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
LR+(Positive likelihood ratio) None 0.66667 0.0
LR-(Negative likelihood ratio) 0.0 1.33333 2.0
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
MK(Markedness) 1.0 -0.16667 -0.33333
N(Condition negative) 4 2 4
NPV(Negative predictive value) 1.0 0.33333 0.66667
P(Condition positive) 1 3 1
POP(Population) 5 5 5
PPV(Precision or positive predictive value) 1.0 0.5 0.0
PRE(Prevalence) 0.2 0.6 0.2
RACC(Random accuracy) 0.04 0.24 0.08
RACCU(Random accuracy unbiased) 0.04 0.25 0.09
TN(True negative/correct rejection) 4 1 2
TNR(Specificity or true negative rate) 1.0 0.5 0.5
TON(Test outcome negative) 4 3 3
TOP(Test outcome positive) 1 2 2
TP(True positive/hit) 1 1 0
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





share|improve this answer











$endgroup$








  • 1




    $begingroup$
    You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
    $endgroup$
    – Ben Reiniger
    Mar 15 at 16:23










  • $begingroup$
    thanks for your reminder.I just edited my answer
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 16:25










  • $begingroup$
    @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
    $endgroup$
    – Esmailian
    Mar 15 at 17:53










  • $begingroup$
    @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 18:53













1












1








1





$begingroup$

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



Disclaimer:



If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


and then see the value of the metrics focusing on the recommended metrics by the following code:



>>> print(cm)
Predict 0 1 2
Actual
0 1 0 0
1 0 1 2
2 0 1 0




Overall Statistics :

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2
ACC(Accuracy) 1.0 0.4 0.4
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
DOR(Diagnostic odds ratio) None 0.5 0.0
ERR(Error rate) 0.0 0.6 0.6
F0.5(F0.5 score) 1.0 0.45455 0.0
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
F2(F2 score) 1.0 0.35714 0.0
FDR(False discovery rate) 0.0 0.5 1.0
FN(False negative/miss/type 2 error) 0 2 1
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
FOR(False omission rate) 0.0 0.66667 0.33333
FP(False positive/type 1 error/false alarm) 0 1 2
FPR(Fall-out or false positive rate) 0.0 0.5 0.5
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
LR+(Positive likelihood ratio) None 0.66667 0.0
LR-(Negative likelihood ratio) 0.0 1.33333 2.0
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
MK(Markedness) 1.0 -0.16667 -0.33333
N(Condition negative) 4 2 4
NPV(Negative predictive value) 1.0 0.33333 0.66667
P(Condition positive) 1 3 1
POP(Population) 5 5 5
PPV(Precision or positive predictive value) 1.0 0.5 0.0
PRE(Prevalence) 0.2 0.6 0.2
RACC(Random accuracy) 0.04 0.24 0.08
RACCU(Random accuracy unbiased) 0.04 0.25 0.09
TN(True negative/correct rejection) 4 1 2
TNR(Specificity or true negative rate) 1.0 0.5 0.5
TON(Test outcome negative) 4 3 3
TOP(Test outcome positive) 1 2 2
TP(True positive/hit) 1 1 0
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





share|improve this answer











$endgroup$



Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



Disclaimer:



If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



>>> from pycm import *

>>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


and then see the value of the metrics focusing on the recommended metrics by the following code:



>>> print(cm)
Predict 0 1 2
Actual
0 1 0 0
1 0 1 2
2 0 1 0




Overall Statistics :

95% CI (-0.02941,0.82941)
Bennett_S 0.1
Chi-Squared 6.66667
Chi-Squared DF 4
Conditional Entropy 0.55098
Cramer_V 0.8165
Cross Entropy 1.52193
Gwet_AC1 0.13043
Joint Entropy 1.92193
KL Divergence 0.15098
Kappa 0.0625
Kappa 95% CI (-0.60846,0.73346)
Kappa No Prevalence -0.2
Kappa Standard Error 0.34233
Kappa Unbiased 0.03226
Lambda A 0.5
Lambda B 0.66667
Mutual Information 0.97095
Overall_ACC 0.4
Overall_RACC 0.36
Overall_RACCU 0.38
PPV_Macro 0.5
PPV_Micro 0.4
Phi-Squared 1.33333
Reference Entropy 1.37095
Response Entropy 1.52193
Scott_PI 0.03226
Standard Error 0.21909
Strength_Of_Agreement(Altman) Poor
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Slight
TPR_Macro 0.44444
TPR_Micro 0.4

Class Statistics :

Classes 0 1 2
ACC(Accuracy) 1.0 0.4 0.4
BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
DOR(Diagnostic odds ratio) None 0.5 0.0
ERR(Error rate) 0.0 0.6 0.6
F0.5(F0.5 score) 1.0 0.45455 0.0
F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
F2(F2 score) 1.0 0.35714 0.0
FDR(False discovery rate) 0.0 0.5 1.0
FN(False negative/miss/type 2 error) 0 2 1
FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
FOR(False omission rate) 0.0 0.66667 0.33333
FP(False positive/type 1 error/false alarm) 0 1 2
FPR(Fall-out or false positive rate) 0.0 0.5 0.5
G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
LR+(Positive likelihood ratio) None 0.66667 0.0
LR-(Negative likelihood ratio) 0.0 1.33333 2.0
MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
MK(Markedness) 1.0 -0.16667 -0.33333
N(Condition negative) 4 2 4
NPV(Negative predictive value) 1.0 0.33333 0.66667
P(Condition positive) 1 3 1
POP(Population) 5 5 5
PPV(Precision or positive predictive value) 1.0 0.5 0.0
PRE(Prevalence) 0.2 0.6 0.2
RACC(Random accuracy) 0.04 0.24 0.08
RACCU(Random accuracy unbiased) 0.04 0.25 0.09
TN(True negative/correct rejection) 4 1 2
TNR(Specificity or true negative rate) 1.0 0.5 0.5
TON(Test outcome negative) 4 3 3
TOP(Test outcome positive) 1 2 2
TP(True positive/hit) 1 1 0
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 15 at 16:24

























answered Mar 15 at 14:11









Alireza ZolanvariAlireza Zolanvari

35716




35716







  • 1




    $begingroup$
    You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
    $endgroup$
    – Ben Reiniger
    Mar 15 at 16:23










  • $begingroup$
    thanks for your reminder.I just edited my answer
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 16:25










  • $begingroup$
    @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
    $endgroup$
    – Esmailian
    Mar 15 at 17:53










  • $begingroup$
    @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 18:53












  • 1




    $begingroup$
    You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
    $endgroup$
    – Ben Reiniger
    Mar 15 at 16:23










  • $begingroup$
    thanks for your reminder.I just edited my answer
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 16:25










  • $begingroup$
    @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
    $endgroup$
    – Esmailian
    Mar 15 at 17:53










  • $begingroup$
    @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
    $endgroup$
    – Alireza Zolanvari
    Mar 15 at 18:53







1




1




$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
Mar 15 at 16:23




$begingroup$
You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
$endgroup$
– Ben Reiniger
Mar 15 at 16:23












$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
Mar 15 at 16:25




$begingroup$
thanks for your reminder.I just edited my answer
$endgroup$
– Alireza Zolanvari
Mar 15 at 16:25












$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
Mar 15 at 17:53




$begingroup$
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
$endgroup$
– Esmailian
Mar 15 at 17:53












$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
Mar 15 at 18:53




$begingroup$
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
$endgroup$
– Alireza Zolanvari
Mar 15 at 18:53

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47339%2fhow-to-select-between-models-when-auc-scores-are-similar%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Lowndes Grove History Architecture References Navigation menu32°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661132°48′6″N 79°57′58″W / 32.80167°N 79.96611°W / 32.80167; -79.9661178002500"National Register Information System"Historic houses of South Carolina"Lowndes Grove""+32° 48' 6.00", −79° 57' 58.00""Lowndes Grove, Charleston County (260 St. Margaret St., Charleston)""Lowndes Grove"The Charleston ExpositionIt Happened in South Carolina"Lowndes Grove (House), Saint Margaret Street & Sixth Avenue, Charleston, Charleston County, SC(Photographs)"Plantations of the Carolina Low Countrye

random experiment with two different functions on unit interval Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern)Random variable and probability space notionsRandom Walk with EdgesFinding functions where the increase over a random interval is Poisson distributedNumber of days until dayCan an observed event in fact be of zero probability?Unit random processmodels of coins and uniform distributionHow to get the number of successes given $n$ trials , probability $P$ and a random variable $X$Absorbing Markov chain in a computer. Is “almost every” turned into always convergence in computer executions?Stopped random walk is not uniformly integrable

How should I support this large drywall patch? Planned maintenance scheduled April 23, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?How do I cover large gaps in drywall?How do I keep drywall around a patch from crumbling?Can I glue a second layer of drywall?How to patch long strip on drywall?Large drywall patch: how to avoid bulging seams?Drywall Mesh Patch vs. Bulge? To remove or not to remove?How to fix this drywall job?Prep drywall before backsplashWhat's the best way to fix this horrible drywall patch job?Drywall patching using 3M Patch Plus Primer