Talk:Coefficient of determination

Latest comment: 1 year ago by DavidMCEddy in topic Adjusted R^2

Problematic article?

edit

The 2021 PeerJ article cited in the introduction has a number of issue to my eye:

  1. The main author is an editor of that journal (maybe in good faith, but still a red flag).
  2. Their definition of R^2 is not the one typically implemented in software and it doesn't match the one in this Wikipedia article. There are 10+ definitions of R^2 and they are not generally equivalent. See Kvålseth (1985) https://www.jstor.org/stable/2683704
  3. The paper cites forum discussions and blog posts. These may be good quality but it doesn't help raise the bar for the confidence in the material.
  4. The writing seems sloppy - minor issue, but doesn't help the overall image.

About the statement in the Wiki article: I find "truthful" to be very misleading. Without getting philosophical, the matter at hand is assessing regression. One main point of the paper is that R^2 is more intuitive since it can be seen as a percentage, while the other measures have arbitrary values. I rephrased accordingly, though not sure how to feel about the cited paper. SpunkyLepton (talk) 20:25, 11 November 2021 (UTC)Reply

edit

It says that only in the case of a linear regression, the coefficient of determination is equal to the square of the correlation coefficient. Should this not be: only in the case of a linear regression with a linear model? (a linear regression can also be performed with e.g. a quadratic model, in which case the the coefficient of determination is not equal to the square of the correlation coefficient). Job. 193.191.138.240 (talk) 09:53, 9 April 2008 (UTC)Reply

"With a linear model" would be ambiguous at best considering that in a regression context a "linear model" refers to how the parameters relate to the predicted values, not the structure of the model with respect to changes in the explanatory variables. Anyway, the present version is almost clear and nearly correct ..."correlation coefficient between the original and modelled data values" is meant to mean the correlation between the observed and predicted values, not the correlation between the observed and individual explanatory variables. Unfortunatey previous changes have left some terminology a liitle vague. I will try to improve things. Melcombe (talk) 11:24, 9 April 2008 (UTC)Reply

Definitions/Variables are not consistent here

edit

This page is confusing because the variables are not consistent with the pages for Residual sum of squares and Explained sum of squares. On those pages,   is defined as   and the same for  . Also, isn't  ? Don't want to fix this myself because I'm just learning stats, maybe someone more experienced can? —Preceding unsigned comment added by 128.6.30.208 (talk) 03:22, 23 October 2007 (UTC)Reply

But these pages may have problems of their own. For example Residual sum of squares defines   for general values of the regression coefficents, not necessarily for the fitted coefficients, whereas Explained sum of squares assumes that fitted coefficients are used, as would the usage here. There is also the question of whetherthe pages are general enough to be interpretable for more than just a single explanatory variable. Melcombe 14:41, 23 October 2007 (UTC)Reply

The basic problem is that there isn't such thing as a consistent definition. It would simply not make sense to "fix" something, since both definitions (Residual/Explained vs. Regression/Error SS) are highly used. Actually, this is the first time that I learned about a problem like this in math/stat, since usually defintions are precise and unique. But that's life. --Scherben 01:13, 26 October 2007 (UTC)Reply

Yes, these notations are confusing! In my opinion, there are two kinds of notations we can use: Notations with or without subscript. That is:  , where   and   is "Total Sum of Square".  , where   and   is "Sum of Square for Regression" or "Explained sum of square".  , where   and   is "Error Sum of Square", "Residual Sum of Square" or "Unexplained Sum of Square". Then we can define R Square as:  . I think the main problem in the original page is  . Maybe some text books name it "Residual sum of square" while others use "Sum of square for regression". Actually, they are different. So it would be better to use   and   to distinguish these two concepts. Otherwise, the   in   may means a lot. Lanyijie 12/28/07

Why is   noted in the text just below the formulae? Should it be instead of   in the formula for  ? 193.11.155.115 (talk) 14:26, 11 September 2008 (UTC)Reply

The formuls for R2 involving   should only be used in those cases where   so there would be no difference. Melcombe (talk) 15:30, 11 September 2008 (UTC)Reply

Adjusted R Square

edit

There is a bit better explanation of this at: http://www.csus.edu/indiv/j/jensena/mgmt105/adjustr2.htm I think we can add to the definition: 1) the motivation for "Adjusted R Square". And 2) to note that it can be viewed as an Index when comparing regression models (like the standard Error).

Tal.

—The preceding unsigned comment was added by Talgalili (talkcontribs) 16:04, 21 February 2007 (UTC).Reply


Currently says "Adjusted R2 does not have the same interpretation as R2. As such, care must be taken in interpreting and reporting this statistic." That is not actionable advice. What care exactly? dfrankow (talk) 18:45, 4 March 2011 (UTC)Reply

Causality

edit

I thought that since R is based on the general linear model you could infer causality from the model?? You are really just doing an ANOVA with a continuous factor (X) as opposed to a categorical one


>> No. R^2 has nothing at all to do with causality. Causality can only be implied by imposition of specific assumptions on the process being modeled. -- Guest.

>> Causality is a design issue not a statistical one. You need to measure the exposure before you see the outcome. If youre doing a cross sectional regression causality can never be infered. Only if it is a regression with the exposure measured at one time point and outcome at a time after measuring the exposure can you suggest*** (suggest is the operative work) that there is a causal relationship- SM

Need some help,... did anyone know why R2 in excel program are different from this meaning ?

Range of R-squared

edit

Who says R-squared should be greater than zero? For example if measured y-values are between 9 and 10, and model prediction is always zero, then R-squared is heavily negative.Kokkokanta 07:50, 28 January 2007 (UTC)Reply

>> Go back and look at the definition. For one thing, all the sums are of squared differences. Moreover, SSE<=SST by construction. So R^2 is certainly non-negative. Adjusted R^2 *can* be negative, however. -- Guest.


>> No R-squared can be negative. This page does not necessarily relate to linear regression, or if it is meant to do so, it does not say this. You only have the conclusion SSE<=SST if "prediction=mean" is a special case of the model being fitted and only for certain ways of fitting the model ... for example you could choose to always fit the model by setting all the parameters to 99. You can still evaluate a value of R-squared in such cases. Less outlandish cases arise where the model fitted doesn't include an intercept term in the usual terminology. The "no intercept" case might warrant a specific mention on the page. Melcombe 14:41, 20 June 2007 (UTC)Reply

Possible expansions

edit
  • Consider mention of the Nagelkerke criterion, an analogue

that you can use with generalized linear models, which are not fitted by ordinary least squares.

  • We can't assume that R^2 is applicable with every kind of

least-squares regression. For example, it doesn't make sense with regression through the origen. There has been a discussion of limitiations, in American Statistician.

  • Adjusted R^2 can be negative.

Dfarrar 14:04, 8 March 2007 (UTC)Reply

Nagelkerke's pseudo-R^2 really doesn't belong in this article IMHO. It deserves a separate page, perhaps along with other pseudo-R^2 measures. The point is well-made about regression through the origin, but redefinition of R^2 is trivial in this context. Perhaps that should be mentioned.

---Guest

R squared will be negative if you remove the intercept from the equation.


Nagelkerke's pseudo-R^2 is a scaled version of Cox and Snell's R^2 that can be obtained from a generalized linear model when dealing with binary responses. When using binary responses, a better coefficient of determination has been suggested in genetic profile analyses (see below).

Lee, S.H, Goddard, M.E., Wray, N.R., Visscher, P.M. (2012) A better coefficient of determination for genetic profile analysis. Genetic Epidemiology 2012; 36(3): 214-224.

It should be also considered to include McKelvey-Zavoina's R^2 (see below).

McKelvey RD, Zavoina W. 1975. A statistical model for the analysis of ordinal level dependent variables. J Math Sociol 4:103-120 Bi 12:08, 13 September 2012 (UTC)

Causality

edit

R^2 is only one measure of association. The causality issue applies to all of them. The issue has been addressed generically. See links inserted.

Dfarrar 14:29, 8 March 2007 (UTC)Reply

Inflation of R-square

edit

This has been a good day for additions to my watched pages. Regarding this new material, I think some terms could be explained to make the article more widely accessible, without doing much harm, e.g. "weakly smaller." Repeating a previoius point, I suggest inclussion of material on analogous statistics applicable with models other than Gaussian, e.g., with generalized linear models. Dfarrar 22:25, 20 March 2007 (UTC)Reply

R squared formula

edit

I changed the formula to what I believe to be the correct one, but it has been reverted. My source is Essentials of Econometrics by Damodar Gujarati. Can whoever changed it please cite their source for this? Cheers.

  • I can see why you're confused. In your book though, "E" likely stands for "Explained" and "R" likely stands for "Residuals." In the equation on this page, "R" stands for "Regression" (or "Explained") and "E" stands for "Error" (or "Residuals"). Gujarati's Basic Econometrics uses "Explained" and "Residuals" as well, so the lettering is exactly the opposite. VivekVish 03:58, 18 April 2007 (UTC)Reply
  • Ah I see, thanks for clearing that up.
  • An edit by someone yesterday (with a history of bad edits on other pages) screwed up this section again. It is now fixed again. The text on alternative meanings of E and R is very helpful, and hopefully will prevent these problems in the future.152.3.58.200 16:44, 7 June 2007 (UTC)Reply

Is not there mistake in definition of  ? I think   should be there.

Chnaged to this form, but there is equivalence since under the conditions this form of R2 is use, the means are the same. Melcombe (talk) 16:53, 11 February 2008 (UTC)Reply

Adj R2

edit

This bit seems wrong to me: "adjusted R2 will be more useful only if the R2 is calculated based on a sample, not the entire population. For example, if our unit of analysis is a state, and we have data for all counties, then adjusted R2 will not yield any more useful information than R2."

It is not clear why this would be. Even if you had the population, you would still be concerned about exhausting degrees of freedom. You would thus want to penalize any calculation of the R2 for the number of regressors. If you have the population of U.S. states (N=50) and you have a model with k=50, you will perfectly predict and get an R2 of one. But this is misleading. The adjustment is meant to account for degrees of freedom, not estimation error.

Still A Student 03:06, 9 September 2007 (UTC)Reply

I agree with this comment. The para should be removed. It might be relevant to add something along the lines of "If there is an unlimited number of linearly independent candidate regressors, both R2 and adjusted R2 become unreliable as the number of regressors increases: R2 tends towards one, while adjusted R2 becomes more variable". Also, perhaps there needs to be some pointers to related statistics such as Mallows Cp.Melcombe 09:02, 18 September 2007 (UTC)Reply

R-squared bigger than 1?

edit

A simple question but I cant figure it out:

Why is r square for y=(1 3 5) and y_est=(2 7 3) is bigger than 1? It must be between 0 and 1. SSR=17 SST=8 —Preceding unsigned comment added by 85.107.12.120 (talk) 13:21, 20 September 2007 (UTC) Reply

This happened because the expression used assumed that the fitted values would be obtained by regession on the observed values and your values don't have the features that would occur if the y_est had been obtained by regression. I have revised the main text. Melcombe 14:23, 10 October 2007 (UTC)Reply


I think it could be better explained in the text. I'm analysing my regression results looking to the R2 at the test set samples. If my regression has a big SQE, the R2 can be bigger than 1. I lost some hours to realize that the definition 0<R2<1 lies only to the training set.

Attempted improvement. Melcombe (talk) 10:06, 6 November 2008 (UTC)Reply

What is ?

edit

What is  ? is it the mean? Can someone put that in the text? --Play

have included a first version Melcombe (talk) 12:12, 6 December 2007 (UTC)Reply

added it again, as removed by someone.Melcombe (talk) 16:36, 11 February 2008 (UTC)Reply

Causality Again (was at top)

edit

I believe that R-squared is a measure of variability aligned rather than variability accounted for. With respect to correlation is not causation, consider R-squared as variabilty "aligned" rather than "accounted for." For example, if the number of churches in cities is correlated with the number of bars in cites, say .9 , then R-squared is .81. Rather than number of bars accounting for number of churches, consider that variability (81%) of their related numbers is aligned. (Their variability alignment is most likely "accounted for" by population.) Respectfully submitted, Gary Greer greerg@uhd.edu January 26, 2008. —Preceding unsigned comment added by 75.16.159.122 (talk) 01:34, 27 January 2008 (UTC)Reply

"Accounted for" is standard terminology. R-squared is used in connection with a model of the user's choice, where the user chooses which variables to use in constructing the model's predicted values. There is no implication of causality ... the idea is to find the best predictor of the dependent variable that can be constructed from the chosen predictors. From one point of view the idea is to explain as much of the variation in the dependent variable (variation of the value from case to case) as possible using the selected variables, and hence the task can be phrased as attempting to account for as much variation as poosible. Similarly, adding an additional independent variable can be thought of as seeking to account for more variation.Melcombe (talk) 16:53, 11 February 2008 (UTC)Reply

So, as a student of statistics, I ask: To explain the correlation between observed number_of_churches and observed number_of_bars, whereby neither are truly explanatory in scenario while both tend to be reponses to the number_of_residents_at_large (which directly may be unobserved), how should these variables be invoked? Also, how should their respective parameters be invoked? Which should be the "dependent" variable and which should be the "independent" variable?? We are presented with an inherent problem of connotation with these terms for statistical usage, especially in english, so some serious effort ought to be made to help clarify the denotations as well as connotations for future generations, no? Those with more experience more easily overlook the difficulties of trying to learn new concepts using traditionally ambiguous terminology that often appears misleading, especially given the connotation of the word origins. Perhaps this bars-vs-churches case gives us the ideal opportunity to start cleaning up our language, eh? —Preceding unsigned comment added by 69.243.78.10 (talk) 10:11, 11 April 2009 (UTC)Reply

Seems to me this dicussion points to a shortcoming in the article. It uses terms such as "explain" and "account for" without prominently mentioning that these are terms of jargon. Civilians often find news stories saying that "statistics show" that a particular variable, such as personal income or running speed, is explained by 10% or 45% or whatever, by race or height or some such, and assume that the explanation for the other 90% or 55, or whatever, might also be found. Somewhere near the top, the article should warn that these various "accounts for" numbers imply neither causality nor that they ought to up to 100%. Jim.henderson (talk) 15:50, 1 May 2012 (UTC)Reply

Undid change from R2 to r2

edit

I undid a set of changes that tried to change the notation from R2 to r2 ... because -

  • I think R2 is the most commonly used notation
  • The notation was not changed everywhere, specifically in displyed maths and section titles and possibly elsewhere, so that the result as left was very poor.

Melcombe (talk) 10:09, 6 March 2008 (UTC)Reply

Weighting the origin as an absolute known point

edit

Forcing the linear curve through the origin is the same as adding the point (0,0) to your data and giving it an infinite number of replicates. If we believe that (0,0) is a legitimate and absolute point, then by using this point to fit our curve, we can argue that the accuracy of our curve is improved merely because we know that it goes through at least one point that we believe to be correct. By weighting the origin this way, we could argue that the coefficient of determination should use the following equation, when it is forced through the origin.

R² = (SUM xy)² / (SUM(x²) * SUM(y²))

A derivation of this formula can be provided if you want to see one. —Preceding unsigned comment added by JNLII (talkcontribs) 22:36, 11 November 2008 (UTC)Reply

Article needs to have summary which is understandable to common people

edit

Per Wikipedia's guidelines, at least the summary of this article should give nearly any person a good understanding of what this is. As it stands now, the summary only states "this isn't defined but were going to give you a 3 page admittedly non-defining definition which only people who have studied statistics will understand".

(I'd like to second this point; I came here to find computing formulas for the coefficient of determination for polynomial regression, and found three pages of incomprehensible and useless (to me) blather.) — Preceding unsigned comment added by Alan8 (talkcontribs) 02:02, 10 June 2014 (UTC)Reply

Proposed reorganization

edit

I think this article should begin with a definition of the population R2, along the following lines:

For a regression relationship with additive errors Y=f(X)+e, the population R2 is 1-var(e)/var(Y). More generally, the population R2 is

 

I think the vast majority of the time, this is what is being estimated when people speak of R2 (there are some alternative definitions in survival analysis, I believe, but they are not in common use).

These expressions are always between zero and one.

The bulk of the current article can then be placed under a header "Estimating R2 in linear least squares modeling" In this section it can be pointed out that statistical estimators do not always obey constraints that their target values obey, and that this is not contradictory or particularly problematic in most cases. Additional sections can cover estimating R2 in settings besides linear least squares.

Doing this would allow some streamlining of the introduction to address the previous comment.

I will leave this comment here for a few weeks before doing anything. Skbkekas (talk) 18:57, 8 March 2009 (UTC)Reply

You are taking a very narrow view and trying to constrain what is said to cover only what a statistician starting from a theoretical regression-based point of view would think a coefficient of determination means. The article needs to be suitable for everyone else as well. To most people a coefficient of determination is something that is calculated in a particular way and which can be interpreted and compared across different predictors in a sensible way. They are not thinking that it estimates anything in relation to a fictitious population. It would be ridiculous to start from a definition that makes the coefficient lie between zero and one when it is common for calculated values to lie outside this range, and where the corresponding theoretical values also lie outside this range in a perfectly meaningful and consistent way. You are making the mistake of thinking that regression models under assumed-true conditions should decide how a non-theoretical quantity should be defined. Melcombe (talk) 09:47, 9 March 2009 (UTC)Reply

I will give others a chance to weigh in before doing anything, but I want to make a few points in response here:

  • I was proposing (i) a reorganization to separate the material that is specific to linear least-squares from the material that is not, and (ii) the addition of a small amount of material (3-4 sentences) pointing out that the R2 statistic has a population version. I did not propose to remove any material. Therefore, I don't think I am taking a "narrow point of view" or trying to "constrain what is said."
  • Nearly every wikipedia article about a statistical topic mentions the existence of a population analogue to a statistic constructed from a sample. So I don't see why the R2 should, exceptionally, be decreed a "non theoretical quantity" with almost no mention of the population analogue in the article (there is only the link to FVU).
  • I don't think the last sentence of the introduction makes a lot of sense, and in any case I don't think the situations where the R2 is outside (0,1) are "important." It is a mildly inconvenient fact of life that some natural statistical estimators don't obey the constraints followed by the thing they are trying to estimate (it comes up in variance components models, robust covariance estimation, ...). I don't question the need to discuss this in the article as it is surely confusing when it comes up unexpectedly. However I don't see the need to discuss this in the summary (it's part of what makes the summary confusing to read, as pointed out in the last comment).
  • The population R2 that I wish to discuss does not depend on any "assumed-true" conditions (except the existence of conditional means and variances), and I am not attempting to change the way that the sample R2 is defined.

Skbkekas (talk) 04:33, 10 March 2009 (UTC)Reply

You did say you wre proposing to make the major change of "this article should begin with a definition of the population R2" ... this would radically alter the article. And specifically you said you wanted a definition "More generally, the population R2 is  ." Here is where the assumed-true bit comes in: the assumption that what a user is trying to do has anything at all to do with  , where an "assumed-true" model is needed to define  , and where it is assumed that   amkes a sensible predictor. While there may be certain equivalences around when working with theoretical populations and particular notions of how the predicted values being considered are determined all these assumptions are not generally applicable. The predictors being studied, and for which a coefficient of determination is to be calculated, are not necessarily specified to be "optimal" in any sense. You are starting from a position too far embedded in regression analysis if you think that Y and X are relevant here as in many applications there are no obvious X around. The only things available for a generally applicable version of a population quantity is the joint distribution of observed and predicted values where it assumed that there is some consistent rule for determining the predicted values across a population, where where the rule itself is not fitted to sample data. That is one can postulate a joint distribution for observed and predicted values. Obviously this can be done by using the mean square error of prediction.
You also say, "I don't think the situations where the R2 is outside (0,1) are important", etc. This just illustrates your narrow viewpoint. And "natural statistical estimators don't obey the constraints followed by the thing they are trying to estimate" is irrelevant here because, in this context, negative estimates happen because the corresponding population values, when defined in a sensible way, are negative.
There may have been a time where definition of "coefficient of determination" defined by statisticians in the 30's in a regression context was all that was important. But the usage of the term has expanded vastly since then and escaped into many fields of application. A "coefficient of determination" is output by many modelling packages and it is often used in contexts where the predicted or modelled values are coming from models that are not directly fitted to the data being used for comparison. Remember that wikipedia is not a statistics text-book.
Melcombe (talk) 10:41, 10 March 2009 (UTC)Reply

I agree with most of what Skbkekas proposed and disagree with most of Melcombe's counter points. I think the proposed changes would improve the article a lot and would not make it narrower, if done with care. How are others seeing this? Is Skbkekas still around and willing to implement the proposed changes? I would be willing to help. If there is interest, I can argue in more detail. Jkarch (talk) 17:46, 6 April 2021 (UTC)Reply

What definitions are used in standard statistical textbooks today?
A major problem is the lack of agreement on how R2 / Coefficient of Determination should be defined in non-normal situations. See, e.g., Logistic regression#Pseudo-R-squared.
The current text in the "Definitions" sections includes:
"The most general definition of the coefficient of determination is
 
It would be great if someone could review the definitions used in standard statistics textbooks at a major university.
I like the claim that, "the population R2 is  ".
HOWEVER, this is NOT a standard definition I've seen before.
@Melcombe: Can you please provide a list of references showing how the term "coefficient of determination" is used in "many fields of application" and how it "is output by many modelling packages and it is often used in contexts where the predicted or modelled values are coming from models that are not directly fitted to the data being used for comparison"? We need a definition that supports most of these uses. If there are uses of the term that differs from what is described here, then the current article should probably include a section on "Alternative definitions".
Then we can use that list of examples to discuss how the article might best be revised, if it should be revised.
Thanks, DavidMCEddy (talk) 20:34, 6 April 2021 (UTC)Reply


I don't know of a text book that uses the proposed definition but I have seen in it multiple times in papers, for example, here: https://www.tandfonline.com/doi/abs/10.1080/01621459.2012.710509 on page 1240.
There it is also shown that
 
This also seems to be the most direct translation of the text "the proportion of the variance in the dependent variable that is predictable from the independent variable" in the beginning of the text.
  is I guess what Melcombe did not like, as it requires a model for  . However, when one replaces   with   where   is just a prediction function obtained from any technique, this should include most usages of   (at least the ones that I know). Under this definition, the normal   sample estimate can also be recognized as the naive estimator of this population value.
Note that in usuage of   in regression modeling, adjusted   is interpreted as an estimate of   using   (in the population) as   whereas if we estimate   using some   that we get from a data set, this is known as predicted  . Jkarch (talk) 08:52, 7 April 2021 (UTC)Reply

f^bar is mean of values or function?

edit

Is

 

or is

 

?

Often, f is a continuous model and a mean can be defined according to the second equation. This seems to make more sense. However, the text seems to imply that the first formular needs to be considered, even if f(x) is a continuous function.

I hope, the question became clear. Tomeasy T C 08:55, 29 June 2009 (UTC)Reply

I have read the relevant parts of the article several times and cannot understand the point being made in the above comment. However, it may or may not be relevant to point out that a sample is always finite and discrete, no matter what the nature of the population. JamesBWatson (talk) 20:09, 29 September 2009 (UTC)Reply
Note that f does not refer to the sample but to the model (or regression) fitted to the sample. Tomeasy T C 18:28, 1 October 2009 (UTC)Reply
I have now thought about this again. I guess that the question is probably intended to mean "is   the mean from the sample or from the population?" If so then the answer is that   is the mean of the modelled values using the model parameters calculated from the sample. JamesBWatson (talk) 09:57, 1 October 2009 (UTC)Reply
Again, f is neither the sample nor the population, f is the model. So, this was still not what I intended to ask. As the model, f can be a continuous function whose mean can be evaluated according to the second equation above. Of course, one could evaluate f at the discrete points where data exists and compute the mean according to the first equation. Since the results are not necessarily equal, I am asking what is meant. Tomeasy T C 18:28, 1 October 2009 (UTC)Reply
I think you should always be using the definition given immediately after "The most general definition of the coefficient of determination is", which does not involve  . It would be possible to replace the sums by integrals there, but the key is that the comparison of y with f can only take place for points for which the observed values are available. So you could use integrals only if y is observed continuously, assuming that if f is computed discretely you would include a rule for interplating, so producing a continuous version of f. Melcombe (talk) 09:10, 2 October 2009 (UTC)Reply
I have continued to wrack my brains in a further attempt to understand this question, and I think I have got it: at any rate I hope so. I suppose the intention is that "model" is intended to mean the function estimated from the sample. In this case it is true that f is commonly a function on an interval, rather than on a discrete set of points. Nevertheless, the sum of squares   is calculated from values at a discrete set of values of  , and it is this discrete sum which is partitioned into "regression" and "residual" components. Therefore all the sums involved take place on the same discrete set of values of  . It is also worth mentioning that f is not necessarily a function on an interval: it may be only a discrete set of values. JamesBWatson (talk) 19:15, 4 October 2009 (UTC)Reply

Someone changed the definitions (subtituting y_bar for f_bar), so that the whole thing is not needed anymore. Was that change correct? Tomeasy T C 17:45, 13 April 2010 (UTC)Reply

Usually (as explained further down the article) the two are the same, so it makes no difference. However, if they are not the same then it seems to me that it should be   so that the "total" sum of squares really is the total of the other two. JamesBWatson (talk) 11:01, 14 April 2010 (UTC)Reply
That's actually another thing I would contest. Why should the following hold: SSE_tot = SSE_reg + SSE_err. Obviously, it would be true if the terms were not squared - but they are. Even using the knowledge about y_bar being the mean of all y_i does not make the addition of two sums equal to SSE_tot. It may become true, if you apply special constrains on the model values f_i. However, the notations we discuss are given within a general context. Tomeasy T C 18:45, 14 April 2010 (UTC)Reply

distribution of R2 under the null?

edit

I think the distribution of R2 under the null hypothesis of independent, normal dependent and independent variables should be included. it's some kind of transformed F statistic, I believe. Shabbychef (talk) 17:00, 11 September 2009 (UTC)Reply

Correlation coefficient

edit

Correlation coefficient currently redirects to Correlation; should it redirect here instead? (note: I've cross-listed this post at Talk:Correlation as well.) rʨanaɢ talk/contribs 03:31, 22 September 2009 (UTC)Reply

It would probably be better redirected to Pearson product-moment correlation coefficient than here. The redirect to Correlation is proably OK also. However it might be possible to set up a sensible/helpful disambiguation page. Melcombe (talk) 08:45, 22 September 2009 (UTC)Reply
A disambiguation page sounds like the best idea, since there are a few possible candidates for redirection. rʨanaɢ talk/contribs 08:51, 22 September 2009 (UTC)Reply
No, it shouldn't. Correlation is the main article on correlation, and defines the correlation coefficient. The article Coefficient of determination mentions the correlation coefficient, but does not define it; in fact it rather presupposes a knowledge of the correlation coefficient. What is more this is as it should be, both because correlation coefficient is a much more widely known concept than coefficient of determination, and because it makes more sense to redirect upwards to a more general topic than to redirect sideways to a different concept at the same level.
Redirecting to Pearson product-moment correlation coefficient makes more sense than to Coefficient of determination, but it seems to me that the present redirect is better than that too. It is clearly more useful for the general reader wanting to know the basics, and for anyone who wants Pearson product-moment correlation coefficient there is a link to it at the top of the section of Correlation which defines the correlation coefficient, so it is not difficult to find.
I don't really see a disambig page as a good idea, as the expression "correlation coefficient" without qualification always means Pearson's coefficient, and the other "correlation coefficients" are also mentioned in the Correlation article, so that all meanings are covered there: in fact you could say this article subsumes the function of a disambig page. On the other hand the section of Correlation dealing with rank correlation coefficients is woefully inadequate, and I shall try to find time soon to rewrite it and extend it, if nobody beats me to it. JamesBWatson (talk) 13:18, 27 September 2009 (UTC)Reply

Deletion of addition

edit

I removed the addition of "A practical minimum for R2 is 0.80 when the correlation is used for predicton." because

  • it is not appropriate in the lead;
  • it is not generally true and any criterion of usefulness will obviously depend on the actual context;
  • there was no citation.

Melcombe (talk) 10:53, 28 June 2010 (UTC)Reply

Good call. Talgalili (talk) 07:27, 29 June 2010 (UTC)Reply

Terminology: errors vs. residuals

edit

I believe the literature standard is to use "errors" to refer to the errors of the true model (with true but unknown coefficients) and "residuals" to refer to the regression residuals. As far as I can see this article follows that standard in the use of words but not the use of subscripts: specifically, SSerr is always used for the sum of squared residuals, leading for example to this odd definitional passage in the "Definitions" section": SSerr=... the sum of squared residuals". And in the adjusted R2 section, VARerr is used for the variance of the residuals. This notation is likely to confuse the reader as to the distinction between errors and residuals and as to when one or the other is being referred to.

Would anyone object if I go through and change all the "err" subscripts to "res" or "resid"? Also, how about if I change SSreg to SSregr everywhere, since "reg" brings to my mind "regular" instead of "regression"?

Also, in the Regression analysis article, I want to change "SSE", defined there as the sum of squared residuals, to "SSR". Comments?Duoduoduo (talk) 18:27, 19 November 2010 (UTC)Reply

Adjustment for symbolic regression

edit

This article talks about adjusting for the number of degrees of freedom in a fit with a certain number of variables like regression coefficients. How does one adjust for trying multiple models, each with potentially different numbers of variables, such as in symbolic regression? Ginger Conspiracy (talk) 04:21, 24 December 2010 (UTC)Reply

One would start with the most obvious thing.... which I think in this context would be to set up a simulation experiment that replicates all the different steps in a "typical analysis" of the kind concerned, including any use of multiple models with different numbers of parameters. There is no real point in looking for an "adjusted number of degrees of freedom" since that is unimportant and mostly a fictional concept. Rather one would concentrate on a measure of fit, such as R2 (or it may easier to use just a sum of squares of errors), and look at the disrtibution of this test statistic (after all the model selection steps), as derived from multiple simulations. Melcombe (talk) 09:32, 24 December 2010 (UTC)Reply

Request for clarification

edit

An IP left this request for clarification on the article page; I'm moving it to here. Duoduoduo (talk) 17:35, 27 May 2011 (UTC)Reply

The article says:

 
where L(0) is the likelihood of the model with only the intercept,   is the likelihood of the estimated model and n is the sample size.

In the sentence above, the term "estimated model" has not been defined. How does this term relate to the data set or "observed values" spoken of in the introduction of R² at the top of this article?

Some high- and lo-correlation charts would help readers grasp the concept at a glance

edit

99th percentile in math, but rusty and in the limited time available came away with but little grasp of what the coefficient of determination means or how it's used.

Those of us who are visual more than mathematical can get the story at a glance if shown two charts, one with high correlation coefficient, one low.

Even the math-able will be introduced to the concept more quickly and easily if the charts are provided in examples at the beginning as an introductory over "view." — Preceding unsigned comment added by 66.167.61.181 (talk) 14:40, 4 September 2011 (UTC)Reply

It would also help to have several examples of how R2 is used.

edit

Here's hoping for an article more useful to all. — Preceding unsigned comment added by 66.167.61.181 (talk) 14:42, 4 September 2011 (UTC)Reply

Norm of Residuals?

edit

Maybe it is already present elsewhere on Wikipedia, but I couldn't find the phrase "norm of residuals" on Wikipedia in a Google search. I think it would be helpful to have a clear equation for norm of residuals (as is commonly used in Matlab). Maybe this would just be for me mostly, but just seeing the equation   (please correct if this is wrong) or something similar would help I think. I like the thought of putting that on this page because of the visuals and equations already present. The page for explained sum of squares could also be a candidate (though it has no figures). Is there a good place to put the norm of residuals equation? Jacobkhed (talk) 23:57, 8 July 2013 (UTC)Reply

Added brief section on norm of residuals. Jacobkhed (talk) 21:47, 13 August 2013 (UTC)Reply

Why is this section even in this article? This doesn't seem to apply to anything outside of someone interested in the fact that MATLAB uses the square root of the residuals somewhere (???). This adds no benefit to the article and it is entirely unclear why it merits inclusion. I propose that it be removed. — Preceding unsigned comment added by 192.55.54.42 (talk) 17:27, 17 June 2014 (UTC)Reply

I think this section is useful, because I searched myself for the "norm of residuals", which is used in MATLAB and so found it's connection to R-Squared. Because in the MATLAB-Documentation there is written, that "resnorm" (as the variable is called there) means the " sum((fun(x,xdata)-ydata).^2) " and I also checked it with the data given in this example, I would suggest correcting the formula   to   and R2 = 0.997 to R2 = 0.998. Would that be correct? Maybe anyone could check this. --WikiRob89 (talk) 14:57, 21 July 2014 (UTC)Reply

@192.55.54.42 - Maybe this article is not the best place to put this section, but I think it is helpful and belongs somewhere on Wikipedia. Originally this section was to compare R^2 with various different goodness of fit indicators but I stopped at just one indicator. Where do you suggest this goes? Or would you rather propose this section be expanded to include other indicators? @WikiRob89 - Agreed that R2 = 0.998 (rounding error previously). The resnorm variable in MATLAB is  , but if I understand the documentation correctly[1], resnorm is for the squared norm of residuals. The norm of residuals included in fitting data in a figure uses  . Jacobkhed (talk) 02:42, 12 February 2015 (UTC)Reply

User:EyeTruth Hi EyeTruth. I noticed you recently added a dispute tag to a section on this Coefficient of determination page. I would like to try and correct any incorrect and/or outdated information. I linked to another wiki article as well as to an external website. Please let me know if this clears things up and any additional corrections that are necessary. Thanks. Jacobkhed (talk) 21:50, 9 February 2016 (UTC)Reply

Sorry for late response, User:Jacobkhed. MATLAB's resnorm (norm of residuals) is same as RSS. Therefore, that statement is incorrect, and also the cited source doesn't mention MATLAB. EyeTruth (talk) 06:08, 25 February 2016 (UTC)Reply
Indeed, matlab's documentation is inconsistent: first it says "Norm of the residual" then "squared 2-norm of the residual"; only the latter is correct. I've notified support@mathworks.com fgnievinski (talk) 20:53, 25 February 2016 (UTC)Reply
Norm of Residuals is false. residual Standard error would be correct.--JonskiC (talk) 12:06, 20 January 2018 (UTC)Reply

Slope-Dependency

edit

Given that the R2 value is a comparison with the null hypothesis that the sample set is best represented by the constant function at the value equal to the arithmetic mean of the sample set, it's really an important caveat that the power of the R2 value to explain variability is directly slope-dependent. In other words, if a sample set function is constant or nearly constant, as opposed to strongly increasing or decreasing, the R2 value for the data set will be low, because the test cannot distinguish the trend (stable) from the null hypothesis (a constant of the arithmetic mean). I'm not a statistician, so certainly someone else might be better off writing up something to describe this caveat to a lay audience, but what I'm describing is important information that should be presented plainly and not in mathematician-ese. — Preceding unsigned comment added by 12.130.161.8 (talk) 01:24, 8 February 2014 (UTC)Reply

p -- number of variables or number of parameters

edit

There is twofold common use of notation p in multiple regression: (i) $p$ as the number of predictors (as in here), and (ii) $p$ as the number of parameters in the linear predictor. The attribute LINEAR in linear regression concerns not X's but betas. Thus the dimension of the model is the dimension of beta space (second dimension of the design matrix). Many authoritative books in the field use $k$ to denote the number of predictors $(X_1, ..., X_k), $ and $p=k+1$ as the number of parameters $(\beta_0, \beta_1,..., \beta_k).$

Of course this is only a convention, but confusions are possible (if you check for the definition of Mallows' $C_p$ there is a mess on the web with $p$ standing for both).

Maybe a sentence pointing out this dichotomy will be prudent.

Greetings from Marietta GA — Preceding unsigned comment added by 71.204.20.163 (talk) 16:28, 15 February 2014 (UTC)Reply

Interpretation

edit

I'm wondering about this one:

"Seventy percent of the variation in the response variable can be explained by the explanatory variables. The remaining thirty percent can be attributed to unknown, lurking variables or inherent variability."

Shouldn't it be

"Seventy percent of the variance in the response variable can be explained by the explanatory variables. The remaining thirty percent can be attributed to unknown, lurking variables or inherent variability."?

I guess variance is a measure of variation but when I read "variation" I would not expect variance...

Bgst (talk) 10:43, 11 September 2014 (UTC)Reply

I've asked about this at talkstats.com. They seem to agree: http://www.talkstats.com/showthread.php/57558-Coefficient-of-determination-interpretation

Anyone mind if I change "variation" to "variance"? Bgst (talk) 21:54, 17 September 2014 (UTC)Reply

Go for it. Dger (talk) 00:29, 18 September 2014 (UTC)Reply
Done! Bgst (talk) 07:57, 19 September 2014 (UTC)Reply

Worked example when yi = ybar for all i and i>1

edit

Consider the observed data set {10,10,10}. For this data set we see that:

 
 
 
 
 
  for all i

And the total sum of squares is zero, i.e.

 

And since the formula is

 

That means we have a division by zero thus:

 

...and the formula is not defined. Regards,Anameofmyveryown (talk) 11:18, 22 February 2015 (UTC)Reply

Worked example when i is equal to 1

edit

Consider the observed data set {10}. For this data set we see that:

 
 
 
  for all i

And the total sum of squares is zero, i.e.

 

And since the formula is

 

That means we have a division by zero thus:

 

...and the formula is not defined. Regards,Anameofmyveryown (talk) 11:18, 22 February 2015 (UTC)Reply

Add discussion of marginal and conditional R2 for multivariate models?

edit

This is a bit outside my comfort area, but I have lately been evaluating linear mixed models using a formulation for deriving a marginal R2 (that is, based on fixed effect(s) only) or a conditional R2 (that is, conditional on all fixed and random effects in the model). This is based on two papers by Nakagawa et al.[2][3] that frankly have gotten the shit cited out of them since their publication. In practical application, this was apparently very well received because it allows estimating model fit while still honoring the random structure of the data. However, it doesn't look as if this approach is covered here yet. A sentence or two should fit well under Coefficient_of_determination#In_a_multivariate_linear_model. There's an R implementation in package MuMIn (::r.squaredGLMM()) that might be worth mentioning - for those of us who scan these articles in the hopes of tackling a practical problem.

I could whip up a bit of text but would have to keep it very general because maths be hard. So if someone else is interested in doing that, it might be preferable? --Elmidae (talk · contribs) 22:21, 7 October 2020 (UTC)Reply

References

  1. ^ http://www.mathworks.com/help/optim/ug/lsqcurvefit.html
  2. ^ Nakagawa, S.; Schielzeth, H. (2013). "A general and simple method for obtaining R2 from generalized linear mixed‐effects models". Methods in ecology and evolution. 4 (2): 133–142. doi:10.1111/j.2041-210x.2012.00261.x.
  3. ^ Nakagawa, S.; Johnson, P. C.; Schielzeth, H. (2017). "The coefficient of determination R 2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded". Journal of the Royal Society Interface. 14 (134): 20170213. doi:10.1098/rsif.2017.0213.

Olkin-Pratt

edit

@Keith D: Thanks for adding a mention of the Olkin-Pratt estimator.

What do you think about adding a bit more about this? For example, expression (2.3) in the Olkin-Pratt paper you cited gives a simple expression in terms of the well-known Gaussian hypergeometric function. And the MBESS::Expected.R2 contributed function for R (programming language) has it programmed, so anyone who wants it can easily add it to any relevant analysis in R. DavidMCEddy (talk) 16:49, 23 April 2021 (UTC)Reply

Hello, it was not me that did the addition, all I was doing was correcting a cite date error. I have no knowledge of the topic. Sorry. Keith D (talk) 16:54, 23 April 2021 (UTC)Reply

New passage removed

edit

A further interpretation of R^2 was added by an anonymous editor who cited a recently self-published note on internet [I'm not saying that the editor and the author of the note are the same person. Just pointing out the note was not formally published in the academic literature]. The text referred to “quotients of surfaces” which makes no sense. Richard Gill (talk) 13:07, 11 February 2023 (UTC)Reply

PS this is the text I removed: R-Square can also be interpreted as the quotient of the surfaces between measurement data and the worst case regression (one constant parameter, the mean) and the surface between measurement data and the actual regression models' estimates [1]

Richard Gill (talk) 13:11, 11 February 2023 (UTC)Reply

Hello Richard Interpreting the sums as numerical integrals lies in the freedom of the art. This interpretation leads straight forward to the drawbacks of R-Square. It is a pitty that you do not take the time to get into the methodology before removing it.

Kind Regards Wolfgang — Preceding unsigned comment added by 188.107.7.68 (talk) 13:53, 11 February 2023 (UTC)Reply

Dear Wolfgang, I did take the time to read your article. If you the Wikipedia editor are the same person as the one who wrote that article, then your editing is an attempt to promote "original research". Unfortunately, this is "not done". Get your ideas published in standard text books first and then other editors will incorporate them into wikipedia articles! Richard Gill (talk) 14:42, 11 February 2023 (UTC)Reply

PS of course a sum can be interpreted as an integral. But what is the quotient of two surfaces? Richard Gill (talk) 14:44, 11 February 2023 (UTC)Reply

References

  1. ^ Wolfgang Rückert, "Critic on R-Square"

Adjusted R^2

edit

@JuliPsy: Between the formula "by Ezekiel" and the one in question, the current text explains:

where dfres is the degrees of freedom of the estimate of the population variance around the model, and dftot is the degrees of freedom of the estimate of the population variance around the mean. dfres is given in terms of the sample size n and the number of variables p in the model, dfres =n − p. dftot is given in the same way, but with p being unity for the mean, i.e. dftot = n − 1."

Inserting the degrees of freedom and using the definition of R2, it can be rewritten as:

 

Thus, if we replace "n-p" in this formula by "n-p-1", we must also revise the intervening text.

I could NOT check either of there references for the formula "by Ezekiel", because they were behind paywalls.

Regarding the more general question of the "correct" definition, a discussion on StackExchange identified four different formulae for adjusted R^2 by Wherry, McNemar, Lord, and Stein, NONE by Ezekiel. The one here is the first one, "by Wherry". "Lord" recommends dividing by "n-p-1". That post on Stack Exchange also notes that R (programming language) does not use any of these in general. It uses

1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

where df.int = 0 for a non-constant model, and rdf = degrees of freedom for residuals. In the most common case here, this gives the formula given here except for the rare case of a non-constant model.

Comments? DavidMCEddy (talk) 12:54, 30 June 2023 (UTC)Reply

@DavidMCEddy Thank you for taking up my suggestion. I made the change because I came up with different results in R. The formula implemented in R is with   in the denominator. For models with intercept  , the formula I adapted then results:
 
This formula can also be found on page 211 (Ezekiel, 1930).
Ezekiel, M. (1930). Methods of Correlation Analysis. New York: John Wiley and Sons. URL: https://www.dbraulibrary.org.in/RareBooks/Methods%20of%20Correlations%20Analysis.pdf
I would advocate adjusting the degrees of freedom of the residuals as well,  . Since, as you mentioned, models with intercept are the normal case. (cf. StackExchange).
Here is R code that illustrates my point
m <- lm(dist ~ speed + I(speed^2), cars)

# R calculations
summary(m)$r.squared      # R^2
summary(m)$adj.r.squared  # R^2_adj
summary(m)$df[2]          # DFres

# "by hand"
y <- cars$dist      # y
y_bar <- mean(y)    # mean y-values   
y_hat <- predict(m) # predicted values
SS_res <- sum((y - y_hat)^2)     # residual SS
SS_reg <- sum((y_hat - y_bar)^2) # explained SS
SS_tot <- sum((y - y_bar)^2)     # total SS
n <- length(y) # number of observations
p <- 2         # number of explanatory variables

# R^2:
(R2 <- SS_reg/SS_tot)
all.equal(R2, summary(m)$r.squared)

# R^2_adj
(df_tot <- n - 1)
(df_res_WIKI <- n - p)     # only true for model w/o intercept
(df_res_R    <- n - p - 1) 

# WIKI: n - p
1 - (SS_res / df_res_WIKI) / (SS_tot / df_tot)
(adj_R2_WIKI <- (1 - (1 - R2) * ((n - 1) / (n - p))))
all.equal(adj_R2_WIKI, summary(m)$adj.r.squared)

# R: n - p - 1
1 - (SS_res / df_res_R) / (SS_tot / df_tot)
(adj_R2_R <- (1 - (1 - R2) * ((n - 1) / (n - p - 1))))
all.equal(adj_R2_R, summary(m)$adj.r.squared)

JuliPsy (talk) 15:48, 30 June 2023 (UTC)Reply

@JuliPsy: Thanks. I found in the Internet Archivethe book by Ezekiel that you cited. I created a Wikidata item for it, added a citation to it to this article, and made the change you recommend. Thanks, DavidMCEddy (talk) 17:30, 30 June 2023 (UTC)Reply

Removed puzzling statement

edit

I removed the following statement because it is a utter nonsense. R^2 cannot be negative. Full stop.

> Models that have worse predictions than this baseline will have a negative R2. — Preceding unsigned comment added by 130.255.146.189 (talk)