Course syllabus Course resources
Course discussions
Course schedule
Send message to instructor
This week's lesson
Send message to course mentor
Class list
Home
Grade book

Week 11: July 23
Workouts
This Week

Did you notice how big your rear end is getting?  I think we will work on your gluts this week.  No flabby cans!   Do not be a sissy!  Work, work, work! 

 



Workout 1:
Strength Training

Measures of Association

Correlation

Remember the Saturday morning back in high school you spent hunched over the Scholastic Aptitude Test? (Maybe it was the American College Test.) The SAT, like other tests of mental ability, collects only a small sample of a person's knowledge, but that sample is useful to colleges and universities because scores on the SAT (or the ACT) are related to grades earned later in college. Students with higher scores usually go on to do better in college than those with lower scores. The relationship is not, however, perfect. Sometimes a person who scored low on the test gets surprisingly good grades in college; sometimes a person who scored high gets low grades.

How can we best describe the degree of relationship between two variables, such as test scores and college grades? This is an example of a question about correlation. Closely related is the problem of prediction. If there is a correlation between two variables, then from a person's score on one variable, we can do better than chance in predicting that person's score on the other variable. Many college admissions committees, for example, use SAT or ACT test scores to predict their applicants' freshman-year grade point averages (GPAs). If an applicant's predicted GPA falls below a certain standard, a committee may reject the applicant. As already indicated, the predictive value of the SAT is not perfect. In general, though, the greater the association between two variables, the more accurately we can predict the standing in one from the standing in the other.

Determining the degree of correlation and prediction is important in many areas of psychology and education. To establish the reliability of a test (i.e., the consistency in scores over repeated administrations), for example, we would want to know to what extent initial performance correlates with performance on the same test at a subsequent time. The test would be useless if it yielded scores that fluctuate widely over time. Correlation and prediction are also important in establishing test validity (i.e., the agreement between a test score and the quality it is supposed to measure). For example, to what extent are scores on a mechanical aptitude test predictive of on-the-job performance as a machinist? Knowing the degree of association here would help a vocational counselor do her job. Other examples of problems in association: Is there a relation between reading speed and reading comprehension? If so, how much? Is brain weight related to IQ? Is the number of cigarettes smoked related to the incidence of lung cancer?

Graphing Bivariate Distributions: The Scatter Diagram

The raw data summarized in Table 5.1 consists of pairs of scores. For each case under observation, we must know the score on a first variable, which we will call X, and the score on a second variable, Y. The subjects, designated by letters, are 10 male college students who filled out two questionnaires. One 10-item questionnaire measured the stress currently afflicting the student. Another 14-item questionnaire assessed difficulties in eating. We have used X for the total score on the stress questionnaire (higher scores indicate greater perceived stress) and Y for the total score on the eating difficulties questionnaire (higher scores indicate greater perceived difficulty).

Table 5.1 Bivariate Data: Scores for 10 Male College Students

STUDENT

STRESS
X

EATING DIFFICULTIES
Y

A

17

9

B

8

13

C

8

7

D

20

18

E

14

11

F

7

2

G

21

5

H

22

15

I

19

26

J

30

28

 

Figure 5.1 Scatter diagram of data from Table 5.1.


To display bivariate data, researchers typically use a scatter diagram or scatterplot like Figure 5.1. Each point in a scatter diagram represents a single case and shows the scores on the two variables for that case. For example, in Figure 5.1, which plots the X and Y scores from Table 5.1, the dot in the upper right corner represents student J, who scored 30 on the stress test (variable X) and 28 on the eating difficulties test (variable Y). The steps in constructing a scatter diagram from a set of paired scores are summarized as follows:

  • Step 1: Designate one variable X and the other Y. It is customary to use X for the "predictor" variable.
  • Step 2: Draw axes of about equal length for your graph.
  • Step 3: Place the high values of X to the right on the horizontal axis and the high values of Y toward the top on the vertical axis. Label convenient points along each axis.
  • Step 4: For each pair of scores, find the point of intersection for the X and Y values and indicate it with a dot.
  • Step 5: Name each axis and give the entire graph a name.

The scatter diagram allows us to easily see the nature of the relationship, if any exists, between the two variables. Pearson's mathematical procedures assume an underlying linear relationship, that is, a relationship that can he best represented by a straight line. Figure 5.1 shows such a relationship. As a general rule, those students who scored high on the stress questionnaire also had high scores on the eating difficulties questionnaire. The relationship is not perfect. Person G, for example, had a total score of 21 (the third highest) on the stress questionnaire, but had a total score of only 5 (the second lowest) on the eating difficulties questionnaire. Nevertheless, a quick inspection of the graph shows us that Pearson's assumption of linearity is appropriate here.

In some cases, the relationship between the independent variable(s) may not be linear. The failure to find evidence of a relationship may be due to one of two possibilities: (1) the variables are, in fact, unrelated, or (2) the variables are related in a nonlinear fashion. Instead, a graph of the relationship on ordinary graph paper would depict a curve. Figure 5.2, for example, displays the hypothetical relationship between age and strength of grip. Because the relationship here is curvilinear (the points on a scatter diagram cluster about a curved line), Pearson's procedures are not appropriate (and, in fact, will give misleading results). There are techniques of correlation and prediction appropriate where the underlying relationship between two variables is not linear but these will not be covered in this lesson.

Figure 5.2 Scatter diagram of two variables that are related in a curvilinear fashion (hypothetical data).

Correlation: A Matter of Direction

 

The scatter diagram not only shows whether there is a linear relationship between two variables, but it also shows at a glance the direction of the relationship. If the higher scores on X are generally paired with the higher scores on Y, and the lower scores on X are generally paired with the lower ones on Y, then the direction of the correlation between the two variables is positive. In a scatter diagram, a positive correlation appears as a cluster of data points that slopes from lower left to upper right. Thus, the correlation shown in Figure 5.1 is positive. It would not matter if we had designated the stress scores as Y instead of X (and plotted them along the vertical axis instead of the horizontal) and the eating difficulties scores as X instead of Y (and plotted them along the horizontal axis): the dots in the scatter diagram would still slope from lower left to upper right. Try it. Figure 5.3 offers more examples of positive correlations.

Figure 5.3 Scatter diagrams illustrating various degrees of positive correlation.

 

 

If the higher scores on X are generally paired with the lower scores on Y and the lower scores on X are generally paired with the higher ones on Y, then the direction of the correlation is negative. For example, one study conducted found a negative correlation between hours of TV watched daily by children and their scores on standardized reading and mathematics tests. The results are shown in Table 5.2. Although these are summary figures (means rather than scores for individuals), you can see that as the number of hours spent watching television increased, test scores decreased.

Table 5.2 

Negative Correlation Between Time Spent Watching TV and Academic Test Scores

Mean Hours of TV watched daily

Mean Test Scores

Reading

Math

0.0-0.5

75

69

0.5-1.0

74

65

1.0-2.0

73

65

2.0-3.0

73

65

3.0-4.0

72

63

4.0-5.0

71

63

5.0-6.0

70

62

6.0+

66

58

In a scatter diagram, a negative correlation appears as a cluster of data points that slopes from upper left to lower right, as illustrated in Figure 5.4a and 5.4b. Again, although it is customary to place the independent variable on the X axis, it does not matter which variable is plotted along the horizontal axis and which along the vertical axis; the direction of the slope will be the same. You would see a negative correlation if you plotted horsepower versus miles per gallon for automobiles.


 


Figure 5.4 Scatter diagrams illustrating negative correlation (a and b) and an essentially zero correlation (c).

In cases where there is no correlation between two variables (i.e., both high and low values of X are equally paired with both high and low values of Y), there is no direction in the pattern of the dots. They are scattered about the diagram in an irregular fashion, as illustrated in Figure 5.4c.

Correlation: A Matter of Degree

Look closely again at Figure 5.3. All three scatter diagrams show a positive correlation between X and Y. You can tell because the dots tend to go from lower left to upper right. Then how do these correlations differ? They differ in degree. The scatter diagram in Figure 5.3a shows a perfect linear correlation, one where all the data points fall on a straight line. The other correlations are less than perfect, but in each case, one can still construct through the cluster of dots a straight line that summarizes the relationship between the two variables. With less than perfect correlations, the dots will show some scatter about the straight line. The more scatter, the weaker the correlation.

 

Figure 5.5 Scatter diagrams b and c from Figure 5.3 with line of best fit.

 

The degree of association shared by two variables is indicated by the coefficient of correlation, invented by Pearson in 1896. It is calculated in a way we will show later, from the pairs of scores. When a perfect correlation exists, its value is plus or minus 1.0. When no relationship exists, its value is 0. Thus, intermediate degrees of correlation are represented by values between 0 and Ī1.0. The sign (plus or minus) indicates only the direction of the correlation (+ for a positive correlation and - for a negative correlation), not its degree. Students sometimes mistakenly believe that a correlation coefficient with a positive value indicates a stronger degree of relationship than does a coefficient with a negative value, but this is not so. Two correlation coefficients having the same absolute value but differing in sign indicate the same degree of relationship. Only the direction of the relationship differs. Thus, a correlation coefficient of -.50 indicates just as strong a relationship between two variables as a coefficient of +.50.

To summarize the relationship between a scatter diagram and the correlation coefficient, the correlation coefficient is a number that indicates how well the data points in a scatter diagram "hug" the straight line of best fit. With perfect correlations, all the data points fall exactly on a straight line as in Figures 5.3a and 5.4a and the value of the coefficient is Ī1.0. When the association between two variables is less than perfect, the data points show some scatter about the straight line that summarizes the relationship, and the absolute value of the correlation coefficient is less than 1.0. The weaker the relationship, tile more scatter and the lower the absolute value of the correlation coefficient.

In the real world, perfect correlations occur only in trivial instances; for example, the correlation will be -1.00 between the number of correct answers on a test and the number of errors plus omissions. Table 5.3 lists typical values of r for some correlations

 

 

Table 5.3

Typical Values of r

Variables

r

IQ from one form of Wechsler Adult Intelligence Test (WAIS) and IQ from an alternate form

+.90

Childhood IQ and adult IQ

+.70 to +.85

Age of man and age of woman among married American parents

+.85

Age of man and age of woman among unmarried American parents

+.70

Father's years of education and grown child's years of education

+.60

Verbal score on Scholastic Aptitude Test (SAT) and mathematics score on SAT

+60

IQ of husband and IQ of wife

+.50

Total score on SAT and freshman-year grade point average (GPA)

+.35

Total score on Graduate Record Exam (GRE) and undergraduate GPA among applicants to a highly selective graduate program in psychology

+.20

Height of man and height of woman among American parents, married or unmarried

+.20

Weight of man and weight of woman among American parents, married or unmarried

+.10

Attitudes about school and cutting school among junior high and high school students

-.29

Authoritarianism and aestheticism among high school seniors

-.42

Latency of visual evoked response and conceptual age at time of birth

-.61

In every day use, such as above, a correlation of .80 is considered high, .50 is moderate, and .30 is low. In examining the correlation coefficient, it is also extremely important to also look at the range. The greater the range the more correlation, so be careful to look for restriction of range. For example, a study of high school GPA and college GPA would produce a high correlation; but if the study were of high school GPA and Air Force Academy cadet GPA, the correlation would look low. Does this mean that high school GPA is no good as a predictor of good bets for the Air Force Academy? No, this is just an example of the restriction of range since everybody in the Air Force Academy has high grades. The same restriction of range exists if one was to look at GRE and finishing masterís degree. Once again the correlation would be low because all have a high GRE to begin with; and if such students do not finish their masterís program, it is not usually because they canít do the course work, but because of personal problems. Another example is GPA versus IQ; the correlation gets smaller as the range is restricted:

  • .60 in elementary school (largest and most heterogeneous)
  • .50 in high school
  • .40 in college
  • .30 in graduate school
  • Even less in the Air Force Academy

Here is what happens: when we gather this data, we are actually chopping off the scores at the upper end of the scatter plot. Therefore, with a low number of data points and a reduced variance, the correlation will look artificially low. Therefore, it is important to always report the sample size when you report a correlation.

Exercise:

 



Workout 2: Cardiovascular
Exercise

Correlation Does Not Establish Causation

If variation in X causes variation in Y, that causal connection will appear in some degree of correlation between X and Y (at least when we remove any obscuring complications). However, we cannot reason backwards from a correlation to a causal relationship. The fact that X and Y vary together is a necessary but not a sufficient condition for us to conclude that there is a cause-and-effect connection between them.

An example is the high positive correlation between the number of churches in a community and the amount of beer consumed in a community. But that does not prove or even imply that going to church caused people to drink beer, or that drinking beer causes people to go to church. Instead, both highly correlated variables are caused by a third variable Ė in this case, population. Another example Ė the correlation for the number of new college students and the number of entries into mental institutions is .80. Thatís very high. But we canít draw cause and effect conclusions. However, even if you donít know the cause and effect relationship, if two variables are related, they can be used to make predictions. In our example of the relationship between the number of churches in a community and the amount of beer consumed, if we know the number of churches in the community, we can predict how much beer should be ordered.

Just as a positive correlation cannot be said to represent causation, so a zero correlation does not necessarily prove the absence of a causal relationship. For example, some studies with college students have found no correlation between hours of study for an examination and test performance. Does this mean that the amount of study by a student had no effect on his test score, of course not. Some bright students study little and still achieve average scores, whereas, some of their less gifted classmates study diligently but achieve only an average performance. A controlled experimental study would almost certainly show some causal relationship.

Even a negative correlation does not rule out the possibility of a positive, direct causative relationship. For example, suppose the weights of 1,000 persons are obtained and they are asked the question: Of your last ten soft drinks, haw many were diet drinks? It is quite likely that the correlation is positive, but dues choosing diet, rather than high calorie soft drinks increase weight? Although the correlation is positive, choosing diet drinks probably decreases the person's calorie intake. In this case, the correlation is positive, but the causative relationship is negative.

From the preceding discussion, it should be clear that one must be very careful not to infer causation from correlation coefficients. Likewise, one cannot prove that there is no causal relationship between X and Y from zero or negative correlation coefficients. Nonzero correlation coefficients do show that Y can be predicted better if X is known than if it is unknown, or, equivalently, knowledge of Y improves the predictability of X. Prediction does not necessarily require any information or assumptions about causation.

Computing the Pearson Coefficient of Correlation

There are other indices of association suited for special situations. If we wanted to correlate rankings, such as the rankings of class members, the midterm rankings might be correlated with the rankings of the same class members on the final. then we would use Spearman's Rho (ρ). Spearman's Rho is also known as the rank-order correlation technique to use; however, Pearson's r is by far the most common correlation coefficient. In fact, when researchers speak of a correlation coefficient without being specific about which one they mean, you may safely assume they are referring to Pearson's r. Technically, it is a product-moment coefficient.

Like standard deviation the correlation coefficient can be computed via a deviation score formula:

Formula 5.1

where:

  • = the correlation coefficient, read as the correlation of Y on X
  • or x = the deviation of the X values from the mean of X;
  • or y = the deviation of the Y values from the mean of Y;
  • or x2 = the sum of squares for the deviation of X;
  • or y2 = the sum of squares for the deviation of Y.

This formula, however simple in appearance can become quite cumbersome to work with, as the following example will show:


Example 5-1. Compute the Pearson product moment correlation coefficient between stress and eating difficulties from Table 5.1.

Student

X

Y

X2

Y2

XY


x


y

x2


y
2


xy

A

17

9

289

81

153

+0.4

-4.4

0.16

19.36

-1.76

B

8

13

64

169

104

-8.6

-0.4

73.96

0.16

+3.44

C

8

7

64

49

56

-8.6

-6.4

73.96

40.96

+55.04

D

20

18

400

324

360

+3.4

+4.6

11.56

21.16

+15.64

E

14

11

196

121

154

-2.6

-2.4

6.76

5.76

+6.24

F

7

2

49

4

14

-9.6

-11.4

92.16

129.96

+109.44

G

21

5

441

25

105

+4.4

-8.4

19.36

70.56

-36.96

H

22

15

484

225

330

+5.4

+1.6

29.16

2.56

+8.64

I

19

26

361

676

494

+2.4

+12.6

5.76

158.76

+30.24

J

30

28

900

784

840

+13.4

+14.6

179.56

213.16

+195.64

n=10

166

134

3248

2458

2610

0

0

492.40

662.40

+385.60

Example 5.1 demonstrates the computation of r by the deviation-score formula for the data presented in Table 5.1 on stress and eating difficulties for 10 male college students. The steps in calculating r by Formula 5.1 are summarized as follows:

  • Step 1: List the pairs of scores in two columns. The order in which you list the pairs makes no difference in the value of r. However, if you shift one raw score you must shift the raw score that is paired with it. To do otherwise would affect the value of the numerator.
  • Step 2: Find the mean for X and the mean for Y.
  • Step 3: convert each raw score to a deviation score. Note that the deviation for X values from the mean is designated by or simply lower case x. Similarly, the deviation for the Y values from the mean of Y is designated by or lower case y.
  • Step 4: Square the deviation scores and calculate the sums of squares. These are denoted by and or simply x2 and y2.
  • Step 5: Next, multiply the deviations from X, , by the deviations from Y, , to get the cross product or xy.
  • Step 6: To calculate the numerator for Formula 5.1, simply enter xy, from step 5.
  • Step 7: To obtain the denominator for Formula 5.1, take the square root of the sum of squares for X (x2) multiplied by the sum of squares for Y (y2).
  • Step 8: Complete the mathematics to obtain r, which in this example proves to be +.675.

The positive correlation we have obtained tells us that it was at least generally true that those with higher perceived stress also had greater perceived eating difficulties. As a check on your work, always look at the scatter diagram after you have calculated r. Remember that with a positive correlation the cluster of dots should go from the lower left to the upper right. If they do not, you have made a mistake in your calculations. In this example, the value +.675 indicates that the relationship between the two variables was a fairly close one, but certainly not perfectly. So low perceived stress did not necessarily guarantee a lack of eating difficulties, and high stress did not doom the student to eating difficulties (i.e., student G).

As I am sure you will agree, Formula 5.1 is cumbersome and lends itself to mathematical errors. An easier computational formula that does not require computation of the deviation from the means is:

Formula 5.2

where:

  • = the correlation coefficient, read as the correlation of X with Y
  • XY = the cross product of X and Y
  • n = the number of pairs of scores
  • = the cross product of the mean of X and the mean of Y
  • = the sum of each X value
  • = the sum of each Y value
  • = the mean of X squared
  • = the mean of Y squared

Example 5.2 Compute the Pearson product moment correlation coefficient between stress and eating difficulties using the simplified formula.

Coefficient of Determination

The coefficient of determination is a measure that allows us to determine the portion of the variability in the dependent variable (Y variable) that can be explained by the regression model through the independent variable (X variable). The coefficient of determination is denoted by the symbol r2, and is obtained by squaring the value of the correlation coefficient. Thus, unlike the correlation coefficient there are no negative values for the coefficient of determination and the range is 0 < r 2< 1. A coefficient of determination close to 1 would imply that the model is explaining most of the variation in the dependent variable and may be a very useful model. Subsequently, a r2 value close to 0 would imply that the model is explaining little of the variation in the dependent variable and may not be a useful model.

In example 5.2 recall that the correlation coefficient was r = .675. Thus, the coefficient of determination r2 = 0.456 or 45.6 percent. That is the regression model can explain about 46 percent of the variation in the Y value or eating disorder. Another way of stating the relationship is that approximately 46 percent of eating disorders are caused by stress. This would make this model questionable to use for prediction because of the rather average coefficient of determination.

 

Exercise:

  • Read Chapter 12 in your text
  • Complete the Correlation Fitness Quiz for this week to see how well you did on your workouts!


Workout 3:
Personal Trainer's Choice:

Linear Regression and Prediction

We learned in the last lesson that SAT scores are positively correlated with freshman-year grade point averages. If someone scores well above average on the SATs, there is a good chance that his or her GPA is also above average. In general, if two variables are correlated, it is possible to make a prediction, with better than chance accuracy, of the standing in one of the variables from knowledge of the standing in the other. The closer the relationship is between the two variables, the higher the correlation coefficient, and the better the prediction. Still, the value of the coefficient by itself does not tell us how to proceed. How, then, do we make predictions from correlated data?

This lesson considers predictions only for cases in which the relationship between bivariate data can best be described by a straight line. The Pearson correlation coefficient was developed for linear relationships. Here we consider the equation for the straight line of best fit and learn how to make predictions. If the correlation between two variables is less than perfect, as it always is in real world cases, then our prediction will also be less than perfect. This means that we will make mistakes when making predictions. Thus, we must also consider the error of prediction and how to measure it.

The Problem of Prediction

Given the correlation between the SAT and freshman-year GPA, college admissions committees looking at applicants' SAT scores can predict with some success how well those students would do if admitted. Two possible ways to make that prediction are demonstrated with the scatter diagram in Figure 6.1. In the cluster of dots, we can see the positive association between SAT scores and freshman-year GPA for a sample of the students admitted to a hypothetical college in the past. Consider a new applicant with a SAT total of 1100. To take the simplest possible approach to predict his or her GPA, we could look only at the students who had that particular score. Their data points appear in the column erected over the value 1100 on the abscissa. Suppose the mean GPA for those students was 2.3. Then 2.3 will be our prediction for this applicant.

This method of prediction has a major shortcoming; it ignores cases that have SAT scores other than 1100. The prediction is based on a relatively small sample and is therefore somewhat unstable; the prediction from another sample of students with scores of 1100 may differ markedly from this one. However, recall that larger samples from a given population vary less among themselves than do smaller samples. Thus, if we can find a way to use the full sample of observations, we can generate predictions that are more resistant to sampling variation.

Figure 6.1  Prediction of Y for persons with an SAT total score of 1000 from column mean and from line of best fit

If it is reasonable to assume that X and Y are linearly related (our scatter diagram should tell us), we can improve our prediction of Y from X by finding the straight line of best fit to the Y values. This will be a line determined by all the scores in the sample on hand. Statisticians call the line of best fit a regression line; the equation that locates the line is the regression equation. The regression line is also shown in Figure 6.1, and as shown, we can use it to make predictions for new cases. Just start with the SAT score on the abscissa, go up to the line, and then go over to the ordinate. By this method, the best prediction for a student with an SAT total of 1100 is a GPA of about 2.12.

Predictions made by this technique are better in their resistance to sampling variation, but two limitations remain. First, a regression line that has been fitted to a sample of points is probably not the same as the line that would best fit the entire population. (Other things being equal, however, the larger the sample, the closer the approximation.) Second, the technique depends on the assumption that a straight line is a reasonable description of the interrelationship between X and Y. Fortunately, the assumption of linearity is often satisfactory. Let us now consider how to define "best fit."

If it is reasonable to assume that X and Y are linearly related (our scatter diagram should tell us), we can improve our prediction of Y from X by finding the straight line of best fit to the Y values. This will be a line determined by all the scores in the sample on hand. Statisticians call the line of best fit a regression line; the equation that locates the line is the regression equation. The regression line is also shown in Figure 6.1, and as shown, we can use it to make predictions for new cases. Just start with the SAT score on the abscissa, go up to the line, and then go over to the ordinate. By this method, the best prediction for a student with an SAT total of 1100 is a GPA of about 2.12.

Predictions made by this technique are better in their resistance to sampling variation, but two limitations remain. First, a regression line that has been fitted to a sample of points is probably not the same as the line that would best fit the entire population. (Other things being equal, however, the larger the sample, the closer the approximation.) Second, the technique depends on the assumption that a straight line is a reasonable description of the interrelationship between X and Y. Fortunately, the assumption of linearity is often satisfactory. Let us now consider how to define "best fit."

THE REGRESSION EQUATION: RAW SCORE FORMULA

A regression equation is a mathematical equation that can be used to predict the values of one dependent variable from known values of one or more independent variables. The term is derived from the heredity studies performed by Sir Francis Galton in which he compared the heights of sons to the heights of their fathers. Galton showed that the heights of the sons of tall fathers regressed towards the mean height of the population through several successive generations. In other words, sons of unusually tall fathers tend to be shorter than their fathers, and sons of unusually short fathers tend to be taller than their fathers.

Simple linear regression analysis allows us to determine what the line of best fit for a given relationship of two variables will be. You might recall from Algebra that the equation of a straight line is usually given by , where m is the slope of the line and b is the y intercept. In statistics, the equation of the regression line is usually written as , where a is the y intercept, b is the slope of the line, and is read as "Y hat," and it gives the predicted Y value for a given X value. You may also find it written with and read as Y predicted.

There are a number of formulas for computing the regression line and thus being able to predict the dependent variable (Y) given the value of the independent variable (X). The basic formula, much like that for computing variance and standard deviation, uses the sum of the squares of both the deviation of the independent variable from its mean and the dependent variable from its mean.

Formula 6.1 Calculating b

where:

  • x and y (small letters) are the deviation of X and Y from their means
  • xy is product of the deviation of X and Y from their means
  • x2 and y2 are the sum of the squares of the deviation of X and Y from their means

Formula 6. 2 Calculating a

Once we have computed a and b it is a simple matter to determine the formula for our regression line and be able to predict a Y value given the corresponding X value.

Formula 6. 3 Calculating the regression line



Example 6.1: SAT Scores and Calculus Midterm Scores of 10 students

Student

SAT

Calculus

1

1100

89

2

1300

92

3

1000

86

4

1100

92

5

1200

90

6

1200

93

7

1400

98

8

1300

95

9

1000

88

10

1400

95

To compute the regression line we must first solve for x2, xy, and the means of X and Y.

X

Y

x

x2

y

y2

xy

1100

89

-100

10000

-2.8

7.84

280

1300

92

100

10000

0.2

0.04

20

1000

86

-200

40000

-5.8

33.64

1160

1100

92

-100

10000

0.2

0.04

-20

1200

90

0

0

-1.8

3.24

0

1200

93

0

0

1.2

1.44

0

1400

98

200

40000

6.2

38.44

1240

1300

95

100

10000

3.2

10.24

320

1000

88

-200

40000

-3.8

14.44

760

1400

95

200

40000

3.2

10.24

640

 

 

 

 

 

 

 

12,000

918

0

200000

0

119.60

4400

 

 

Thus, if we wanted to predict what another studentís calculus score would be if he or she had an SAT score of 1000:

Figure 6.1  Scatterplot of SAT Scores and Calculus Scores

THE CRITERION OF BEST FIT

It is all very well to speak of finding the straight line of best fit, but how is one to know when the "best fit" has been achieved? Best fit could be defined in more than one (reasonable) way. Karl Pearson's solution to this problem was to apply the least-squares criterion. Figure 6.2 illustrates his thinking for predicting Y from X.

Figure 6-2  Discrepancies between seven Y values and the line of regression of Y on X

 

The figure shows a bivariate distribution and a possible regression line. How good are the predictions from this line? For the seven cases shown in the scatter diagram, the errors of prediction appear as vertical lines, each connecting the actual value of Y to the predicted value, which we call Y', given by the regression line. The longer the vertical line, the greater the error in prediction.

Let stand for the discrepancy between the actual value of Y and the predicted value: . Pearson's least-squares criterion for the regression line is this: the line of best fit is the one that minimizes the sum of the squares of the discrepancies. Thus is to be as small as possible.

Why not just minimize the sum of the absolute magnitudes of the discrepancies rather than the sum of the squares? The answer has two parts:

  1. It is difficult to deal mathematically with the absolute discrepancies, whereas the squared discrepancies permit mathematical developments of practical value.
  2. The location of the regression line and the value of the correlation coefficient will fluctuate less under the influence of random sampling than would happen if another criterion were used.

This is not the first time we have encountered a sum of squared discrepancies. Recall that the sum of the squared deviations from the mean,, is minimal. Is there some connection between Pearson's regression line and the mean? Yes.

First, just as the mean is a least-squares solution to the problem of finding a measure of central tendency, so the regression line is a least-squares solution to the problem of finding the best-fitting straight line. Both minimize the sum of squared discrepancies, and they thus have analogous properties, including resistance to sampling variation.

Second, the regression line is actually a kind of mean. It is a running mean, a series of means. For each value of X, the regression line tells us the mean, or expected, value of Y. In other words, whereas Y is the mean of all Y values in a set of scores, is an estimate of the mean of Y given the condition that X has a particular value.

 

EXERCISE:

 

 

 



Ask Your Friends
at the Gym

Need Help?  Have a question but can't find the answer? Here are some options:

Back
(To Page 1 of 3)

Now Let's Review Your Fitness
Report and Cool Down

Course
Syllabus
Course
Schedule
This Week's
Lesson

Class List

       
Gradebook

Course
Resources

Course
Discussions
Course
Home Page

 

© by Carolyn Pearson and William Moomaw 2003. All rights reserved. Updated on May 14, 2009