Calculate PMCC from raw data

Questions that provide raw bivariate data in a table and require calculating the product moment correlation coefficient directly from the individual data values.

6 questions

OCR S1 2008 January Q9
9 It is thought that the pH value of sand (a measure of the sand's acidity) may affect the extent to which a particular species of plant will grow in that sand. A botanist wished to determine whether there was any correlation between the pH value of the sand on certain sand dunes, and the amount of each of two plant species growing there. She chose random sections of equal area on each of eight sand dunes and measured the pH values. She then measured the area within each section that was covered by each of the two species. The results were as follows.
\cline { 2 - 10 } \multicolumn{1}{c|}{}Dune\(A\)\(B\)\(C\)\(D\)\(E\)\(F\)\(G\)\(H\)
\cline { 2 - 10 } \multicolumn{1}{c|}{}pH value, \(x\)8.58.59.58.56.57.58.59.0
\multirow{2}{*}{
Area, \(y \mathrm {~cm} ^ { 2 }\)
covered
}
Species \(P\)1501505753304515340330
\cline { 2 - 10 }Species \(Q\)1701580230752500
The results for species \(P\) can be summarised by $$n = 8 , \quad \Sigma x = 66.5 , \quad \Sigma x ^ { 2 } = 558.75 , \quad \Sigma y = 1935 , \quad \Sigma y ^ { 2 } = 711275 , \quad \Sigma x y = 17082.5 .$$
  1. Give a reason why it might be appropriate to calculate the equation of the regression line of \(y\) on \(x\) rather than \(x\) on \(y\) in this situation.
  2. Calculate the equation of the regression line of \(y\) on \(x\) for species \(P\), in the form \(y = a + b x\), giving the values of \(a\) and \(b\) correct to 3 significant figures.
  3. Estimate the value of \(y\) for species \(P\) on sand where the pH value is 7.0 . The values of the product moment correlation coefficient between \(x\) and \(y\) for species \(P\) and \(Q\) are \(r _ { P } = 0.828\) and \(r _ { Q } = 0.0302\).
  4. Describe the relationship between the area covered by species \(Q\) and the pH value.
  5. State, with a reason, whether the regression line of \(y\) on \(x\) for species \(P\) will provide a reliable estimate of the value of \(y\) when the pH value is
    (a) 8,
    (b) 4 .
  6. Assume that the equation of the regression line of \(y\) on \(x\) for species \(Q\) is also known. State, with a reason, whether this line will provide a reliable estimate of the value of \(y\) when the pH value is 8 .
Edexcel S1 2024 January Q4
  1. A French test and a Spanish test were sat by 11 students.
The table below shows their marks.
StudentABCDEFGHIJK
French mark ( f )2430323236364044506068
Spanish mark ( \(\boldsymbol { s }\) )1690242832363844484868
Greg says that if these points were plotted on a scatter diagram, then the point \(( 30,90 )\) would be an outlier because 90 is an outlier for the Spanish marks. An outlier is defined as a value that is $$\text { greater than } Q _ { 3 } + 1.5 \times \left( Q _ { 3 } - Q _ { 1 } \right) \text { or smaller than } Q _ { 1 } - 1.5 \times \left( Q _ { 3 } - Q _ { 1 } \right)$$
  1. Show that 90 is an outlier for the Spanish marks. Ignoring the point (30, 90), Greg calculated the following summary statistics. $$\sum f = 422 \quad \sum s = 382 \quad S _ { f f } = 1667.6 \quad S _ { f s } = 1735.6$$
  2. Use these summary statistics to show that the equation of the least squares regression line of \(s\) on \(f\) for the remaining 10 students is $$s = - 5.72 + 1.04 f$$ where the values of the intercept and gradient are given to 3 significant figures. You must show your working.
  3. Give an interpretation of the gradient of the regression line. Two further students sat the French test but missed the Spanish test.
  4. Using the equation given in part (b), estimate
    1. a Spanish mark for the student who scored 55 marks in their French test,
    2. a Spanish mark for the student who scored 18 marks in their French test.
  5. State, giving a reason, which of the two estimates found in part (d) would be the more reliable estimate.
OCR Further Statistics AS 2022 June Q1
1 A geography student chose a certain point in a stream and took measurements of the speed of flow, \(v \mathrm {~ms} ^ { - 1 }\), of water at various depths, \(d \mathrm {~m}\), below the surface at that point. The results are shown in the table.
\(d\)0.10.150.20.250.30.350.40.450.5
\(v\)0.80.50.71.21.11.31.61.40.4
\(n = 9 \quad \sum d = 2.7 \quad \sum v = 9.0 \quad \sum d ^ { 2 } = 0.96 \quad \sum v ^ { 2 } = 10.4 \quad \sum \mathrm {~d} v = 2.85\)
    1. Explain why \(d\) is an example of an independent, controlled variable.
    2. Use two relevant terms to describe the variable \(v\) in a similar way. A statistician believes that the point ( \(0.5,0.4\) ) may be an anomaly.
  1. Calculate the equation of the least squares regression line of \(v\) on \(d\) for all the points in the table apart from ( \(0.5,0.4\) ).
  2. Use the equation of the line found in part (b) to estimate the value of \(v\) when \(d = 0.5\).
  3. Use your answer to part (c) to comment on the statistician’s belief.
  4. Use the diagram in the Printed Answer Booklet (which does not illustrate the data in this question) to explain what is meant by "least squares regression line".
OCR Further Statistics AS 2020 November Q3
3 An investor obtains data about the profits of 8 randomly chosen investment accounts over two one-year periods. The profit in the first year for each account is \(p \%\) and the profit in the second year for each account is \(q \%\). The results are shown in the table and in the scatter diagram.
AccountABCDEFGH
\(p\)1.62.12.42.72.83.35.28.4
\(q\)1.62.32.22.23.12.97.64.8
\(n = 8 \quad \sum \mathrm { p } = 28.5 \quad \sum \mathrm { q } = 26.7 \quad \sum \mathrm { p } ^ { 2 } = 136.35 \quad \sum \mathrm { q } ^ { 2 } = 116.35 \quad \sum \mathrm { pq } = 116.70\)
\includegraphics[max width=\textwidth, alt={}, center]{bf1468d1-e02e-47d2-bf41-5bc8f5b4d7c4-3_782_1280_998_242}
  1. State which, if either, of the variables \(p\) and \(q\) is independent.
  2. Calculate the equation of the regression line of \(q\) on \(p\).
    1. Use the regression line to estimate the value of \(q\) for an investment account for which \(p = 2.5\).
    2. Give two reasons why this estimate could be considered reliable.
  3. Comment on the reliability of using the regression line to predict the value of \(q\) when \(p = 7.0\).
OCR FS1 AS 2021 June Q1
1 Five observations of bivariate data \(( x , y )\) are given in the table.
\(x\)781264
\(y\)201671723
  1. Find the value of Pearson's product-moment correlation coefficient.
  2. State what your answer to part (a) tells you about a scatter diagram representing the data.
  3. A new variable \(a\) is defined by \(a = 3 x + 4\). Dee says "The value of Pearson's product-moment correlation coefficient between \(a\) and \(y\) will not be the same as the answer to part (a)." State with a reason whether you agree with Dee. An investor obtains data about the profits of 8 randomly chosen investment accounts over two one-year periods. The profit in the first year for each account is \(p \%\) and the profit in the second year for each account is \(q \%\). The results are shown in the table and in the scatter diagram.
    AccountABCDEFGH
    \(p\)1.62.12.42.72.83.35.28.4
    \(q\)1.62.32.22.23.12.97.64.8
    \(n = 8 \quad \Sigma p = 28.5 \quad \Sigma q = 26.7 \quad \Sigma p ^ { 2 } = 136.35 \quad \Sigma q ^ { 2 } = 116.35 \quad \Sigma p q = 116.70\)
    \includegraphics[max width=\textwidth, alt={}, center]{4c7546b9-03ee-47a1-915f-41e2b4ca19c0-03_762_1248_906_260}
  4. State which, if either, of the variables \(p\) and \(q\) is independent.
  5. Calculate the equation of the regression line of \(q\) on \(p\).
    1. Use the regression line to estimate the value of \(q\) for an investment account for which \(p = 2.5\).
    2. Give two reasons why this estimate could be considered reliable.
  6. Comment on the reliability of using the regression line to predict the value of \(q\) when \(p = 7.0\).
OCR FS1 AS 2017 Specimen Q8
53 marks
8 The following table gives the mean per capita consumption of mozzarella cheese per annum, \(x\) pounds, and the number of civil engineering doctorates awarded, \(y\), in the United States in each of 10 years. \section*{2. Subject-specific Marking Instructions for AS Level Further Mathematics A} Annotations should be used whenever appropriate during your marking. The A, M and B annotations must be used on your standardisation scripts for responses that are not awarded either 0 or full marks. It is vital that you annotate standardisation scripts fully to show how the marks have been awarded. For subsequent marking you must make it clear how you have arrived at the mark you have awarded. An element of professional judgement is required in the marking of any written paper. Remember that the mark scheme is designed to assist in marking incorrect solutions. Correct solutions leading to correct answers are awarded full marks but work must not be judged on the answer alone, and answers that are given in the question, especially, must be validly obtained; key steps in the working must always be looked at and anything unfamiliar must be investigated thoroughly. Correct but unfamiliar or unexpected methods are often signalled by a correct result following an apparently incorrect method. Such work must be carefully assessed. When a candidate adopts a method which does not correspond to the mark scheme, escalate the question to your Team Leader who will decide on a course of action with the Principal Examiner.
If you are in any doubt whatsoever you should contact your Team Leader.
The following types of marks are available. \section*{M} A suitable method has been selected and applied in a manner which shows that the method is essentially understood. Method marks are not usually lost for numerical errors, algebraic slips or errors in units. However, it is not usually sufficient for a candidate just to indicate an intention of using some method or just to quote a formula; the formula or idea must be applied to the specific problem in hand, e.g. by substituting the relevant quantities into the formula. In some cases the nature of the errors allowed for the award of an M mark may be specified. \section*{A} Accuracy mark, awarded for a correct answer or intermediate step correctly obtained. Accuracy marks cannot be given unless the associated Method mark is earned (or implied). Therefore M0 A1 cannot ever be awarded. \section*{B} Mark for a correct result or statement independent of Method marks. \section*{E} Mark for explaining a result or establishing a given result. This usually requires more working or explanation than the establishment of an unknown result.
Unless otherwise indicated, marks once gained cannot subsequently be lost, e.g. wrong working following a correct form of answer is ignored. Sometimes this is reinforced in the mark scheme by the abbreviation isw. However, this would not apply to a case where a candidate passes through the correct answer as part of a wrong argument.
d When a part of a question has two or more 'method' steps, the M marks are in principle independent unless the scheme specifically says otherwise; and similarly where there are several B marks allocated. (The notation 'dep*' is used to indicate that a particular mark is dependent on an earlier, asterisked, mark in the scheme.) Of course, in practice it may happen that when a candidate has once gone wrong in a part of a question, the work from there on is worthless so that no more marks can sensibly be given. On the other hand, when two or more steps are successfully run together by the candidate, the earlier marks are implied and full credit must be given.
e The abbreviation FT implies that the A or B mark indicated is allowed for work correctly following on from previously incorrect results. Otherwise, A and B marks are given for correct work only - differences in notation are of course permitted. A (accuracy) marks are not given for answers obtained from incorrect working. When A or B marks are awarded for work at an intermediate stage of a solution, there may be various alternatives that are equally acceptable. In such cases, what is acceptable will be detailed in the mark scheme. If this is not the case please, escalate the question to your Team Leader who will decide on a course of action with the Principal Examiner.
Sometimes the answer to one part of a question is used in a later part of the same question. In this case, A marks will often be 'follow through'. In such cases you must ensure that you refer back to the answer of the previous part question even if this is not shown within the image zone. You may find it easier to mark follow through questions candidate-by-candidate rather than question-by-question.
f Unless units are specifically requested, there is no penalty for wrong or missing units as long as the answer is numerically correct and expressed either in SI or in the units of the question (e.g. lengths will be assumed to be in metres unless in a particular question all the lengths are in km , when this would be assumed to be the unspecified unit.) We are usually quite flexible about the accuracy to which the final answer is expressed; over-specification is usually only penalised where the scheme explicitly says so. When a value is given in the paper only accept an answer correct to at least as many significant figures as the given value. This rule should be applied to each case. When a value is not given in the paper accept any answer that agrees with the correct value to 2 s.f. Follow through should be used so that only one mark is lost for each distinct accuracy error, except for errors due to premature approximation which should be penalised only once in the examination. There is no penalty for using a wrong value for \(g\). E marks will be lost except when results agree to the accuracy required in the question. Rules for replaced work: if a candidate attempts a question more than once, and indicates which attempt he/she wishes to be marked, then examiners should do as the candidate requests; if there are two or more attempts at a question which have not been crossed out, examiners should mark what appears to be the last (complete) attempt and ignore the others. NB Follow these maths-specific instructions rather than those in the assessor handbook.
h For a genuine misreading (of numbers or symbols) which is such that the object and the difficulty of the question remain unaltered, mark according to the scheme but following through from the candidate's data. A penalty is then applied; 1 mark is generally appropriate, though this may differ for some papers. This is achieved by withholding one A mark in the question. Marks designated as cao may be awarded as long as there are no other errors. E marks are lost unless, by chance, the given results are established by equivalent working. 'Fresh starts' will not affect an earlier decision about a misread. Note that a miscopy of the candidate's own working is not a misread but an accuracy error.
i If a calculator is used, some answers may be obtained with little or no working visible. Allow full marks for correct answers (provided, of course, that there is nothing in the wording of the question specifying that analytical methods are required). Where an answer is wrong but there is some evidence of method, allow appropriate method marks. Wrong answers with no supporting method score zero. If in doubt, consult your Team Leader. If in any case the scheme operates with considerable unfairness consult your Team Leader. \end{table} PS = Problem Solving
M = Modelling \section*{Summary of Updates}
5(i)(a)
5(i)(b)
5(i)(c)
5(ii)
\multirow[t]{8}{*}{6(i)}
\multirow[t]{10}{*}{6(ii)}
6 (iii)(a)
6 (iii)(b)
6 (iv)
\multirow[t]{25}{*}{
8
(iii)(a)
}
8(iii)(b)
\section*{PLEASE DO NOT WRITE ON THIS PAGE}