Analyze large data set correlations

A question is this type if and only if it specifically uses the large data set to investigate correlations between variables like temperature, rainfall, pressure, etc.

6 questions

Edexcel AS Paper 2 2020 June Q2
  1. Jerry is studying visibility for Camborne using the large data set June 1987.
The table below contains two extracts from the large data set.
It shows the daily maximum relative humidity and the daily mean visibility.
Date
Daily Maximum
Relative Humidity
Daily Mean Visibility
Units\(\%\)
\(10 / 06 / 1987\)905300
\(28 / 06 / 1987\)1000
(The units for Daily Mean Visibility are deliberately omitted.)
Given that daily mean visibility is given to the nearest 100,
  1. write down the range of distances in metres that corresponds to the recorded value 0 for the daily mean visibility. Jerry drew the following scatter diagram, Figure 2, and calculated some statistics using the June 1987 data for Camborne from the large data set. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{d62e5a00-cd23-417f-b244-8b3e24da4aa2-04_823_1764_1281_137} \captionsetup{labelformat=empty} \caption{Figure 2}
    \end{figure} Jerry defines an outlier as a value that is more than 1.5 times the interquartile range above \(Q _ { 3 }\) or more than 1.5 times the interquartile range below \(Q _ { 1 }\).
  2. Show that the point circled on the scatter diagram is an outlier for visibility.
  3. Interpret the correlation between the daily mean visibility and the daily maximum relative humidity. Jerry drew the following scatter diagram, Figure 3, using the June 1987 data for Camborne from the large data set, but forgot to label the \(x\)-axis.
    \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{d62e5a00-cd23-417f-b244-8b3e24da4aa2-05_730_1056_342_386} \captionsetup{labelformat=empty} \caption{Figure 3}
    \end{figure}
  4. Using your knowledge of the large data set, suggest which variable the \(x\)-axis on this scatter diagram represents.
Edexcel AS Paper 2 2023 June Q2
  1. Fred and Nadine are investigating whether there is a linear relationship between Daily Mean Pressure, \(p \mathrm { hPa }\), and Daily Mean Air Temperature, \(t ^ { \circ } \mathrm { C }\), in Beijing using the 2015 data from the large data set.
Fred randomly selects one month from the data set and draws the scatter diagram in Figure 1 using the data from that month. The scale has been left off the horizontal axis. \begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{854568d2-b32d-44de-8a9c-26372e509c20-04_794_1539_589_264} \captionsetup{labelformat=empty} \caption{Figure 1}
\end{figure}
  1. Describe the correlation shown in Figure 1. Nadine chooses to use all of the data for Beijing from 2015 and draws the scatter diagram in Figure 2. She uses the same scales as Fred. \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{854568d2-b32d-44de-8a9c-26372e509c20-04_777_1509_1841_278} \captionsetup{labelformat=empty} \caption{Figure 2}
    \end{figure}
  2. Explain, in context, what Nadine can infer about the relationship between \(p\) and \(t\) using the information shown in Figure 2.
  3. Using your knowledge of the large data set, state a value of \(p\) for which interpolation can be used with Figure 2 to predict a value of \(t\).
  4. Using your knowledge of the large data set, explain why it is not meaningful to look for a linear relationship between Daily Mean Wind Speed (Beaufort Conversion) and Daily Mean Air Temperature in Beijing in 2015.
  5. Explain, in context, what Nadine can infer about the relationship between \(p\) and \(t\) using the information shown in Figure 2.
Edexcel AS Paper 2 Specimen Q4
  1. Sara was studying the relationship between rainfall, \(r \mathrm {~mm}\), and humidity, \(h \%\), in the UK. She takes a random sample of 11 days from May 1987 for Leuchars from the large data set.
She obtained the following results.
\(h\)9386959786949797879786
\(r\)1.10.33.720.6002.41.10.10.90.1
Sara examined the rainfall figures and found $$Q _ { 1 } = 0.1 \quad Q _ { 2 } = 0.9 \quad Q _ { 3 } = 2.4$$ A value that is more than 1.5 times the interquartile range (IQR) above \(Q _ { 3 }\) is called an outlier.
  1. Show that \(r = 20.6\) is an outlier.
  2. Give a reason why Sara might:
    1. include
    2. exclude
      this day's reading. Sara decided to exclude this day's reading and drew the following scatter diagram for the remaining 10 days' values of \(r\) and \(h\).
      \includegraphics[max width=\textwidth, alt={}, center]{8f3dbcb4-3260-4493-a230-12577b4ed691-08_988_1081_1555_420}
  3. Give an interpretation of the correlation between rainfall and humidity. The equation of the regression line of \(r\) on \(h\) for these 10 days is \(r = - 12.8 + 0.15 h\)
  4. Give an interpretation of the gradient of this regression line.
    1. Comment on the suitability of Sara's sampling method for this study.
    2. Suggest how Sara could make better use of the large data set for her study.
Edexcel AS Paper 2 Specimen Q3
  1. Pete is investigating the relationship between daily rainfall, \(w \mathrm {~mm}\), and daily mean pressure, \(p\) hPa , in Perth during 2015. He used the large data set to take a sample of size 12.
He obtained the following results.
\(p\)100710121013100910191010101010101013101110141022
\(w\)102.063.063.038.438.035.034.232.030.428.028.015
Pete drew the following scatter diagram for the values of \(w\) and \(p\) and calculated the quartiles.
Q 1Q 2Q 3
\(p\)10101011.51013.5
\(w\)29.234.650.7
\includegraphics[max width=\textwidth, alt={}]{b29b0411-8401-420b-9227-befe25c245d8-04_818_1081_989_477}
An outlier is a value which is more than 1.5 times the interquartile range above Q3 or more than 1.5 times the interquartile range below Q1.
  1. Show that the 3 points circled on the scatter diagram above are outliers.
    (2)
  2. Describe the effect of removing the 3 outliers on the correlation between daily rainfall and daily mean pressure in this sample.
    (1) John has also been studying the large data set and believes that the sample Pete has taken is not random.
  3. From your knowledge of the large data set, explain why Pete's sample is unlikely to be a random sample. John finds that the equation of the regression line of \(w\) on \(p\), using all the data in the large data set, is $$w = 1023 - 0.223 p$$
  4. Give an interpretation of the figure - 0.223 in this regression line. John decided to use the regression line to estimate the daily rainfall for a day in December when the daily mean pressure is 1011 hPa .
  5. Using your knowledge of the large data set, comment on the reliability of John's estimate.
    (Total for Question 3 is 6 marks)
Edexcel Paper 3 Specimen Q2
  1. A meteorologist believes that there is a relationship between the daily mean windspeed, \(w \mathrm { kn }\), and the daily mean temperature, \(t ^ { \circ } \mathrm { C }\). A random sample of 9 consecutive days is taken from past records from a town in the UK in July and the relevant data is given in the table below.
\(\boldsymbol { t }\)13.316.215.716.616.316.419.317.113.2
\(\boldsymbol { w }\)711811138151011
The meteorologist calculated the product moment correlation coefficient for the 9 days and obtained \(r = 0.609\)
  1. Explain why a linear regression model based on these data is unreliable on a day when the mean temperature is \(24 ^ { \circ } \mathrm { C }\)
  2. State what is measured by the product moment correlation coefficient.
  3. Stating your hypotheses clearly test, at the \(5 \%\) significance level, whether or not the product moment correlation coefficient for the population is greater than zero. Using the same 9 days a location from the large data set gave \(\bar { t } = 27.2\) and \(\bar { w } = 3.5\)
  4. Using your knowledge of the large data set, suggest, giving your reason, the location that gave rise to these statistics.
Edexcel Paper 3 Specimen Q2
2. A researcher believes that there is a linear relationship between daily mean temperature and daily total rainfall. The 7 places in the northern hemisphere from the large data set are used. The mean of the daily mean temperatures, \(t ^ { \circ } \mathrm { C }\), and the mean of the daily total rainfall, \(s \mathrm {~mm}\), for the month of July in 2015 are shown on the scatter diagram below.
\includegraphics[max width=\textwidth, alt={}, center]{565bfa73-8095-4242-80b6-cd47aaff6a31-03_844_1339_497_372}
  1. With reference to the scatter diagram, explain why a linear regression model may not be suitable for the relationship between \(t\) and s .
    (1) The researcher calculated the product moment correlation coefficient for the 7 places and obtained \(r = 0.658\).
  2. Stating your hypotheses clearly, test at the \(10 \%\) level of significance, whether or not the product moment correlation coefficient for the population is greater than zero.
    (3)
  3. Using your knowledge of the large data set, suggest the names of the 2 places labelled \(G\) and \(H\).
    (1)
  4. Using your knowledge from the large data set, and with reference to the locations of the two places labelled \(G\) and \(H\), give a reason why these places have the highest temperatures in July.
    (2)
  5. Suggest how you could make better use of the large data set to investigate the relationship between daily mean temperature and daily total rainfall.
    (1)
    (Total 7 marks)