2.02j Clean data: missing data, errors

22 questions

Sort by: Default | Easiest first | Hardest first
OCR MEI S1 Q4
19 marks Moderate -0.3
4 The incomes of a sample of 918 households on an island are given in the table below.
Income
\(( x\) thousand pounds \()\)
\(0 \leqslant x \leqslant 20\)\(20 < x \leqslant 40\)\(40 < x \leqslant 60\)\(60 < x \leqslant 100\)\(100 < x \leqslant 200\)
Frequency23836514212845
  1. Draw a histogram to illustrate the data.
  2. Calculate an estimate of the mean income.
  3. Calculate an estimate of the standard deviation of the incomes.
  4. Use your answers to parts (ii) and (iii) to show there are almost certainly some outliers in the sample. Explain whether or not it would be appropriate to exclude the outliers from the calculation of the mean and the standard deviation.
  5. The incomes were converted into another currency using the formula \(y = 1.15 x\). Calculate estimates of the mean and variance of the incomes in the new currency.
OCR MEI S1 Q1
17 marks Easy -1.2
1 The temperature of a supermarket fridge is regularly checked to ensure that it is working correctly. Over a period of three months the temperature (measured in degrees Celsius) is checked 600 times. These temperatures are displayed in the cumulative frequency diagram below. \includegraphics[max width=\textwidth, alt={}, center]{c7cb0f6b-7b6b-4c52-8287-7efc6bd70247-1_1052_1647_549_289}
  1. Use the diagram to estimate the median and interquartile range of the data.
  2. Use your answers to part (i) to show that there are very few, if any, outliers in the sample.
  3. Suppose that an outlier is identified in these data. Discuss whether it should be excluded from any further analysis.
  4. Copy and complete the frequency table below for these data.
    Temperature
    \(( t\) degrees Celsius \()\)
    \(3.0 \leqslant t \leqslant 3.4\)\(3.4 < t \leqslant 3.8\)\(3.8 < t \leqslant 4.2\)\(4.2 < t \leqslant 4.6\)\(4.6 < t \leqslant 5.0\)
    Frequency243157
  5. Use your table to calculate an estimate of the mean.
  6. The standard deviation of the temperatures in degrees Celsius is 0.379 . The temperatures are converted from degrees Celsius into degrees Fahrenheit using the formula \(F = 1.8 C + 32\). Hence estimate the mean and find the standard deviation of the temperatures in degrees Fahrenheit.
OCR MEI S1 Q1
18 marks Easy -1.2
1 The maximum temperatures \(x\) degrees Celsius recorded during each month of 2005 in Cambridge are given in the table below.
JanFebMarAprMayJunJulAugSepOctNovDec
9.27.110.714.216.621.822.022.621.117.410.17.8
These data are summarised by \(n = 12 , \Sigma x = 180.6 , \Sigma x ^ { 2 } = 3107.56\).
  1. Calculate the mean and standard deviation of the data.
  2. Determine whether there are any outliers.
  3. The formula \(y = 1.8 x + 32\) is used to convert degrees Celsius to degrees Fahrenheit. Find the mean and standard deviation of the 2005 maximum temperatures in degrees Fahrenheit.
  4. In New York, the monthly maximum temperatures are recorded in degrees Fahrenheit. In 2005 the mean was 63.7 and the standard deviation was 16.0 . Briefly compare the maximum monthly temperatures in Cambridge and New York in 2005. The total numbers of hours of sunshine recorded in Cambridge during the month of January for each of the last 48 years are summarised below.
    Hours \(h\)\(70 \leqslant h < 100\)\(100 \leqslant h < 110\)\(110 \leqslant h < 120\)\(120 \leqslant h < 150\)\(150 \leqslant h < 170\)\(170 \leqslant h < 190\)
    Number of years681011103
  5. Draw a cumulative frequency graph for these data.
  6. Use your graph to estimate the 90th percentile.
OCR MEI S1 Q1
6 marks Moderate -0.8
1 The amounts of electricity, \(x \mathrm { kWh }\) (kilowatt hours), used by 40 households in a three-month period are summarised as follows. $$n = 40 \quad \sum x = 59972 \quad \sum x ^ { 2 } = 96767028$$
  1. Calculate the mean and standard deviation of \(x\).
  2. The formula \(y = 0.163 x + 14.5\) gives the cost in pounds of the electricity used by each household. Use your answers to part (i) to deduce the mean and standard deviation of the costs of the electricity used by these 40 households.
Edexcel AS Paper 2 2019 June Q4
8 marks Moderate -0.8
  1. Joshua is investigating the daily total rainfall in Hurn for May to October 2015
Using the information from the large data set, Joshua wishes to calculate the mean of the daily total rainfall in Hurn for May to October 2015
  1. Using your knowledge of the large data set, explain why Joshua needs to clean the data before calculating the mean. Using the information from the large data set, he produces the grouped frequency table below.
    Daily total rainfall ( \(r \mathrm {~mm}\) )FrequencyMidpoint ( \(\boldsymbol { x } \mathbf { m m }\) )
    \(0 \leqslant r < 0.5\)1210.25
    \(0.5 \leqslant r < 1.0\)100.75
    \(1.0 \leqslant r < 5.0\)243.0
    \(5.0 \leqslant r < 10.0\)127.5
    \(10.0 \leqslant r < 30.0\)1720.0
    $$\text { You may use } \sum \mathrm { f } x = 539.75 \text { and } \sum \mathrm { f } x ^ { 2 } = 7704.1875$$
  2. Use linear interpolation to calculate an estimate for the upper quartile of the daily total rainfall.
  3. Calculate an estimate for the standard deviation of the daily total rainfall in Hurn for May to October 2015
    1. State the assumption involved with using class midpoints to calculate an estimate of a mean from a grouped frequency table.
    2. Using your knowledge of the large data set, explain why this assumption does not hold in this case.
    3. State, giving a reason, whether you would expect the actual mean daily total rainfall in Hurn for May to October 2015 to be larger than, smaller than or the same as an estimate based on the grouped frequency table.
Edexcel Paper 3 2023 June Q3
7 marks Moderate -0.8
  1. Ben is studying the Daily Total Rainfall, \(x \mathrm {~mm}\), in Leeming for 1987
He used all the data from the large data set and summarised the information in the following table.
\(x\)0\(0.1 - 0.5\)\(0.6 - 1.0\)\(1.1 - 1.9\)\(2.0 - 4.0\)\(4.1 - 6.9\)\(7.0 - 12.0\)\(12.1 - 20.9\)\(21.0 - 32.0\)\(\operatorname { tr }\)
Frequency5518182117996229
  1. Explain how the data will need to be cleaned before Ben can start to calculate statistics such as the mean and standard deviation. Using all 184 of these values, Ben estimates \(\sum x = 390\) and \(\sum x ^ { 2 } = 4336\)
  2. Calculate estimates for
    1. the mean Daily Total Rainfall,
    2. the standard deviation of the Daily Total Rainfall. Ben suggests using the statistic calculated in part (b)(i) to estimate the annual mean Daily Total Rainfall in Leeming for 1987
  3. Using your knowledge of the large data set,
    1. give a reason why these data would not be suitable,
    2. state, giving a reason, how you would expect the estimate in part (b)(i) to differ from the actual annual mean Daily Total Rainfall in Leeming for 1987
OCR MEI AS Paper 2 2022 June Q11
9 marks Moderate -0.8
11 The pre-release material contains information about the Median Income of Taxpayers and the Percentage of Pupils Achieving at Least 5 A*- C grades, including English and Maths, at the end of KS4 in different areas of London. Alex is investigating whether there is a relationship between median income and the percentage of pupils achieving at least 5 A* - C grades, including English and Maths, at the end of KS4. Alex decides to use the first 12 rows of data for 2014-5 from the pre-release data as a sample. The sample is shown in Fig. 11.1. \begin{table}[h]
AreaMedian Income of TaxpayersPercentage of Pupils Achieving at Least 5 A*- C grades including English and Maths
City of London61100\#N/A
Barking and Dagenham2180054.0
Barnet2710070.1
Bexley2440055.0
Brent2270060.0
Bromley2810068.0
Camden3310056.4
Croydon2510059.6
Ealing2460062.1
Enfield2530054.5
Greenwich2460057.7
Hackney2600060.4
\captionsetup{labelformat=empty} \caption{Fig. 11.1}
\end{table}
  1. Explain whether the data in Fig. 11.1 is a simple random sample of the data for 2014-5.
  2. The City of London is included in Alex's sample. Explain why Alex is not able to use the data for the City of London in this investigation. \begin{figure}[h]
    \captionsetup{labelformat=empty} \caption{Fig. 11.2 shows a scatter diagram showing Percentage of Pupils against Median Income for all of the areas of London for which data is available.} \includegraphics[alt={},max width=\textwidth]{e0b502a8-c742-4d78-993c-8c0c7329ec9c-09_716_1378_356_244}
    \end{figure} Fig. 11.2 Alex identifies some outliers.
  3. On the copy of Fig. 11.2 in the Printed Answer Booklet, ring three of these outliers. Alex then discards all the outliers and uses the LINEST function on a spreadsheet to obtain the following model. \(\mathrm { P } = 0.0009049 \mathrm { M } + 37.38\),
    where \(P =\) percentage of pupils and \(M =\) median income.
  4. Show that the model is a good fit for the data for Hackney.
  5. Use the model to find an estimate of the value of \(P\) for City of London.
  6. Give two reasons why this estimate may not be reliable. Alex states that more than 50\% of the pupils in London achieved at least a grade C at the end of KS4 in English and Maths in 2014-5.
  7. Use the information in Fig. 11.2 together with your knowledge of the pre-release material to explain whether there is evidence to support this statement.
OCR MEI AS Paper 2 2023 June Q8
4 marks Moderate -0.5
8 The pre-release material contains information on Pulse Rate and Body Mass Index (BMI). A student is investigating whether there is a relationship between pulse rate and BMI. A section of the available data is shown in the table.
SexAgeBMIPulse
Male6229.5460
Female2023.68\#N/A
Male1726.9772
Male3524.764
Male1720.0954
Male8523.8654
Female8124.04\#N/A
The student decides to draw a scatter diagram.
  1. With reference to the table, explain which data should be cleaned before any analysis takes place. The student cleans the data for BMI and Pulse Rate in the pre-release material and draws a scatter diagram. \begin{figure}[h]
    \captionsetup{labelformat=empty} \caption{Scatter diagram of Pulse Rate against BMI} \includegraphics[alt={},max width=\textwidth]{82438df0-6550-4ffd-92d8-3c67bec59a6b-06_869_1575_1585_246}
    \end{figure} The student identifies one outlier.
  2. On the copy of the scatter diagram in the Printed Answer Booklet, circle this outlier. The student decides to remove this outlier from the data. They then use the LINEST function in the spreadsheet to obtain the following formula for the line of best fit. \(\mathrm { P } = 0.29 \mathrm { Q } + 64.2\),
    where \(P =\) PulseRate and \(Q = \mathrm { BMI }\). They use this to estimate the Pulse Rate of a person with BMI 23.68.
    They obtain a value of 71 correct to the nearest whole number.
  3. With reference to the scatter diagram, explain whether it is appropriate to use the formula for the line of best fit. It is suggested that all pairs of values where the pulse rate is above 100 should also be cleaned from the data, as they must be incorrect.
  4. Use your knowledge of the pre-release material to explain whether or not all pairs of values with a pulse rate of more than 100 should be cleaned from the data.
OCR MEI AS Paper 2 2021 November Q7
7 marks Easy -1.2
7 The pre-release material contains information about health expenditure. Fig. 7.1 shows an extract from the data. \begin{table}[h]
CountryHealth expenditure (\% of GDP)
Algeria7.2
Egypt5.6
Libya5
Morocco5.9
Sudan8.4
Tunisia7
Western Sahara\#N/A
Angola3.3
Benin4.6
Botswana5.4
Burkina Faso5
\captionsetup{labelformat=empty} \caption{Fig. 7.1}
\end{table}
  1. Explain how the data should be cleaned before any analysis takes place. Kareem uses all the available data to conduct an investigation into health expenditure as a percentage of GDP in different countries. He calculates the mean to be 6.79 and the standard deviation to be 2.78 . Fig. 7.2 shows the smallest values and the largest values of health expenditure as a percentage of GDP. \begin{table}[h]
    Smallest values of Health expenditure (\% of GDP)Largest values of Health expenditure (\% of GDP)
    1.511.7
    1.911.9
    2.113.7
    13.7
    16.5
    17.1
    17.1
    \captionsetup{labelformat=empty} \caption{Fig. 7.2}
    \end{table}
  2. Determine which of these values are outliers. Kareem removes the outliers from the data and finds that there are 187 values left. He decides to collect a sample of size 30 . He uses the following sampling procedure.
    Assign each value a number from 1 to 187. Generate a random number, \(n\), between 1 and 13 . Starting with the \(n\)th value, choose every 6th value after that until 30 values have been chosen.
  3. Explain whether Kareem is using simple random sampling.
OCR MEI Paper 2 2023 June Q14
8 marks Moderate -0.8
14 The pre-release material contains information concerning the median income of taxpayers in \(\pounds\) and the percentage of all pupils at the end of KS4 achieving 5 or more GCSEs at grade A*-C, including English and Maths, for different areas of London. Some of the data for 2014/15 is shown in Fig. 14.1. \begin{table}[h]
\captionsetup{labelformat=empty} \caption{Fig. 14.1}
Median Income of Taxpayers in £Percentage of Pupils Achieving 5 or more A*-C, including English and Maths
City of London61100\#N/A
Barking and Dagenham2180054.0
Barnet2710070.1
Bexley2440055.0
Brent2270060.0
Bromley2810068.0
\end{table} A student investigated whether there is any relationship between median income of taxpayers and percentage of pupils achieving 5 or more GCSEs at grade A*-C, including English and Maths.
  1. With reference to Fig. 14.1, explain how the data should be cleaned before any analysis can take place. After the data was cleaned, the student used software to draw the scatter diagram shown in Fig. 14.2. Scatter diagram to show percentage of pupils achieving 5 A*-C grades against median income of taxpayers \begin{figure}[h]
    \captionsetup{labelformat=empty} \caption{Fig. 14.2} \includegraphics[alt={},max width=\textwidth]{11788aaf-98fb-4a78-8a40-a40743b1fe15-10_574_1481_1900_241}
    \end{figure} The student calculated that the product moment correlation coefficient for these data is 0.3743 .
  2. Give two reasons why it may not be appropriate to use a linear model for the relationship between median income of taxpayers in \(\pounds\) and the percentage of all pupils at the end of KS4 achieving 5 or more GCSEs at grade A*-C. The student carried out some further analysis. The results are shown in Fig. 14.3. \begin{table}[h]
    \captionsetup{labelformat=empty} \caption{Fig. 14.3}
    median income of
    taxpayers in \(\pounds\)
    percentage of pupils
    achieving \(5 + \mathrm { A } ^ { * } - \mathrm { C }\)
    mean2721661.0
    standard deviation4177.55.32
    \end{table} The student identified three outliers in total.
    The student decided to remove these outliers and recalculate the product moment correlation coefficient.
  3. Explain whether the new value of the product moment correlation coefficient would be between 0.3743 and 1 or between 0 and 0.3743 .
OCR MEI Paper 2 2021 November Q12
5 marks Moderate -0.5
12 Fig. 12.1 shows an excerpt from the pre-release material. \begin{table}[h]
ABCDEFGH
1SexAgeMaritalWeightHeightBMIWaistPulse
2Female34Married60.3173.420.0582.574
3Female85Widowed64.7161.224.9\#N/A\#N/A
4Female48Divorced100.6171.434.24105.692
5Male61Married70.9169.524.6892.270
6Male68Divorced96.8181.629.35112.968
\captionsetup{labelformat=empty} \caption{Fig. 12.1}
\end{table} There was no data available for cell H3.
  1. Explain why \#N/A is used when no data is available. Fig. 12.2 shows a scatter diagram of pulse rate against BMI (Body Mass Index) for females. All the available data was used. Pulse rate against BMI for females \begin{figure}[h]
    \includegraphics[alt={},max width=\textwidth]{c9d14a4d-a1c8-42ad-9c0b-42cef6b3612f-08_659_1552_1363_233} \captionsetup{labelformat=empty} \caption{Fig. 12.2}
    \end{figure} There are two outliers on the diagram.
  2. On the copy of Fig. 12.2 in the Printed Answer Booklet, ring these outliers.
  3. Use your knowledge of the pre-release material to explain whether either of these outliers should be removed.
  4. State whether the diagram suggests there is any correlation between pulse rate and BMI. The product moment correlation coefficient between waist measurement, \(w\), in cm and BMI, \(b\), for females was found to be 0.912 . All the available data was used.
  5. Explain why a model of the form \(\mathrm { w } = \mathrm { mb } + \mathrm { c }\) for the relationship between waist measurement and BMI is likely to be appropriate. The LINEST function on a spreadsheet gives \(m = 2.16\) and \(c = 33.0\).
  6. Calculate an estimate of the value for cell G3 in Fig. 12.1.
Edexcel S1 2016 June Q4
12 marks Moderate -0.8
4. A researcher recorded the time, \(t\) minutes, spent using a mobile phone during a particular afternoon, for each child in a club. The researcher coded the data using \(v = \frac { t - 5 } { 10 }\) and the results are summarised in the table below.
Coded Time (v)Frequency ( \(\boldsymbol { f }\) )Coded Time Midpoint (m)
\(0 \leqslant v < 5\)202.5
\(5 \leqslant v < 10\)24\(a\)
\(10 \leqslant v < 15\)1612.5
\(15 \leqslant v < 20\)1417.5
\(20 \leqslant v < 30\)6\(b\)
$$\text { (You may use } \sum f m = 825 \text { and } \sum f m ^ { 2 } = 12012.5 \text { ) }$$
  1. Write down the value of \(a\) and the value of \(b\).
  2. Calculate an estimate of the mean of \(v\).
  3. Calculate an estimate of the standard deviation of \(v\).
  4. Use linear interpolation to estimate the median of \(v\).
  5. Hence describe the skewness of the distribution. Give a reason for your answer.
  6. Calculate estimates of the mean and the standard deviation of the time spent using a mobile phone during the afternoon by the children in this club.
Edexcel S1 2021 June Q3
14 marks Moderate -0.8
  1. A random sample of 100 carrots is taken from a farm and their lengths, \(L \mathrm {~cm}\), recorded. The data are summarised in the following table.
Length, \(L\) cmFrequency, fClass mid point, \(\boldsymbol { x } \mathbf { c m }\)
\(5 \leqslant L < 8\)56.5
\(8 \leqslant L < 10\)139
\(10 \leqslant L < 12\)1611
\(12 \leqslant L < 15\)2513.5
\(15 \leqslant L < 20\)3017.5
\(20 \leqslant L < 28\)1124
A histogram is drawn to represent these data.
The bar representing the class \(5 \leqslant L < 8\) is 1.5 cm wide and 1 cm high.
  1. Find the width and height of the bar representing the class \(15 \leqslant L < 20\)
  2. Use linear interpolation to estimate the median length of these carrots.
  3. Estimate
    1. the mean length of these carrots,
    2. the standard deviation of the lengths of these carrots. A supermarket will only buy carrots with length between 9 cm and 22 cm .
  4. Estimate the proportion of carrots from the farm that the supermarket will buy. Any carrots that the supermarket does not buy are sold as animal feed. The farm makes a profit of 2.2 pence on each carrot sold to the supermarket, a profit of 0.8 pence on each carrot longer than 22 cm and a loss of 1.2 pence on each carrot shorter than 9 cm .
  5. Find an estimate of the mean profit per carrot made by the farm.
Edexcel S1 2022 June Q3
14 marks Moderate -0.3
  1. Gill buys a bag of logs to use in her stove. The lengths, \(l \mathrm {~cm}\), of the 88 logs in the bag are summarised in the table below.
Length \(( \boldsymbol { l } )\)Frequency \(( \boldsymbol { f } )\)
\(15 < l \leqslant 20\)19
\(20 < l \leqslant 25\)35
\(25 < l \leqslant 27\)16
\(27 < l \leqslant 30\)15
\(30 < l \leqslant 40\)3
A histogram is drawn to represent these data.
The bar representing logs with length \(27 < l \leqslant 30\) has a width of 1.5 cm and a height of 4 cm .
  1. Calculate the width and height of the bar representing log lengths of \(20 < l \leqslant 25\)
  2. Use linear interpolation to estimate the median of \(l\) The maximum length of log Gill can use in her stove is 26 cm .
    Gill estimates, using linear interpolation, that \(x\) logs from the bag will fit into her stove.
  3. Show that \(x = 62\) Gill randomly selects 4 logs from the bag.
  4. Using \(x = 62\), find the probability that all 4 logs will fit into her stove. The weights, \(W\) grams, of the logs in the bag are coded using \(y = 0.5 w - 255\) and summarised by $$n = 88 \quad \sum y = 924 \quad \sum y ^ { 2 } = 12862$$
  5. Calculate
    1. the mean of \(W\)
    2. the variance of \(W\)
Edexcel S1 2024 June Q3
14 marks Moderate -0.8
  1. The lengths, \(x \mathrm {~mm}\), of 50 pebbles are summarised in the table below.
LengthFrequency
\(20 \leqslant x < 30\)2
\(30 \leqslant x < 32\)16
\(32 \leqslant x < 36\)20
\(36 \leqslant x < 40\)8
\(40 \leqslant x < 45\)3
\(45 \leqslant x < 50\)1
A histogram is drawn to represent these data.
The bar representing the class \(32 \leqslant x < 36\) is 2.5 cm wide and 7.5 cm tall.
  1. Calculate the width and the height of the bar representing the class \(30 \leqslant x < 32\)
  2. Using linear interpolation, estimate the median of \(x\) The weight, \(w\) grams, of each of the 50 pebbles is coded using \(10 y = w - 20\) These coded data are summarised by $$\sum y = 104 \quad \sum y ^ { 2 } = 233.54$$
  3. Show that the mean of \(w\) is 40.8
  4. Calculate the standard deviation of \(w\) The weight of a pebble recorded as 40.8 grams is added to the sample.
  5. Without carrying out any further calculations, state, giving a reason, what effect this would have on the value of
    1. the mean of \(w\)
    2. the standard deviation of \(w\)
Edexcel S1 2002 June Q2
4 marks Easy -1.2
2. Statistical models can be used to describe real world problems. Explain the process involved in the formulation of a statistical model.
(4)
Edexcel S1 2013 June Q2
11 marks Easy -1.3
  1. The marks of a group of female students in a statistics test are summarised in Figure 1
\begin{figure}[h]
\includegraphics[alt={},max width=\textwidth]{6faf2dd2-a114-40b7-88ae-4a75dbfb4706-04_629_1102_342_429} \captionsetup{labelformat=empty} \caption{Figure 1}
\end{figure}
  1. Write down the mark which is exceeded by \(75 \%\) of the female students. The marks of a group of male students in the same statistics test are summarised by the stem and leaf diagram below.
    Mark(2|6 means 26)Totals
    14(1)
    26(1)
    3447(3)
    4066778(6)
    5001113677(9)
    6223338(6)
    7008(3)
    85(1)
    90(1)
  2. Find the median and interquartile range of the marks of the male students. An outlier is a mark that is
    either more than \(1.5 \times\) interquartile range above the upper quartile or more than \(1.5 \times\) interquartile range below the lower quartile.
  3. In the space provided on Figure 1 draw a box plot to represent the marks of the male students, indicating clearly any outliers.
  4. Compare and contrast the marks of the male and the female students.
Edexcel S1 2013 June Q4
14 marks Moderate -0.8
4. The following table summarises the times, \(t\) minutes to the nearest minute, recorded for a group of students to complete an exam.
Time (minutes) \(t\)\(11 - 20\)\(21 - 25\)\(26 - 30\)\(31 - 35\)\(36 - 45\)\(46 - 60\)
Number of students f628816131110
$$\text { [You may use } \sum \mathrm { f } t ^ { 2 } = 134281.25 \text { ] }$$
  1. Estimate the mean and standard deviation of these data.
  2. Use linear interpolation to estimate the value of the median.
  3. Show that the estimated value of the lower quartile is 18.6 to 3 significant figures.
  4. Estimate the interquartile range of this distribution.
  5. Give a reason why the mean and standard deviation are not the most appropriate summary statistics to use with these data. The person timing the exam made an error and each student actually took 5 minutes less than the times recorded above. The table below summarises the actual times.
    Time (minutes) \(t\)\(6 - 15\)\(16 - 20\)\(21 - 25\)\(26 - 30\)\(31 - 40\)\(41 - 55\)
    Number of students f628816131110
  6. Without further calculations, explain the effect this would have on each of the estimates found in parts (a), (b), (c) and (d).
Edexcel S1 2014 June Q1
9 marks Moderate -0.8
  1. A random sample of 35 homeowners was taken from each of the villages Greenslax and Penville and their ages were recorded. The results are summarised in the back-to-back stem and leaf diagram below.
TotalsGreenslaxPenvilleTotals
(2)8725567889(7)
(3)98731112344569(11)
(4)4440401247(5)
(5)66522500555(5)
(7)865421162566(4)
(8)8664311705(2)
(5)984328(0)
(1)499(1)
Key: 7 | 3 | 1 means 37 years for Greenslax and 31 years for Penville
Some of the quartiles for these two distributions are given in the table below.
GreenslaxPenville
Lower quartile, \(Q _ { 1 }\)\(a\)31
Median, \(Q _ { 2 }\)6439
Upper quartile, \(Q _ { 3 }\)\(b\)55
  1. Find the value of \(a\) and the value of \(b\). An outlier is a value that falls either $$\begin{aligned} & \text { more than } 1.5 \times \left( Q _ { 3 } - Q _ { 1 } \right) \text { above } Q _ { 3 } \\ & \text { or more than } 1.5 \times \left( Q _ { 3 } - Q _ { 1 } \right) \text { below } Q _ { 1 } \end{aligned}$$
  2. On the graph paper opposite draw a box plot to represent the data from Penville. Show clearly any outliers.
  3. State the skewness of each distribution. Justify your answers. \includegraphics[max width=\textwidth, alt={}, center]{8270bcae-494c-4248-8229-a72e9e84eab0-03_930_1237_1800_367}
Edexcel S1 2015 June Q1
14 marks Easy -1.2
  1. Each of 60 students was asked to draw a \(20 ^ { \circ }\) angle without using a protractor. The size of each angle drawn was measured. The results are summarised in the box plot below. \includegraphics[max width=\textwidth, alt={}, center]{9626e3ce-35d6-41b5-a0bd-1185f38b9e36-02_371_1040_340_461}
    1. Find the range for these data.
    2. Find the interquartile range for these data.
    The students were then asked to draw a \(70 ^ { \circ }\) angle.
    The results are summarised in the table below.
    Angle, \(\boldsymbol { a }\), (degrees)Number of students
    \(55 \leqslant a < 60\)6
    \(60 \leqslant a < 65\)15
    \(65 \leqslant a < 70\)13
    \(70 \leqslant a < 75\)11
    \(75 \leqslant a < 80\)8
    \(80 \leqslant a < 85\)7
  2. Use linear interpolation to estimate the size of the median angle drawn. Give your answer to 1 decimal place.
  3. Show that the lower quartile is \(63 ^ { \circ }\) For these data, the upper quartile is \(75 ^ { \circ }\), the minimum is \(55 ^ { \circ }\) and the maximum is \(84 ^ { \circ }\) An outlier is an observation that falls either more than \(1.5 \times\) (interquartile range) above the upper quartile or more than \(1.5 \times\) (interquartile range) below the lower quartile.
    1. Show that there are no outliers for these data.
    2. Draw a box plot for these data on the grid on page 3.
  4. State which angle the students were more accurate at drawing. Give reasons for your answer.
    (3) \includegraphics[max width=\textwidth, alt={}, center]{9626e3ce-35d6-41b5-a0bd-1185f38b9e36-03_378_1059_2067_447}
AQA AS Paper 2 2020 June Q12
1 marks Easy -2.0
A student plots the scatter diagram below showing the mass in kilograms against the CO₂ emissions in grams per kilogram for a sample of cars in the Large Data Set. \includegraphics{figure_12} Their teacher tells them to remove an error to clean the data. Identify the data point which should be removed. Circle your answer below. [1 mark] \(A\) \quad \(B\) \quad \(C\) \quad \(D\)
AQA Paper 3 2021 June Q13
6 marks Moderate -0.8
The table below is an extract from the Large Data Set.
Propulsion TypeRegionEngine SizeMassCO₂Particulate Emissions
2London189615331540.04
2North West189614231460.029
2North West189613531380.025
2South West199815471590.026
2London189613881380.025
2South West189612141300.011
2South West189614801460.029
2South West189614131460.024
2South West249616951920.034
2South West142212511220.025
2South West199520751750.034
2London189612851400.036
2North West18960146
    1. Calculate the mean and standard deviation of CO₂ emissions in the table. [2 marks]
    2. Any value more than 2 standard deviations from the mean can be identified as an outlier. Determine, using this definition of an outlier, if there are any outliers in this sample of CO₂ emissions. Fully justify your answer. [2 marks]
  1. Maria claims that the last line in the table must contain two errors. Use your knowledge of the Large Data Set to comment on Maria's claim. [2 marks]