Clean or interpret large data set structure

Questions that ask students to explain data cleaning needs, identify variable types, state units, or describe structural features of the large data set without performing calculations.

7 questions · Easy -1.8

Sort by: Default | Easiest first | Hardest first
Edexcel AS Paper 2 2019 June Q4
8 marks Moderate -0.8
  1. Joshua is investigating the daily total rainfall in Hurn for May to October 2015
Using the information from the large data set, Joshua wishes to calculate the mean of the daily total rainfall in Hurn for May to October 2015
  1. Using your knowledge of the large data set, explain why Joshua needs to clean the data before calculating the mean. Using the information from the large data set, he produces the grouped frequency table below.
    Daily total rainfall ( \(r \mathrm {~mm}\) )FrequencyMidpoint ( \(\boldsymbol { x } \mathbf { m m }\) )
    \(0 \leqslant r < 0.5\)1210.25
    \(0.5 \leqslant r < 1.0\)100.75
    \(1.0 \leqslant r < 5.0\)243.0
    \(5.0 \leqslant r < 10.0\)127.5
    \(10.0 \leqslant r < 30.0\)1720.0
    $$\text { You may use } \sum \mathrm { f } x = 539.75 \text { and } \sum \mathrm { f } x ^ { 2 } = 7704.1875$$
  2. Use linear interpolation to calculate an estimate for the upper quartile of the daily total rainfall.
  3. Calculate an estimate for the standard deviation of the daily total rainfall in Hurn for May to October 2015
    1. State the assumption involved with using class midpoints to calculate an estimate of a mean from a grouped frequency table.
    2. Using your knowledge of the large data set, explain why this assumption does not hold in this case.
    3. State, giving a reason, whether you would expect the actual mean daily total rainfall in Hurn for May to October 2015 to be larger than, smaller than or the same as an estimate based on the grouped frequency table.
OCR H240/02 2022 June Q10
10 marks Easy -1.8
10 The table shows the age structure of usual residents of 18 Local Authorities (LAs) in the North West region of the UK in 2011.
Local AuthorityAge 0 to 17Age 18 to 24Age 25 to 64Age 65 and over
A26.20\%9.06\%51.81\%12.92\%
B23.32\%8.99\%52.32\%15.37\%
C22.24\%8.96\%52.56\%16.23\%
D22.67\%8.10\%53.27\%15.96\%
E20.70\%7.77\%54.77\%16.76\%
F18.14\%6.51\%51.13\%24.21\%
G18.96\%14.20\%48.51\%18.33\%
H19.06\%14.79\%52.12\%14.04\%
I25.15\%9.04\%51.16\%14.65\%
J22.93\%8.81\%52.22\%16.04\%
K21.48\%13.98\%50.82\%13.73\%
L23.98\%9.20\%52.26\%14.56\%
M21.67\%11.19\%52.94\%14.19\%
N17.82\%6.01\%51.93\%24.23\%
O22.83\%7.30\%53.86\%16.01\%
P21.76\%8.28\%54.03\%15.93\%
Q21.42\%8.43\%53.90\%16.25\%
R18.61\%7.33\%49.35\%24.71\%
\section*{Percentage of residents}
  1. Without reference to any other columns, explain how you would use only the columns for the age ranges 0 to 17 and 18 to 24 to decide whether an LA might be one of the following.
    1. An LA that includes a university
    2. An LA that attracts young couples to live
    3. An LA that attracts retired people to live
  2. Using your answers to part (a), identify the following.
    1. Four LAs that might include a university
    2. Three LAs that might be attractive to retired people
  3. Explain why your answer to part (b)(ii), based only on the columns for the age ranges 0 to 17 and 18 to 24, may not be reliable.
  4. The lower quartile, median and upper quartile of the percentages in the column "Age 65 and over" are \(14.56 \% , 15.99 \%\) and \(16.76 \%\) respectively. Use this information to comment on your answers to part (b)(ii) and part (c). In a magazine article, a councillor plans to describe a typical LA in the North West region. He wants to quote the average percentage of residents aged 65 or over.
  5. The mean of the percentages in the column "Age 65 and over" is \(16.90 \%\). Use this information, and the information given in part (d), to explain whether the median or the mean better represents the data in the column "Age 65 and over".
AQA AS Paper 2 2020 June Q12
1 marks Easy -2.0
A student plots the scatter diagram below showing the mass in kilograms against the CO₂ emissions in grams per kilogram for a sample of cars in the Large Data Set. \includegraphics{figure_12} Their teacher tells them to remove an error to clean the data. Identify the data point which should be removed. Circle your answer below. [1 mark] \(A\) \quad \(B\) \quad \(C\) \quad \(D\)
AQA AS Paper 2 2023 June Q12
1 marks Easy -2.5
The mass of a bag of nuts produced by a company is known to have a mean of 40 grams and a standard deviation of 3 grams. The company produces five different flavours of nuts. The bags of nuts are packed in large boxes. Given the information above, identify the continuous variable from the options below. Tick (\(\checkmark\)) one box. [1 mark] The flavours of the bags of nuts The known standard deviation of the mass of a bag of nuts The mass of an individual bag of nuts The number of bags of nuts in a large box
AQA AS Paper 2 2024 June Q12
1 marks Easy -2.5
Which one of the following is not a measure of spread? Circle your answer. [1 mark] median \(\qquad\) range \(\qquad\) standard deviation \(\qquad\) variance
AQA AS Paper 2 Specimen Q14
1 marks Easy -2.5
In the Large Data Set, the emissions of carbon dioxide are measured in what units? Circle your answer. [1 mark] mg/litre \quad\quad g/litre \quad\quad g/km \quad\quad mg/km
OCR H240/02 2018 December Q10
6 marks Moderate -0.8
Using the 2001 UK census results and some software, Javid intended to calculate the mean number of people who travelled to work by underground, metro, light rail or tram (UMLT) for all 348 Local Authorities. However, Javid noticed that for one LA the entry in the UMLT column is a dash, rather than a 0. See the extract below.
Data extract for one LA in 2001
Work mainly at or from homeUMLTTrainBus, minibus or coach
29544
Javid felt that it was not clear how this LA was to be treated so he decided to omit it from his calculation.
  1. Explain how the omission of this LA affects Javid's calculation of the mean. [1]
The value of the mean that Javid obtained was 2046.3.
  1. Calculate the value of the mean when this LA is not removed. [2]
Javid finds that the corresponding mean for all Local Authorities for 2011 is 2860.8. In order to compare the means for the two years, Javid also finds the total number of employees in each of these years. His results are given below.
Year20012011
Total number of employees23 627 75326 526 336
  1. Show that a higher proportion of employees used the metro to travel to work in 2011 than in 2001. [2]
  2. Suggest a reason for this increase. [1]