# Fall 2022 Final Exam

Instructor(s): Justin Eldridge

This exam was administered in-person. The exam was closed-notes, except students were allowed to bring 5 two-sided cheat sheets. No calculators were allowed. Students had 180 minutes to take this exam.

Welcome to the Final Exam for DSC 80.

This exam has several coding questions. These questions will be manually graded, and your answers are not expected to be syntactically-perfect! Be careful not to spend too much time perfecting your code.

## Problem 1

In the next five questions, assume you have access to a DataFrame named pts, shown below: ### Problem 1.1

What is the type of the result of the following line of code?

pts.groupby('group')['x'].max()

• int

• float

• str

• pandas.Series

• pandas.DataFrame

Calling pts.groupby('group')['x'].max() will return a Series with the the index consisting of each of the unique groups and a column containing the max 'x' value within each group.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.

### Problem 1.2

Suppose z is a pandas Series containing the data shown below:

>>> z
5    20
0     1
2     3
Name: z, dtype: int64

Notice that the index of this Series does not match the index of pts.

Which of the following will be the result of running:

>>> pts['z'] = z
>>> pts

Below are some of the options:

• Option 1: • Option 2: • Option 3: • Option 1

• Option 2

• Option 3

• An exception will be raised because z is missing some of the rows that are in pts.

The code above will simply create a new column 'z', and match the corresponding value of 'z' with the corresponding index of 'z' that exists in pts. For the indeces in pts that aren’t present in 'z', there will be a NaN value in the appropriate spot in the 'z' column.

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 1.3

What is the result of the following code?

>>> pivot = pts.pivot_table(
values='x',
index='group',
columns='color',
aggfunc='count')
>>> pivot.loc['A', 'red']

Your answer should be in the form of a number (or possibly NaN). Your answer does NOT need to be exactly what is displayed by Python.

The pivot table simply counts the number of rows in pts that have 'group' value ‘A’ AND 'color' value red, which we could see that there are only two rows that satisfy those conditions.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

### Problem 1.4

Suppose the costs DataFrame contains the following data: Suppose we run:

>>> res = pts.merge(costs, how='left')

How many rows will res have?

Note that in a left merge, we retain all the rows in the left DataFrame, regardless of whether or not those rows are shared between the left and right DataFrames. Thus we also realize that it must be the case that the resulting merged DataFrame has the same number of rows as the left DataFrame when performing a left merge. Since 'pts' is the left DataFrame, we’ll be left with 6 rows after the merge since 'pts' has 6 rows.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 1.5

Suppose we have defined:

def foo(ser):
return (ser - ser.min()).max()

What will be the result of:

>>> pts.groupby('group')[['x', 'y']].aggregate(foo).loc['A', 'x']

The foo function essentially takes in a Series, and returns the difference between the maximum value in the Series and the minimum value in that same Series.

Thus when we call foo as our aggregate function, we’ll perform this foo function on the values of each column within each group. Also, .loc['A', 'x'] means that we just want get the resulting foo value computed on column ‘x’ within group ‘A’. Focusing our attention on the values of column ‘x’ that are of group ‘A’, we see that the largest value in ‘x’ of group ‘A’ is just 5, and the smallest value is just 1. Therefore our answer is just 5 - 1 or 4.

##### Difficulty: ⭐️

The average score on this problem was 90%.

## Problem 2

The Earth Impact Database, curated by the University of New Brunswick’s Planetary and Space Science Centre, contains information on almost 200 impact craters caused by meteorites that have crashed into the Earth. In this question, assume you have access to a dataframe named impacts, shown below: A short description of each column follows:

• 'crater_name': the name of the impact crater.
• 'state': if the crater is in the United States, the state containing the crater is listed here; otherwise, it is missing.
• 'country': the country containing the crater.
• 'target_rock': the type of rock that the crater is in.
• 'diameter_km': the diameter of the crater in kilometers.
• 'drilled': whether or not the crater has been drilled to analyze its contents.

### Problem 2.1

There are many missing values in the state column. Upon inspection, you find that a crater is missing a state if and only if the crater is not located in the United States. Note: while we’re aware that upon inspection of the actual DataFrame, that this statement is clearly not true (such as Canada and Australia), we’ll assume that the statement is true for the sake of the problem.

What is the most likely type of the missingness in the statecolumn?

• Not Missing at Random (NMAR)

• Missing at Random (MAR)

• Missing Completely at Random (MCAR)

• Missing by Design (MD)

The answer is Missing by Design becuase we can easily predict whether or not the state column will be missing based on the country column. Namely, if the country column is not ‘United States’, then we know that the state column is missing.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.

### Problem 2.2

Suppose we want to test the following hypotheses:

• Null Hypothesis: the crater diameter of impacts in sedimentary rock comes from the same distribution as the crater diameter of impacts in mixed rock.
• Alternative Hypothesis: the crater diameter of impacts in sedimentary rock is significantly larger on average.

When the distributions of crater diameter are plotted, we see the following: Which one of the following is the best test statistic in this case?

• Total Variation Distance (TVD) between the distributions

• Kolmogorov-Smirnov (K-S) distance between the distributions

• the signed difference between the mean crater diameter of impacts in sedimentary rock, minus the mean crater diameter of impacts in mixed rock

• the unsigned (absolute) difference between the mean crater diameter of impacts in sedimentary rock, minus the mean crater diameter of impacts in mixed rock

K-S Statistic doesn’t work well on discrete quantitative variables so we could rule that out. TVD is mainly used with categorical data so we could rule that out (and it’s the absolute value so it wouldn’t tell us whetehr or not one group is larger than the other group). We used the signed difference between mean crater diameter because we want to see whether or not one group is larger than the other, and unsigned difference between mean crater diameter wouldn’t tell us anything about that.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

### Problem 2.3

Suppose it is observed that some values in the column are missing. To determine if there is an association between this missingness and the values in the country column, a permutation test will be performed with the null hypothesis that the distribution of countries when the diameter is missing is the same as the distribution of countries when the diameter is not missing.

Which of the following test statistics should be used?

• the Total Variation Distance (TVD) between the distribution of countries when the diameter is missing and the distribution of countries when the diameter is not missing

• the Kolmogorov-Smirnov statistic between the distribution of countries when the diameter is missing and the distribution of countries when the diameter is not missing

• the signed difference between the mean crater diameter of impacts where the country is missing, minus the mean crater diameter of impacts where the country is not missing

• the unsigned (absolute) difference between the mean crater diameter of impacts where the country is missing, minus the mean crater diameter of impacts where the country is not missing

Since ‘countries’ is a categorical variable, TVD would work the best here.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

### Problem 2.4

Suppose the permutation test described in the previous problem fails to reject the null hypothesis. Assuming that NMAR and MD have been ruled out already, what can be said about the missingness in diameter_km?

• it is MCAR

• it is MAR

• We cannot say for sure without first testing for an association between the missingness and the other columns besides country.

In order to test whether or not a column is MCAR or MAR, we have to test the missingness of that column against every other column in order to be conclusive about the missingness mechanism. Thus the answer is Option C.

##### Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

### Problem 2.5

Suppose we fill in the missing values in the diameter_km column by random sampling. That is, for each missing diameter, we randomly sample from the the set of observed diameters. You may assume that these samples are drawn from the uniform distribution on observed diameters, and that they are independent.

Assume that it is known that the missingness in the diameter_km column is MAR. Which of the following is true about the overall mean of the diameter_km column after imputation?

• It is likely to be an unbiased estimate of the true mean.

• It is likely to be a biased estimate of the true mean.

Since the missigness mechanism for diameter_km is MAR, we know that the missigness depends on some other bias from another column, implying the the observed values are inherently biased. Since we’re drawing from a biased sample space, we conclude that we’re likely to generate a biased estimate of the true mean.

##### Difficulty: ⭐️⭐️

The average score on this problem was 81%.

## Problem 3

The questions below reference the following HTML.

<html>
<title>ZOMBO</title>

<body>
<h1>Welcome to Zombo.com</h1>
<div id="greeting">
<ul>
<li>This is Zombo.com, welcome!</li>
<li>This is Zombo.com</li>
<li>Welcome to Zombo.com</li>
<li>You can do anything at Zombo.com -- anything at all!</li>
<li>The only limit is yourself.</li>
</ul>
</div>
<h3>Footnotes</h3>

<ol id="footnotes">
<li>Made in California with <a href="https://reactj.org">React.</a><li>
</ol>
</div>
</body>
</html>

### Problem 3.1

Consider the node representing the body tag in the Document Object Model (DOM) tree of the above HTML. How many children does this node have?

We could count the number of children of the body tag by looking at the indentations to see that the body tag has three children: 1 <h1> element and 2 <div> elements.

##### Difficulty: ⭐️⭐️

The average score on this problem was 89%.

### Problem 3.2

The page shown above contains five greetings, each one a list item in an unordered list. The first greeting is This is Zombo.com, welcome!, and the last is The only limit is yourself..

Suppose we have parsed the HTML into a BeautifulSoup object stored in the variable named soup.

Which of the following pieces of code will produce a list of BeautifulSoup objects, each one representing a single greeting list item? Mark all which apply.

• soup.find('div').find_all('li')

• soup.find_all('li', id='greeting')

• soup.find('div', id='greeting').find_all('li')

• soup.find_all('ul/li')

Answer: Option A and Option C

Option A works because find('div') will navigate to the first div element, and find_all('li') will get all the li elements in the form of a list, which is what we wanted. Looking at Option C, we could see that it does basically the exact same thing so that option is correct as well. Option B wouldn’t work because the li elements don’t have an attribute ‘greetings’, and Option D doesn’t work because there are no ul/li elements.

##### Difficulty: ⭐️⭐️

The average score on this problem was 85%.

### Problem 3.3

Suppose you perform an HTTP request to a web API using the requests module. The response succeeds, and you get the following (truncated) content back:

>>> resp = requests.get("https://pokeapi.co/api/v2/pokemon/squirtle")
>>> resp.content
b'{"abilities":[{"ability":{"name":"torrent","url":"https://pokeapi.co"...

What type of data has been returned?

• HTML

• JSON

• XML

• PNG

From the format of the contents, we could pretty clearly see that it is not an HTML file, nor is it a PNG file (which is type of image file). Now an XML file also doesn’t look like that, rather it looks more like an HTML file with different kinds of tags. By process of elimination, the answer is JSON.

Alternatively, we could recall that JSON files are in the format of dictionaries of attrivute/value pairs, which is what we see in the contents.

##### Difficulty: ⭐️

The average score on this problem was 94%.

## Problem 4

You are scraping a web page using requests. Your code has been working fine and returns the desired result, but suddenly you find that your code takes much longer to finish (if it finishes at all). It does not raise an exception.

What is the most likely cause of the issue?

• The page has a very large GIF that hasn’t stopped playing.

• You have made too many requests to the server in a short amount of time

• The page contains a Unicode character that requests cannot parse

• The page has suddenly changed and has caused requests to enter an infinite loop.

• Option A: We can pretty confidentely (I hope) rule this option out since whether or not a GIF has stopped playing or not shouldn’t affect our web scraping.
• Option B: This answer is right because a server will time you out (and potentially block you) if you make too many requests to the server.
• Option C: This shouldn’t cause your code to never finish, rather, it’s more likely that the request module just doesn’t process said Unicode character correctly or it throws an error.
• Option D: Again, this shouldn’t cause your code to never finish, rather, the request module will just parse the older version of the website at the time you called it.

##### Difficulty: ⭐️

The average score on this problem was 100%.

## Problem 5

In this question, you will be asked to determine which strings are matched by various regular expression patterns when re.search is used. For these questions, remember that re.search(pattern, s) matches s if the pattern can be found anywhere in s (not necessarily at the beginning). For example, re.search("name", "my name is justin") matches, while re.search("foo", "my name is justin") does not.

### Problem 5.1

Which of the below strings are matched by re.search using the pattern 'a+'? Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

**Answer: Option A, B, C and D

The regex pattern 'a+' searches for any string that contains at least one substring consisting of the character a one or more times. Clearly all of these strings contain a substring og the character a one or more times.

##### Difficulty: ⭐️

The average score on this problem was 98%.

### Problem 5.2

Which of the below strings are matched by re.search using the pattern 'a+ b+'? Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

The regex pattern 'a+ b+' searches for any string that contains at least one substring consisting of the character a one or more times followed by a space followed by the character b one or more times. The only strings that have this pattern are Options A and B.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 5.3

Which of the below strings are matched by re.search using the pattern '\baa\b'

Recall that the r at the front of the pattern string above makes it a “raw” string; this is used so that \b is not interpreted by Python as a special backspace character.

Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

The regex pattern '\b' matches for the boundary of a word (so like the start and end of a word which could seperated by spaces). Thus the regex pattern searches for the substring 'aa' that is its own standalone word. The only string that has this pattern is Option A.

##### Difficulty: ⭐️⭐️

The average score on this problem was 75%.

### Problem 5.4

Which of the below strings are matched by re.search using the pattern '(aba){2,}'? Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

The regex pattern '(aba){2,}' searches the substring consisting of aba 2 or more times. The only string that has this pattern is Option C.

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 5.5

Which of the below strings are matched by re.search using the pattern 'a..a'? Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

The regex pattern 'a..a' searches the substring consisting of a followed by any two characters followed by a. The only string that has this pattern is Option C.

##### Difficulty: ⭐️

The average score on this problem was 99%.

### Problem 5.6

Which of the below strings are matched by re.search using the pattern '.*'? Select all that apply.

• "aa bb cc"

• "aaa bbb ccc"

• "abaaba"

• "abacaba"

**Answer: Option A, B, C and D

The regex pattern '.*' searches the substring consisting of any character 0 or more times. Clearly all of the strings contain that pattern.

##### Difficulty: ⭐️

The average score on this problem was 100%.

## Problem 6

### Problem 6.1

Consider the following four sentences:

• “this is one”
• “this is two”
• “this is the third”
• “and this is the fourth”

Suppose these sentences are encoded into a “bag of words” feature representation. The result is a dataframe with four rows (one for each sentence). How many columns are in this dataframe? Your answer should be in the form of a number.

Recall that bag of word creates a new column for each unique word. Thus the problem boils down to “how many unique words are there in the following four sentences”, which we count 8: “this”, “is”, “one”, “two”, “the”, “third”, “and”, “fourth”.

##### Difficulty: ⭐️

The average score on this problem was 98%.

### Problem 6.2

Again consider the same four sentences shown above.

What is the TF-IDF score for the word “this” in the first sentence? Use base-2 logarithm.

Recall that TF is calculated as the number of terms that appear in that sentence divided by the total number of terms in the sentence. In this case the TF value of “this” in the first sentence is \frac{1}{3}.

IDF is calculated as the log of the number of sentences divided by the number of sentences in which that term appears in. In this case, the IDF value of “this” in the first sentence is \log_{2}(\frac{4}{4}) = 0.

Thus the TF-IDF is just \frac{1}{3} * 0 = 0

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 6.3

Again consider the same four sentences shown above.

What is the TF-IDF score for the word “and” in the last sentence? Use base-2 logarithm.

Recall that TF is calculated as the number of terms that appear in that sentence divided by the total number of terms in the sentence. In this case the TF value of “and” in the last sentence is \frac{1}{5}.

IDF is calculated as the log of the number of sentences divided by the number of sentences in which that term appears in. In this case, the IDF value of “and” in the last sentence is \log_{2}(\frac{4}{1}) = 2.

Thus the TF-IDF is just \frac{1}{5} * 2 = 0.4

##### Difficulty: ⭐️⭐️

The average score on this problem was 89%.

## Problem 7

### Problem 7.1

Consider the dataframe shown below: Suppose you wish to use this data in a linear regression model. To do so, the color column must be encoded numerically.

True or False: a meaningful way to numerically encode the color column is to replace each string by its index in the alphabetic ordering of the colors. That is, to replace “blue” by 1, “green” by 2, and “red” by 3.

• True

• False

Note that color is a nominal categorical varaible, meaning that there is no inherent ordering to the categories. Thus encoding the color variables by 1, 2, and 3 doesn’t make any meaningful sense. I.e. it doesn’t make sense to think of the color “red” as being “greater” than the color “blue” in any meaningful way.

##### Difficulty: ⭐️⭐️

The average score on this problem was 88%.

### Problem 7.2

Suppose you perform a one-hot encoding of a Series containing the following strings:

["red", "blue", "red", "green", "green", "purple", "orange", "blue"]}

Assume that the encoding is created using the OneHotEncoder(drop='first') from sklearn. Note the drop='first' keyword argument: skearn’s documentation says that this will “drop the first category in each feature.”

How many columns will the resulting one-hot encoding table have?

Recall that One-Hot Encoding will create a new column for each unique term in the group of strings. Thus “normal” One-Hot Encoding will create 5 columns, since there are 5 distinct words in the list of strings: “red”, “blue”, “green”, “purple” and “orange”. However, setting drop='first' will drop one of the columns, leaving us with 4 columns.

##### Difficulty: ⭐️⭐️

The average score on this problem was 88%.

## Problem 8

The next four questions concern the following example.

A Silicon Valley startup candy company buys a factory from a lightbulb company and repurposes some of the existing equipment to make chocolate bars with the goal of disrupting the candy industry. Their IPO is delayed, however, when they discover that some of their chocolate bars contain broken glass.

The company’s engineers quickly build an AI which looks at the chocolate bars coming off of the conveyor belt and predicts which bars contain broken glass (“yes”) and which don’t (“no”). The results are shown in the following confusion matrix: ### Problem 8.1

What is the accuracy of their model as a percentage (between 0% and 100%)?

The accuracy of a model is given by the number of correct predictions divided by the total number of predictions. To tell if a prediction is correct, we simply see if the predicted value matches the actual value. In this case, our model’s accuracy is just \frac{87+4}{87+6+3+4} = \frac{91}{100} which is just 91%.

##### Difficulty: ⭐️

The average score on this problem was 95%.

### Problem 8.2

What is the recall of their model as a percentage (between 0% and 100%)?

The recall of a model is given by the number of True Positives divided by the sum of True Positives and False Negatives. A True Positive is when the model correctly predicts a Positive value (in this case, a Positive value is “Yes”), and a False Negative is when a model incorrectly predicts a Negative value. Thus the answer is just \frac{4}{6+4} = \frac{4}{10} which is just 40%.

##### Difficulty: ⭐️

The average score on this problem was 92%.

### Problem 8.3

From a safety perspective, which metric should be maximized in this situation?

• Precision

• Recall

From a safetly perspective, it makes sense that we should maximize catching broken glass in the chocolate bars. This means that we should minimize False Negative values, which means maximizing Recall.

##### Difficulty: ⭐️

The average score on this problem was 100%.

### Problem 8.4

Suppose the company’s investors wish to improve the model’s precision. Which should they do (besides hire better data scientists)?

• Lower their model’s threshold for predicting that a bar contains glass, thus throwing out more candy.

• Raise their model’s threshold for predicting that a bar contains glass, thus throwing out less candy.

The precision of a model is given by the number of True Positives divided by the sum of True Positives and False Positives. Thus to maximumize the precision of a model, we should minimize the number of False Positives. To do this, simply raising the model’s threshold for predicting Positive values would result in less False Positives. Thus the answer is Option B.

##### Difficulty: ⭐️

The average score on this problem was 97%.

## Problem 9

### Problem 9.1

Suppose you split a data set into a training set and a test set. You train a classifier on the training set and test it on the test set.

True or False: the training accuracy must be higher than the test accuracy.

• True

• False

There is no direct correlation between the training accuracy of a model and the test accuracy of the model, since the way you decide to split your data set is largely random. Thus there might be a possibility that your test accuracy ends up higher than your training accuracy. To illustrate this, suppose your training data consists of 100 data points and your test data consists of 1 data point. Now suppose your model achieves a training accuracy of 90%, and when you proceed to test it on the test set, your model predicts that singular data point correctly. Clearly your test accuracy is now higher than your training accuracy.

##### Difficulty: ⭐️

The average score on this problem was 97%.

### Problem 9.2

Suppose you are using sklearn to train a decision tree model to predict whether a data science student is enrolled in DSC 80 or not based on several pieces of information, including their hours spent coding per week and whether or not they have heard of the Kolmogorov-Smirnov test statistic.

Suppose you train your model, but achieve much lower training and test accuracies than you expect. When you look at the data and make predictions yourself, you are easily able to achieve higher train and test accuracies.

What should be done to improve the performance of the model?

• Decrease the max_depth hyperparameter; the model is “overfitting”.

• Increase the max_depth hyperparameter; the model is “underfitting”.

The fact that your model has a low training accuracy means that your model is not complex enough to capture the trends that are present in the data and thus likely “underfitting”. This means that you should increase the max_depth parameter.