Fall 2022 Final Exam

This exam was administered in-person. The exam was closed-notes, except students were allowed to bring 5 two-sided cheat sheets. No calculators were allowed. Students had 180 minutes to take this exam.

This exam has several coding questions. These questions will be manually graded, and your answers are not expected to be syntactically-perfect! Be careful not to spend too much time perfecting your code.

Problem 1

In the next five questions, assume you have access to a DataFrame named pts, shown below:

Problem 1.1

Answer: Option D

Calling pts.groupby('group')['x'].max() will return a Series with the the index consisting of each of the unique groups and a column containing the max 'x' value within each group.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 68%.

Problem 1.2

Answer: Option 3

The code above will simply create a new column 'z', and match the corresponding value of 'z' with the corresponding index of 'z' that exists in pts. For the indeces in pts that aren’t present in 'z', there will be a NaN value in the appropriate spot in the 'z' column.

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 1.3

>>> pivot = pts.pivot_table(
    values='x',
    index='group',
    columns='color',
    aggfunc='count')
>>> pivot.loc['A', 'red']

Your answer should be in the form of a number (or possibly NaN). Your answer does NOT need to be exactly what is displayed by Python.

Answer: 2

The pivot table simply counts the number of rows in pts that have 'group' value ‘A’ AND 'color' value red, which we could see that there are only two rows that satisfy those conditions.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

Problem 1.4

Answer: 6

Note that in a left merge, we retain all the rows in the left DataFrame, regardless of whether or not those rows are shared between the left and right DataFrames. Thus we also realize that it must be the case that the resulting merged DataFrame has the same number of rows as the left DataFrame when performing a left merge. Since 'pts' is the left DataFrame, we’ll be left with 6 rows after the merge since 'pts' has 6 rows.

Difficulty: ⭐️

The average score on this problem was 95%.

Problem 1.5

Answer: 4

The foo function essentially takes in a Series, and returns the difference between the maximum value in the Series and the minimum value in that same Series.

Thus when we call foo as our aggregate function, we’ll perform this foo function on the values of each column within each group. Also, .loc['A', 'x'] means that we just want get the resulting foo value computed on column ‘x’ within group ‘A’. Focusing our attention on the values of column ‘x’ that are of group ‘A’, we see that the largest value in ‘x’ of group ‘A’ is just 5, and the smallest value is just 1. Therefore our answer is just 5 - 1 or 4.

Difficulty: ⭐️

The average score on this problem was 90%.

Problem 2

The Earth Impact Database, curated by the University of New Brunswick’s Planetary and Space Science Centre, contains information on almost 200 impact craters caused by meteorites that have crashed into the Earth. In this question, assume you have access to a dataframe named impacts, shown below:

Problem 2.1

There are many missing values in the state column. Upon inspection, you find that a crater is missing a state if and only if the crater is not located in the United States. Note: while we’re aware that upon inspection of the actual DataFrame, that this statement is clearly not true (such as Canada and Australia), we’ll assume that the statement is true for the sake of the problem.

Answer: Option 4

The answer is Missing by Design becuase we can easily predict whether or not the state column will be missing based on the country column. Namely, if the country column is not ‘United States’, then we know that the state column is missing.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.

Problem 2.2

Answer: Option C

K-S Statistic doesn’t work well on discrete quantitative variables so we could rule that out. TVD is mainly used with categorical data so we could rule that out (and it’s the absolute value so it wouldn’t tell us whetehr or not one group is larger than the other group). We used the signed difference between mean crater diameter because we want to see whether or not one group is larger than the other, and unsigned difference between mean crater diameter wouldn’t tell us anything about that.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.

Problem 2.3

Suppose it is observed that some values in the column are missing. To determine if there is an association between this missingness and the values in the country column, a permutation test will be performed with the null hypothesis that the distribution of countries when the diameter is missing is the same as the distribution of countries when the diameter is not missing.

Answer: Option A

Since ‘countries’ is a categorical variable, TVD would work the best here.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.

Problem 2.4

Suppose the permutation test described in the previous problem fails to reject the null hypothesis. Assuming that NMAR and MD have been ruled out already, what can be said about the missingness in diameter_km?

Answer: Option C

In order to test whether or not a column is MCAR or MAR, we have to test the missingness of that column against every other column in order to be conclusive about the missingness mechanism. Thus the answer is Option C.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

Problem 2.5

Suppose we fill in the missing values in the diameter_km column by random sampling. That is, for each missing diameter, we randomly sample from the the set of observed diameters. You may assume that these samples are drawn from the uniform distribution on observed diameters, and that they are independent.

Assume that it is known that the missingness in the diameter_km column is MAR. Which of the following is true about the overall mean of the diameter_km column after imputation?

Answer: Option B

Since the missigness mechanism for diameter_km is MAR, we know that the missigness depends on some other bias from another column, implying the the observed values are inherently biased. Since we’re drawing from a biased sample space, we conclude that we’re likely to generate a biased estimate of the true mean.

Difficulty: ⭐️⭐️

The average score on this problem was 81%.

Problem 3

<html>
    <head>
        <title>ZOMBO</title>
    </head>

    <body>
        <h1>Welcome to Zombo.com</h1>
        <div id="greeting">
            <ul>
                <li>This is Zombo.com, welcome!</li>
                <li>This is Zombo.com</li>
                <li>Welcome to Zombo.com</li>
                <li>You can do anything at Zombo.com -- anything at all!</li>
                <li>The only limit is yourself.</li>
            </ul>
        </div>
        <div id="footnotes" class="faded">
            <h3>Footnotes</h3>

            <ol id="footnotes">
                <li>Please consider <a href="paypal.html">donating!</a></li>
                <li>Made in California with <a href="https://reactj.org">React.</a><li>
            </ol>
        </div>
    </body>
</html>

Problem 3.1

Consider the node representing the body tag in the Document Object Model (DOM) tree of the above HTML. How many children does this node have?

Answer: 3

We could count the number of children of the body tag by looking at the indentations to see that the body tag has three children: 1 <h1> element and 2 <div> elements.

Difficulty: ⭐️⭐️

The average score on this problem was 89%.

Problem 3.2

The page shown above contains five greetings, each one a list item in an unordered list. The first greeting is This is Zombo.com, welcome!, and the last is The only limit is yourself..

Suppose we have parsed the HTML into a BeautifulSoup object stored in the variable named soup.

Which of the following pieces of code will produce a list of BeautifulSoup objects, each one representing a single greeting list item? Mark all which apply.

Answer: Option A and Option C

Option A works because find('div') will navigate to the first div element, and find_all('li') will get all the li elements in the form of a list, which is what we wanted. Looking at Option C, we could see that it does basically the exact same thing so that option is correct as well. Option B wouldn’t work because the li elements don’t have an attribute ‘greetings’, and Option D doesn’t work because there are no ul/li elements.

Difficulty: ⭐️⭐️

The average score on this problem was 85%.

Problem 3.3

Suppose you perform an HTTP request to a web API using the requests module. The response succeeds, and you get the following (truncated) content back:

>>> resp = requests.get("https://pokeapi.co/api/v2/pokemon/squirtle")
>>> resp.content
b'{"abilities":[{"ability":{"name":"torrent","url":"https://pokeapi.co"...

Answer: JSON

From the format of the contents, we could pretty clearly see that it is not an HTML file, nor is it a PNG file (which is type of image file). Now an XML file also doesn’t look like that, rather it looks more like an HTML file with different kinds of tags. By process of elimination, the answer is JSON.

Alternatively, we could recall that JSON files are in the format of dictionaries of attrivute/value pairs, which is what we see in the contents.

Difficulty: ⭐️

The average score on this problem was 94%.

Problem 4

You are scraping a web page using requests. Your code has been working fine and returns the desired result, but suddenly you find that your code takes much longer to finish (if it finishes at all). It does not raise an exception.

Answer: Option B

Option A: We can pretty confidentely (I hope) rule this option out since whether or not a GIF has stopped playing or not shouldn’t affect our web scraping.
Option B: This answer is right because a server will time you out (and potentially block you) if you make too many requests to the server.
Option C: This shouldn’t cause your code to never finish, rather, it’s more likely that the request module just doesn’t process said Unicode character correctly or it throws an error.
Option D: Again, this shouldn’t cause your code to never finish, rather, the request module will just parse the older version of the website at the time you called it.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 5

In this question, you will be asked to determine which strings are matched by various regular expression patterns when re.search is used. For these questions, remember that re.search(pattern, s) matches s if the pattern can be found anywhere in s (not necessarily at the beginning). For example, re.search("name", "my name is justin") matches, while re.search("foo", "my name is justin") does not.

Problem 5.1

Which of the below strings are matched by re.search using the pattern 'a+'? Select all that apply.

**Answer: Option A, B, C and D

The regex pattern 'a+' searches for any string that contains at least one substring consisting of the character a one or more times. Clearly all of these strings contain a substring og the character a one or more times.

Difficulty: ⭐️

The average score on this problem was 98%.

Problem 5.2

Which of the below strings are matched by re.search using the pattern 'a+ b+'? Select all that apply.

**Answer: Option A and B

The regex pattern 'a+ b+' searches for any string that contains at least one substring consisting of the character a one or more times followed by a space followed by the character b one or more times. The only strings that have this pattern are Options A and B.

Difficulty: ⭐️

The average score on this problem was 95%.

Problem 5.3

Which of the below strings are matched by re.search using the pattern '\baa\b'

Recall that the r at the front of the pattern string above makes it a “raw” string; this is used so that \b is not interpreted by Python as a special backspace character.

**Answer: Option A

The regex pattern '\b' matches for the boundary of a word (so like the start and end of a word which could seperated by spaces). Thus the regex pattern searches for the substring 'aa' that is its own standalone word. The only string that has this pattern is Option A.

Difficulty: ⭐️⭐️

The average score on this problem was 75%.

Problem 5.4

Which of the below strings are matched by re.search using the pattern '(aba){2,}'? Select all that apply.

**Answer: Option C

The regex pattern '(aba){2,}' searches the substring consisting of aba 2 or more times. The only string that has this pattern is Option C.

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 5.5

Which of the below strings are matched by re.search using the pattern 'a..a'? Select all that apply.

**Answer: Option C

The regex pattern 'a..a' searches the substring consisting of a followed by any two characters followed by a. The only string that has this pattern is Option C.

Difficulty: ⭐️

The average score on this problem was 99%.

Problem 5.6

Which of the below strings are matched by re.search using the pattern '.*'? Select all that apply.

**Answer: Option A, B, C and D

The regex pattern '.*' searches the substring consisting of any character 0 or more times. Clearly all of the strings contain that pattern.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 6

Problem 6.1

Suppose these sentences are encoded into a “bag of words” feature representation. The result is a dataframe with four rows (one for each sentence). How many columns are in this dataframe? Your answer should be in the form of a number.

Answer: 8

Recall that bag of word creates a new column for each unique word. Thus the problem boils down to “how many unique words are there in the following four sentences”, which we count 8: “this”, “is”, “one”, “two”, “the”, “third”, “and”, “fourth”.

Difficulty: ⭐️

The average score on this problem was 98%.

Problem 6.2

What is the TF-IDF score for the word “this” in the first sentence? Use base-2 logarithm.

Answer: 0

Recall that TF is calculated as the number of terms that appear in that sentence divided by the total number of terms in the sentence. In this case the TF value of “this” in the first sentence is \frac{1}{3}.

IDF is calculated as the log of the number of sentences divided by the number of sentences in which that term appears in. In this case, the IDF value of “this” in the first sentence is \log_{2}(\frac{4}{4}) = 0.

Thus the TF-IDF is just \frac{1}{3} * 0 = 0

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 6.3

What is the TF-IDF score for the word “and” in the last sentence? Use base-2 logarithm.

Answer: 0.4

Recall that TF is calculated as the number of terms that appear in that sentence divided by the total number of terms in the sentence. In this case the TF value of “and” in the last sentence is \frac{1}{5}.

IDF is calculated as the log of the number of sentences divided by the number of sentences in which that term appears in. In this case, the IDF value of “and” in the last sentence is \log_{2}(\frac{4}{1}) = 2.

Thus the TF-IDF is just \frac{1}{5} * 2 = 0.4

Difficulty: ⭐️⭐️

The average score on this problem was 89%.

Problem 7

Problem 7.1

Suppose you wish to use this data in a linear regression model. To do so, the color column must be encoded numerically.

True or False: a meaningful way to numerically encode the color column is to replace each string by its index in the alphabetic ordering of the colors. That is, to replace “blue” by 1, “green” by 2, and “red” by 3.

Answer: False

Note that color is a nominal categorical varaible, meaning that there is no inherent ordering to the categories. Thus encoding the color variables by 1, 2, and 3 doesn’t make any meaningful sense. I.e. it doesn’t make sense to think of the color “red” as being “greater” than the color “blue” in any meaningful way.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.

Problem 7.2

Suppose you perform a one-hot encoding of a Series containing the following strings:

Assume that the encoding is created using the OneHotEncoder(drop='first') from sklearn. Note the drop='first' keyword argument: skearn’s documentation says that this will “drop the first category in each feature.”

Answer: 4

Recall that One-Hot Encoding will create a new column for each unique term in the group of strings. Thus “normal” One-Hot Encoding will create 5 columns, since there are 5 distinct words in the list of strings: “red”, “blue”, “green”, “purple” and “orange”. However, setting drop='first' will drop one of the columns, leaving us with 4 columns.

Difficulty: ⭐️⭐️

The average score on this problem was 88%.

Problem 8

A Silicon Valley startup candy company buys a factory from a lightbulb company and repurposes some of the existing equipment to make chocolate bars with the goal of disrupting the candy industry. Their IPO is delayed, however, when they discover that some of their chocolate bars contain broken glass.

The company’s engineers quickly build an AI which looks at the chocolate bars coming off of the conveyor belt and predicts which bars contain broken glass (“yes”) and which don’t (“no”). The results are shown in the following confusion matrix:

Problem 8.1

Answer: 91%

The accuracy of a model is given by the number of correct predictions divided by the total number of predictions. To tell if a prediction is correct, we simply see if the predicted value matches the actual value. In this case, our model’s accuracy is just \frac{87+4}{87+6+3+4} = \frac{91}{100} which is just 91%.

Difficulty: ⭐️

The average score on this problem was 95%.

Problem 8.2

Answer: 40%

The recall of a model is given by the number of True Positives divided by the sum of True Positives and False Negatives. A True Positive is when the model correctly predicts a Positive value (in this case, a Positive value is “Yes”), and a False Negative is when a model incorrectly predicts a Negative value. Thus the answer is just \frac{4}{6+4} = \frac{4}{10} which is just 40%.

Difficulty: ⭐️

The average score on this problem was 92%.

Problem 8.3

Answer: Recall

From a safetly perspective, it makes sense that we should maximize catching broken glass in the chocolate bars. This means that we should minimize False Negative values, which means maximizing Recall.

Difficulty: ⭐️

The average score on this problem was 100%.

Problem 8.4

Suppose the company’s investors wish to improve the model’s precision. Which should they do (besides hire better data scientists)?

Answer: Option B

The precision of a model is given by the number of True Positives divided by the sum of True Positives and False Positives. Thus to maximumize the precision of a model, we should minimize the number of False Positives. To do this, simply raising the model’s threshold for predicting Positive values would result in less False Positives. Thus the answer is Option B.

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 9

Problem 9.1

Suppose you split a data set into a training set and a test set. You train a classifier on the training set and test it on the test set.

Answer: False

There is no direct correlation between the training accuracy of a model and the test accuracy of the model, since the way you decide to split your data set is largely random. Thus there might be a possibility that your test accuracy ends up higher than your training accuracy. To illustrate this, suppose your training data consists of 100 data points and your test data consists of 1 data point. Now suppose your model achieves a training accuracy of 90%, and when you proceed to test it on the test set, your model predicts that singular data point correctly. Clearly your test accuracy is now higher than your training accuracy.

Difficulty: ⭐️

The average score on this problem was 97%.

Problem 9.2

Suppose you are using sklearn to train a decision tree model to predict whether a data science student is enrolled in DSC 80 or not based on several pieces of information, including their hours spent coding per week and whether or not they have heard of the Kolmogorov-Smirnov test statistic.

Suppose you train your model, but achieve much lower training and test accuracies than you expect. When you look at the data and make predictions yourself, you are easily able to achieve higher train and test accuracies.

Answer: Option B

The fact that your model has a low training accuracy means that your model is not complex enough to capture the trends that are present in the data and thus likely “underfitting”. This means that you should increase the max_depth parameter.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.

Problem 1

Problem 1.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.2

Click to view the solution.

Difficulty: ⭐️

Problem 1.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 1.4

Click to view the solution.

Difficulty: ⭐️

Problem 1.5

Click to view the solution.

Difficulty: ⭐️

Problem 2

Problem 2.1

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.2

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.3

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.4

Click to view the solution.

Difficulty: ⭐️⭐️⭐️

Problem 2.5

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3

Problem 3.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3.2

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 3.3

Click to view the solution.

Difficulty: ⭐️

Problem 4

Click to view the solution.

Difficulty: ⭐️

Problem 5

Problem 5.1

Click to view the solution.

Difficulty: ⭐️

Problem 5.2

Click to view the solution.

Difficulty: ⭐️

Problem 5.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 5.4

Click to view the solution.

Difficulty: ⭐️

Problem 5.5

Click to view the solution.

Difficulty: ⭐️

Problem 5.6

Click to view the solution.

Difficulty: ⭐️

Problem 6

Problem 6.1

Click to view the solution.

Difficulty: ⭐️

Problem 6.2

Click to view the solution.

Difficulty: ⭐️

Problem 6.3

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 7

Problem 7.1

Click to view the solution.

Difficulty: ⭐️⭐️

Problem 7.2

Click to view the solution.