← return to practice.dsc80.com

**Instructor(s):** Justin Eldridge

This exam was administered online. The exam was open-notes,
open-book. Students had **180 minutes** to take this
exam.

Welcome to the Final Exam for DSC 80. This is an open-book, open-note exam, but you are not permitted to discuss the exam with anyone until after grades are released.

This exam has several coding questions. These questions will be
manually graded, and your answers are *not* expected to be
syntactically-perfect! Be careful not to spend too much time perfecting
your code.

This table below contains energy usage information for every building owned and managed by the New York City Department of Citywide Administrative Services (DCAS). DCAS is the arm of the New York City municipal government which handles ownership and management of the city’s office facilities and real estate inventory. The organization voluntarily publicly discloses self-measured information about the energy use of its buildings.

Assume that the table has been read into the variable named
`df`

. The result of `df.info()`

is shown
below:

Write a piece of code that computes the name of the borough whose median building energy usage is the highest.

Note: your code will be graded manually, and it is not expected to be perfect. Be careful to not spend too much time trying to make your code perfect!

**Answer: **
`df.groupby('Borough')['Energy'].median().idxmax()`

The idea is to perform a groupby and get the median within each group and then somehow get the group with the largest median.

The average score on this problem was 91%.

Write a line of code to normalize the building addresses. To normalize an address, replace all spaces with underscores, change “Street” to “St”, change “Avenue” to “Av”, and convert to lower case. Your code should evaluate to a series containing the normalized addresses.

**Answer: **
`df['Building Addresss'].apply(lambda x: x.replace(' ', '_').replace('Avenue', 'Av').replace('Street', 'St').lower())`

The average score on this problem was 90%.

From the table above, it looks like it is common for city buildings to have addresses that start with the number ‘1’; e.g., “100 Centre Street”. Write a piece of code that plots a histogram showing the number of times each number appears as the first number in an address.

**Answer: **
`df['Building Addresss'].apply(lambda x: int(x[0])).plot(kind = 'hist')`

The apply part of the code gets the first number of each address, and then plots the resulting Series as a histogram.

The average score on this problem was 90%.

Suppose the following code is run:

```
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder()
oh 'Borough']]).mean(axis=0) oh.fit_transform(df[[
```

What will be the result?

An array of size 5 containing the proportion of buildings in each of the five boroughs.

An array with as many entries as there are buildings containing 0.2 in each spot.

An array of size 5 containing all zeros.

An array of size 5 containing one 1 and the rest zeros.

**Answer: ** Option A

One-Hot-Encoding will create 5 columns representing each ‘Borough’. For each building, a 1 will be filled in the column of the ‘Borough’ in which the building belongs to, and 0 for the other 4 columns. Thus taking the mean of each of the 5 ‘Borough’ columns will just produce the proportion of buildings in each of the 5 boroughs, yielding Option A.

The average score on this problem was 90%.

Write a piece of code (perhaps more than one line) that computes the
proportion of addresses in each borough that end with either
`St`

or `Street`

.

**Answer: **

```
def match(x):
return (x.endswith('St')) | (x.endswith('Street'))
'Building Address'] = df['Building Address'].apply(lambda x: match(x))
df['Borough')['Building Address'].mean() df.groupby(
```

The average score on this problem was 79%.

For this problem, consider the HTML document shown below:

```
<html>
<head>
<title>Data Science Courses</title>
</head>
<body>
<h1>Welcome to the World of Data Science!</h1>
<h2>Current Courses</h2>
<div class="course_list">
<img alt="Course Banner", src="courses.png">
<p>
Here are the courses available to take:</p>
<ul>
<li>Machine Learning</li>
<li>Design of Experiments</li>
<li>Driving Business Value with DS</li>
</ul>
<p>
<a href="./2021-sp.html">here</a>.
For last quarter's classes, see </p>
</div>
<h2>News</h2>
<div class="news">
<p class="news">
<b>Visualization</b> is launched.
New course on <a href="https://http.cat/301.png" target="_blank">here</a>
See </p>
</div>
</body>
</html>
```

How many children does the `div`

node with class
`course_list`

contain in the Document Object Model (DOM)?

**Answer: ** 4 children

Looking at the code, we could see that the `div`

with
class `course_list`

has 4 children, namely: a
`img`

node, `p`

node, `ul`

node and
`p`

node.

The average score on this problem was 78%.

Suppose the HTML document has been parsed using
`doc = bs4.BeautifulSoup(html)`

. Write a line of code to get
the `h1`

header text in the form of a string.

**Answer: ** `doc.find('h1').text`

Since there’s only one `h1`

element in the html code, we
could simply do `doc.find('h1')`

to get the `h1`

element. Then simply adding `.text`

will get the text of the
`h1`

element in the form of a string.

The average score on this problem was 93%.

Suppose the HTML document has been parsed using
`doc = bs4.BeautifulSoup(html)`

. Write a piece of code that
scrapes the course names from the HTML. The value returned by your code
should be a list of strings.

**Answer: **
`[x.text for x in doc.find_all('li')]`

Doing `doc.find_all('li')`

will find all `li`

elements and return it is the form of a list. Simply performing some
basic list comprehension combined `.text`

to get the text of
each `li`

element will yield the desired result.

The average score on this problem was 93%.

There are two links in the document. Which of the following will
return the URL of the link contained in the `div`

with class
`news`

? Mark all that apply.

`doc.find_all('a')[1].attrs['href']`

`doc.find('a')[1].attrs['href']`

`doc.find(a, class='news').attrs['href']`

`doc.find('div', attrs={'class': 'news'}).find('a').attrs['href']`

`doc.find('href', attrs={'class': 'news'})`

**Answer: ** Option A and Option D

- Option A: This option works because
`doc.find_all('a')`

will return a list of all the`a`

elements in the order that it appears in the HTML document, and since the`a`

with class`news`

is the second`a`

element appearing in the HTML doc, we do`[1]`

to select it (as we would in any other list). Finally, we return the URL of the`a`

element by getting the`'href'`

attribute using`.attrs['href']`

- Option B: This does not work because
`.find`

will only find the first instance of`a`

, which is not the one we’re looking for. - Option C: This does not work because there are no quotations around
the
`a`

. - Option D: This option works becuase
`doc.find('div', attrs={'class': 'news'})`

will first find the`div`

element with`class='news'`

, and then find the`a`

element within that element and get the`href`

attribute of that, which is what we want. - Option E: This does not work because there is no
`href`

element in the HTML document.

The average score on this problem was 90%.

What is the purpose of the `alt`

attribute in the
`img`

tag?

It provides an alternative image that will be shown to some users at random

It creates a link with text present in

`alt`

It provides the text to be shown below the image as a caption

It provides text that should be shown in the case that the image cannot be displayed

**Answer: ** Option D

^pretty self-explanatory.

The average score on this problem was 98%.

You are scraping a web page using the `requests`

module.
Your code works fine and returns the desired result, but suddenly you
find that when you run your code it starts but never finishes – it does
not raise an error or return anything. What is the most likely cause of
the issue?

The page has a very large GIF that hasn’t stopped playing

You have made too many requests to the server in too short of a time, and you are being “timed out”

The page contains a Unicode character that

`requests`

cannot parseThe page has suddenly changed and has caused

`requests`

to enter an infinite loop

**Answer: ** Option B

- Option A: We can pretty confidentely (I hope) rule this option out since whether or not a GIF has stopped playing or not shouldn’t affect our web scraping.
- Option B: This answer is right because a server will time you out (and potentially block you) if you make too many requests to the server.
- Option C: This shouldn’t cause your code to never finish, rather, it’s more likely that the request module just doesn’t process said Unicode character correctly or it throws an error.
- Option D: Again, this shouldn’t cause your code to never finish, rather, the request module will just parse the older version of the website at the time you called it.

The average score on this problem was 78%.

You are creating a new programming language called IDK. In this language, all variable names must satisfy the following constraints:

- They must start with
`Um`

,`Umm`

,`Ummm`

, etc. That is, a capital`U`

followed by one or more`m`

’s. - Following this may be any string of one or more letters. The first letter must be capitalized.
- The variable name must end with a single question mark.
- Numbers or other special characters are not allowed.

Examples of valid variable names: `UmmmX?`

,
`UmTest?`

, `UmmmmPendingQueue?`

Examples of
invalid variable names: `ummX?`

, `Um?`

,
`Ummhello?`

, `UmTest`

Write a regular expression pattern string `pat`

to
validate variable names in this new language. Your pattern should work
when `re.match(pat, s)`

is called.

**Answer: ** `'U(m+)[A-Z]([A-z]*)(\?)'`

Starting our regular expression, it’s not too difficult to see that
we need a `'U'`

followed by `(m+)`

which will
match with a singular capital `'U'`

followed by at least one
lowercase `'m'`

. Next it is required that we follow that up
with any string of letters, where the first letter is capitalized. we do
this with `'[A-Z]([A-z]*)'`

, where `'[A-Z]'`

will
match with any capital letter and `'([A-z]*)'`

will match
with lowercase and uppercase letters 0 or more times. Finally we end the
regex with `'\?'`

which matches with a question mark.

The average score on this problem was 87%.

Which of the following strings will be matched by the regex pattern
`\$\s*\d{1,3}(\.\d{2})?`

? Note that it should match the
complete string given. Mark all that apply.

$ 100

$10.12

$1340.89

$1456.8

$478.23

$.99

**Answer: ** Option A, Option B and Option E

Let’s dissect what the regex in the question actually means. First,
the `'\$'`

simply matches with any question mark. Next,
`'\s*'`

matches with whitespace characters 0 or more times,
and `'\d{1,3}'`

will match with any digits 1-3 times
inclusive. Finally, `'(\.\d{2})?'`

will match with any
expression consisting of a period and any two digits following that
period 0 or 1 times (due to the `'?'`

mark). With those rules
in mind, it’s not too difficult to check that Options A, B and E
work.

Options C, D and F don’t work because none of those expressions have 1-3 digits before the period.

The average score on this problem was 83%.

“borough”, “brough”, and “burgh” are all common suffixes among
English towns. Write a single regular expression pattern string
`pat`

that will match any town name ending with one of these
suffixes. Your pattern should work when `re.match(pat, s)`

is
called.

**Answer: **
`'(.*)((borough$)|(brough$)|(burgh$))'`

We will assume that the question wants us to create a regex that
matches with any string that ends with the strings described in the
problem. First, `'(.*)'`

will match with anything 0 or more
times at the start of the expression. Then
`'((borough$)|(brough$)|(burgh$))'`

will match with any of
the described strings in the problem , since `'$'`

indicates
the end of string and `'|'`

is just `or`

in
regex.

The average score on this problem was 82%.

For this problem, consider the following data set of product reviews.

What is the Inverse Document Frequency (IDF) of the word “are” in the
`review`

column? Use base-10 log in your computations.

**Answer: ** 0

The IDF is computed by the logarithm of the quotient of total number of documents divided by number of documents with the said term in it. Since there are 6 documents and all 6 have the word “are” in it, we the IDF is just \log{(\frac{6}{6})} which is just 0.

The average score on this problem was 97%.

Calculate the TF-IDF score of “keyboard” in the last review. Use the base-10 logarithm in your calculation.

**Answer: ** 0.119

TF is just the number of times a term appears in a document divided by the number of terms in a document. Thus the TF score of “keyboard” in the last review is just \frac{1}{4}

The IDF score is just log{(\frac{6}{2})}, since there are 2 documents with “keyboard” in it and documents total.

Thus multiplying \frac{1}{4} \times \log{(\frac{6}{2})} yields 0.119.

The average score on this problem was 81%.

The music store wants to predict rating using other features in the
dataset. Before that, we have to deal with the categorical variables -
`music_level`

, and `product_name`

. How should you
encode the `music_level`

variable for the modeling?

One-hot encoding

Bag of words

Ordinal encoding

Tf-Idf

**Answer: ** Option C

In ordinal encoding, each unique category is assigned an integer
value. Thus it would make sense to encode `music_level`

with
Ordinal encoding since there an inherent order to
`music_level`

. i.e. ‘Grade IV’ would be “higher” than ‘Grade
I’ in `music_level`

.

Bag of words doesn’t work since we’re not necessarily trying to
encode multi-worded strings, and TF-IDF simply doesn’t make sense for
encoding. One-hot encoding doesn’t work as well in this case since
`music_level`

is an ordinal categorical variable.

The average score on this problem was 100%.

How should you encode the `product_name`

variable for the
modeling?

One-hot encoding

Bag of words

Ordinal encoding

Tf-Idf

The average score on this problem was 98%.

**Answer: ** Option A

Since `product_name`

is a nominal categorical variable, it
would make sense to encode it using `One-hot encoding`

.
Ordinal encoding wouldn’t work this time since `product_name`

is not an ordinal categorical variable. We could eliminate the rest of
the options from similar reasoning as the question above.

True or False: the TF-IDF score of a term depends on the document containing it.

True

False

**Answer: ** True

TF is calculated by the number of times a term appears in a document divided by the number of terms in the document, which depends on the document itself. So if TF depends on the document itself, so does the TF-IDF score.

The average score on this problem was 97%.

Suppose you perform a one-hot encoding of a Series containing the following music genres:

```
["hip hop",
"country",
"polka",
"country",
"pop",
"pop",
"hip hop"
]
```

Assume that the encoding is created using
`OneHotEncoder(drop='first')`

transformer from
`sklearn.preprocessing`

(note the `drop='first'`

)
keyword.

How many columns will the one-hot encoding table have?

**Answer: ** 3

Normal One-hot encoding will yield 4 columns, one for each of the
following categories: “hip-hop”, “country”, “polka”, and “pop”. However,
`drop='first'`

will drop the first column after One-hot
encoding which will yield 4 - 1 = 3 columns.

The average score on this problem was 80%.

A credit card company wants to develop a fraud detector model which can flag transactions as fraudulent or not using a set of user and platform features. Note that fraud is a rare event (about 1 in 1000 transactions are fraudulent).

Suppose the model provided the following confusion matrix on the test set. Here, 1 (a positive result) indicates fraud and 0 (a negative result) indicates otherwise. Let’s evaluate it to understand the model performance better.

Use this confusion matrix to answer the following questions.

What is the accuracy of the model shown above? Your answer should be a number between 0 and 1.

**Answer: ** 0.809

Accuracy is just calculated as the number of correct predictions divided by total predictions. From the table, we could see that there are 34 + 989 = 1023 correct predictions, and 34 + 153 + 87 + 989 = 1263, giving us a 1023/1263 or 0.809 accuracy.

The average score on this problem was 94%.

How many false negatives were produced by the model shown above?

**Answer: ** 153

A false negative is when the model predicts a negative result when the actual value is positive. So in the case of this model, a false negative would be when the model predicts 0 when the actual value should be 1. Simply looking at the table shows that there are 153 false negatives.

The average score on this problem was 91%.

What is the precision of the model shown above?

**Answer: ** 0.281

The precision of the model is given by the number of True Positives (TP), divided by the sum of the number of True Positives (TP) and False Positives (FP) or \frac{TP}{TP + FP}. Plugging the appropriate values in the formula gives us 34/(34 + 87) = 34/121 or 0.281.

The average score on this problem was 85%.

What is the recall of the model shown above?

**Answer: ** 0.18

The precision of the model is given by the number of True Positives (TP), divided by the sum of the number of True Positives (TP) and False Negatives (FN) or \frac{TP}{TP + FN}. Plugging the appropriate values in the formula gives us 34/(34 + 153) = 34/187 or 0.18.

The average score on this problem was 82%.

What is the accuracy of a “naive” model that always predicts that the transaction is not fraudulent? Your answer should be a number between 0 and 1.

**Answer: ** 0.852

If our model always predicts not fraudulent or 0, then we will get an accuracy of \frac{87+989}{34+153+87+989} or 0.852

The average score on this problem was 78%.

Describe a simple model that will achieve 100% recall despite it being a poor classifier.

**Answer: ** A model that always guesses fraudulent.

Recall that the formula for recall (no pun intended) is given by \frac{TP}{TP + FN}. Thus we reason that if we want to achieve 100% recall, we should simply achieve no False Negatives (or FN = 0), since that will gives us \frac{TP}{TP + 0}, which is obviously just 1. Therefore, if we never guess 0 or non-fraudulent, we’ll never get any False Negatives (b/c getting any FN requires our model to predict negatives for some values). Hence a mmodel that always guesses fraudulent will achieve 100% recall, despite obviously being a bad model.

The average score on this problem was 86%.

Suppose you consider false negatives to be more costly than false positives, where a “positive” prediction is one for fraud. Assuming that precision and recall are both high, which is better for your purposes:

A model with higher recall than precision.

A model with higher precision than recall.

**Answer: ** Option A

Since we consider false negatives to be more costly, it would therefore make sense to favor on a model that puts more emphasis on recall because, again, the formula for recall is given by \frac{TP}{TP + FN}. O the other hand, precision doesn’t care about false negatives, so the answer here is Option A.

The average score on this problem was 85%.

Suppose you split a data set into a training set and a test set. You train your model on the training set and test it on the test set.

True or False: the training accuracy must be higher than the test accuracy.

True

False

**Answer: ** False

There is no direct correlation between the training accuracy of a model and the test accuracy of the model, since the way you decide to split your data set is largely random. Thus there might be a possibility that your test accuracy ends up higher than your training accuracy. To illustrate this, suppose your training data consists of 100 data points and your test data consists of 1 data point. Now suppose your model achieves a training accuracy of 90%, and when you proceed to test it on the test set, your model predicts that singular data point correctly. Clearly youre test accuracy is now higher than your training accuracy.

The average score on this problem was 75%.

Suppose you create a 70%/30% train/test split and train a decision tree classifier. You find that the training accuracy is much higher than the test accuracy (90% vs 60%). Which of the following is likely to help significantly improve the test accuracy? Select all that apply. You may assume that the classes are balanced.

Reduce the number of features

Increase the number of features

Decrease the max depth parameter of the decision tree

Increase the max depth parameter of the decision tree

**Answer: ** Option A and Option C

When your training accuracy is significaly higher than your test accuracy, it tells you that your model is perhaps overfitting the data. And thus we wish to reduce the complexity of our model. From the choices given, we can reduce the complexity of our model by reducing the number of features or decreasing the max depth parameter of the decision tree. (Recall that the greater depth the a decision tree model, the more complex the model).

The average score on this problem was 65%.

Suppose you are training a decision tree classifier as part of a pipeline with PCA. You will need to choose three parameters: the number of components to use in PCA, the maximum depth of the decision tree, and the minimum number of points needed for a leaf node. You’ll do this using sklearn’s GridSearchCV which performs a grid search with k-fold cross validation.

Suppose you’ll try 3 possibilities for the number of PCA parameters, 5 possibilities for the max depth of the tree, 10 possibilities for the number of points needed for a leaf node, and use k=5 folds for cross-validation.

How many times will the model be trained by GridSearchCV?

**Answer: ** 750

Note that our grid search will test every combination of hyperparameters for our model, which gives us a total of 3 * 5 * 10 = 150 combinations of hyperparameters. However, we all perform k-fold cross validation with 5 folds, meaning that our model is trained 5 times for each combination of hyperparameters, giving us a grand total of 5 * 150 = 750.

The average score on this problem was 84%.

The plot below shows the distribution of reported gas mileage for two models of car.

What test statistic is the best choice for testing whether the two empirical distributions came from different underlying distributions?

the TVD

the absolute difference in means

the signed difference in means

the Kolmogorov-Smirnov Statistic

**Answer: ** Option D

TVD simply doesn’t work here since we’re not working with categorical data. Any sort of difference in means here wouldn’t really tell us much either since both distributions seem to have basically the same mean. Thus to tell whether the two distributions are different, it would be best if we used K-S statistic.

The average score on this problem was 88%.

Suppose 1000 people are surveyed. One of the questions asks for the person’s age. Upon reviewing the results of the survey, it is discovered that some of the ages are missing – these people did not respond with their age. What is the most likely type of this missingness?

Missing At Random

Missing Completely At Random

Not Missing At Random

Missing By Design

**Answer: ** Option C

The most likely mechanism for this is NMAR becuase there is a good reason why the missingess depends on the values themselves, namely, if one is a baby (like 1 or 2 years old), theres a a high probablility that they literally just aren’t aware of their age and hence are less likely to answer. On the other end of the spectrum, older people might also refrain from answering their age.

MAR doesn’t work here because there aren’t really any other columns that’ll tell us anything about the likelihood of the missigness of the age column. MD clearly doesn’t work here since we have no way of determining the missing values through the other columns. Alghough MCAR could be a possibility, it is more approrpiate that this problem illustrates NMAR rather than MCAR.

The average score on this problem was 76%.

Consider a data set consisting of the height of 1000 people. The data set contains two columns: height, and whether or not the person is an adult.

Suppose that some of the heights are missing. Among those whose heights are observed there are 400 adults and 400 children; among those whose height is missing, 50 are adults and 150 are children.

If the mean height is computed using only the observed data, which of the following will be true?

the mean will be biased low

the mean will be biased high

the mean will be unbiased

**Answer: ** Option B

Since there are more missing children heights than missing adult heights, (we will assume that adults are taller than children), the observed mean will be higher than the actual total mean, or the observed mean will be biased high.

The average score on this problem was 94%.

We have built two models which have the following accuracies: Model 1: Train score: 93%, Test score: 67%. Model 2: Train score: 84%, Test score: 80% Which of the following model will you choose to use to make future predictions on unseen data? You may assume that the class labels are balanced.

Model 1

Model 2

**Answer: ** Model 2

When testing our models, we only really care about the test scores of our models, and so we’ll typically choose the model with the higher test score which is Model 2.

The average score on this problem was 99%.

Suppose we retrain a decision tree model, each time increasing the
`max_depth`

parameter. As we do so, we plot the *test*
error. What will we likely see?

The test error will first decrease, then increase.

The test error will decrease.

The test error will first increase, then decrease.

The test error will remain unchanged.

**Answer: ** Option A

As we increase our `max_depth`

parameter of our model, our
model’s test error will decrease. It’ll keep decreasing until we’ve
reached the optimal `max_depth`

parameter of our model, for
which after that the test error will start to increase (due to
overfitting). Thus the correct answer is Option A.

The average score on this problem was 87%.