← return to practice.dsc80.com

**Instructor(s):** Suraj Rampure

This exam was administered in-person. The exam was closed-notes,
except students were allowed to bring 2 two-sided cheat sheets. No
calculators were allowed. Students had **180 minutes** to
take this exam.

Welcome to the Final Exam for DSC 80 in Spring 2022! For this exam, students had 3 hours, and could bring 2 two-sided cheat sheets.

For each day in May 2022, the DataFrame `streams`

contains
the number of streams for each of the “Top 200" songs on Spotify that
day — that is, the number of streams for the 200 songs with the most
streams on Spotify that day. The columns in `streams`

are as
follows:

`"date"`

: the date the song was streamed`"artist_names"`

: name(s) of the artists who created the song`"track_name"`

: name of the song`"streams"`

: the number of times the song was streamed on Spotify that day

The first few rows of `streams`

are shown below. Since
there were 31 days in May and 200 songs per day, `streams`

has 6200 rows in total.

Note that:

`streams`

is already sorted in a very particular way — it is sorted by`"date"`

in reverse chronological (decreasing) order, and, within each`"date"`

, by`"streams"`

in increasing order.Many songs will appear multiple times in

`streams`

, because many songs were in the Top 200 on more than one day.

Complete the implementation of the function `song_by_day`

,
which takes in an integer `day`

between 1 and 31
corresponding to a day in May, and an integer `n`

, and
returns the song that had the `n`

-th most streams on
`day`

. For instance,

`song_by_day(31, 199)`

should evaluate to
`"pepas"`

, because `"pepas"`

was the 199th most
streamed song on May 31st.

**Note:** You are not allowed to sort within
`song_by_day`

— remember, `streams`

is already
sorted.

```
def song_by_day(day, n):
= f"2022-05-{str(day).zfill(2)}"
day_str = streams[__(a)__].iloc[__(b)__]
day_only return __(c)__
```

What goes in each of the blanks?

**Answer: ** a) `streams['date'] == day_str`

,
b) `(200 - n)`

, c) `day_only['track_name']`

The first line in the function gives us an idea that maybe later on
in the function we’re going to filter for all the days that match the
given data. Indeed, in blank a, we filter for all the rows in which the
`'date'`

column matches `day_str`

. In blank b, we
could access directly access the row with the `n`

-th most
stream using iloc. (Remember, the image above shows us that the streams
are sorted by most streamed in ascending order, so to find the
`n`

-th most popular song of a day, we simply do
`200-n`

). Finally, to return the track name, we could simply
do `day_only['track_name']`

.

The average score on this problem was 63%.

Below, we define a DataFrame `pivoted`

.

```
= streams.pivot_table(index="track_name", columns="date",
pivoted ="streams", aggfunc=np.max) values
```

After defining `pivoted`

, we define a Series
`mystery`

below.

`= 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1) mystery `

`mystery.loc["pepas"]`

evaluates to 23. In one sentence,
describe the relationship between the number 23 and the song
`"pepas"`

in the context of the `streams`

dataset.
For instance, a correctly formatted but incorrect answer is “I listened
to the song `"pepas"`

23 times today."

**Answer: ** See below.

`pivoted.apply(lambda s: s.isna().sum(), axis=1)`

computes
the number of days that a song was not on the Top 200, so
`31 - pivoted.apply(lambda s: s.isna().sum(), axis=1)`

computes the number of days the song was in the Top 200. As such, the
correct interpretation is that ** "pepas" was in the
Top 200 for 23 days in May**.

The average score on this problem was 68%.

In defining `pivoted`

, we set the keyword argument
`aggfunc`

to `np.max`

. Which of the following
functions could we have used instead of `np.max`

without
changing the values in `pivoted`

? **Select all that
apply.**

`np.mean`

`np.median`

`len`

`lambda df: df.iloc[0]`

None of the above

**Answer: ** Option A, B and D

For each combination of `"track_name"`

and
`"date"`

, there is just a single value — the number of
streams that song received on that date. As such, the
`aggfunc`

needs to take in a Series containing a single
number and return that same number.

The mean and median of a Series containing a single number is equal to that number, so the first two options are correct.

The length of a Series containing a single number is 1, no matter what that number is, so the third option is not correct.

`lambda df: df.iloc[0]`

takes in a Series and returns the first element in the Series, which is the only element in the Series. This option is correct as well. (The parameter name`df`

is irrelevant.)

The average score on this problem was 80%.

Below, we define another DataFrame `another_mystery`

.

```
= (streams.groupby("date").last()
another_mystery "artist_names", "track_name"])
.groupby([ .count().reset_index())
```

`another_mystery`

has 5 rows. In one sentence, describe
the significance of the number 5 in the context of the
`streams`

dataset. For instance, a correctly formatted but
incorrect answer is “There are 5 unique artists in
`streams`

." Your answer should not include the word”row".

**Answer: ** See below.

1in Since `streams`

is sorted by `"date"`

in
descending order and, within each `"date"`

, by
`"streams"`

in ascending order,
`streams.groupby("date").last()`

is a DataFrame containing
the song with the most `"streams"`

on each day in May. In
other words, we found the “top song" for each day. (The DataFrame we
created has 31 rows.)

When we then execute
`.groupby(["artist_names", "track_name"]).count()`

, we create
one row for every unique combination of song and artist, amongst the
“top songs". (If no two artists have a song with the same name, this is
the same as creating one row for every unique song.) Since there are 5
rows in this new DataFrame (resetting the index doesn’t do anything
here), it means that **there were only 5 unique songs that were
ever the “top song" in a day in May**; this is the correct
interpretation.

The average score on this problem was 50%.

Suppose the DataFrame `today`

consists of 15 rows — 3 rows
for each of 5 different `"artist_names"`

. For each artist, it
contains the `"track_name"`

for their three most-streamed
songs today. For instance, there may be one row for
`"olivia rodrigo"`

and `"favorite crime"`

, one row
for `"olivia rodrigo"`

and `"drivers license"`

,
and one row for `"olivia rodrigo"`

and
`"deja vu"`

.

Another DataFrame, `genres`

, is shown below in its
entirety.

Suppose we perform an **inner** merge between
`today`

and `genres`

on
`"artist_names"`

. If the five `"artist_names"`

in
`today`

are the same as the five `"artist_names"`

in `genres`

, what fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{2}{5}

If the five `"artist_names"`

in `today`

and
**genres** are the same, the DataFrame that results from an
inner merge will have 15 rows, one for each row in `today`

.
This is because there are 3 matches for `"harry styles"`

, 3
matches for `"olivia rodrigo"`

, 3 matches for
`"glass animals"`

, and so on.

In the merged DataFrame’s 15 rows, 6 of them will correspond to
`"Pop"`

artists — 3 to `"harry styles"`

and 3 to
`"olivia rodrigo"`

. Thus, the fraction of rows that contain
`"Pop"`

in the `"genre"`

column is \frac{6}{15} = \frac{2}{5} (which is the
fraction of rows that contained `"Pop"`

in
`genres["genre"]`

, too).

The average score on this problem was 97%.

Suppose we perform an **inner** merge between
`today`

and `genres`

on
`"artist_names"`

. Furthermore, suppose that the only
overlapping `"artist_names"`

between `today`

and
`genres`

are `"drake"`

and
`"olivia rodrigo"`

. What fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{1}{2}

If we perform an inner merge, there will only be 6 rows in the merged
DataFrame — 3 for `"olivia rodrigo"`

and 3 for
`"drake"`

. 3 of those 6 rows will have `"Pop"`

in
the `"genre"`

column, hence the answer is \frac{3}{6} = \frac{1}{2}.

The average score on this problem was 86%.

Suppose we perform an **outer** merge between
`today`

and `genres`

on
`"artist_names"`

. Furthermore, suppose that the only
overlapping `"artist_names"`

between `today`

and
`genres`

are `"drake"`

and
`"olivia rodrigo"`

. What fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{2}{9}

Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:

Rows that are in

`today`

that are not in`genres`

. There are 9 of these (3 each for the 3 artists that are in`today`

and not`genres`

).`today`

doesn’t have a`"genre"`

column, and so all of these`"genre"`

s will be`NaN`

upon merging.Rows that are in

`genres`

that are not in`today`

. There are 3 of these — one for`"harry styles"`

, one for`"glass animals"`

, and one for`"doja cat"`

. 1 of these 3 have`"Pop"`

in the`"genre"`

column.Rows that are in both

`today`

and`genres`

. There are 6 of these — 3 for`"olivia rodrigo"`

and 3 for`"drake"`

— and 3 of those rows contain`"Pop"`

in the`"genre"`

column.

Tallying things up, we see that there are 9
+ 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have `"Pop"`

in the
`"genre"`

column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.

The average score on this problem was 29%.

Billy and Daisy each decide what songs to stream by rolling dice. Billy rolls a six-sided die 36 times and sees the same number of 1s, 2s, 3s, 4s, 5s, and 6s. Daisy rolls a six-sided die 72 times and sees 36 1s, 18 4s, and 18 6s.

What is the total variation distance (TVD) between their distributions of rolls? Give your answer as a number.

**Answer: ** \frac{1}{2}

First, we must normalize their distributions so they sum to 1. \text{billy} = \begin{bmatrix} \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6} \end{bmatrix}, \: \: \: \text{daisy} = \begin{bmatrix} \frac{1}{2}, 0, 0, \frac{1}{4}, 0, \frac{1}{4} \end{bmatrix}

Then, recall that the TVD is the sum of the absolute differences in proportions, all divided by 2:

\begin{aligned} \text{TVD} &= \frac{1}{2} \Bigg( \Big| \frac{1}{2} - \frac{1}{6} \Big| + \Big| 0 - \frac{1}{6} \Big| + \Big| 0 - \frac{1}{6} \Big| + \Big| \frac{1}{4} - \frac{1}{6} \Big| + \Big| 0 - \frac{1}{6} \Big| + \Big| \frac{1}{4} - \frac{1}{6} \Big| \Bigg) \\ &= \frac{1}{2} \Bigg( \frac{2}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{12} + \frac{1}{6} + \frac{1}{12} \Bigg) \\ &= \frac{1}{2} \cdot 1 \\ &= \frac{1}{2}\end{aligned}

The average score on this problem was 42%.

Consider two categorical distributions, each made up of the same n categories. Given no other information, what is the smallest possible TVD between the two distributions, and what is the largest possible TVD between the two distributions? Give both answers as numbers.

smallest possible TVD =

largest possible TVD =

**Answer: ** Smallest: 0, Largest: 1

There is an absolute value in the definition of TVD, so it must be at least 0. The TVD does not scale with the number of categories; its maximum possible value is 1. We will not provide a rigorous proof here, but intuitively, the most “different" two distributions can be is if they both have 100% of their values in different categories. In such a case, the TVD is 1.

The average score on this problem was 74%.

Consider the following pair of distributions.

Suppose we want to perform a permutation test to test whether the two distributions come from the same population distribution. Which test statistic is most likely to yield a significant result?

Difference in means

Absolute difference in means

Kolmogorov-Smirnov (K-S) statistic

Total variation distance

**Answer: ** Option C

These are two quantitative distributions, and the total variation distance only applies for categorical distributions.

These two distributions have similar means, so the difference in means and absolute difference in means won’t be able to tell them apart.

As such, the correct answer is the Kolmogorov-Smirnov statistic, which roughly measures the largest “gap" between two cumulative distribution functions.

The average score on this problem was 87%.

As a music enthusiast who checks Spotify’s Top 200 daily, you have noticed that many more songs from the “Hip-Hop/Rap" genre were popular in 2021 than were popular in years prior. You decide to investigate whether this could have happened by chance, or if Hip-Hop/Rap is actually becoming more popular.

You acquire ten DataFrames, one for each year between 2012 and 2021
(inclusive), containing the Top 200 songs on each day of each year. Each
DataFrame has four columns, `"year"`

,
`"artist_names"`

, `"track_name"`

, and
`"genre"`

, and 365 \cdot 200 =
73000 rows (or, in the case of leap years 2012 and 2016, 366 \cdot 200 = 73200 rows). You concatenate
these DataFrames into a big DataFrame called `all_years`

.

To conduct a hypothesis test, you assume that all of the songs that
were ever in the Top 200 in a particular year are a sample of all
popular songs from that year. As such, `all_years`

contains a
sample of all popular songs from 2012-2021. Your hypotheses are as
follows:

**Null Hypothesis**: The number of unique Hip-Hop/Rap songs that were popular in 2021 is equal to the average number of unique Hip-Hop/Rap songs that were popular each year between 2012 and 2021 (inclusive).**Alternative Hypothesis:**The number of unique Hip-Hop/Rap songs that were popular in 2021 is greater than the average number of unique Hip-Hop/Rap songs that were popular each year between 2012 and 2021 (inclusive).

To generate data under the null, you decide to treat the songs that
were in the Top 200 in 2021 as a random sample from the songs that were
in the Top 200 in all ten years. As your test statistic, you use the
**number of unique Hip-Hop/Rap songs** in your sample.

**Complete the implementation of the hypothesis test given on
the next page.**

**Note:** Remember that songs can appear multiple times
in `all_years`

— many songs appear in the Top 200 multiple
times in the same year, and some even appear in the Top 200 in different
years. However, for the purposes of this question, you can assume that
no two different `"artist_names"`

have songs with the same
`"track_name"`

; that is, each `"track_name"`

belongs to a unique `"artist_names"`

.

```
= # DataFrame that contains all 10 years' charts
all_years
# Helper function to compute the number of
# unique Hip-Hop/Rap songs in a given DataFrame
def unique_rap(df):
= df[df["genre"] == "Hip-Hop/Rap"]
rap_only return rap_only.groupby(__(a)__).count()__(b)__[0]
= unique_rap(all_years[all_years["year"] == 2021])
count_2021 = np.array([])
counts
for _ in range(10000):
= all_years.sample(__(c)__, replace=True)
samp = np.append(counts, unique_rap(samp))
counts
= (__(d)__).mean() p_val
```

What goes in blank (a)?

`"year"`

`"artist_names"`

`"track_name"`

`"genre"`

**Answer: ** Option C: `"track_name"`

The first thing to notice is that first line in the function filters
for all the rap songs in df. Next, since the helper function is supposed
to return the number of unique Hip-Hop/Rap songs in a given DataFrame,
we group by `"track_name"`

so that each group/category is a
unique song. Grouping by `"year"`

,
`"artist_names"`

, or `"genre"`

won’t help us
compute the number of unique songs.

The average score on this problem was 89%.

What goes in blank (b)?

`.shape`

`.loc`

`.iloc`

Nothing (i.e. add

`[0]`

immediately after`.count()`

)

**Answer: ** Option A: `.shape`

Note that after grouping by `"track_name"`

, each
individual row in the resulting dataframe is a unique song within the
rap genre. Thus the number of rows in the resulting dataframe is just
the number of unique rap songs in a dataframe. To get the number of rows
of a dataframe, we simply do `.shape[0]`

The average score on this problem was 89%.

What goes in blank (c)?

`count_2021.shape[0]`

`200`

`365`

`200 * 365`

**Answer: ** Option D: `200 * 365`

Consider the following statement from the problem: “To generate data
under the null, you decide to treat the songs that were in the Top 200
in 2021 as a random sample from the songs that were in the Top 200 in
all ten years.” Thus it follows that we need to generate
`200 * 365`

songs since there are `200 * 365`

songs in the Top 200 in 2021.

Alternatively, it makes sense to sample `200 * 365`

songs
from `all_years`

because we use the number of unique rap
songs as a test statistic for each sample. Thus our sample should be the
same size as the number of songs in the Top 200 in 2021.

The average score on this problem was 52%.

What goes in blank (d)?

**Answer: ** `(counts >= count_2021)`

Note that `counts`

is an array that contains the number of
Hip-Hop/Rap songs from each sample. Because our alternative hypothesis
is that “The number of unique Hip-Hop/Rap songs that were popular in
2021 is *greater* than the average number of unique Hip-Hop/Rap
songs that were popular each year between 2012 and 2021”, we’re
interested in the proportion of simulated test statistics that were
equal to our observed or *greater*.
`counts >= count_2021`

is an array of Booleans, denoting
whether or not `count_2021`

was greater than the number of
rap songs in each sample, and
`(counts >= count_2021).mean()`

is the proportion
(p-value) we’re looking for.

The average score on this problem was 53%.

The DataFrame `random_10`

contains the
`"track_name"`

and `"genre"`

of 10 randomly-chosen
songs in Spotify’s Top 200 today, along with their
`"genre_rank"`

, which is their rank in the Top 200
**among songs in their "genre"**. For
instance, “the real slim shady" is the 20th-ranked Hip-Hop/Rap song in
the Top 200 today.

`random_10`

is shown below in its
entirety.The `"genre_rank"`

column of `random_10`

contains missing values. Below, we provide four different imputed
`"genre_rank"`

columns, each of which was created using a
different imputation technique. On the next page, match each of the four
options to the imputation technique that was used in the option.

Note that each option (A, B, C, D) should be used exactly once between parts (a) and (d).

In which option was unconditional mean imputation used?

**Answer: ** Option B

Explanation given in part d) below

The average score on this problem was 99%.

In which option was mean imputation conditional on
`"genre"`

used?

**Answer: ** Option D

Explanation given in part d) below

The average score on this problem was 96%.

In which option was unconditional probabilistic imputation used?

**Answer: ** Option C

Explanation given in part d) below

The average score on this problem was 92%.

In which option was probabilistic imputation conditional on
`"genre"`

used?

**Answer: ** Option A

First, note that in Option B, all three missing values are filled in with the same number, 7. The mean of the observed values in

`random_10["genre rank"]`

is 7, so we must have performed unconditional mean imputation in Option B. (Technically, it’s possible for Option B to be the result of unconditional probabilistic imputation, but we stated that each option could only be used once, and there is another option that can only be unconditional probabilistic imputation.)Then note that in Option C, the very last missing value (in the

`"Pop"`

`"genre"`

) is filled in with a 7, which is not the mean of the observed`"Pop"`

values, but rather a value from the`"Alternative"`

`"genre"`

. This must mean that unconditional probabilistic imputation was used in Option C, since that’s the only way a value from a different group can be used for imputation (if we are not performing some sort of mean imputation).This leaves Option A and Option D. The last two missing values (the two in the

`"Pop"`

`"genre"`

) are both filled in with the same value, 2 in Option A and 5 in Option D. The mean of the observed values for the`"Pop"`

`"genre"`

is \frac{9+2+4}{3} = 5, so mean imputation conditional on`"genre"`

must have been used in Option D and thus probabilistic imputation conditional on`"genre"`

must have been used in Option A.

The average score on this problem was 92%.

In parts (e) and (f), suppose we want to run a permutation test to
determine whether the missingness of `"genre rank"`

depends
on `"genre"`

.

Name a valid test statistic for this permutation test.

**Answer: ** Total Variation Distance (TVD)

We are comparing two distributions:

The distribution of

`"genre"`

when`"genre rank"`

is missing.The distribution of

`"genre"`

when`"genre rank"`

is not missing.

Since the distribution of `"genre"`

is categorical, the
above two distributions are also categorical. The only test statistic we
have for comparing categorical distributions is the total variation
distance (TVD).

The average score on this problem was 63%.

Suppose we conclude that the missingness of `"genre rank"`

likely **depends on** `"genre"`

. Which
imputation technique should we choose if we want to preserve the
variance of the `"genre rank"`

column?

Unconditional mean imputation

Mean imputation conditional on

`"genre"`

Unconditional probabilistic imputation

Probabilistic imputation conditional on

`"genre"`

**Answer: ** Option D: Probabilistic imputation
conditional on `"genre"`

Mean imputation does not preserve the variance of the imputed values
— since it fills in all missing numbers with the same number (either
overall, or within each group), the variance of the imputed dataset is
less than the pre-imputed dataset. To preserve the variance of the
imputed values, we must use probabilistic imputation of some sort. Since
the missingness of `"genre rank"`

was found to be dependent
on `"genre"`

, we perform probabilistic imputation
**conditional** on `"genre"`

, i.e. impute
`"genre rank"`

s randomly within each
`"genre"`

.

The average score on this problem was 75%.

The DataFrame `trends`

contains the the number of streams
yesterday (`"yest"`

) and the number of streams the day before
yesterday (`"day_before_yest"`

) for songs in Spotify’s Top
200. Remember, a song was in the Top 200 yesterday if it was one of the
200 most streamed songs yesterday.

The first few rows of `trends`

are shown below.

The `"yest"`

column contains missing values. What is the
most likely missingness mechanism for `"yest"`

?

Missing by design

Not missing at random

Missing at random

Missing completely at random

**Answer: ** Option B: NMAR or Option C: MAR

We accepted two answers here — not missing at random and missing at random.

MCAR is ruled out right away, since there is some “pattern" to the missingness, i.e. some sort of relationship between

`"day_before_yest"`

and the missingness of`"yest"`

.One could argue not missing at random, because stream counts are more likely to be missing from

`"yest"`

if they are smaller. A song is missing from`"yest"`

but present in`"day_before_yest"`

if its number of streams was in the Top 200 yesterday but not today; if this is true, this must mean that its number of streams is less than any of the songs whose stream counts are actually in`"yest"`

.One could also argue missing at random, because the missingness of

`"yest"`

does indeed depend on`"day_before_yest"`

.Missing by design is

**not**a valid answer here. While it is true that`"day_before_yest"`

tells you something about the missingness of`"yest"`

, it is**not**the case that you can 100% of the time predict if`"yest"`

will be missing just by looking at`"day_before_yest"`

; this would need to be the case if`"yest"`

were missing by design.

The average score on this problem was 64%.

Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.

```
<head>
<title>3*Canada-2022-06-04</title>
<head>
<body>
<h1>Spotify Top 3 - Canada</h1>
<table>
<tr class='heading'>
<th>Rank</th>
<th>Artist(s)</th>
<th>Song</th>
</tr>
<tr class=1>
<td>1</td>
<td>Harry Styles</td>
<td>As It Was</td>
</tr>
<tr class=2>
<td>2</td>
<td>Jack Harlow</td>
<td>First Class</td>
</tr>
<tr class=3>
<td>3</td>
<td>Kendrick Lamar</td>
<td>N95</td>
</tr>
</table>
</body>
```

Suppose we define `soup`

to be a
`BeautifulSoup`

object that is instantiated using the
document above.

How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?

**Answer: ** 14

There’s 1 `<title>`

, 1 `<h1>`

, 3
`<th>`

s, and 9 `<td>`

s, adding up to
14.

The average score on this problem was 64%.

What does the following line of code evaluate to?

`len(soup.find_all("td"))`

**Answer: ** 9

As mentioned in the solution to the part above, there are 9
`<td>`

nodes, and `soup.find_all`

finds them
all.

The average score on this problem was 95%.

What does the following line of code evaluate to?

`"tr").get("class") soup.find(`

**Answer: ** `["heading"]`

or
`"heading"`

`soup.find("tr")`

finds the first occurrence of a
`<tr>`

node, and `get("class")`

accesses the
value of its `"class"`

attribute.

Note that technically the answer is `["heading"]`

, but
`"heading"`

received full credit too.

The average score on this problem was 96%.

Complete the implementation of the function `top_nth`

,
which takes in a positive integer `n`

and returns the
**name of the n-th ranked
song** in the HTML document. For instance,
`top_nth(2)`

should evaluate to `"First Class"`

(`n=1`

corresponds to the top song).

**Note:** Your implementation should work in the case
that the page contains more than 3 songs.

```
def top_nth(n):
return soup.find("tr", attrs=__(a)__).find_all("td")__(b)__
```

What goes in blank (a)?

What goes in blank (b)?

**Answer: ** a) `{'class' : n}`

b)
`[2].text`

or `[-1].text`

The logic is to find the `<tr>`

node with the
correct class attribute (which we do by setting attr to
`{'class' : 2}`

), then access the text of the node’s last
`<td>`

child (since that’s where the song titles are
stored).

The average score on this problem was 66%.

Suppose we run the line of code `r = requests.get(url)`

,
where `url`

is a string containing a URL to some online data
source.

**True or False:** If `r.status_code`

is
`200`

, then `r.text`

must be a string containing
the HTML source code of the site at `url`

.

True

False

**Answer: ** Option B: False

A status code of 200 means that the request has succeeded. Hence, the response could be JSON, it is not necessarily HTML.

The average score on this problem was 44%.

Each Spotify charts webpage is specific to a particular country (as different countries have different music tastes). Embedded in each charts page is a “datestring", that describes:

the number of songs on the page,

the country, and

the date.

For instance, `"3*Canada-2022-06-04"`

is a datestring
stating that the page contains the top 3 songs in Canada on June 4th,
2022. A valid datestring contains a number, a country name, a year, a
month, and a day, such that:

the number, country name, and year are each separated by a single dash (

`"-"`

), asterisk (`"*"`

), or space (`" "`

).the year, month, and day are each separated by a single dash (

`"-"`

) only

Below, assign `exp`

to a regular expression that
**extracts country names from valid datestrings**. If the
datestring does not follow the above format, it should not extract
anything. Example behavior is given below.

```
>>> re.findall(exp, "3*Canada-2022-06-04")
"Canada"]
[
>>> re.findall(exp, "144 Brazil*1998-11-26")
"Brazil"]
[
>>> re.findall(exp, "18_USA-2009-05-16")
[]
```

`exp = r"^`

`$"`

**Answer: ** One solution is given below.

Click this link to interact with the solution on regex101.

While grading, we were not particular about students’ treatment of uppercase characters in country names.

The average score on this problem was 76%.

Consider the following regular expression.

`r"^\w{2,5}.\d*\/[^A-Z5]{1,}"`

**Select all** strings below that **contain any
match** with the regular expression above.

`"billy4/Za"`

`"billy4/za"`

`"DAI_s2154/pacific"`

`"daisy/ZZZZZ"`

`"bi_/_lly98"`

`"!@__!14/atlantic"`

**Answer: ** Option B, Option C, and Option E

Let’s first dissect the regular expression into manageable groups:

`"^"`

matches the regex to its right at the start of a given string`"\w{2,5}"`

matches alphanumeric characters (a-Z, 0-9 and _) 2 to 5 times inclusively. (Note the that it does indeed match with the underscore)`"."`

is a basic wildcard`"\d*"`

matches digits (0-9), at least 0 times`"\/"`

matches the`"/"`

character`"[^A-Z5]{1,}"`

matches any character that isn’t (A-Z or 5) at least once.

Thus using these rules, it’s not hard to verify that Options B, C and E are matches.

The average score on this problem was 85%.

Consider the following string and regular expressions:

```
= "doja cat you right"
song_str
= r"\b\w+\b" # \b stands for word boundary
exp_1 = r" \w+"
exp_2 = r" \w+ " exp_3
```

What does

`len(re.findall(exp_1, song_str))`

evaluate to?What does

`len(re.findall(exp_2, song_str))`

evaluate to?What does

`len(re.findall(exp_3, song_str))`

evaluate to?

**Answer: ** See below

`"\b"`

matches “word boundaries", which are any locations that separate words. As such, there are 4 matches —`["doja", "cat", "you", "right"]`

. Thus the answer is 4.The 3 matches are

`[" cat", " you", " right"]`

. Thus the answer is 3.This was quite tricky! The key is remembering that

`re.findall`

only finds**non-overlapping matches**(if you look at the solutions to the above two parts, none of the matches overlapped). Reading from left to right, there is only a single non-overlapping match:`"cat"`

. Sure,`" you "`

also matches the pattern, but since the space after`"cat"`

was already “found" by`re.findall`

, it cannot be included in any future matches. Thus the answer is 1.

The average score on this problem was 60%.

The DataFrame below contains a corpus of four song titles, labeled from 0 to 3.

What is the TF-IDF of the word `"hate"`

in Song 0’s title?
Use base 2 in your logarithm, and give your answer as a simplified
fraction.

**Answer: ** \frac{1}{6}

There are 12 words in Song 0’s title, and 2 of them are
`"hate"`

, so the term frequency of `"hate"`

in
Song 0’s title is \frac{2}{12} =
\frac{1}{6}.

There are 4 documents total, and 2 of them contain
`"hate"`

(Song 0’s title and Song 3’s title), so the inverse
document frequency of `"hate"`

in the corpus is \log_2 \left( \frac{4}{2} \right) = \log_2 (2) =
1.

Then, the TF-IDF of `"hate"`

in Song 0’s title is

\text{TF-IDF} = \frac{1}{6} \cdot 1 = \frac{1}{6}

The average score on this problem was 86%.

Which word in Song 0’s title has the highest TF-IDF?

`"i"`

`"hate"`

`"you"`

`"love"`

`"that"`

Two or more words are tied for the highest TF-IDF in Song 0’s title

**Answer: ** Option A: `"i"`

It was not necessary to compute the TF-IDFs of all words in Song 0’s
title to determine the answer. \text{tfidf}(t,
d) is high when t occurs often
in d but rarely overall. That is the
case with `"i"`

— it is the most common word in Song 0’s
title (with 4 appearances), but it does not appear in any other
document. As such, it must be the word with the highest TF-IDF in Song
0’s title.

The average score on this problem was 84%.

Let \text{tfidf}(t, d) be the TF-IDF of term t in document d, and let \text{bow}(t, d) be the number of occurrences of term t in document d.

**Select all** correct answers below.

If \text{tfidf}(t, d) = 0, then \text{bow}(t, d) = 0.

If \text{bow}(t, d) = 0, then \text{tfidf}(t, d) = 0.

Neither of the above statements are necessarily true.

**Answer: ** Option B

Recall that \text{tfidf}(t, d) = \text{tf}(t, d) \cdot \text{idf}(t), and note that \text{tf}(t, d) is just $ (t, d) $. Thus, \text{tfidf}(t, d) is 0 is if either \text{bow}(t, d) = 0 or \text{idf}(t) = 0.

So, if \text{bow}(t, d) = 0, then \text{tf}(t, d) = 0 and \text{tfidf}(t, d) = 0, so the second option is true. However, if \text{tfidf}(t, d) = 0, it could be the case that \text{bow}(t, d) > 0 and \text{idf}(t) = 0 (which happens when term t is in every document), so the first option is not necessarily true.

The average score on this problem was 91%.

Below, we’ve encoded the corpus from the previous page using the bag-of-words model.

Note that in the above DataFrame, each row has been normalized to have a length of 1 (i.e. |\vec{v}| = 1 for all four row vectors).

Which song’s title has the highest cosine similarity with Song 0’s title?

Song 1

Song 2

Song 3

**Answer: ** Option B: Song 2

Recall, the cosine similarity between two vectors \vec{a}, \vec{b} is computed as

\cos \theta = \frac{\vec{a} \cdot \vec{b}}{| \vec{a} | | \vec{b}|}

We are told that each row vector is already normalized to have a length of 1, so to compute the similarity between two songs’ titles, we can compute dot products directly.

Song 0 and Song 1: 0.47 \cdot 0.76

Song 0 and Song 2: 0.47 \cdot 0.58 + 0.71 \cdot 0.58

Song 0 and Song 3: 0.47 \cdot 0.71

Without using a calculator (which students did not have access to during the exam), it is clear that the dot product between Song 0’s title and Song 2’s title is the highest, hence Song 2’s title is the most similar to Song 0’s.

The average score on this problem was 87%.

Recall from Project 4, a trigram is an N-Gram with N=3. Below, we instantiate a trigram language model with a list of 12 tokens.

```
# \x02 means start of paragraph, \x03 means end of paragraph
= ["\x02", "hi", "my", "name", "is", "what",
tokens "my", "name", "is", "who", "my", "\x03"]
= NGramLM(3, tokens) lm
```

What does `lm.probability(("name", "is", "what", "my"))`

evaluate to? In other words, what is P(\text{name is what my})? **Show your
work in the box below, and give your final answer as a simplified
fraction in the box at the bottom of the page.**

*Hint: Do not perform any unnecessary computation — only compute
the conditional probabilities that are needed to evaluate P(\text{name is what my}).*

**Answer: ** \frac{1}{12}

Since we are using a trigram model, to compute the conditional
probability of a token, we must condition on the prior two tokens. For
the first and second tokens, `"name"`

and `"is"`

,
there aren’t two prior tokens to condition on, so we instead look at
unigrams and bigrams instead.

First, we decompose P(\text{name is what my}):

\begin{aligned} P(\text{name is what my}) &= P(\text{name}) \cdot P(\text{is $|$ name}) \cdot P(\text{what $|$ name is}) \cdot P(\text{my $|$ is what}) \end{aligned}

P(\text{name}) is \frac{2}{12}, since there are 12 total tokens and 2 of them are equal to

`"name"`

.P(\text{is $|$ name}) is \frac{2}{2} = 1, because of the 2 times

`"name"`

appears,`"is"`

always follows it.P(\text{what $|$ name is}) is \frac{1}{2}, because of the 2 times

`"name is"`

appears,`"what"`

follows it once (`"who"`

follows it the other time).P(\text{my $|$ is what}) is \frac{1}{1} = 1, because

`"is what"`

only appeared once, and`"my"`

appeared right after it.

Thus: \begin{aligned} P(\text{name is what my}) &= P(\text{name}) \cdot P(\text{is $|$ name}) \cdot P(\text{what $|$ name is}) \cdot P(\text{my $|$ is what}) \\ &= \frac{2}{12} \cdot 1 \cdot \frac{1}{2} \cdot 1 \\ &= \frac{1}{12} \end{aligned}

The average score on this problem was 57%.

The DataFrame `new_releases`

contains the following
information for songs that were recently released:

`"genre"`

: the genre of the song (one of the following 5 possibilities:`"Hip-Hop/Rap"`

,`"Pop"`

,`"Country"`

,`"Alternative"`

, or`"International"`

)`"rec_label"`

: the record label of the artist who released the song (one of the following 4 possibilities:`"EMI"`

,`"SME"`

,`"UMG"`

, or`"WMG"`

)`"danceability"`

: how easy the song is to dance to, according to the Spotify API (between 0 and 1)`"speechiness"`

: what proportion of the song is made up of spoken words, according to the Spotify API (between 0 and 1)`"first_month"`

: the number of total streams the song had on Spotify in the first month it was released

The first few rows of `new_releases`

are shown below
(though `new_releases`

has many more rows than are shown
below).

We decide to build a linear regression model that predicts
`"first_month"`

given all other information. To start, we
conduct a train-test split, splitting `new_releases`

into
`X_train`

, `X_test`

, `y_train`

, and
`y_test`

.

We then fit two linear models (with intercept terms) to the training data:

Model 1 (

`lr_one`

): Uses`"danceability"`

only.Model 2 (

`lr_two`

): Uses`"danceability"`

and`"speechiness"`

only.

**True or False:** If
`lr_one.score(X_train, y_train)`

is much lower than

`lr_one.score(X_test, y_test)`

, it is likely that
`lr_one`

overfit to the training data.

True

False

**Answer: ** False

For regression models, the `score`

method computes R^2. A higher R^2 indicates a better linear fit. If the
training R^2 is much greater than the
testing R^2, that is indication of
overfitting. However, that is not the case here — here, we were asked
what happens if the training R^2 is
much less than the testing R^2, which
is not indication of overfitting (it is an indication that your model,
luckily, performs much better on unseen data than it does on observed
data!).

The average score on this problem was 81%.

Consider the following outputs.

```
>>> X_train.shape[0]
50
>>> np.sum((y_train - lr_two.predict(X_train)) ** 2)
500000 # five hundred thousand
```

What is Model 2 (`lr_two`

)’s training RMSE? Give your
answer as an integer.

**Answer: ** 100

We are given that there are n=50 data points, and that the sum of squared errors \sum_{i = 1}^n (y_i - H(x_i))^2 is 500{,}000. Then:

\begin{aligned} \text{RMSE} &= \sqrt{\frac{1}{n} \sum_{i = 1}^n (y_i - H(x_i))^2} \\ &= \sqrt{\frac{1}{50} \cdot 500{,}000} \\ &= \sqrt{10{,}000} \\ &= 100\end{aligned}

The average score on this problem was 87%.

Now, suppose we fit two more linear models (with intercept terms) to the training data:

Model 3 (\texttt{lr\_drop}): Uses

`"danceability"`

and`"speechiness"`

as-is, and one-hot encodes`"genre"`

and`"rec_label"`

, using`OneHotEncoder(drop="first")`

.Model 4 (\texttt{lr\_no\_drop}): Uses

`"danceability"`

and`"speechiness"`

as-is, and one-hot encodes`"genre"`

and`"rec_label"`

, using`OneHotEncoder()`

.

Note that the only difference between Model 3 and Model 4 is the fact
that Model 3 uses `drop="first"`

.

How many **one-hot encoded** columns are used in each
model? In other words, how many **binary** columns are used
in each model? Give both answers as integers.

*Hint: Make sure to look closely at the description of
new_releases at the top of the previous page, and don’t
include the already-quantitative features.*

number of one-hot encoded columns in Model 3 (`lr_drop`

)
=

number of one-hot encoded columns in Model 4
(`lr_no_drop`

) =

**Answer: ** 7 and 9

There are 5 unique values of `"genre"`

and 4 unique values
of `"rec_label"`

, so if we create a single one-hot encoded
column for each one, there would be 5 + 4 =
9 one-hot encoded columns (which there are in
`lr_no_drop`

).

If we drop one one-hot-encoded column per category, which is what
`drop="first"`

does, then we only have (5 - 1) + (4 - 1) = 7 one-hot encoded columns
(which there are in `lr_drop`

).

The average score on this problem was 75%.

Fill in the blank:

`lr_drop.score(X_test, y_test)`

is _____

`lr_no_drop.score(X_test, y_test)`

.

likely greater than

roughly equal to

likely less than

**Answer: ** Option B: roughly equal to

Multicollinearity does not impact a linear model’s ability to make predictions (even on unseen data), it only impacts the interpretability of its coefficients. As such, the testing RMSE of both Model 3 and Model 4 will be roughly the same.

The average score on this problem was 70%.

Recall, in Model 4 (`lr_no_drop`

) we one-hot encoded
`"genre"`

and `"rec_label"`

, and did not use
`drop="first"`

when instantiating our
`OneHotEncoder`

.

Suppose we are given the following coefficients in Model 4:

The coefficient on

`"genre_Pop"`

is 2000.The coefficient on

`"genre_Country"`

is 1000.The coefficient on

`"danceability"`

is 10^6 = 1{,}000{,}000.

Daisy and Billy are two artists signed to the same
`"rec_label"`

who each just released a new song with the same
`"speechiness"`

. Daisy is a `"Pop"`

artist while
Billy is a `"Country"`

artist.

Model 4 predicted that Daisy’s song and Billy’s song will have the
same `"first_month"`

streams. What is the **absolute
difference** between Daisy’s song’s `"danceability"`

and Billy’s song’s `"danceability"`

? Give your answer as a
simplified fraction.

**Answer: ** \frac{1}{1000}

“My favorite problem on the exam!" -Suraj

Model 4 is made up of 11 features, i.e. 11 columns.

4 of the columns correspond to the different values of

`"rec_label"`

. Since Daisy and Billy have the same`"rec_label"`

, their values in these four columns are all the same.One of the columns corresponds to

`"speechiness"`

. Since Daisy’s song and Billy’s song have the same`"speechiness"`

, their values in this column are the same.5 of the columns correspond to the different values of

`"genre"`

. Daisy is a`"Pop"`

artist, so she has a 1 in the`"genre_Pop"`

column and a 0 in the other four`"genre_"`

columns, and similarly Billy has a 1 in the`"genre_Country"`

column and 0s in the others.One of the columns corresponds to

`"danceability"`

, and Daisy and Billy have different quantitative values in this column.

Let d_1 and d_2.

The key is in recognizing that all features in Daisy’s prediction and
Billy’s prediction are the same, other than the coefficients on
`"genre_Pop"`

, `"genre_Country"`

, and
`"danceability"`

. Let’s let d_1 be Daisy’s song’s
`"danceability"`

, and let d_2 be Billy’s song’s
`"danceability"`

. Then:

\begin{aligned} 2000 + 10^{6} \cdot d_1 = 1000 + 10^{6} \cdot d_2 \\ 1000 &= 10^{6} (d_2 - d_1) \\ \frac{1}{1000} &= d_2 - d_1\end{aligned}

Thus, the absolute difference between their songs’
`"danceability"`

s is \frac{1}{1000}.

The average score on this problem was 70%.

Suppose we build a binary classifier that uses a song’s
`"track_name"`

and `"artist_names"`

to predict
whether its genre is `"Hip-Hop/Rap"`

(1) or not (0).

For our classifier, we decide to use a brand-new model built into
`sklearn`

called the

`BillyClassifier`

. A `BillyClassifier`

instance
has three hyperparameters that we’d like to tune. Below, we show a
dictionary containing the values for each hyperparameter that we’d like
to try:

```
= {
hyp_grid "radius": [0.1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100], # 12 total
"inflection": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4], # 10 total
"color": ["red", "yellow", "green", "blue", "purple"] # 5 total
}
```

To find the best combination of hyperparameters for our
`BillyClassifier`

, we first conduct a train-test split, which
we use to create a training set with 800 rows. We then use
`GridSearchCV`

to conduct k-fold cross-validation for each combination
of hyperparameters in `hyp_grid`

, with k=4.

When we call `GridSearchCV`

, how many times is a
`BillyClassifier`

instance trained in total? Give your answer
as an integer.

**Answer: ** 2400

There are 12 \cdot 10 \cdot 5 = 600
combinations of hyperparameters. For each combination of
hyperparameters, we will train a `BillyClassifier`

with that
combination of hyperparameters k = 4
times. So, the total number of `BillyClassifier`

instances
that will be trained is 600 \cdot 4 =
2400.

The average score on this problem was 73%.

In each of the 4 folds of the data, how large is the training set, and how large is the validation set? Give your answers as integers.

size of training set =

size of validation set =

**Answer: ** 600, 200

Since we performed k=4 cross-validation, we must divide the training set into four disjoint groups each of the same size. \frac{800}{4} = 200, so each group is of size 200. Each time we perform cross validation, one group is used for validation, and the other three are used for training, so the validation set size is 200 and the training set size is 200 \cdot 3 = 600.

The average score on this problem was 77%.

Suppose that after fitting a `GridSearchCV`

instance, its
`best_params_`

attribute is

`"radius": 8, "inflection": 4, "color": "blue"} {`

Select all true statements below.

The specific combination of hyperparameters in

`best_params_`

had the highest average training accuracy among all combinations of hyperparameters in`hyp_grid`

.The specific combination of hyperparameters in

`best_params_`

had the highest average validation accuracy among all combinations of hyperparameters in`hyp_grid`

.The specific combination of hyperparameters in

`best_params_`

had the highest training accuracy among all combinations of hyperparameters in`hyp_grid`

, in each of the 4 folds of the training data.The specific combination of hyperparameters in

`best_params_`

had the highest validation accuracy among all combinations of hyperparameters in`hyp_grid`

, in each of the 4 folds of the training data.A

`BillyClassifier`

that is fit using the specific combination of hyperparameters in`best_params_`

is guaranteed to have the best accuracy on unseen testing data among all combinations of hyperparameters in`hyp_grid`

.

**Answer: ** Option B

When performing cross validation, we select the combination of
hyperparameters that had the highest **average validation
accuracy** across all four folds of the data. That is, by
definition, how `best_params_`

came to be. None of the other
options are guaranteed to be true.

The average score on this problem was 82%.

After fitting our `BillyClassifier`

from the previous
question, we use it to make predictions on an unseen test set. Our
results are summarized in the following confusion matrix.

What is the recall of our classifier? Give your answer as a fraction (it does not need to be simplified).

**Answer: ** \frac{35}{57}

There are 105 true positives and 66 false negatives. Hence, the recall is \frac{105}{105 + 66} = \frac{105}{171} = \frac{35}{57}.

The average score on this problem was 89%.

The accuracy of our classifier is \frac{69}{117}. How many **true
negatives** did our classifier have? Give your answer as an
integer.

**Answer: ** 33

Let x be the number of true negatives. The number of correctly classified data points is 105 + x, and the total number of data points is 105 + 30 + 66 + x = 201 + x. Hence, this boils down to solving for x in \frac{69}{117} = \frac{105 + x}{201 + x}.

It may be tempting to cross-multiply here, but that’s not necessary (in fact, we picked the numbers specifically so you would not have to)! Multiply \frac{69}{117} by \frac{2}{2} to yield \frac{138}{234}. Then, conveniently, setting x = 33 in \frac{105 + x}{201 + x} also yields \frac{138}{234}, so x = 33 and hence the number of true negatives our classifier has is 33.

The average score on this problem was 84%.

True or False: In order for a binary classifier’s precision and recall to be equal, the number of mistakes it makes must be an even number.

True

False

**Answer: ** True

Remember that \text{precision} = \frac{TP}{TP + FP} and \text{recall} = \frac{TP}{TP + FN}. In order for precision to be the same as recall, it must be the case that FP = FN, i.e. that our classifier makes the same number of false positives and false negatives. The only kinds of “errors" or”mistakes" a classifier can make are false positives and false negatives; thus, we must have

\text{mistakes} = FP + FN = FP + FP = 2 \cdot FP

2 times any integer must be an even integer, so the number of mistakes must be even.

The average score on this problem was 100%.

Suppose we are building a classifier that listens to an audio source (say, from your phone’s microphone) and predicts whether or not it is Soulja Boy’s 2008 classic “Kiss Me thru the Phone." Our classifier is pretty good at detecting when the input stream is”Kiss Me thru the Phone", but it often incorrectly predicts that similar sounding songs are also “Kiss Me thru the Phone."

Complete the sentence: Our classifier has...

low precision and low recall.

low precision and high recall.

high precision and low recall.

high precision and high recall.

**Answer: ** Option B: low precision and high
recall.

Our classifier is good at identifying when the input stream is “Kiss Me thru the Phone", i.e. it is good at identifying true positives amongst all positives. This means it has high recall.

Since our classifier makes many false positive predictions — in other words, it often incorrectly predicts “Kiss Me thru the Phone" when that’s not what the input stream is — it has many false positives, so its precision is low.

Thus, our classifier has low precision and high recall.

The average score on this problem was 91%.