← return to practice.dsc80.com
Instructor(s): Suraj Rampure
This exam was administered in-person. The exam was closed-notes, except students were allowed to bring a single two-sided cheat sheet. No calculators were allowed. Students had 50 minutes to take this exam.
Welcome to the Midterm Exam for DSC 80 in Spring 2022!
Throughout this exam, we will be using the following DataFrame
students
, which contains various information about high
school students and the university/universities they applied to.
The columns are:
'Name' (str)
: the name of the student.'High School' (str)
: the High School that the student
attended.'Email' (str)
: the email of the student.'GPA' (float)
: the GPA of the student.'AP' (int)
: the number of AP exams that the student
took.'University' (str)
: the name of the university that the
student applied to.'Admit' (str)
: the acceptance status of the student
(where ‘Y’ denotes that they were accepted to the university and ‘N’
denotes that they were not).The rows of 'student'
are arranged in no particular
order. The first eight rows of 'student'
are shown above
(though 'student'
has many more rows than pictured
here).
What kind of variable is "APs"
?
Quantitative discrete
Quantitative continuous
Qualitative ordinal
Qualitative nominal
Answer: Quantitative discrete
Since we can count the number of "APs"
a student is
enrolled in, it is clearly a quantitative variable. Also note that the
number of "APs"
a student is enrolled in has to be a whole
number, hence it is a quantitative discrete variable.
The average score on this problem was 97%.
Because we can sort universities by admit rate,
"University"
is a qualitative ordinal variable.
True
False
Answer: False
In order for a categorical variable to be ordinal, there must be some inherent order to the categories. We can always sort a categorical column based on some other column, but that doesn’t make the categories themselves ordinal. For instance, we can sort colors based on how much King Triton likes them, but that doesn’t make colors ordinal!
The average score on this problem was 79%.
Write a single line of code that evaluates to the most common value
in the "High School"
column of students
, as a
string. Assume there are no ties.
Answer:
students["High School"].value_counts().idxmax()
or
students.groupby("High School")["Name"].count().sort_values().index[-1]
The average score on this problem was 89%.
Fill in the blank so that the result evaluates to a Series indexed by
"Email"
that contains a list of the
universities that each student was admitted to. If a
student wasn’t admitted to any universities, they should have an empty
list.
"Email").apply(_____) students.groupby(
What goes in the blank?
Answer:
lambda df: df.loc[df["Admit"] == "Y", "University"].tolist()
The average score on this problem was 53%.
Which of the following blocks of code correctly assign
max_AP
to the maximum number of APs taken by a student who
was rejected by UC San Diego?
Option 1:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.loc[cond1 & cond2, "APs"].sort_values().iloc[-1] max_AP
Option 2:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.groupby(["University", "Admit"]).max().reset_index()
d3 = d3.loc[cond1 & cond2, "APs"].iloc[0] max_AP
Option 3:
= students.pivot_table(index="Admit",
p ="University",
columns="APs",
values="max")
aggfunc= p.loc["N", "UC San Diego"] max_AP
Option 4:
# .last() returns the element at the end of a Series it is called on
= students.sort_values(["APs", "Admit"]).groupby("University")
groups = groups["APs"].last()["UC San Diego"] max_AP
Select all that apply. There is at least one correct option.
Option 1
Option 2
Option 3
Option 4
Answer: Option 1 and Option 3
Option 1 works correctly, it is probably the most straightforward
way of answering the question. cond1
is True
for all rows in which students were rejected, and cond2
is
True
for all rows in which students applied to UCSD. As
such, students.loc[cond1 & cond2]
contains only the
rows where students were rejected from UCSD. Then,
students.loc[cond1 & cond2, "APs"].sort_values()
sorts
by the number of "APs"
taken in increasing order, and
.iloc[-1]
gets the largest number of "APs"
taken.
Option 2 doesn’t work because the lengths of cond1
and cond2
are not the same as the length of
d3
, so this causes an error.
Option 3 works correctly. For each combination of
"Admit"
status ("Y"
, "N"
,
"W"
) and "University"
(including UC San
Diego), it computes the max number of "APs"
. The usage of
.loc["N", "UC San Diego"]
is correct too.
Option 4 doesn’t work. It currently returns the maximum number of
"APs"
taken by someone who applied to UC San Diego; it does
not factor in whether they were admitted, rejected, or
waitlisted.
The average score on this problem was 85%.
Currently, students
has a lot of repeated information —
for instance, if a student applied to 10 universities, their GPA appears
10 times in students
.
We want to generate a DataFrame that contains a single row for each
student, indexed by "Email"
, that contains their
"Name"
, "High School"
, "GPA"
, and
"APs"
.
One attempt to create such a DataFrame is below.
"Email").aggregate({"Name": "max",
students.groupby("High School": "mean",
"GPA": "mean",
"APs": "max"})
There is exactly one issue with the line of code above. In one sentence, explain what needs to be changed about the line of code above so that the desired DataFrame is created.
Answer: The problem right now is that aggregating
High School by mean doesn’t work since you can’t aggregate a column with
strings using "mean"
. Thus changing it to something that
works for strings like "max"
or "min"
would
fix the issue.
The average score on this problem was 79%.
Consider the following snippet of code.
= students.assign(Admit=students["Admit"] == "Y") \
pivoted ="High School",
.pivot_table(index="University",
columns="Admit",
values="sum") aggfunc
Some of the rows and columns of pivoted
are shown
below.
No students from Warren High were admitted to Columbia or Stanford.
However,
pivoted.loc["Warren High", "Columbia"]
and
pivoted.loc["Warren High", "Stanford"]
evaluate to
different values. What is the reason for this difference?
Some students from Warren High applied to Stanford, and some others applied to Columbia, but none applied to both.
Some students from Warren High applied to Stanford but none applied to Columbia.
Some students from Warren High applied to Columbia but none applied to Stanford.
The students from Warren High that applied to both Columbia and Stanford were all rejected from Stanford, but at least one was admitted to Columbia.
When using pivot_table
, pandas
was not able
to sum strings of the form "Y"
, "N"
, and
"W"
, so the values in pivoted
are
unreliable.
Answer: Option 3
pivoted.loc["Warren High", "Stanford"]
is
NaN
because there were no rows in students
in
which the "High School"
was "Warren High"
and
the "University"
was "Stanford"
, because
nobody from Warren High applied to Stanford. However,
pivoted.loc["Warren High", "Columbia"]
is not
NaN
because there was at least one row in
students
in which the "High School"
was
"Warren High"
and the "University"
was
"Columbia"
. This means that at least one student from
Warren High applied to Columbia.
Option 3 is the only option consistent with this logic.
The average score on this problem was 93%.
Define small_students
to be the DataFrame with 8 rows
and 2 columns shown directly below, and define districts
to
be the DataFrame with 3 rows and 2 columns shown below
small_students
.
Consider the DataFrame merged
, defined below.
= small_students.merge(districts,
merged ="High School",
left_on="school",
right_on="outer") how
How many total NaN
values does merged
contain? Give your answer as an integer.
Answer: 4
merged
is shown below.
The average score on this problem was 13%.
Consider the DataFrame concatted
, defined below.
concatted = pd.concat([small_students, districts], axis=1)
How many total NaN
values does concatted
contain? Give your answer as an integer.
Hint: Draw out what concatted
looks like. Also,
remember that the default axis
argument to
pd.concat
is axis=0
.
Answer: 10
concatted
is shown below.
The average score on this problem was 76%.
Let’s consider admissions at UC San Diego and UC Santa Barbara for two high schools in particular.
For instance, the above table tells us that 200 students from La Jolla Private applied to UC San Diego, and 50 were admitted.
What is the largest possible integer value of N such that:
UC Santa Barbara has a strictly higher admit rate for both La Jolla Private and Sun God Memorial High individually, but
UC San Diego has a strictly higher admit rate overall?
Answer: 124
Let’s consider the two conditions separately.
First, UC Santa Barbara needs to have a higher admit rate for both high schools. This is already true for La Jolla Private (\frac{100}{300} > \frac{50}{200}); for Sun God Memorial High, we just need to ensure that \frac{N}{150} > \frac{200}{300}. This means that N > 100.
Now, UC San Diego needs to have a higher admit rate overall. The UC San Diego admit rate is \frac{50+200}{200+300} = \frac{250}{500} = \frac{1}{2}, while the UC Santa Barbara admit rate is \frac{100 + N}{450}. This means that we must require that \frac{1}{2} = \frac{225}{450} > \frac{100+N}{450}. This means that 225 > 100 + N, i.e. that N < 125.
So there are two conditions on N: N > 100 and N < 125. The largest integer N that satisfies these conditions is N=124, which is the final answer.
The average score on this problem was 72%.
Valentina has over 1000 students. When a student signs up for
Valentina’s college counseling, they must provide a variety of
information about themselves and their parents. Valentina keeps track of
all of this information in a table, with one row per student. (Note that
this is not the students
DataFrame from
earlier in the exam.)
Valentina asks each her students for the university that their
parents attended for undergrad. The "father’s university"
column of Valentina’s table contains missing values. Valentina believes
that values in this column are missing because not all students’ fathers
attended university.
According to Valentina’s interpretation, what is the missingness
mechanism of "father’s university"
?
Missing by design
Not missing at random
Missing at random
Missing completely at random
Answer: Not missing at random
Per Valentina’s interpretation, the reason for the missingness in the
"father’s university"
column is that not all fathers
attended university, and hence they opted not to fill out the survey.
Here, the likelihood that values are missing depends on the values
themselves, so the data are NMAR.
The average score on this problem was 81%.
The "mother’s phone number"
column of Valentina’s table
contains missing values. Valentina knows for a fact that all of her
students’ mothers have phone numbers. She looks at her dataset and draws
the following visualization, relating the missingness of
"mother’s phone number"
to "district"
(the
school district that the student’s family lives in):
Given just the above information, what is the missingness mechanism
of "mother’s phone number"
?
Missing by design
Not missing at random
Missing at random
Missing completely at random
Answer: Missing at random
Here, the distribution of "district"
is different when
"mother’s phone number"
is missing and when
"mother’s phone number"
is present (the two distributions
plotted look quite different). As such, we conclude that the missingness
of "mother’s phone number"
depends on
"district"
, and hence the data are MAR.
The average score on this problem was 88%.
UC Hicago, a new private campus of the UC, has an annual tuition of $80,000. UC Hicago states that if an admitted student’s parents’ combined income is under $80,000, they will provide that student a scholarship for the difference.
Valentina keeps track of each student’s parents’ incomes along with the scholarship that UC Hicago promises them in a table. The first few rows of her table are shown below.
Given just the above information, what is the missingness mechanism
of "scholarship"
?
Missing by design
Not missing at random
Missing at random
Missing completely at random
Answer: Missing by design
Here, the data are missing by design because you can 100% of the time
predict whether a "scholarship"
will be missing by looking
at the "mother’s income"
and "father’s income"
columns. If the sum of "mother’s income"
and
"father’s income"
is at least $80,000,
"scholarship"
will be missing; otherwise, it will not be
missing.
The average score on this problem was 88%.
Consider the following pair of hypotheses.
Null hypothesis: The average GPA of UC San Diego admits from La Jolla Private is equal to the average GPA of UC San Diego admits from all schools.
Alternative hypothesis: The average GPA of UC San Diego admits from La Jolla Private is less than the average GPA of UC San Diego admits from all schools.
What type of test is this?
Hypothesis test
Permutation test
Answer: Hypothesis test
Here, we are asking if one sample is a random sample of a known population. While this may seem like a permutation test in which we compare two samples, there is really only one sample here — the GPAs of admits from La Jolla Private. To simulate new data, we sample from the distribution of all GPAs.
Note that this is similar to the bill lengths on Torgersen Island example from Lecture 6 (in Spring 2022, at least).
The average score on this problem was 75%.
Which of the following test statistics would be appropriate to use in this test? Select all valid options.
La Jolla Private mean GPA
Difference between La Jolla Private mean GPA and overall mean GPA
Absolute difference between La Jolla Private mean GPA and overall mean GPA
Total variation distance (TVD)
Kolmogorov-Smirnov (K-S) statistic
Answer: Option 1 and 2
In hypothesis tests where we test to see if a sample came from a larger distribution, we often use the sample mean as the test statistic (again, see the Torgersen Island bill lengths example from Lecture 12). Hence, the La Jolla Private mean GPA is a valid option.
Note that in this hypothesis test, we will simulate new data by generating random samples, each one being the same size as the number of applications from La Jolla Private. The overall mean GPA will not change on each simulation, as it is a constant. Hence, Option 2 reduces to Option 1 minus a constant, which purely shifts the distribution of the test statistic and the observed statistic horizontally but does not change their relative positions to one another. Hence, the difference between the La Jolla Private mean GPA and overall mean GPA is also a valid test statistic.
Option 3 is not valid because our alternative hypothesis has a direction (that the mean GPA of La Jolla Private admits is less than the mean GPA of all admits). The absolute difference would be appropriate for a directionless alternative hypothesis, e.g. that the mean GPA of La Jolla Private admits is different than the mean GPA of all admits.
Option 4 doesn’t work because we are not dealing with categorical distributions.
Option 5 doesn’t work because we are not running a permutation test to test if two samples come from the underlying population distribution; rather, here we are testing if one sample comes from a larger population (and our hypotheses explicitly mentioned the mean).
The average score on this problem was 81%.
Consider the following pair of hypotheses.
Null hypothesis: The distribution of admitted, waitlisted, and rejected students at UC San Diego from Warren High is equal to the distribution of admitted, waitlisted, and rejected students at UC San Diego from La Jolla Private.
Alternative hypothesis: The distribution of admitted, waitlisted, and rejected students at UC San Diego from Warren High is different from the distribution of admitted, waitlisted, and rejected students at UC San Diego from La Jolla Private.
What type of test is this?
Hypothesis test
Permutation test
Answer: Permutation test
There are two relevant distributions at play here:
The distribution of admit/waitlist/reject proportions at Warren High.
The distribution of admit/waitlist/reject proportions at La Jolla Private.
To generate new data under the null, we need to shuffle the group labels, i.e. randomly assign students to groups.
The average score on this problem was 89%.
Which of the following test statistics would be appropriate to use in this test? Select all valid options.
Warren High admit rate
Difference between Warren High and La Jolla Private admit rates
Absolute difference between Warren High and La Jolla Private admit rates
Total variation distance (TVD)
Kolmogorov-Smirnov (K-S) statistic
Answer: Option 4
The two distributions described in 11(a) are categorical, and the TVD is the only test statistic that measures the "distance" between two categorical distributions.
The average score on this problem was 79%.
After getting bored of working with her students, Valentina decides to experiment with different ways of simulating data for the following pair of hypotheses:
Null hypothesis: The coin is fair.
Alternative hypothesis: The coin is biased in favor of heads.
As her test statistic, Valentina uses the number of heads. She
defines the 2-D array A
as follows:
# .flatten() reshapes from 50 x 2 to 1 x 100
= np.array([
A 0, 1]) for _ in range(50)]).flatten()
np.array([np.random.permutation([for _ in range(3000)
])
She also defines the 2-D array B
as follows:
# .flatten() reshapes from 50 x 2 to 1 x 100
= np.array([
B 0, 1], 2) for _ in range(50)]).flatten()
np.array([np.random.choice([for _ in range(3000)
])
Below, we see a histogram of the distribution of her test statistics.
Which one of the following arrays are visualized above?
A.sum(axis=1)
B.sum(axis=1)
Answer: Option 2
Note that arr.sum(axis=1)
takes the sum of each
row of arr
.
The difference comes down to the behavior of
np.random.permutation([0, 1])
and
np.random.choice([0, 1])
.
Each call to np.random.permutation([0, 1])
will either
return array([0, 1])
or array([1, 0])
— one
head and one tail. As a result, each row of A
will consist
of 50 1s and 50 0s, and so the sum of each row of A
will be
exactly 50. If we drew a histogram of this distribution, it would be a
single spike at the number 50.
On the other hand, each call to
np.random.choice([0, 1], 2)
could either return
array([0, 0])
, array([0, 1])
,
array([1, 0])
, or array([1, 1])
. Each of these
are returned with equal probabilities. In effect,
np.random.choice([0, 1], 2)
flips a fair coin twice, so
[np.random.choice([0, 1], 2) for _ in range(50)]
flips a
fair coin 100 times. When we take the sum of each row of B
,
we will get the number of heads in 100 coin flips; the histogram drawn
is consistent with this interpretation.
The average score on this problem was 83%.