Python & Pandas Data Analysis Questions & Answers
Question 1
Both “=” and “==” perform the same function in Python?
Answer: False
Question 2
What is the output of the following lines of code:
Answer: 211
Question 3
Which of the following is not a valid variable type in Python?
Answer: DoubleString
Question 4
What will be the result of the following code?
Answer: 6.0
Question 5
What will be the output of the following Python expression?
Answer: “100100100”
Question 6
Which code will filter the dataframe df for rows where the column “col1” has values that are NOT equal to 5?
Answer: df[df[“col1”] != 5]
Question 7
Which of the following pandas methods will report summary statistics of numeric columns?
Answer: describe()
Question 8
Suppose we have a dataframe df, which includes a column animal. I would like to count the number of times each animal appears in the column, e.g. Elephant: 8, Zebra: 7, etc. Which of the following commands can I use to do that?
Answer: df[“animal”].value_counts()
Question 9
Assuming dataframe df is properly defined, what line of code would output the first 10 rows of data?
Answer: df.head(10)
Question 10
Which option will help us select elements “a” and “b” in the list L = [“a”, “b”, “c”]? (Select all that apply.)
Answers:
- L[0:2]
- L[-3:-1]
Question 11
Which of these will help us select two columns, c1 and c2, from the dataframe df?
Answer: df[[“c1”, “c2”]]
Question 12a
What will be the output of List1[-2] if List1 is defined as below?
Answer: ‘Professor’
Question 12b
A list can hold multiple types of data.
Answer: True
Question 13
What would the following code output?
Answer: 5
Question 14
How would you return the enrollment count in business from this dictionary?
Answer: Univ[“business”][“enrollment”]
Question 15
Which of the following is appropriate syntax for a key-value pair in a dictionary?
Answer: “Location”: “Urbana”
Question 16
What will be the output of this code?
Answer: {‘Dan’: 3.8, ‘Jason’: 4, ‘Jasmine’: 4, ‘Lori’: 3.85}
Question 18
In the below scatterplot, what is the relationship between the variables on the x-axis and y-axis?
Answer: A positive relationship
Question 19
The following boxplot shows the distribution of tip amounts by day of the week. Based on the figure, which of the following is true? (Select all that apply.)
Answers:
- The median tip on Sat is less than that on Sun.
- The first quartile for tip is similar across all four days.
- Friday and Sunday have no outliers.
Question 20
The following bar chart shows average rainfall by month. Based on the figure, which of the following is true?
There was no day in March when it rained for more than 50 mm. For each day of April 2014, at least 50 mm of rain was recorded. The day with the highest rain was in May 2014.
Answer: None of the other options.
Question 21
To filter a dataframe df such that a column, col1, has values only greater than 100, we run the following code:
What will be the outcome if we only run the code within the outer set of square brackets, i.e., df[‘col1’] > 100?
Answer: A series of True and False will be printed.
Question 22
A scatterplot is a useful chart to visualize the distribution of a categorical variable.
Answer: False
Question 23
Suppose you have a dataset of box office revenues for the top 1000 movies. The dataset includes the genre of each movie. Which chart could you use to show average revenues by genre?
Answer: Bar chart
Question 24
What is the value of the interquartile range (IQR)?
Answer: Q3 – Q1
Question 24
What purpose does a cross-tabulation (pd.crosstab) serve in data analysis?
Answer: Examine the relationship between two categorical variables.
Question 26
What does the groupby function in pandas do?
Answer: Group rows of a dataframe based on one or more columns
Question 27
Which of the following measures cannot be inferred from a box plot?
Answers:
- Count
- Min
- Mean
Question 28
Which of the following plots can be created based on a single numeric variable?
Answers:
- Box plot
- Histogram
Question 29
Which of the following statements is TRUE?
Answer: Each column in a pandas dataframe is a Series.
Question 30
Consider the dataframe df with categorical variable catvar which takes n possible values. How many dummy variables will the following code create?
Answer: n dummy variables
Question 31
Consider the dataframe df with m numerical variables, and one categorical variable catvar which takes n possible values. How many variables will be in the dataframe created by the following code?
Answer: m+n-1 variables
Question 32
Consider the dataframe df with categorical variable catvar and a numerical variable numvar. What is the correct code to find the mean of numvar for each value of catvar?
Answer: df.groupby(“catvar”)[“numvar”].mean()
Question 33
Consider the dataframe df with categorical variable catvar and two numerical variables numvar1 and numvar2. What is the correct code to find the mean of numvar1 and numvar2 for each value of catvar?
Answer: df.groupby(“catvar”)[[“numvar1”, “numvar2”]].mean()
Question 34
The dataframe df has 100 rows and 5 columns. After running df.duplicated().sum(), you find that the dataframe has 9 duplicate values. How many rows will df have after running df.drop_duplicates()?
Answer: 100, the code will only return a copy of df with duplicates removed. It will not change the underlying dataframe.
How many rows will df have after running df.drop_duplicates(inplace = True)?
Answer: 91 (100 – 9)
Question 35
The dataframe df has 100 rows and 5 columns. After running df.isnull().sum(), you find that the dataframe has 5 missing values in Column1 and 5 missing values in Column2. How many rows will df have after running df.dropna()?
Answer: 100, the code will only return a copy of df with duplicates removed. It will not change the underlying dataframe.
How many rows will df have after running df.dropna(inplace = True)?
Answers:
- At least 90 observations (100 – 5 – 5, if the missing values are all in different rows)
- At most 95 observations (100 – 5, if the missing values in Column1 are in the same rows as the missing values in Column2)
Question 36
For the next three questions, consider the following dataframe, called df.
Which of the following code will return the following subset? (Select all that apply.)
Answers:
- df.loc[[1, 3, 5],:]
- df.loc[[1, 3, 5]]
- df.iloc[0:3,:]
- df.loc[1:5,:]
Question 37
Which of the following code will return the following subset? (Select all that apply.)
Answers:
- df.loc[:,[“Col1”, “Col3”, “Col5”]]
- df.iloc[:,0:3]
Question 39
Which of the following code will return the following subset? (Select all that apply.)
Answers:
- df.loc[[1,3,5],[“Col1”, “Col3”, “Col5”]]
- df.iloc[0:3,0:3]
- df.loc[1:5,”Col1″:”Col5″]
Question 40
You have an inventory dictionary that tracks items at two different outlets:
inventory = {
"OutletA": {"chairs": 15, "desks": 30},
"OutletB": {"chairs": 12, "desks": 21}
}
Which expression correctly retrieves the number of desks at OutletB?
Answer: inventory[“OutletB”][“desks”]
Question 41
Consider a pandas DataFrame df that has a numeric column called price. You want to filter for rows where price is at least 50 and less than 100. Which line of code accomplishes this?
Answer: df[(df[“price”] >= 50) & (df[“price”] < 100)]
Question 42
A study was conducted to analyze the relationship between study habits and whether students drink coffee. The data has been stored in a dataframe:
Study hours/drinks_coffee | FALSE | TRUE |
Low | 0.5 | 0.5 |
Medium | 0.4 | 0.6 |
High | 0.25 | 0.75 |
study_hours (possible values: Low, Medium, High)
drinks_coffee (possible values: True, False)
In the crosstab, what is the correct interpretation of the value 0.75 in the bottom right?
Answer: b. 75% of students who study a lot also drink coffee.
Question 43
You want to calculate the sum and standard deviation of Exam_Score and Study_Hours, grouped by Major and Year. Which of the following is correct?
Answer: a. df.groupby([“Major”, “Year”])[[“Exam_Score”, “Study_Hours”]].agg([“sum”, “std”])