In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.Which of the following code blocks will accomplish this task?
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.Which of the following is a negative consequence of the approach suggested by the colleague?
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.Which of the following describes why?
A data scientist is attempting to tune a logistic regression model logistic using scikit-learn. They want to specify a search space for two hyperparameters and let the tuning process randomly select values for each evaluation.They attempt to run the following code block, but it does not accomplish the desired task:Which of the following changes can the data scientist make to accomplish the task?