A data scientist is using Spark ML to engineer features for an exploratory machine learning project.They decide they want to standardize their features using the following code block:Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.Which of the following changes can the data scientist make to address the concern?
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.Which of the following feature engineering tasks will be the least efficient to distribute?
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.The Spark DataFrame train_df has the following schema:The machine learning engineer shares the following code block:Which of the following changes does the machine learning engineer need to make to complete the task?
A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.They have written the following incomplete code block:Which of the following pieces of code can be used to fill in the above blank to complete the task?