Create Next App

databricks CERTIFIED_MACHINE_LEARNING_ASSOCIATE

Exam contains 73 questions

Page 4 of 13

Question 19 🔥

A data scientist is using Spark ML to engineer features for an exploratory machine learning project.They decide they want to standardize their features using the following code block:Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.Which of the following changes can the data scientist make to address the concern?

Which database solution meets these requirements?

A. Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values

Highly voted

B. Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values

Highly voted

C. Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

Highly voted

D. Utilize the Pipeline API to standardize the training data according to the test data's summary statistics

Highly voted

E. Utilize the Pipeline API to standardize the test data according to the training data's summary statistics

Highly voted

Discussion of the question

Question 20 🔥

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.Which of the following feature engineering tasks will be the least efficient to distribute?

Which database solution meets these requirements?

A. One-hot encoding categorical features

Highly voted

B. Target encoding categorical features

Highly voted

C. Imputing missing feature values with the mean

Highly voted

D. Imputing missing feature values with the true median

Highly voted

E. Creating binary indicator features for missing values

Highly voted

Discussion of the question

Question 21 🔥

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Which database solution meets these requirements?

E. The vectorized pandas UDFs process data in memory rather than spilling to disk

Highly voted

A. The vectorized pandas UDFs allow for the use of type hints

Highly voted

B. The vectorized pandas UDFs process data in batches rather than one row at a time

Highly voted

C. The vectorized pandas UDFs allow for pandas API use inside of the function

Highly voted

D. The vectorized pandas UDFs work on distributed DataFrames

Highly voted

Discussion of the question

Question 22 🔥

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

Which database solution meets these requirements?

A. Leave-one-out encoding

Highly voted

C. One-hot encoding

Highly voted

D. Categorical embeddings

Highly voted

E. String indexing

Highly voted

B. Target encoding

Highly voted

Discussion of the question

Question 23 🔥

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.The Spark DataFrame train_df has the following schema:The machine learning engineer shares the following code block:Which of the following changes does the machine learning engineer need to make to complete the task?

Which database solution meets these requirements?

A. They need to call the transform method on train_df

Highly voted

B. They need to convert the features column to be a vector

Highly voted

C. They do not need to make any changes

Highly voted

D. They need to utilize a Pipeline to fit the model

Highly voted

E. They need to split the features column out into one column for each feature

Highly voted

Discussion of the question

Question 24 🔥

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.They have written the following incomplete code block:Which of the following pieces of code can be used to fill in the above blank to complete the task?

Which database solution meets these requirements?

Discussion of the question

Ready to Pass Your Certification Test

databricks CERTIFIED_MACHINE_LEARNING_ASSOCIATE

Exam contains 73 questions

Lorem ipsum dolor sit amet consectetur. Eget sed turpis aenean sit aenean. Integer at nam ullamcorper a.

Company

Product

Resources

Follow us