Black Friday Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70percent

Databricks Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Exam Practice Test

Databricks Certified Machine Learning Associate Exam Questions and Answers

Question 1

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:

A.

They should exponentiate the computed RMSE value

B.

They should take the log of the predictions before computing the RMSE

C.

They should evaluate the MSE of the log predictions to compute the RMSE

D.

They should exponentiate the predictions before computing the RMSE

Question 2

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

A.

When the new solution requires if-else logic determining which model to use to compute each prediction

B.

When the new solution's models have an average latency that is larger than the size of the original model

C.

When the new solution requires the use of fewer feature variables than the original model

D.

When the new solution requires that each model computes a prediction for every record

E.

When the new solution's models have an average size that is larger than the size of the original model

Question 3

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

B.

One-hot encoding is dependent on the target variable’s values which differ for each apaplication.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

Question 4

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

Options:

A.

They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

B.

They can check the Databricks Runtime ML box when creating their clusters.

C.

They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

D.

They can set the runtime-version variable in their Spark session to “ml”.

Question 5

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Options:

A.

The vectorized pandas UDFs allow for the use of type hints

B.

The vectorized pandas UDFs process data in batches rather than one row at a time

C.

The vectorized pandas UDFs allow for pandas API use inside of the function

D.

The vectorized pandas UDFs work on distributed DataFrames

E.

The vectorized pandas UDFs process data in memory rather than spilling to disk

Question 6

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

Options:

A.

MLflow Experiment Tracking

B.

Spark ML

C.

Autoscaling clusters

D.

Autoscaling clusters

E.

Delta Lake

Question 7

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.

Which approach should the data scientist take to complete this task?

Options:

A.

They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.

B.

They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.

C.

They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.

D.

They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.

Question 8

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options:

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Question 9

A machine learning engineer is trying to scale a machine learning pipelinepipelinethat contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to theestimatorparameter and then placing the updated cv object as the final stage of thepipelinein place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

Options:

A.

The model will take longerto train for each unique combination of hvperparameter values

B.

The feature engineering stages will be computed using validation data

C.

The cross-validation process will no longer be

D.

The cross-validation process will no longer be reproducible

E.

The model will be refit one more per cross-validation fold

Question 10

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.

As a result, they have the following code block:

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options:

A.

Change SparkTrials() to Trials()

B.

Reduce num_evals to be less than 10

C.

Change fmin() to fmax()

D.

Remove the trials=trials argument

E.

Remove the algo=tpe.suggest argument

Question 11

A machine learning engineer has been notified that a new Staging version of a model registered to the MLflow Model Registry has passed all tests. As a result, the machine learning engineer wants to put this model into production by transitioning it to the Production stage in the Model Registry.

From which of the following pages in Databricks Machine Learning can the machine learning engineer accomplish this task?

Options:

A.

The home page of the MLflow Model Registry

B.

The experiment page in the Experiments observatory

C.

The model version page in the MLflow ModelRegistry

D.

The model page in the MLflow Model Registry

Question 12

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:

A.

Theycan turn on Databricks Autologging

B.

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.

They can start each child run with the same experiment ID as the parent run

E.

They can specify nested=True when starting the parent run for the tuningprocess

Question 13

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Options:

A.

Change the number of compute nodes to be half or less than half of the number of evaluations.

B.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

C.

Change the iterative optimization algorithm used to facilitate the tuning process.

D.

Change the number of compute nodes to be double or more than double the number of evaluations.

Question 14

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

Options:

A.

Leave-one-out encoding

B.

Target encoding

C.

One-hot encoding

D.

Categorical

E.

String indexing

Question 15

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Options:

A.

One-hot encoding categorical features

B.

Target encoding categorical features

C.

Imputing missing feature values with the mean

D.

Imputing missing feature values with the true median

E.

Creating binary indicator features for missing values

Question 16

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

Gradient boosted trees

B.

K-means

C.

Random forest

D.

Linear regression

E.

Decision tree

Question 17

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

B)

C)

D)

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Question 18

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.

batch_dfhas the following schema:

customer_id STRING

The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:

In which situation will the machine learning engineer’s code block perform the desired inference?

Options:

A.

When the Feature Store feature set was logged with the model at model_uri

B.

When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark

C.

When the model at model_uri only uses customer_id as a feature

D.

This code block will not perform the desired inference in any situation.

E.

When all of the features used by the model at model_uri are in a single Feature Store table

Question 19

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

Options:

A.

PySpark DataFrame API

B.

pandas API on Spark

C.

Spark SQL

D.

Feature Store

Question 20

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

IGradient boosted trees

B.

K-means

C.

Random forest

D.

Decision tree

Question 21

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML.

Which of the following compute tools is best suited for this use case?

Options:

A.

Single Node cluster

B.

Standard cluster

C.

SQL Warehouse

D.

None of these compute tools support this task

Question 22

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.

Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

Options:

A.

fmin

B.

SparkTrials

C.

quniform

D.

search_space

E.

objective_function