[Q37-Q61] 2024 Updated Databricks-Machine-Learning-Associate PDF for the Databricks-Machine-Learning-Associate Tests Free Updated Today!

NEW QUESTION 37
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:
* Hyperparameter 1: [2, 5, 10]
* Hyperparameter 2: [50, 100]
Which of the following represents the number of machine learning models that can be trained in parallel during this process?

3

5

6

18

NEW QUESTION 38
A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.
Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

A holdout set is not necessary when using a train-validation split

Fewer hyperparameter values need to be tested when using a train-validation split

Bias is avoidable when using a train-validation split

Reproducibility is achievable when using a train-validation split

Fewer models need to be trained when using a train-validation split

NEW QUESTION 39
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.
Which of the following code blocks will accomplish this task?

spark_df.loc[:,spark_df[“discount”] <= 0]

spark_df[spark_df[“discount”] <= 0]

spark_df.filter (col(“discount”) <= 0)

spark_df.loc(spark_df[“discount”] <= 0, :]

NEW QUESTION 40
A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).
Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

NEW QUESTION 41
A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.
Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

They can turn on Databricks Autologging

They can specify nested=True when starting the child run for each unique combination of hyperparameter values

They can start each child run inside the parent run’s indented code block using mlflow.start runO

They can start each child run with the same experiment ID as the parent run

They can specify nested=True when starting the parent run for the tuning process

NEW QUESTION 42
A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.
Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Change the number of compute nodes to be half or less than half of the number of evaluations.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

Change the iterative optimization algorithm used to facilitate the tuning process.

Change the number of compute nodes to be double or more than double the number of evaluations.

NEW QUESTION 43
A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using an Iterator?

The data will be limited to a single executor preventing the model from being loaded multiple times

The model will be limited to a single executor preventing the data from being distributed

The model only needs to be loaded once per executor rather than once per batch during the inference process

The data will be distributed across multiple executors during the inference process

NEW QUESTION 44
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space.
As a result, they have the following code block:

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Change SparkTrials() to Trials()

Reduce num_evals to be less than 10

Change fmin() to fmax()

Remove the trials=trials argument

Remove the algo=tpe.suggest argument

NEW QUESTION 45
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?

One-hot encoding categorical features

Target encoding categorical features

Imputing missing feature values with the mean

Imputing missing feature values with the true median

Creating binary indicator features for missing values

NEW QUESTION 46
A data scientist has created a linear regression model that uses log(price) as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFrame preds_df.
They are using the following code block to evaluate the model:
regression_evaluator.setMetricName(“rmse”).evaluate(preds_df)
Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable with price?

They should exponentiate the computed RMSE value

They should take the log of the predictions before computing the RMSE

They should evaluate the MSE of the log predictions to compute the RMSE

They should exponentiate the predictions before computing the RMSE

NEW QUESTION 47
Which of the following machine learning algorithms typically uses bagging?

Gradient boosted trees

K-means

Random forest

Linear regression

Decision tree

NEW QUESTION 48
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?

The second model is much more accurate than the first model

The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE

The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE

The first model is much more accurate than the second model

The RMSE is an invalid evaluation metric for regression problems

NEW QUESTION 49
A data scientist is using Spark ML to engineer features for an exploratory machine learning project.
They decide they want to standardize their features using the following code block:

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set.
Which of the following changes can the data scientist make to address the concern?

Utilize the MinMaxScaler object to standardize the training data according to global minimum and maximum values

Utilize the MinMaxScaler object to standardize the test data according to global minimum and maximum values

Utilize a cross-validation process rather than a train-test split process to remove the need for standardizing data

Utilize the Pipeline API to standardize the training data according to the test data’s summary statistics

Utilize the Pipeline API to standardize the test data according to the training data’s summary statistics

NEW QUESTION 50
A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.
Which of the following approaches can the team use to identify which task is the cause of the failure?

Run each notebook interactively

Review the matrix view in the Job’s runs

Migrate the Job to a Delta Live Tables pipeline

Change each Task’s setting to use a dedicated cluster

NEW QUESTION 51
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?

spark_df.summary ()

spark_df.stats()

spark_df.describe().head()

spark_df.printSchema()

spark_df.toPandas()

NEW QUESTION 52
A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:
* 10.0
* 12.0
* 17.0
Which of the following values represents the overall cross-validation root-mean-squared error?

13.0

17.0

12.0

39.0

10.0

NEW QUESTION 53
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

pandas API on Spark DataFrames are more performant than Spark DataFrames

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

NEW QUESTION 54
A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.
Which of the following is a negative consequence of the approach suggested by the colleague?

The model will take longer to train for each unique combination of hvperparameter values

The feature engineering stages will be computed using validation data

The cross-validation process will no longer be

The cross-validation process will no longer be reproducible

The model will be refit one more per cross-validation fold

NEW QUESTION 55
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

predict(*spark_df.columns)

mapInPandas(predict)

predict(Iterator(spark_df))

mapInPandas(predict(spark_df.columns))

predict(spark_df.columns)

NEW QUESTION 56
The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.
Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Spark ML cannot distribute linear regression training

Singular value decomposition

Least-squares method

Logistic regression

Iterative optimization

NEW QUESTION 57
Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Keras

pandas

PvTorch

Spark ML

Scikit-learn

NEW QUESTION 58
A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.
In which situation will the machine learning engineer be correct?

When the new solution requires if-else logic determining which model to use to compute each prediction

When the new solution’s models have an average latency that is larger than the size of the original model

When the new solution requires the use of fewer feature variables than the original model

When the new solution requires that each model computes a prediction for every record

When the new solution’s models have an average size that is larger than the size of the original model

NEW QUESTION 59
Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

MLflow Experiment Tracking

Spark ML

Autoscaling clusters

Delta Lake

Spark ML (part of Apache Spark’s MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.
Reference
Apache Spark MLlib Guide: https://spark.apache.org/docs/latest/ml-guide.html Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how Spark ML can be used:
Hyperparameter Tuning with CrossValidator: Spark ML includes the CrossValidator and TrainValidationSplit classes, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Define the model
model = …
# Create a parameter grid
paramGrid = ParamGridBuilder() \
.addGrid(model.hyperparam1, [value1, value2]) \
.addGrid(model.hyperparam2, [value3, value4]) \
.build()
# Define the evaluator
evaluator = BinaryClassificationEvaluator()
# Define the CrossValidator
crossval = CrossValidator(estimator=model,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.
Reference
Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML

NEW QUESTION 60
Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

Random Search

Halving Random Search

Tree of Parzen Estimators

Grid Search

NEW QUESTION 61
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE
actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?