New Year Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70percent

Databricks Databricks-Certified-Professional-Data-Scientist Databricks Certified Professional Data Scientist Exam Exam Practice Test

Databricks Certified Professional Data Scientist Exam Questions and Answers

Question 1

What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

Options:

A.

Expected value

B.

Variance

C.

Linear regression

D.

Quantiles

Question 2

You are analyzing data in order to build a classifier model. You discover non-linear data and discontinuities that will affect the model. Which analytical method would you recommend?

Options:

A.

Logistic Regression

B.

Decision Trees

C.

Linear Regression

D.

ARIMA

Question 3

In unsupervised learning which statements correctly applies

Options:

A.

It does not have a target variable

B.

Instead of telling the machine Predict Y for our data X, we're asking What can you tell me about X?

C.

telling the machine Predict Y for our data X

Question 4

You are doing advanced analytics for the one of the medical application using the regression and you have two variables which are weight and height and they are very important input variables, which cannot be ignored and they are also highly co-related. What is the best solution for that?

Options:

A.

You will take cube root of height

B.

You will take square root of weight

C.

You will take square of the height.

D.

You would consider using BMI (Body Mass Index)

Question 5

Find out the classifier which assumes independence among all its features?

Options:

A.

Neural networks

B.

Linear Regression

C.

Naive Bayes

D.

Random forests

Question 6

In which phase of the analytic lifecycle would you expect to spend most of the project time?

Options:

A.

Discovery

B.

Data preparation

C.

Communicate Results

D.

Operationalize

Question 7

Suppose there are three events then which formula must always be equal to P(E1|E2,E3)?

Options:

A.

P(E1,E2,E3)P(E1)/P(E2:E3)

B.

P(E1,E2;E3)/P(E2,E3)

C.

P(E1,E2|E3)P(E2|E3)P(E3)

D.

P(E1,E2|E3)P(E3)

E.

P(E1,E2,E3)P(E2)P(E3)

Question 8

You are working with the Clustering solution of the customer datasets. There are almost 40 variables are available for each customer and almost 1.00,0000 customer's data is available. You want to reduce the number of variables for clustering, what would you do?

Options:

A.

You will randomly reduce the number of variables

B.

You will find the correlation among the variables and from their variables are not co-related will be discarded.

C.

You will find the correlation among the variables and from the highly co-related variables, you will be considering only one or two variables from it.

D.

You cannot discard any variable for creating clusters.

E.

You can combine several variables in one variable

Question 9

You are asked to create a model to predict the total number of monthly subscribers for a specific magazine. You are provided with 1 year's worth of subscription and payment data, user demographic data, and 10 years worth of content of the magazine (articles and pictures). Which algorithm is the most appropriate for building a predictive model for subscribers?

Options:

A.

Linear regression

B.

Logistic regression

C.

Decision trees

D.

TF-IDF

Question 10

Spam filtering of the emails is an example of

Options:

A.

Supervised learning

B.

Unsupervised learning

C.

Clustering

D.

1 and 3 are correct

E.

2 and 3 are correct

Question 11

Which of the following are advantages of the Support Vector machines?

Options:

A.

Effective in high dimensional spaces.

B.

it is memory efficient

C.

possible to specify custom kernels

D.

Effective in cases where number of dimensions is greater than the number of samples

E.

Number of features is much greater than the number of samples, the method still give good performances

F.

SVMs directly provide probability estimates

Question 12

Question-13. Which of the following is not the Classification algorithm?

Options:

A.

Logistic Regression

B.

Support Vector Machine

C.

Neural Network

D.

Hidden Markov Models

E.

None of the above

Question 13

Under which circumstance do you need to implement N-fold cross-validation after creating a regression model?

Options:

A.

The data is unformatted.

B.

There is not enough data to create a test set.

C.

There are missing values in the data.

D.

There are categorical variables in the model.

Question 14

In which of the scenario you can use the linear regression model?

Options:

A.

Predicting Home Price based on the location and house area

B.

Predicting demand of the goods and services based on the weather

C.

Predicting tumor size reduction based on input as number of radiation treatment

D.

Predicting sales of the text book based on the number of students in state

Question 15

Select the choice where Regression algorithms are not best fit

Options:

A.

When the dimension of the object given

B.

Weight of the person is given

C.

Temperature in the atmosphere

D.

Employee status

Question 16

What are the advantages of the Hashing Features?

Options:

A.

Requires the less memory

B.

Less pass through the training data

C.

Easily reverse engineer vectors to determine which original feature mapped to a vector location

Question 17

Question-3: In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features (such as the words in a language), i.e., turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values modulo the number of features as indices directly, rather than looking the indices up in an associative array. So what is the primary reason of the hashing trick for building classifiers?

Options:

A.

It creates the smaller models

B.

It requires the lesser memory to store the coefficients for the model

C.

It reduces the non-significant features e.g. punctuations

D.

Noisy features are removed

Question 18

Which of the following true with regards to the K-Means clustering algorithm?

Options:

A.

Labels are not pre-assigned to each objects in the cluster.

B.

Labels are pre-assigned to each objects in the cluster.

C.

It classify the data based on the labels.

D.

It discovers the center of each cluster.

E.

It find each objects fall in which particular cluster

Question 19

You are using one approach for the classification where to teach the agent not by giving explicit categorizations, but by using some sort of reward system to indicate success, where agents might be rewarded for doing certain actions and punished for doing others. Which kind of this learning

Options:

A.

Supervised

B.

Unsupervised

C.

Regression

D.

None of the above

Question 20

You have collected the 100's of parameters about the 1000's of websites e.g. daily hits, average time on the websites, number of unique visitors, number of returning visitors etc. Now you have find the most important parameters which can best describe a website, so which of the following technique you will use

Options:

A.

PCA (Principal component analysis)

B.

Linear Regression

C.

Logistic Regression

D.

Clustering