Black Friday Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70percent

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Options:

A.

array_remove(transactionsDf, "*")

B.

transactionsDf.unpersist()

(Correct)

C.

del transactionsDf

D.

transactionsDf.clearCache()

E.

transactionsDf.persist()

Question 2

In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column

productId?

1. importedDf.createOrReplaceTempView("importedDf")

2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")

3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")

4. importedDf = spark.read.option("format", "json").path(jsonPath)

5. importedDf = spark.read.json(jsonPath)

Options:

A.

4, 1, 2

B.

5, 1, 3

C.

5, 2

D.

4, 1, 3

E.

5, 1, 2

Question 3

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

Options:

A.

transactionsDf.drop(["predError", "value"])

B.

transactionsDf.drop("predError", "value")

C.

transactionsDf.drop(col("predError"), col("value"))

D.

transactionsDf.drop(predError, value)

E.

transactionsDf.drop("predError & value")

Question 4

Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

Options:

A.

Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions

B.

Decrease values for the properties spark.default.parallelism and spark.sql.partitions

C.

Increase values for the properties spark.sql.parallelism and spark.sql.partitions

D.

Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions

E.

Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

Question 5

Which of the following are valid execution modes?

Options:

A.

Kubernetes, Local, Client

B.

Client, Cluster, Local

C.

Server, Standalone, Client

D.

Cluster, Server, Local

E.

Standalone, Client, Cluster

Question 6

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Options:

A.

Since itemId is the index, it does not need to be an argument to the select() method.

B.

The alias() method needs to be called after the select() method.

C.

The explode() method expects a Column object rather than a string.

D.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

E.

The split() method should be used inside the select() method instead of the explode() method.

Question 7

Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?

Code block:

1.json_schema = """

2.{"type": "struct",

3. "fields": [

4. {

5. "name": "itemId",

6. "type": "integer",

7. "nullable": true,

8. "metadata": {}

9. },

10. {

11. "name": "supplier",

12. "type": "string",

13. "nullable": true,

14. "metadata": {}

15. }

16. ]

17.}

18."""

Options:

A.

spark.read.json(filePath, schema=json_schema)

B.

spark.read.schema(json_schema).json(filePath)

1.schema = StructType.fromJson(json.loads(json_schema))

2.spark.read.json(filePath, schema=schema)

C.

spark.read.json(filePath, schema=schema_of_json(json_schema))

D.

spark.read.json(filePath, schema=spark.read.json(json_schema))

Question 8

Which of the following statements about Spark's configuration properties is incorrect?

Options:

A.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

B.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

C.

The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

D.

The default number of partitions to use when shuffling data for joins or aggregations is 300.

E.

The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Question 9

In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame

transactionsDf and itemsDf on columns productId and itemId, respectively?

1. .filter(~isnull(col('value')))

2. .count()

3. transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))

4. transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')

5. .filter(col('value').isnotnull())

6. .sum(col('value'))

Options:

A.

4, 1, 2

B.

3, 1, 6

C.

3, 1, 2

D.

3, 5, 2

E.

4, 6

Question 10

The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame

transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))

Options:

A.

1. sample

2. True

3. 0.15

4. filter

B.

1. sample

2. False

3. 0.15

4. select

C.

1. sample

2. 0.85

3. False

4. select

D.

1. fraction

2. 0.15

3. True

4. where

E.

1. fraction

2. False

3. 0.85

4. select

Question 11

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

A.

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

B.

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

C.

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

D.

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

E.

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Question 12

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

A.

transactionsDf.max('value').min('value')

B.

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

C.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

D.

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

E.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Question 13

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

A.

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

B.

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

C.

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

D.

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

E.

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 14

Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?

Options:

A.

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

B.

transactionsDf.select(sqrt(predError))

C.

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

D.

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

E.

transactionsDf.select(sqrt("predError"))

Question 15

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

Options:

A.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

B.

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

C.

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

D.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

E.

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Question 16

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1.def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

7.

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

9.

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

Options:

A.

The operator used to adding the column does not add column predErrorAdded to the DataFrame.

B.

Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

C.

The udf() method does not declare a return type.

D.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

E.

The Python function is unable to handle null values, resulting in the code block crashing on execution.

Question 17

Which of the following statements about broadcast variables is correct?

Options:

A.

Broadcast variables are serialized with every single task.

B.

Broadcast variables are commonly used for tables that do not fit into memory.

C.

Broadcast variables are immutable.

D.

Broadcast variables are occasionally dynamically updated on a per-task basis.

E.

Broadcast variables are local to the worker node and not shared across the cluster.

Question 18

The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which

dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+----------------+

2.|transactionId|predError|value|storeId|productId| f| transactionDate|

3.+-------------+---------+-----+-------+---------+----+----------------+

4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35|

5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01|

6.| 3| 3| null| 25| 3|null|2020-04-02 10:53|

7.+-------------+---------+-----+-------+---------+----+----------------+

Code block:

1.transactionsDf = transactionsDf.drop("transactionDate")

2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")

Options:

A.

Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used

instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp().

B.

Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column

transactionDate should be wrapped in a col() operator.

C.

Column transactionDate should be wrapped in a col() operator.

D.

The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column

transactionDate with the new column transactionTimestamp.

E.

Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used

instead of the existing column assignment.

Question 19

The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.

Find the error.

Code block:

transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

Options:

A.

The "outer" argument should be eliminated, since "outer" is the default join type.

B.

The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.

C.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.

D.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").

E.

The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.

Question 20

Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?

Sample of DataFrame itemsDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

Options:

A.

itemsDf.split("supplier", " ").count()

B.

itemsDf.split("supplier", " ").size()

C.

itemsDf.select(word_count("supplier"))

D.

spark.select(size(split(col(supplier), " ")))

E.

itemsDf.select(size(split("supplier", " ")))

Question 21

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

Options:

A.

1. cache

2. MEMORY_ONLY_2

3. count()

B.

1. persist

2. DISK_ONLY_2

3. count()

C.

1. persist

2. MEMORY_ONLY_2

3. select()

D.

1. cache

2. DISK_ONLY_2

3. count()

E.

1. persist

2. MEMORY_ONLY_2

3. count()

Question 22

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

A.

spark.read.json(filePath)

B.

spark.read.path(filePath, source="json")

C.

spark.read().path(filePath)

D.

spark.read().json(filePath)

E.

spark.read.path(filePath)

Question 23

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

Options:

A.

Slot is another name for executor.

B.

There must be less executors than tasks.

C.

An executor runs on a single core.

D.

There must be more slots than tasks.

E.

Tasks run in parallel via slots.

Question 24

Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

Options:

A.

transactionsDf.sample(True, 0.5)

B.

transactionsDf.take(1000).distinct()

C.

transactionsDf.sample(False, 0.5)

D.

transactionsDf.take(1000)

E.

transactionsDf.sample(True, 0.5, force=True)

Question 25

The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.

Code block:

Options:

A.

transactionsDf.format("parquet").option("mode", "append").save(path)

B.

The code block is missing a reference to the DataFrameWriter.

C.

save() is evaluated lazily and needs to be followed by an action.

D.

The mode option should be omitted so that the command uses the default mode.

E.

The code block is missing a bucketBy command that takes care of partitions.

F.

Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.

Question 26

Which of the following describes a valid concern about partitioning?

Options:

A.

A shuffle operation returns 200 partitions if not explicitly set.

B.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

C.

No data is exchanged between executors when coalesce() is run.

D.

Short partition processing times are indicative of low skew.

E.

The coalesce() method should be used to increase the number of partitions.

Question 27

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

Options:

A.

1. select

2. col("storeId")

3. cast

4. StringType

B.

1. select

2. col("storeId")

3. as

4. StringType

C.

1. cast

2. "storeId"

3. as

4. StringType()

D.

1. select

2. col("storeId")

3. cast

4. StringType()

E.

1. select

2. storeId

3. cast

4. StringType()