Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 today updated questions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Question 1

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Options:

array_remove(transactionsDf, "*")

transactionsDf.unpersist()

(Correct)

del transactionsDf

transactionsDf.clearCache()

transactionsDf.persist()

Question 2

In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column

productId?

1. importedDf.createOrReplaceTempView("importedDf")

2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")

3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")

4. importedDf = spark.read.option("format", "json").path(jsonPath)

5. importedDf = spark.read.json(jsonPath)

Options:

4, 1, 2

5, 1, 3

5, 2

4, 1, 3

5, 1, 2

Question 3

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

Options:

transactionsDf.drop(["predError", "value"])

transactionsDf.drop("predError", "value")

transactionsDf.drop(col("predError"), col("value"))

transactionsDf.drop(predError, value)

transactionsDf.drop("predError & value")

Question 4

Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

Options:

Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions

Decrease values for the properties spark.default.parallelism and spark.sql.partitions

Increase values for the properties spark.sql.parallelism and spark.sql.partitions

Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions

Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions

Question 5

Which of the following are valid execution modes?

Options:

Kubernetes, Local, Client

Client, Cluster, Local

Server, Standalone, Client

Cluster, Server, Local

Standalone, Client, Cluster

Question 6

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Options:

Since itemId is the index, it does not need to be an argument to the select() method.

The alias() method needs to be called after the select() method.

The explode() method expects a Column object rather than a string.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The split() method should be used inside the select() method instead of the explode() method.

Question 7

Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?

Code block:

1.json_schema = """

2.{"type": "struct",

3. "fields": [

4. {

5. "name": "itemId",

6. "type": "integer",

7. "nullable": true,

8. "metadata": {}

9. },

10. {

11. "name": "supplier",

12. "type": "string",

13. "nullable": true,

14. "metadata": {}

15. }

16. ]

17.}

18."""

Options:

spark.read.json(filePath, schema=json_schema)

spark.read.schema(json_schema).json(filePath)

1.schema = StructType.fromJson(json.loads(json_schema))

2.spark.read.json(filePath, schema=schema)

spark.read.json(filePath, schema=schema_of_json(json_schema))

spark.read.json(filePath, schema=spark.read.json(json_schema))

Answer:

Explanation:

Explanation

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam

preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the

operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For

example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type

pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can

transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL

format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see

an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT<a: BIGINT> from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects.

Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

- pyspark.sql.DataFrameReader.schema — PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.json — PySpark 3.1.2 documentation

- pyspark.sql.functions.schema_of_json — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 51 (Databricks import instructions)

Question 8

Which of the following statements about Spark's configuration properties is incorrect?

Options:

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

The default number of partitions to use when shuffling data for joins or aggregations is 300.

The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Question 9

In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame

transactionsDf and itemsDf on columns productId and itemId, respectively?

1. .filter(~isnull(col('value')))

2. .count()

3. transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))

4. transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')

5. .filter(col('value').isnotnull())

6. .sum(col('value'))

Options:

4, 1, 2

3, 1, 6

3, 1, 2

3, 5, 2

4, 6

Question 10

The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame

transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))

Options:

1. sample

2. True

3. 0.15

4. filter

1. sample

2. False

3. 0.15

4. select

1. sample

2. 0.85

3. False

4. select

1. fraction

2. 0.15

3. True

4. where

1. fraction

2. False

3. 0.85

4. select

Question 11

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Question 12

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.max('value').min('value')

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Question 13

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

Options:

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question 14

Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?

Options:

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

transactionsDf.select(sqrt(predError))

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

transactionsDf.select(sqrt("predError"))

Answer:

Explanation:

Explanation

transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression

as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just

using the column name as a string, "predError".

The QUESTION NO: asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of

DataFrame transactionsDf expressed through col("predError").

transactionsDf.withColumn("predErrorSqrt", sqrt(predError))

Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way – to Spark it looks as if you are trying to refer to the non-existent Python variable predError.

You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead.

transactionsDf.select(sqrt(predError))

Wrong. Here, the explanation just above this one about how to refer to predError applies.

transactionsDf.select(sqrt("predError"))

No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the QUESTION NO: asks for a column to

be added to the original DataFrame transactionsDf.

transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())

No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column.

More info: pyspark.sql.DataFrame.withColumn — PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, QUESTION NO: 31 (Databricks import instructions)

Question 15

Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format

month/day/year in column transactionDateFormatted?

Excerpt of DataFrame transactionsDf:

Options:

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))

transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))

transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))

Question 16

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1.def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

Options:

The operator used to adding the column does not add column predErrorAdded to the DataFrame.

Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

The udf() method does not declare a return type.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

The Python function is unable to handle null values, resulting in the code block crashing on execution.

Question 17

Which of the following statements about broadcast variables is correct?

Options:

Broadcast variables are serialized with every single task.

Broadcast variables are commonly used for tables that do not fit into memory.

Broadcast variables are immutable.

Broadcast variables are occasionally dynamically updated on a per-task basis.

Broadcast variables are local to the worker node and not shared across the cluster.

Question 18

The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which

dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors.

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+----------------+

3.+-------------+---------+-----+-------+---------+----+----------------+

4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35|

5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01|

6.| 3| 3| null| 25| 3|null|2020-04-02 10:53|

7.+-------------+---------+-----+-------+---------+----+----------------+

Code block:

1.transactionsDf = transactionsDf.drop("transactionDate")

2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")

Options:

Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used

instead of the existing column assignment. Operator to_unixtime() should be used instead of unix_timestamp().

Column transactionDate should be dropped after transactionTimestamp has been written. The withColumn operator should be used instead of the existing column assignment. Column

transactionDate should be wrapped in a col() operator.

Column transactionDate should be wrapped in a col() operator.

The string indicating the date format should be adjusted. The withColumnReplaced operator should be used instead of the drop and assign pattern in the code block to replace column

transactionDate with the new column transactionTimestamp.

Column transactionDate should be dropped after transactionTimestamp has been written. The string indicating the date format should be adjusted. The withColumn operator should be used

instead of the existing column assignment.

Question 19

The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively.

Find the error.

Code block:

transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")

Options:

The "outer" argument should be eliminated, since "outer" is the default join type.

The join type needs to be appended to the join() operator, like join().outer() instead of listing it as the last argument inside the join() call.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.itemId == transactionsDf.productId.

The term [itemsDf.itemId, transactionsDf.productId] should be replaced by itemsDf.col("itemId") == transactionsDf.col("productId").

The "outer" argument should be eliminated from the call and join should be replaced by joinOuter.

Question 20

Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?

Sample of DataFrame itemsDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

Options:

itemsDf.split("supplier", " ").count()

itemsDf.split("supplier", " ").size()

itemsDf.select(word_count("supplier"))

spark.select(size(split(col(supplier), " ")))

itemsDf.select(size(split("supplier", " ")))

Question 21

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

Options:

1. cache

2. MEMORY_ONLY_2

3. count()

1. persist

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. select()

1. cache

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. count()

Question 22

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

spark.read.json(filePath)

spark.read.path(filePath, source="json")

spark.read().path(filePath)

spark.read().json(filePath)

spark.read.path(filePath)

Question 23

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

Options:

Slot is another name for executor.

There must be less executors than tasks.

An executor runs on a single core.

There must be more slots than tasks.

Tasks run in parallel via slots.

Question 24

Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

Options:

transactionsDf.sample(True, 0.5)

transactionsDf.take(1000).distinct()

transactionsDf.sample(False, 0.5)

transactionsDf.take(1000)

transactionsDf.sample(True, 0.5, force=True)

Question 25

The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.

Code block:

Options:

transactionsDf.format("parquet").option("mode", "append").save(path)

The code block is missing a reference to the DataFrameWriter.

save() is evaluated lazily and needs to be followed by an action.

The mode option should be omitted so that the command uses the default mode.

The code block is missing a bucketBy command that takes care of partitions.

Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.

Question 26

Which of the following describes a valid concern about partitioning?

Options:

A shuffle operation returns 200 partitions if not explicitly set.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No data is exchanged between executors when coalesce() is run.

Short partition processing times are indicative of low skew.

The coalesce() method should be used to increase the number of partitions.

Answer:

Explanation:

Explanation

A shuffle operation returns 200 partitions if not explicitly set.

Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.

The coalesce() method should be used to increase the number of partitions.

Incorrect. The coalesce() method can only be used to decrease the number of partitions.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.

A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.

Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the

number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is

smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one

would want to have the number of partitions equal to the number of executors (but not more).

So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No data is exchanged between executors when coalesce() is run.

No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.

Short partition processing times are indicative of low skew.

Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.

Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short

processing time is not per se indicative a low skew: It may simply be short because the partition is small.

A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their

partitions than others. But the answer does not make any comparison – so by itself it does not provide enough information to make any assessment about skew.

More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation

Question 27

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

Options:

1. select

2. col("storeId")

3. cast

4. StringType

1. select

2. col("storeId")

3. as

4. StringType

1. cast

2. "storeId"

3. as

4. StringType()

1. select

2. col("storeId")

3. cast

4. StringType()

1. select

2. storeId

3. cast

4. StringType()

Load More Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions

Winter Special Flat 65% Limited Time Discount offer - Ends in 0d 00h 00m 00s - Coupon code: suredis

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: