Create Next App

databricks CERTIFIED_DATA_ENGINEER_PROFESSIONAL

Exam contains 222 questions

Page 3 of 37

Question 13 🔥

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINTNew records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Which database solution meets these requirements?

A. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table.

Highly voted

B. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated.

Highly voted

C. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.

Highly voted

D. Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current.

Highly voted

E. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.

Highly voted

Discussion of the question

Question 14 🔥

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.Which code block accomplishes this task while minimizing potential compute costs?

Which database solution meets these requirements?

A. preds.write.mode("append").saveAsTable("churn_preds")

Highly voted

B. preds.write.format("delta").save("/preds/churn_preds")

Highly voted

Discussion of the question

Question 15 🔥

A table is registered with the following code:Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

Which database solution meets these requirements?

A. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

Highly voted

B. All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Highly voted

C. Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

Highly voted

D. All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

Highly voted

E. The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Highly voted

Discussion of the question

Question 16 🔥

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.Which of the following likely explains these smaller file sizes?

Which database solution meets these requirements?

A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

Highly voted

B. Z-order indices calculated on the table are preventing file compaction

Highly voted

C. Bloom filter indices calculated on the table are preventing file compaction

Highly voted

D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table

Highly voted

E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Highly voted

Discussion of the question

Question 17 🔥

Which statement regarding stream-static joins and static Delta tables is correct?

Which database solution meets these requirements?

A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

Highly voted

B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

Highly voted

C. The checkpoint directory will be used to track state information for the unique keys present in the join.

Highly voted

D. Stream-static joins cannot use static Delta tables because of consistency issues.

Highly voted

E. The checkpoint directory will be used to track updates to the static Delta table.

Highly voted

Discussion of the question

Question 18 🔥

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Choose the response that correctly fills in the blank within the code block to complete this task.

Which database solution meets these requirements?

A. to_interval("event_time", "5 minutes").alias("time")

Highly voted

B. window("event_time", "5 minutes").alias("time")

Highly voted

C. "event_time"

Highly voted

D. window("event_time", "10 minutes").alias("time")

Highly voted

E. lag("event_time", "10 minutes").alias("time")

Highly voted

Discussion of the question

Ready to Pass Your Certification Test

databricks CERTIFIED_DATA_ENGINEER_PROFESSIONAL

Exam contains 222 questions

Lorem ipsum dolor sit amet consectetur. Eget sed turpis aenean sit aenean. Integer at nam ullamcorper a.

Company

Product

Resources

Follow us