An organization currently runs a large Hadoop environment in their data center and is in the process of creating an alternative Hadoop environment on AWS, using Amazon EMR.They generate around 20 TB of data on a monthly basis. Also on a monthly basis, files need to be grouped and copied to Amazon S3 to be used for the AmazonEMR environment. They have multiple S3 buckets across AWS accounts to which data needs to be copied. There is a 10G AWS Direct Connect setup between their data center and AWS, and the network team has agreed to allocate 50% of AWS Direct Connect bandwidth to data transfer. The data transfer cannot take more than two days.What would be the MOST efficient approach to transfer data to AWS on a monthly basis?
An organization is developing a mobile social application and needs to collect logs from all devices on which it is installed. The organization is evaluating theAmazon Kinesis Data Streams to push logs and Amazon EMR to process data. They want to store data on HDFS using the default replication factor to replicate data among the cluster, but they are concerned about the durability of the data. Currently, they are producing 300 GB of raw data daily, with additional spikes during special events. They will need to scale out the Amazon EMR cluster to match the increase in streamed data.Which solution prevents data loss and matches compute demand?
An administrator needs to manage a large catalog of items from various external sellers. The administrator needs to determine if the items should be identified as minimally dangerous, dangerous, or highly dangerous based on their textual descriptions. The administrator already has some items with the danger attribute, but receives hundreds of new item descriptions every day without such classification.The administrator has a system that captures dangerous goods reports from customer support team of from user feedback.What is a cost-effective architecture to solve this issue?
A company receives data sets coming from external providers on Amazon S3. Data sets from different providers are dependent on one another. Data sets will arrive at different times and in no particular order.A data architect needs to design a solution that enables the company to do the following:✑ Rapidly perform cross data set analysis as soon as the data becomes available✑ Manage dependencies between data sets that arrive at different timesWhich architecture strategy offers a scalable and cost-effective solution that meets these requirements?
An advertising organization uses an application to process a stream of events that are received from clients in multiple unstructured formats.The application does the following:✑ Transforms the events into a single structured format and streams them to Amazon Kinesis for real-time analysis.✑ Stores the unstructured raw events from the log files on local hard drivers that are rotated and uploaded to Amazon S3.The organization wants to extract campaign performance reporting using an existing Amazon redshift cluster.Which solution will provide the performance data with the LEAST number of operations?
An Amazon Redshift Database is encrypted using KMS. A data engineer needs to use the AWS CLI to create a KMS encrypted snapshot of the database in another AWS region.Which three steps should the data engineer take to accomplish this task? (Choose three.)