Question # 1
Problem Scenario 28 : You need to implement near real time solutions for collecting information when submitted in file with below Data echo "IBM,100,20160104" >> /tmp/spooldir2/.bb.txt echo "IBM,103,20160105" >> /tmp/spooldir2/.bb.txt mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt After few mins echo "IBM,100.2,20160104" >> /tmp/spooldir2/.dr.txt echo "IBM,103.1,20160105" >> /tmp/spooldir2/.dr.txt mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt You have been given below directory location (if not available than create it) /tmp/spooldir2 . As soon as file committed in this directory that needs to be available in hdfs in /tmp/flume/primary as well as /tmp/flume/secondary location.However, note that/tmp/flume/secondary is optional, if transaction failed which writes in this directory need not to be rollback. Write a flume configuration file named flumeS.conf and use it to load data in hdfs with following additional properties . 1. Spool /tmp/spooldir2 directory 2. File prefix in hdfs sholuld be events 3. File suffix should be .log 4. If file is not committed and in use than it should have _ as prefix. 5. Data should be written as text to hdfs |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create directory mkdir /tmp/spooldir2 Step 2 : Create flume configuration file, with below configuration for source, sink and channel and save it in flume8.conf. agent1 .sources = source1 agent1.sinks = sink1a sink1bagent1.channels = channel1a channel1b agent1.sources.source1.channels = channel1a channel1b agent1.sources.source1.selector.type = replicating agent1.sources.source1.selector.optional = channel1b agent1.sinks.sink1a.channel = channel1a agent1 .sinks.sink1b.channel = channel1b agent1.sources.source1.type = spooldir agent1 .sources.sourcel.spoolDir = /tmp/spooldir2 agent1.sinks.sink1a.type = hdfs agent1 .sinks, sink1a.hdfs. path = /tmp/flume/primary agent1 .sinks.sink1a.hdfs.tilePrefix = events agent1 .sinks.sink1a.hdfs.fileSuffix = .log agent1 .sinks.sink1a.hdfs.fileType = Data Stream agent1 .sinks.sink1b.type = hdfs agent1 .sinks.sink1b.hdfs.path = /tmp/flume/secondary agent1 .sinks.sink1b.hdfs.filePrefix = events agent1.sinks.sink1b.hdfs.fileSuffix = .log agent1 .sinks.sink1b.hdfs.fileType = Data Stream agent1.channels.channel1a.type = file agent1.channels.channel1b.type = memory step 4 : Run below command which will use this configuration file and append data in hdfs. Start flume service: flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume8.conf -name age Step 5 : Open another terminal and create a file in /tmp/spooldir2/ echo "IBM,100,20160104" » /tmp/spooldir2/.bb.txt echo "IBM,103,20160105" » /tmp/spooldir2/.bb.txt mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt After few mins echo "IBM.100.2,20160104" »/tmp/spooldir2/.dr.txt echo "IBM,103.1,20160105" » /tmp/spooldir2/.dr.txt mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt
Question # 2
Problem Scenario 34 : You have given a file named spark6/user.csv. Data is given below: user.csv id,topic,hits Rahul,scala,120 Nikita,spark,80 Mithun,spark,1 myself,cca175,180 Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row. Map(id -> om, topic -> scala, hits -> 120)
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs. Step 2 : Load user.csv file from hdfs and create PairRDDs val csv = sc.textFile("spark6/user.csv") Step 3 : split and clean data val headerAndRows = csv.map(line => line.split(",").map(_.trim)) Step 4 : Get header row val header = headerAndRows.first Step 5 : Filter out header (We need to check if the first val matches the first header name) val data = headerAndRows.filter(_(0) != header(O)) Step 6 : Splits to map (header/value pairs) val maps = data.map(splits => header.zip(splits).toMap) step 7: Filter out the user "myself val result = maps.filter(map => mapf'id") != "myself") Step 8 : Save the output as a Text file. result.saveAsTextFile("spark6/result.txt")
Question # 3
Problem Scenario 77 : You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.orders table=retail_db.order_items jdbc URL = jdbc:mysql://quickstart:3306/retail_db Columns of order table : (orderid , order_date , order_customer_id, order_status) Columns of ordeMtems table : (order_item_id , order_item_order_ld , order_item_product_id, order_item_quantity,order_item_subtotal,order_ item_product_price) Please accomplish following activities. 1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory p92_orders and p92 order items . 2. Join these data using orderid in Spark and Python 3. Calculate total revenue perday and per order 4. Calculate total and average revenue for each date. - combineByKey -aggregateByKey |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Import Single table . sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=orders -target-dir=p92_orders –m 1 sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items -target-dir=p92_order_items –m1 Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p92_orders/part-m-00000 hadoop fs -cat p92_order_items/part-m-00000 Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile("p92_orders") orderltems = sc.textFile("p92_order_items") Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value) #First value is orderjd ordersKeyValue = orders.map(lambda line: (int(line.split(",")[0]), line)) #Second value as an Orderjd orderltemsKeyValue = orderltems.map(lambda line: (int(line.split(",")[1]), line)) Step 5 : Join both the RDD using orderjd joinedData = orderltemsKeyValue.join(ordersKeyValue) #print the joined data for line in joinedData.collect(): print(line) Format of joinedData as below. [Orderld, 'All columns from orderltemsKeyValue', 'All columns from orders Key Value'] Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order. //Retruned row will contain ((order_date,order_id),amout_collected) revenuePerDayPerOrder = joinedData.map(lambda row: ((row[1][1].split(M,M)[1],row[0]}, float(row[1][0].split(",")[4]))) #print the result for line in revenuePerDayPerOrder.collect(): print(line) Step 7 : Now calculate total revenue perday and per order A. Using reduceByKey totalRevenuePerDayPerOrder = revenuePerDayPerOrder.reduceByKey(lambda runningSum, value: runningSum + value) for line in totalRevenuePerDayPerOrder.sortByKey().collect(): print(line) #Generate data as (date, amount_collected) (Ignore ordeMd) dateAndRevenueTuple = totalRevenuePerDayPerOrder.map(lambda line: (line[0][0], line[1])) for line in dateAndRevenueTuple.sortByKey().collect(): print(line) Step 8 : Calculate total amount collected for each day. And also calculate number of days. #Generate output as (Date, Total Revenue for date, total_number_of_dates) #Line 1 : it will generate tuple (revenue, 1) #Line 2 : Here, we will do summation for all revenues at the same time another counter to maintain number of records. #Line 3 : Final function to merge all the combiner totalRevenueAndTotalCount = dateAndRevenueTuple.combineByKey( \ lambda revenue: (revenue, 1), \ lambda revenueSumTuple, amount: (revenueSumTuple[0] + amount, revenueSumTuple[1] + 1), \ lambda tuplel, tuple2: (round(tuple1[0] + tuple2[0], 2}, tuple1[1] + tuple2[1]) \ for line in totalRevenueAndTotalCount.collect(): print(line) Step 9 : Now calculate average for each date averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1]}} for line in averageRevenuePerDate.collect(): print(line) Step 10 : Using aggregateByKey #line 1 : (Initialize both the value, revenue and count) #line 2 : runningRevenueSumTuple (Its a tuple for total revenue and total record count for each date) #line 3 : Summing all partitions revenue and count totalRevenueAndTotalCount = dateAndRevenueTuple.aggregateByKey( \ (0,0), \ lambda runningRevenueSumTuple, revenue: (runningRevenueSumTuple[0] + revenue, runningRevenueSumTuple[1] + 1), \ lambda tupleOneRevenueAndCount, tupleTwoRevenueAndCount: (tupleOneRevenueAndCount[0] + tupleTwoRevenueAndCount[0], tupleOneRevenueAndCount[1] + tupleTwoRevenueAndCount[1]) \ ) for line in totalRevenueAndTotalCount.collect(): print(line) Step 11 : Calculate the average revenue per date averageRevenuePerDate = totalRevenueAndTotalCount.map(lambda threeElements: (threeElements[0], threeElements[1][0]/threeElements[1][1])) for line in averageRevenuePerDate.collect(): print(line)
Question # 4
Problem Scenario 33 : You have given a files as below. spark5/EmployeeName.csv (id,name) spark5/EmployeeSalary.csv (id,salary) Data is given below: EmployeeName.csv E01,Lokesh E02,Bhupesh E03,Amit E04,Ratan E05,Dinesh E06,Pavan E07,Tejas E08,Sheela E09,Kumar E10,Venkat EmployeeSalary.csv E01,50000 E02,50000 E03,45000 E04,45000 E05,50000 E06,45000 E07,50000 E08,10000 E09,10000 E10,10000 Now write a Spark code in scala which will load these two tiles from hdfs and join the same, and produce the (name.salary) values. And save the data in multiple tile group by salary (Means each file will have name of employees with same salary). Make sure file name include salary as well. |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs. Step 2 : Load EmployeeName.csv file from hdfs and create PairRDDs val name = sc.textFile("spark5/EmployeeName.csv") val namePairRDD = name.map(x=> (x.split(",")(0),x.split('V')(1))) Step 3 : Load EmployeeSalary.csv file from hdfs and create PairRDDs val salary = sc.textFile("spark5/EmployeeSalary.csv") val salaryPairRDD = salary.map(x=> (x.split(",")(0),x.split(",")(1))) Step 4 : Join all pairRDDS val joined = namePairRDD.join(salaryPairRDD} Step 5 : Remove key from RDD and Salary as a Key. val keyRemoved = joined.values Step 6 : Now swap filtered RDD. val swapped = keyRemoved.map(item => item.swap) Step 7 : Now groupBy keys (It will generate key and value array) val grpByKey = swapped.groupByKey().collect() Step 8 : Now create RDD for values collection val rddByKey = grpByKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)} Step 9 : Save the output as a Text file. rddByKey.foreach{ case (k,rdd) => rdd.saveAsTextFile("spark5/Employee"+k)}
Question # 5
Problem Scenario 68 : You have given a file as below. spark75/f ile1.txt File contain some text. As given Below spark75/file1.txt Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking For a slightly more complicated task, lets look into splitting up sentences from our documents into word bigrams. A bigram is pair of successive tokens in some sequence. We will look at building bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones. The first problem is that values in each partition of our initial RDD describe lines from the file rather than sentences. Sentences may be split over multiple lines. The glom() RDD method is used to create a single entry for each document containing the list of all lines, we can then join the lines up, then resplit them into sentences using "." as the separator, using flatMap so that every object in our RDD is now a sentence. A bigram is pair of successive tokens in some sequence. Please build bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones. |
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create all three tiles in hdfs (We will do using Hue}. However, you can first create in local filesystem and then upload it to hdfs. Step 2 : The first problem is that values in each partition of our initial RDD describe lines from the file rather than sentences. Sentences may be split over multiple lines. The glom() RDD method is used to create a single entry for each document containing the list of all lines, we can then join the lines up, then resplit them into sentences using "." as the separator, using flatMap so that every object in our RDD is now a sentence. sentences = sc.textFile("spark75/file1.txt") \ .glom() \ map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.spllt(".")) Step 3 : Now we have isolated each sentence we can split it into a list of words and extract the word bigrams from it. Our new RDD contains tuples containing the word bigram (itself a tuple containing the first and second word) as the first value and the number 1 as the second value. bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1)for i in range(0,len(x)-1)]) Step 4 : Finally we can apply the same reduceByKey and sort steps that we used in the wordcount example, to count up the bigrams and sort them in order of descending frequency. In reduceByKey the key is not an individual word but a bigram. freq_bigrams = bigrams.reduceByKey(lambda x,y:x+y)\ map(lambda x:(x[1],x[0])) \ sortByKey(False) freq_bigrams.take(10)
Question # 6
Problem Scenario 29 : Please accomplish the following exercises using HDFS command line options. 1. Create a directory in hdfs named hdfs_commands. 2. Create a file in hdfs named data.txt in hdfs_commands. 3. Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions. 4. Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory. 5. Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system. 6. Create a file in local filesystem named file1.txt and put it to hdfs
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Create directory hdfs dfs -mkdir hdfs_commands Step 2 : Create a file in hdfs named data.txt in hdfs_commands. hdfs dfs -touchz hdfs_commands/data.txt Step 3 : Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions. hdfs dfs -copyToLocal -p hdfs_commands/data.txt/home/cloudera/Desktop/HadoopExam Step 4 : Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory. touch data_local.txt hdfs dfs -moveFromLocal /home/cloudera/Desktop/HadoopExam/dataJocal.txt hdfs_commands/ Step 5 : Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system. hdfs dfs -touchz hdfscommands/data hdfs.txt hdfs dfs -getfrdfs_commands/data_hdfs.txt /home/cloudera/Desktop/HadoopExam/ Step 6 : Create a file in local filesystem named filel .txt and put it to hdfs touch filel.txt hdfs dfs -put/home/cloudera/Desktop/HadoopExam/file1.txt hdfs_commands/
Question # 7
Problem Scenario 1: You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.categories jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities. 1. Connect MySQL DB and check the content of the tables. 2. Copy "retaildb.categories" table to hdfs, without specifying directory name. 3. Copy "retaildb.categories" table to hdfs, in a directory name "categories_target". 4. Copy "retaildb.categories" table to hdfs, in a warehouse directory name "categories_warehouse".
|
Answer: See the explanation for Step by Step Solution and configuration. Explanation: Solution : Step 1 : Connecting to existing MySQL Database mysql -user=retail_dba - password=cloudera retail_db Step 2 : Show all the available tables show tables; Step 3 : View/Count data from a table in MySQL select count(1} from categories; Step 4 : Check the currently available data in HDFS directory hdfs dfs -Is Step 5 : Import Single table (Without specifying directory). sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=categories Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 6 : Read the data from one of the partition, created using above command, hdfs dfs - catxategories/part-m-00000 Step 7 : Specifying target directory in import command (We are using number of mappers =1, you can change accordingly) sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera ~table=categories -target-dir=categortes_target -m 1 Step 8 : Check the content in one of the partition file. hdfs dfs -cat categories_target/part-m-00000 Step 9 : Specifying parent directory so that you can copy more than one table in a specified target directory. Command to specify warehouse directory. sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db -username=retail dba - password=cloudera -table=categories -warehouse-dir=categories_warehouse -m 1
Cloudera CCA175 Exam Dumps
5 out of 5
Pass Your CCA Spark and Hadoop Developer Exam Exam in First Attempt With CCA175 Exam Dumps. Real CCA Spark and Hadoop Developer Exam Questions As in Actual Exam!
— 96 Questions With Valid Answers
— Updation Date : roOeth
— Free CCA175 Updates for 90 Days
— 98% CCA Spark and Hadoop Developer Exam Exam Passing Rate
PDF Only Price 99.99$
19.99$
Buy PDF
Speciality
Additional Information
Testimonials
Related Exams
- Number 1 Cloudera CCA Spark and Hadoop Developer study material online
- Regular CCA175 dumps updates for free.
- CCA Spark and Hadoop Developer Exam Practice exam questions with their answers and explaination.
- Our commitment to your success continues through your exam with 24/7 support.
- Free CCA175 exam dumps updates for 90 days
- 97% more cost effective than traditional training
- CCA Spark and Hadoop Developer Exam Practice test to boost your knowledge
- 100% correct CCA Spark and Hadoop Developer questions answers compiled by senior IT professionals
Cloudera CCA175 Braindumps
Realbraindumps.com is providing CCA Spark and Hadoop Developer CCA175 braindumps which are accurate and of high-quality verified by the team of experts. The Cloudera CCA175 dumps are comprised of CCA Spark and Hadoop Developer Exam questions answers available in printable PDF files and online practice test formats. Our best recommended and an economical package is CCA Spark and Hadoop Developer PDF file + test engine discount package along with 3 months free updates of CCA175 exam questions. We have compiled CCA Spark and Hadoop Developer exam dumps question answers pdf file for you so that you can easily prepare for your exam. Our Cloudera braindumps will help you in exam. Obtaining valuable professional Cloudera CCA Spark and Hadoop Developer certifications with CCA175 exam questions answers will always be beneficial to IT professionals by enhancing their knowledge and boosting their career.
Yes, really its not as tougher as before. Websites like Realbraindumps.com are playing a significant role to make this possible in this competitive world to pass exams with help of CCA Spark and Hadoop Developer CCA175 dumps questions. We are here to encourage your ambition and helping you in all possible ways. Our excellent and incomparable Cloudera CCA Spark and Hadoop Developer Exam exam questions answers study material will help you to get through your certification CCA175 exam braindumps in the first attempt.
Pass Exam With Cloudera CCA Spark and Hadoop Developer Dumps. We at Realbraindumps are committed to provide you CCA Spark and Hadoop Developer Exam braindumps questions answers online. We recommend you to prepare from our study material and boost your knowledge. You can also get discount on our Cloudera CCA175 dumps. Just talk with our support representatives and ask for special discount on CCA Spark and Hadoop Developer exam braindumps. We have latest CCA175 exam dumps having all Cloudera CCA Spark and Hadoop Developer Exam dumps questions written to the highest standards of technical accuracy and can be instantly downloaded and accessed by the candidates when once purchased. Practicing Online CCA Spark and Hadoop Developer CCA175 braindumps will help you to get wholly prepared and familiar with the real exam condition. Free CCA Spark and Hadoop Developer exam braindumps demos are available for your satisfaction before purchase order.
Send us mail if you want to check Cloudera CCA175 CCA Spark and Hadoop Developer Exam DEMO before your purchase and our support team will send you in email.
If you don't find your dumps here then you can request what you need and we shall provide it to you.
Bulk Packages
$60
- Get 3 Exams PDF
- Get $33 Discount
- Mention Exam Codes in Payment Description.
Buy 3 Exams PDF
$90
- Get 5 Exams PDF
- Get $65 Discount
- Mention Exam Codes in Payment Description.
Buy 5 Exams PDF
$110
- Get 5 Exams PDF + Test Engine
- Get $105 Discount
- Mention Exam Codes in Payment Description.
Buy 5 Exams PDF + Engine
Jessica Doe
CCA Spark and Hadoop Developer
We are providing Cloudera CCA175 Braindumps with practice exam question answers. These will help you to prepare your CCA Spark and Hadoop Developer Exam exam. Buy CCA Spark and Hadoop Developer CCA175 dumps and boost your knowledge.
|