August 11, 2022

Which day had the largest number of installed drives, and what was this number?

Hey there, all the requirements in here.

Answer these questions using Spark code. Submit your code (in a py file) and the answers to the questions (in a text file). The answers should use the full dataset, not the small dataset. Start with the code shown below. (Hint: for any tasks that say max/largest, don’t use sortByKey, because that’s much slower than a better option.)

Which day had the largest number of installed drives, and what was this number?
How many distinct drives (by model+serial) are installed (i.e., that exist in the data) in each year?
What’s the max drive capacity per year?

Full dataset: change the file path to: file:///ssd/data/backblaze.csv (146 million rows) – my solution took 17min

Run spark like this: spark-submit backblaze-spark.py –master=local[5]

Or to hide log messages: spark-submit backblaze-spark.py –master=local[5] 2> /dev/null

Starting code with some examples that you can remove:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Backblaze").getOrCreate()

schema = "day DATE, serial STRING, model STRING, capacity LONG, failure INTEGER"
d = spark.read.schema(schema).load("file:///home/jeckroth/cinf201/spark/assignment/small-backblaze.csv", format="csv", sep=",", header="true")
d = d.rdd

# print first 10 rows
print(d.take(10))

## How many failures occurred each year?

# make key (year) & value (failure 0/1)
d2 = d.map(lambda row: (row.day.year, row.failure))
# add up failures per year
failureCounts = d2.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt)
print(failureCounts.collect())

## Which model (not serial number) has the most failures overall?

# grab model & failure from data, model is the key
d3 = d.map(lambda row: (row.model, row.failure))
# count failures for that model; result so far: [(modelX, 55), (modelY, 2100)]
d3 = d3.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt)
# flip keys and values; result so far: [(55, modelX), (2100, modelY)]
d3 = d3.map(lambda pair: (pair[1], pair[0]))
# sort by value (second in the pair)
d3 = d3.sortByKey(ascending=False) ### NOT EFFICIENT TECHNIQUE
print(d3.collect())

Collepals.com Plagiarism Free Papers

Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.

Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS

Why Hire Collepals.com writers to do your paper?

Quality- We are experienced and have access to ample research materials.

We write plagiarism Free Content

Confidential- We never share or sell your personal information to third parties.

Support-Chat with us today! We are always waiting to answer all your questions.

Which day had the largest number of installed drives, and what was this number?

Related Posts

Identify properties characteristic of alkenes, cycloalkenes, alkynes, and aromatics

Write a college-level learning narrative about two specific skills or concepts from this course that are useful to a current or future employer. Rese

You are working as an IT security professional for an organization (called Website 101) that has 300 employees and one large corporate office with th