At JW Player, we use Spark to explore new data features and run reports that help drive product decisions and improve algorithms. But doing data analysis at the terabyte level is time consuming, especially when having to manually set up AWS Elastic Mapreduce (EMR) clusters. Our code often depends on custom libraries or Spark settings that require bootstrapping. Moreover, iterating on changes is cumbersome and adds extra steps to our workflow.
Open sourcing is part of JW’s culture and what makes our player great. So we want to share a workflow tool we have been using to launch Spark jobs on EMR. We call it Spark Steps (code).
Spark Steps allows you to configure your cluster and upload your script and its dependencies via AWS S3. All you need to do is define an S3 bucket.
$ sparksteps report_to_csv.py \
–s3-bucket $AWS_S3_BUCKET \
–aws-region us-east-1 \
–release-label emr-4.7.0 \
–submit-args=”–packages com.databricks:spark-csv_2.10:1.4.0″ \
The above example creates a cluster of 1 node with default instance type m4.large, uploads the pyspark script report_to_csv.py to the specified S3 bucket and copies the file from S3 to the master node. Each operation is defined as an EMR “step” that you can monitor in the EMR Management Console. The final step is to run the spark application with submit args that includes the spark csv package and app args “–report-date”.
For more complicated examples such as uploading custom directories or using a virtual private cloud, check out the README.
There are plenty of improvements that can be made such as dynamic spot pricing, bootstrapping and monitoring. Our goal is to share Spark Steps as early as possible so that we can improve upon it together.