Contact Sales

An Easier Way to Run Spark Jobs on AWS EMR

Blog 1 min read | Jun 16, 2016 | JW Player

At JW Player, we use Spark to explore new data features and run reports that help drive product decisions and improve algorithms. But doing data analysis at the terabyte level is time consuming, especially when having to manually set up AWS Elastic Mapreduce (EMR) clusters. Our code often depends on custom libraries or Spark settings that require bootstrapping. Moreover, iterating on changes is cumbersome and adds extra steps to our workflow.

Open sourcing is part of JW’s culture and what makes our player great. So we want to share a workflow tool we have been using to launch Spark jobs on EMR. We call it Spark Steps (code).

Spark Steps allows you to configure your cluster and upload your script and its dependencies via AWS S3. All you need to do is define an S3 bucket.

Example

$ AWS_S3_BUCKET=

$ sparksteps report_to_csv.py

–s3-bucket $AWS_S3_BUCKET

–aws-region us-east-1

–release-label emr-4.7.0

–submit-args=”–packages com.databricks:spark-csv_2.10:1.4.0″

–app-args=”–report-date 2016-05-10″

The above example creates a cluster of 1 node with default instance type m4.large, uploads the pyspark script report_to_csv.py to the specified S3 bucket and copies the file from S3 to the master node. Each operation is defined as an EMR “step” that you can monitor in the EMR Management Console. The final step is to run the spark application with submit args that includes the spark csv package and app args “–report-date”.

For more complicated examples such as uploading custom directories or using a virtual private cloud, check out the README.

There are plenty of improvements that can be made such as dynamic spot pricing, bootstrapping and monitoring. Our goal is to share Spark Steps as early as possible so that we can improve upon it together.

Happy Sparking!

Blog

What to Expect From JWP at NAB 2024

2 min read | 03/13/24

NAB 2024 is just around the corner and this year is slated to be the best yet. Content professionals from across the globe in every sector of broadcasting will...

Blog

Unlock Personalized Streaming With Identity Management Profiles

2 min read | 12/1/23

User journey and personalization have taken center stage in today’s OTT scene. As the demand for personalized content grows, the need for streamlined and tailored viewing experiences becomes increasingly...

Blog

How To Set Up A Password-Protected Live Stream

7 min read | 05/18/23

With over 2,000 cyber hacks happening daily in 2023, it’s crystal clear that protecting the security of your content online has never been more crucial. And you don’t have...

An Easier Way to Run Spark Jobs on AWS EMR

Example

JWP Wins 2024 NAB Show Product of the Year Award

Securing Your Video Content: Best Practices for Video Security

Everything You Need for PPV Sports Streaming in 2024

What to Expect From JWP at NAB 2024

Unlock Personalized Streaming With Identity Management Profiles

How To Set Up A Password-Protected Live Stream