Over half a billion videos are watched on JW Player video player every day resulting in about 7 billion events a day which generates approximately 1.5 to 2 terabytes of compressed data every day. We, in the data team here at JW Player, have built various batch and real time pipelines to crunch this data in order to provide analytics to our customers. For more details about our infrastructure, you can look at JW at Scale and Fast String Matching. In this post, I am going to discuss how we got Hive with Tez running in our batch processing pipelines.
All of our pipelines run on AWS and a significant portion of our daily batch pipelines code is written in Hive. These pipelines run from 1 to 10 hours every day to clean and then aggregate this data. We have been looking ways to optimize these pipelines.