Vectorizing Videos: Leveraging DeepWalk for Video Recommendations


One of the core services JW Player offers to video publishers is that of turn-key recommendations that can drive higher engagement, watch time, and retention among their viewers. For the thousands of publishers using this service, this translates directly to increased advertising dollars and is thus a major focus for algorithmic improvement on the part of our data science team.

As we don’t receive explicit feedback from viewers as to whether they enjoyed a piece of content or the degree to which they enjoyed it, our work relies on building implicit signals and using them to associate users and/or media to each other. Typically we infer associations between media when they have been co-watched – i.e. watched by the same viewer. A sensible next step is to use something like Association Rule Mining to translate such a signal into recommendations of the “people who liked X also liked Y” variety, or to use a Collaborative Filtering approach with latent representations for viewers and videos to generate personalized recommendations.

For a number of reasons, though, our team is shifting towards a deep learning framework for our recommendations. Doing so allows us to seamlessly integrate parameters beyond viewing behavior into our recommendation engines, such as video and user metadata, and it provides a flexible platform that can adapt to the wide ranging publishers we serve. Additionally, this shift also gives us an opportunity to overcome a drawback of many recommender algorithms: that of learning asymmetric associations between content (e.g. in the case of episodic videos) all while updating representations in an online fashion in near real-time.

To this end, an algorithm called DeepWalk developed by Bryan Perozzi et al. achieves the above goals by learning a vector representation of nodes in a graph. The nodes in our case are individual videos, with edges between them weighted by co-watch frequency and recency. There is no restriction that the graph be symmetric, however, and in fact we preserve the sequential nature of co-watch behavior by making edges directional. In other words, if a viewer watched media A followed by media B, then we will add a directional weight from node A to node B, but not the other way around.

DeepWalk learns these representations by doing a random walk from each node with a predetermined number of steps. It cleverly then treats each walk as a “sentence” that can be fed into a word representation algorithm like Word2Vec, developed by Mikolov et al. If you’re unfamiliar with Word2Vec, check out this article by Chris Moody from Stitch Fix.

The DeepWalk authors have a well written Python implementation out there but unfortunately it’s a bit dated (2014) and not easy to use for online learning. To help, we developed a Cython-based implementation of DeepWalk with the following features:

  • The association graph is represented as a sparse matrix for memory efficiency. Using a sparse matrix has two advantages: i) we can continually update the weights based on some time decay and ii) we can use memory views to do random walks in Cython, which is order of magnitudes faster than in Python.
  • The model can be trained online, largely thanks to gensim’s recent release that allows online learning with Word2Vec.

We are open sourcing our implementation as jwalk on GitHub. With the initial goal of replicating DeepWalk, we eventually would like to add other hyperparameters to control how walks occur, in a similar vein to Node2Vec by Aditya Grover et al. Contributions are always welcome.