Probabilistic Programming Possibilities - From the Diaries of John Henry

By Nicholas Teague

At the heart of the Edward library is a foundation built from stones of TensorFlow Distributions. Put simply, the Distributions library is a tool for modeling distinct classes of distributions and generating samples based on those distribution properties. Of course since this is a TensorFlow library it includes all of the built-in support for operations like GPU integration or distributed operations, but by itself the Distributions library doesn’t support Edward operations like inference or integration with neural networks for instance. There are actually two key methods available in the library, ‘Distributions’ for defining distributions, and ‘Bijectors’ for applying transformations to distributions. I’ll start this brief dialogue on the library with a quick illustration of shape semantics associated with generated samples, as understanding the mechanics of generated samples goes a long way to intuiting the operation of Distributions for generating samples, a key mechanism behind the Edward library.

A sample consists of a series of Monte Carlo draws, where the collection of events are aggregated into a set of batches and batches are aggregated into a sample. The shape parameters shown here of n / b / s are not neccesarily scalars, they represent shapes after all so I believe the representation could also be a list of scalars, each representing the number of dimensions along some axis (for example similar to how we initialized dimensions to the latent vectors in the VAE demonstration as a list [N,d] to give you an idea). In the example of the VAE demonstration, one event would be a drawn 28x28 pixel image, and so I believe the event_shape for sampling from the Xn distribution would be [28*28] (whereas the sample from the Zn distribution would have event_shape of [N,d]). A batch_shape would be, actually not sure on this point how to tie into the VAE example, this might be a part of the considerations for the various inference objective functions, not sure. And then the sample_shape just speaks to how we are taking Monte Carlo draws for the entire sample. When we say that sample_shape draws are identically distributed and batch_shape are not identically distributed, we are referring to the possibility that, hmm kind of speculating here, well I guess this might be for scenarios where the event_shapes are subject to a distribution, and actually this might be what is meant by event_shape can be dependent, such as dependent on the batch I expect. Kind of speculating on this point. Moving on.

As noted in the intro to this section, the TensorFlow Distributions library has two primary methods, the Distributions for modeling classes of distributions, and the Bijectors for applying transformations to distributions. For the Distributions methods, there is actually a library of distribution classes to choose from, each with their own parameters and characteristics (such as the difference between continuous and discrete distributions demonstrated earlier in this paper). The Bijectors library I find very interesting and it’s not really obvious to me how Bijectors ties into the Edward library — that might all be under the hood. This might be slightly less intuitive so I’ll try to offer a little clarification, when we are applying a transformation to a distribution, that’s loosely analogous to say when we are applying like a z-score normalization to a single variable in tabular data preprocessing. But consider that the z-score normalization primarily changes the scaling of values returned from a distribution by way of offset and multiplier, I expect the Bijector library could add more sophistication in updating the shape of the distribution curve — applying a transform to the values returned from a sampling by way of applying a transformation to the curve of probabilities associated with returned potential values from a distribution. Of course important to keep in mind that not all distributions share enough in common that you can directly translate from one to another — for example a distribution with bounded left tail can’t be derived from a distribution with unbounded left tail without losing some information in the process, or more dramatically a continuous distribution can not be transformed to a discrete distribution without even more loss of information. Most importantly, and I expect an easy ‘gotcha’ here, might be that the transformation of a distribution relies on the assumption that the properties of the source distribution are sufficiently understood — I believe there are some distributions where estimating parameters potentially require orders of magnitude more data than others, such as with fat tailed distributions.

I’ll briefly offer in conclusion that this exploration into the Edward and TensorFlow Distributions library was a very enjoyable and worthwhile experience. I hope you the reader might have gotten some value out of this review. I know this was kind of challenging subject matter, the goal here was to try and provide plain language treatment at high detail of mechanics such as for a reader to gain an understanding which took me several days of reviewing papers to achieve. If you enjoyed or got some value out of this review, I hope you might consider checking out Automunge, which is an open source library built as a platform for feature engineering and automated data preparation of tabular data for machine learning. It’s very useful and I’m very hungry for feedback from users who might help me identify points that are well-done and/or might need clarification. Wishing the best to the impressive probabilistic programming facilitators of the Edward and TensorFlow libraries. I expect industry will continue to find new high-value use-cases for these tools as a timeless foundation for the intersections between neural networks and probability. Cheers.