Word embeddings are a widely-used tool to analyze language. Exponential family embeddings generalize the technique to other types of data by modeling the conditional probability of a target observation (a word or an item) conditioned on the elements in the context (other words or items). One challenge to fitting embedding methods is sparse data, such as a document/term matrix that contains many zeros. We develop zero-inflated embeddings to address this issue. In a zero-inflated embedding (ZIE), a zero in the data can come from an interaction to other data (i.e., an embedding) or from a separate process by which many observations are equal to zero (i.e. a probability mass at zero). Fitting a ZIE naturally downweights the zeros and dampens their influence on the model. Another challenge is to choose a correct context for a target observation. We think it is not optimal for an embedding model to include all context elements in the conditional probability. We improve the predictions and the quality of the embedding representations by modeling the probability of the target conditioned on a subset of the elements in the context. We develop a model that can account for this, and use amortized variational inference to automatically choose this subset.
Liping Liu (https://www.eecs.tufts.edu/~liulp/) is an assistant professor at Tufts University. He earned his doctorate at Oregon State University, is interested in probabilistic modeling, classification, and clustering within machine learning. He also applies these machine learning techniques to ecology studies. Liu previously held the position of postdoctoral associate at Columbia University, working with David Blei on aspects of machine learning, and worked on commercial data analysis for IBM T.J. Watson Research. He is a reviewer for several conferences and journals on machine learning.