# On the Compression of Unknown Sources

Date: 2018-09-11 Add to Google CalendarTime: 1:00pm

Location: Holmes Hall 389

Speaker: Maryam Hosseini, PhD Candidate

Abstract:

Usually, data is considered as outcome of probability sources and to get insight from data, we need to know more about its underlying distribution. Although it is very helpful to know the precise characterization of the source, most of the time such information is not available. However, we usually know that the underlying distribution is not completely arbitrary and belongs to a general class of models, such as the class of iid or Markov distributions. While universal compression of finite support distributions has been well studied, we look into more involved classes---in particular distributions over countable supports, as well as the relations between compression and estimation in Markov setups without mixing assumptions. In the first part of the dissertation, we investigate "compressibility" of a class of distributions. The exact identity of each distribution in the class is unknown, so we aim to find a universal encoder to compress all distributions in the class. But since the universal encoder does not match exactly to the underlying distribution, the average number of bits we use is higher, and the excess bits used over the entropy is the redundancy. We study the redundancy of universal encodings of strings generated by independent identically distributed (i.i.d.) sampling from a set of distributions over a countable support. We show that universal compression of length-n i.i.d. sequences is characterized by how well the tails of distributions in the collection can be universally described, and we formalize the later as the tail-redundancy of collection. We also consider the redundancy of universally compressing strings generated by a binary Markov source without any bound on the memory. We prove asymptotically matching (in order) upper and lower bounds on the redundancy. Apart from the abstract the analysis of a collection of unknown distributions, we design and implement an algorithm to compress the data obtained from an unknown source. Compression can be lossless or lossy. For lossless compression, Lempel and Ziv proposed a universal implementable algorithm and prove that their algorithm achieves the theoretical bound asymptotically. However, many applications can tolerate such amount of distortion which may allow for additional compression. We adapt Codelet parsing, a lossy Lempel-Ziv type algorithm. It sequentially parses a sequence to phrases which we call sourcelet and maps them to codelet in the dictionary. We develop concept "strong match" and use Cycle Lemma to show that strong match method does not remove the optimal leaves from the tree. We study different properties of this dictionary and monitor how the leaves of the dictionary evolve.

<back>