Modern algorithms for parallel, streaming and query-based data processing

Курс

Партнёр:

Computer Science клуб при ПОМИ РАН

Предмет:

Computer Science

Лектор:

In this mini-course we will cover recent developments in approximate algorithms for processing very large data sets. We will focus on fundamental techniques that can be used in a variety of different computational settings, including MapReduce-like batch parallel computation, algorithms for streaming data and data that can be accessed through queries. The list of topics will include:

— Linear sketching: a powerful paradigm for compressing data that is based on carefully chosen random linear projections. Linear sketching allows to construct approximate arrays, estimate the size of cuts in graphs, compute approximate matchings, reduce dimension of vectors, solve linear regression and other linear-algebraic problems, etc.

— Core-sets: a baseline approach to problems involving geometric data core-sets allow to solve a variety of clustering and related problems. Core-sets are constructed by representing a dataset as a carefully selected subset of its points.

— Beyond linear sketches and core-sets: we will see how to design algorithms for massive data processing in situations when linear sketches and core-sets fall short of achieving optimal performance.

— Query-based sampling methods: we will see how queries can be used to design approximate decision-making algorithms for testing properties of high-dimensional and noisy data.