The Data Scientist’s Guide to Topological Data Analysis: Preamble

by Justin Skycak on

Bridging the communication gap between academia and industry in the field of TDA.

This post is part of the series The Data Scientist's Guide to Topological Data Analysis.


Topological Data Analysis, abbreviated TDA, is a suite of data analytic methods inspired by the mathematical field of algebraic topology. TDA is attractive yet elusive for most data scientists, since its potential as a data exploration tool is often communicated through esoteric terminology unfamiliar to non-mathematicians. The purpose of this guide is to bridge the communication gap between academia and industry, so that non-mathematician data scientists may add current TDA methods to their analytic toolkits and anticipate new developments in the field of TDA.

The guide begins with an overview of Mapper, a TDA algorithm that has recently transitioned from academia to industry with commercial success. We explain the Mapper algorithm, demo open-source software, and present a handful of its commercial use-cases (some of which are original). Then, we switch to persistent homology, a TDA method that has not yet broken through to industry but is supported by a growing body of academic work. We explain the intuition behind homotopy, approximation, homology, and persistence, and demo open-source persistent homology software. It is hoped that the data scientist reading this guide will be inspired to give Mapper a try in their future analytic work, and be on the lookout for future developments in persistent homology that push it from academia to industry.

Mapper

  1. Algorithm. The Mapper algorithm maps high-dimensional data into smaller networks that retain the main topological features of the data and are easy to visualize.
  2. Software. To run the Mapper algorithm on small to medium-size datasets, one can use the open source R package TDAmapper.
  3. Use-Cases at Ayasdi. On a larger scale, Mapper has been used commercially by the company Ayasdi to forecast returns, detect fraud, aid in oil and gas exploration, plan ad campaigns, and discover biomarkers.
  4. Use-Cases at Aunalytics. At Aunalytics, Mapper (via R's TDAmapper) provided granular insights on a location tracking dataset, and revealed insights in a sparse call-center dataset even though there was little cohesion in the resulting network.

Persistent Homology

  1. Homotopy. Algebraic topology aims to describe the connectivity of any arbitrary space. It does this by computing the homotopy, or number of "loops" in each dimension.
  2. Approximation. In computational topology, datasets can be interpreted as samples taken from an underlying topological space, and for any given margin of error a topology can be constructed to approximate the underlying space.
  3. Homology. Homotopy groups are extremely difficult to compute in high dimensions. Homology is a similar concept which can be easier to compute.
  4. Persistence. Persistence barcode plots show which topological features persist through many scales of the data, and can be used to calculate similarity between different spaces.
  5. Software. To compute persistent homology of small to medium-size datasets, one can use the open source R package TDA.


This post is part of the series The Data Scientist's Guide to Topological Data Analysis.