# Correlation Detective

Correlation Detective is Java library for fast and scalable multivariate correlation analysis.

## What is CD? (2 min video)

#### Table of contents

- How to install
- What is a multivariate correlation?
- Why are they relevant?
- Why Correlation Detective?
- Demo

## How to install

### Option 1: Install using Maven

Add the following dependency to your pom.xml file:

`<dependency> <groupId>io.github.correlationdetective</groupId> <artifactId>CorrelationDetective</artifactId> <version>1.0</version> </dependency>`

### Option 2: Clone the Correlation Detective github repository

Clone the Correlation Detective repository to your local machine:

`git clone https://github.com/CorrelationDetective/library.git`

Navigate to the project directory:

`cd library`

Build the project using Maven:

`mvn clean install`

## What is a multivariate correlation (MC)?

Strictly speaking, a multivariate correlation is any statistical relationship (whether causal or not) between three or more random variables (e.g. time-series/vectors). This concept is different from the more commonly used *bivariate/pairwise correlation* (Wiki) which only considers two variables.

There are a multitude of multivariate correlation metrics that measure such relationships, the most straightforward being *Multi-Pearson*. This metric essentially measures the *Pearson correlation coefficient* (Wiki) between (element-wise) aggregations of two sets of vectors.

**In words**, this boils down to for example asking the question *“How dependent is the stock price of BMW on the average stock price of Apple and Microsoft?”* (multivariate), where one would only consider the prices of single stocks when focusing on bivariate correlations. This example is visualized in the figure below. Do note that while this multivariate correlation is high, the pairwise correlations between these stocks are low. This shows that multivariate correlations are not a trivial extension to bivariate correlations, but are able to express strong latent relationships as well.

**Mathematically**, if we want to derive the multivariate correlation coefficient \(\rho\) between vector sets \(\{A,B\}\) and \(\{C\}\) using averaging as an aggregation method, this boils down to computing;

*Visual representation of bivariate and multivariate correlations over stock price data*

Even though we use averaging as an aggregation method for these examples, Multi-Pearson supports effectively every element-wise aggregation method such as MAX, MIN, SUM, XOR, etc. This is important as the relevancy of aggregation method depend on the application domain.

Also note that these examples only consider the Multi-Pearson correlation metric, while there also exist plenty of research on other multivariate correlation metrics such as **Tripoles** or **Multipoles** (Link).

## Why are they relevant?

Recent studies have repeatedly shown that multivariate correlations can capture patterns in data that could not have been found by only considering bivariate correlations. By considering such correlations one can **gain new insights from data**, which help for better understanding (natural) phenomena. That’s why multivariate correlations have become a popular topic in research communities from a wide variety of scientific domains throughout the last years.

Some recent discoveries are:

**Neuroscience**- Analysis of
*fMRI*data lead to the discovery that the brain’s left middle frontal assimilates information from the right superior frontal and left inferior frontal regions when watching a video (Link).

- Analysis of
**Climatology**- Analysis of
*Air Pressure*data lead to the discovery of led to the characterization of a new weather phenomenon and to improved climate models. Precicely, that the air pressure over the West Siberian Plain is strongly negatively correlated to the aggregated pressure levels over Darwin, Australia and Tahiti (Link).

- Analysis of
**Genomics/Medicine**- Researchers found through analysis of
*gene*data that presence of multiple RASopathy genes contributed to an elevated risk of autism spectrum disorders (ASDs) due to a phenomenon called epistasis. This phenomenon involves the dependence of the effect of a gene mutation on the presence or absence of mutations in other genes. In other words, multiple genes interact with each other which impacts the expression of a disease, while each gene individually only has weak correlation with the disease trait (Link).

- Researchers found through analysis of
**Finance**- While part of ongoing research, multivariate correlation analysis of
*stock price*data has found application in*portfolio diversification*(the act of creating a selection of stocks that minimizes risk) and*portfolio repair*(the act of finding one or more replacements of a stock in a portfolio such that it follows the old porfolio’s performance as close as possible).

- While part of ongoing research, multivariate correlation analysis of

## Why Correlation Detective?

*“If MC analysis is so straightforward and valuable, why are we figuring it out just now?”*

Good question. Unfortunately, the problem with MC analysis is that it’s very **computationally expensive**, meaning it takes a long time to finish.

This has to do with the following;

- MC analysis involves finding all interesting correlations in a dataset of vectors/time-series.
- This means that one has to compute (or estimate) the correlations of all possible combinations of vectors in the dataset.
- The total number of vector combinations in a dataset increases almost
*exponentially*with the set size of such a combination. - Therefore, the computational effort of the analysis increases immensely if one wants to consider multivariate correlations besides the traditional bivariate correlations.

*Example:* a reasonable dataset of 1000 vectors includes around 500K unique combinations of 2 vectors. Computing the (Pearson) correlation for each combination would take around **1 second** on a strong laptop with a multithreaded algorithm. However, considering combinations of 3 vectors already involves iterating over 500 million combinations, which would take around **8 minutes**. Combinations of 4 vectors? Over one trillion combinations and an estimated computation time of over **28 hours**.

There exist other algorithms which bring these computation times down by at most one order of magnitude. However, they do not guarantee to find all interesting combinations and usually also impose constraints on the results (e.g. only consider combinations of 3 vectors that have high internal pairwise correlations).

In contrast, Correlation Detective is

- Two orders of magnitude
**faster**than baseline algorithms (see figure below for reference). **Generic**; supports multiple query types, measures and*optional*constraints.- Has extensions that support query
**approximation**and**streaming**data.

These factors make that CD now enables researchers and analysts to include MC analysis as a regular step in their workflow.

# Variables in correlation | Baseline | CD |
---|---|---|

2 | 1.2 sec | 1.0 sec |

3 | 8 min | 18 sec |

4 | 28 h | 70 min |

5 | ~239 days | 15 h |

*Computation times of MC analysis for a dataset of 1000 stock prices*

# Demo

To provide an example of the output of CD, we run the streaming version of CD (named CDStream) on the NYSE Trade and Quote dataset (Link). This dataset contains intraday transactions data (trades and quotes) for all securities listed on the New York Stock Exchange (NYSE) and American Stock Exchange (AMEX), as well as Nasdaq National Market System (NMS) and SmallCap issues **with millisecond-level granularity**.

We simulate a stream of this data using the provided timestamps and feed this datastream to CDStream. As per our configuration, the algorithm handles the arriving price updates in batches of one second, and updates the result set accordingly. In this case, the result set is comprised of all combinations of 3 vectors \(a,b,c\), such that the multivariate correlation \(\rho_{a,AVG(b,c)} \geq 0.85\) over a sliding window of one hour (i.e. the prices throughout the latest hour). The figure below visualizes a subset of the output of CDStream through time.

*Animation of the result set (i.e. highly correlated triplets) when running CDStream on Trade and Quote dataset*

A you can see, CDStream is able to monitor the correlations of combinations in the result set as well as identify new combinations that enter the set. This feature is essential if one wants to **analyze complex temporal relations in datasets** (i.e. correlations that exist only for some time) and/or the effect that sudden events have on correlations. Example use cases include;

__Flash-trading__(where early discovery of irregularities in the market can help traders spot investment opportunities)__Weather sensor network__(where measurements must be monitored and analyzed for detection of anomalous events such as storms and floods)__Network monitoring system__(where usage information must be tracked to timely identify weak spots and DoS attacks)

### Demo request

Want to see a demo on your own (numerical) dataset? Contact us via email.

Also take a look at our mission statement.