EXAMPLES

All examples below are free to use and share, but we would be very grateful if you can please cite us if you do use any code or idea below as part of a paper or presentation.

Getting Started

Highlighting Patterns in Your Visuals

3D Visualisations

Interactive Visualisations

GETTING STARTED

Level: Beginner

Programming Language: Python & R

 

Getting started with coding can seem daunting at first, but it needn't be. Modern Integrated Development Environments (IDE) are very easy to install and use to begin to write some code. IDE are user-friendly pieces of software that allow you to develop, edit and test your code and view any figures you make.

 

If you plan to use R, you'll need to install R itself along with RStudio, and excellent IDE for R. Similarly, if you plan to use Python, you'll need to install Python itself along with an IDE called Jupyter. Fortunately, in the case of Python, it is possible to download a single file which bundles together Python along with the IDE, which runs through a web browser.

 

Setting Up R

 

Install R, which you can download from https://cran.r-project.org

Install RStudio, which you can download from www.rstudio.com/products/rstudio/download/

 

Once these are installed, you can simply open RStudio and away you go. Code can be typed in the top left quadrant of RStudio and can be run by clicking Run.

 

Setting Up Python

 

Install Anaconda for Python 3.x (currently 3.6 at time of print), which you can download from www.anaconda.com/download/

 

Once you have installed Anaconda (Python bundled together with Jupyter), you can open Jupyter through your web browser by opening the Terminal (on Mac) or command prompt (on Windows) and typing the commands below into the Terminal/command prompt. On Mac, the Terminal can be found in Applications/Utilities, while on Windows you can find the command prompt by searching cmd. To start up Jupyter, type the following command into your Terminal or command prompt

 

jupyter notebook

Once you've typed this and pressed return, your default web browser should fire up and display Jupyter. To create a new notebook fro which you can start writing code, click on New on the right hand side of the page and choose Python 3. Once this notebook opens, you can type code into the first cell and press Shift+Enter to run the code.

 
HIGHLIGHTING PATTERNS IN YOUR VISUALS

Level: Intermediate

Programming Language: Python

Download code here

Storytelling is amongst the most important aspects of science, and immunology; it's how you convince others of the importance of your findings. To do this, text can be a useful tool to convey our message, but visualisations are usually a far more effective way to demonstrate relationships in data. Whatever the message, you want to ensure it is as clear as possible. You have several tools at your disposal to do this. Let's consider a hypothetical example with two hypothetical continuous variables measured from three subpopulations. You can begin by visualising the two variables using a scatter plot.

No obvious relationship is visible in this plot, and while you might notice a potential subpopulation in the bottom left area of the figure, it's too difficult to conclusively see that there are three subpopulations present within the data. You can of course use different shapes or colours to highlight the existence of multiple populations.

This is often enough to highlight visual trends in our data, but sometimes you might want to go one step further and make the relationship more obvious to the viewer. you can do this by drawing an ellipsoid around each subpopulation of interest. Whilst it might be tempting to draw such an ellipsoid manually based on where you intuitively think it should be drawn to faithfully represent the bulk of the data points from each respective subpopulation, this is a biased, subjective approach that should be avoided. Instead, you should rely on data-driven techniques to determine how and where to draw the ellipsoid. The data allows you to determine the two parameters that define the ellipsoid that we will end up drawing – these parameters are the size and orientation of the ellipsoid. Whilst there are some complex technical aspects to determining how to draw an ellipsoid in a data-driven manner, much of the underlying theory here will be avoided - it's more important that you are able to understand when to implement such a technique, and how to implement it (the code is provided at the link above). 

Orientation: The orientation of the ellipsoid is given by performing eigendecomposition of the covariance matrix of each subpopulation (this is akin to performing PCA on each subpopulation in turn). Eigendecomposition provides you with eigenvectors and eigenvalues. Eigenvectors tell you the direction in which the data varies the most (i.e. the first principal component from PCA is aligned to the eigenvector whose eigenvalue is largest), and so the ellipsoid must be aligned such that its major axis is aligned with this eigenvector. The minor axis of the ellipsoid is then aligned to the second principal component (i.e. the eigenvector with the second largest eigenvalue), and is orthogonal to the major axis.

Size: Eigenvectors determine the orientation of the ellipsoid, eigenvalues determine its size. you will usually want to display an ellipsoid with a 95% confidence interval. It turns out that that we need to stretch the ellipsoid by 2 x sqrt(5.991 x eigenvalue1) in the major axis direction and 2 x sqrt(5.991 x eigenvalue2) in the minor axis direction, where eigenvalue1 and eigenvalue2 are the eigenvalues corresponding to the largest and second largest eigenvectors respectively.

For those interested in getting to grips with more of the theory, read about eigendecomposition of matrices here.

 

 

These ellipsoids highlight how the three subpopulations can be distinguished using the two variables used for plotting, but this isn’t the only option. Indeed, in situations where it is incorrect to assume a that the data points can be modelled by a bivariate Gaussian distribution, you must seek other methods to highlight the segregation of the subpopulations. Perhaps a more intuitive method to highlight the relationship is through use of the convex hull of each subpopulation. The convex hull is defined by the minimum number of data points that are required to envelope all other data points from the same subpopulation. No prior assumptions are required in order to use such a technique, and so may be deemed a safer choice than using a data-driven ellipsoid. The convex hull of each subpopulation is shown below.

 

 

 

 

 

 

 

 

 

 

 

As with most visuals, there is no strict right or wrong answer as to which technique to use to highlight patterns in your data. Instead it is up to you, the creator, to determine how you can best convey your message without deliberately misleading the viewer.

3D VISUALISATIONS

Level: Beginner

Programming Language: Python

Download code here

Dimension reduction techniques allow us to represent high-dimensional data in a lower number of dimensions, typically 2D for the purposes of a plot. However, if we want to show a particular relationship that exists across precisely three variables, we are left with a more difficult choice - use dimension reduction to visualise relationships in 2D or show all three variables directly in the same visualisation. Such a situation may arise when visualising flow cytometry data.

Suppose we have gated a population of interest in FlowJo, we are able to export the raw data in a file format that can then easily be read by Python. You can learn how to export raw data from FlowJo here.

Once we've exported data as a csv or txt file, we can use Python's matplotlib library to visualise the relationship between three markers of interest.

 

 

 

 

 

 

 

 

 

 

 

 

INTERACTIVE VISUALISATIONS PART 1

Level: Beginner

Programming Language: Python

Download code here

Sometimes we want to create visualisations that are not to be included in a paper or a talk, but are instead intended to be used in a more informal setting in a meeting to discuss the data that we generate. In such circumstances, we may want to move away from static or 3D rotating plots, and instead include an element of user interaction.

For example, suppose we have collected data from a set of patients, and want to interrogate the dataset informally to identify any patterns that emerge. Interactive plots allow us to interrogate specific data points of interest in more detail. In the example below, the act of a user hovering over a particular point of interest reveals further information about the patient. Specifically, we show the patient's gMDSC % and age on the x- and y- axes respectively, and when we hover over a data point of interest we reveal the patient's disease classification in the context of Hepatitis B (i.e. whether they are immune tolerant, inactive carriers or immune active).

 

 

 

 

 

 

 

 

 

 
 
 
 
DATA SCIENCE
FOR
IMMUNOLOGISTS