NetSci
Loading...
Searching...
No Matches
NetSci: A Toolkit for High Performance Scientific Network Analysis Computation

Overview


NetSci is a specialized toolkit designed for advanced network analysis in computational sciences. Utilizing the capabilities of modern GPUs, it offers a powerful and efficient solution for processing computationally demanding network analysis metrics while delivering state-of-the-art performance.

Installation


NetSci is designed with a focus on ease of installation and long-term stability, ensuring compatibility with Linux systems featuring CUDA-capable GPUs (compute capability 3.5 and above). It leverages well-supported core C++ and Python libraries to maintain simplicity and reliability.

  1. Download Miniconda Installation Script:
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  2. Execute the Installation Script:
    bash Miniconda3-latest-Linux-x86_64.sh
  3. Update Environment Settings:
    source ~/.bashrc
  4. Install Git with Conda:
    conda install -c conda-forge git
  5. Clone the NetSci Repository:
    git clone https://github.com/netscianalysis/netsci.git
  6. Navigate to the NetSci Root Directory:
    cd netsci
  7. Create NetSci Conda Environment:
    conda env create -f netsci.yml
  8. Activate NetSci Conda Environment:
    conda activate netsci
  9. Create CMake Build Directory:
    mkdir build
  10. Set NetSci Root Directory Variable:
    NETSCI_ROOT=$(pwd)
  11. Navigate to the CMake Build Directory:
    cd ${NETSCI_ROOT}/build
  12. Compile CUDA Architecture Script:
    nvcc ${NETSCI_ROOT}/build_scripts/cuda_architecture.cu -o cuda_architecture
  13. Set CUDA Architecture Variable:
    CUDA_ARCHITECTURE=$(./cuda_architecture)
  14. Configure the Build with CMake:
    cmake .. -DCONDA_DIR=$CONDA_PREFIX -DCUDA_ARCHITECTURE=${CUDA_ARCHITECTURE}
  15. Build NetSci:
    cmake --build . -j
  16. Build NetSci Python Interface:
    make python
  17. Test C++ and CUDA Backend:
    ctest
  18. Run Python Interface Tests:
    cd ${NETSCI_ROOT}
    pytest

Theory

Mutual information is used to measure how much two random variables are correlated, including both linear and non-linear relationships. Imagine we have a set of data pairs \((x_i, y_i)\), where each pair is an independent realization of random variables \((X, Y)\). These variables follow a distribution \(\mu(x, y)\). Shannon entropy, denoted as \(H(X)\), is calculated using:

\[ H(X) = -\int\mu(x)\log\mu(x)dx \]

where the logarithm's base determines the information's unit (bits, nats, etc.). We use the natural logarithm in our context. Mutual information, \(I(X, Y)\), is defined as:

\[ I(X, Y) = H(X) + H(Y) - H(X, Y) \]

This value indicates how strongly \(X\) and \(Y\) are connected. If they are completely independent, \(I(X, Y)\) equals zero. Often, we don't know \(\mu\) exactly and need to estimate it. Assuming \(\mu\) is uniform, we approximate \(H(X)\) with:

\[ \widehat{H}(X) = -\frac{1}{N}\sum_{i=1}^N\widehat{\log(\mu(x_i))} \]

We use a k-nearest neighbor estimator for this purpose. To calculate the probability distributions necessary for these estimations, we consider the distances to a data point's nearest neighbors in both X and Y dimensions, and compute probabilities based on these distances.


Algorithms

Parallel Mutual Information

Variable Description
Xa Array containing the data points for the first random variable in the mutual information calculation.
Xb Array containing the data points for the second random variable in the mutual information calculation.
k Integer specifying the number of nearest neighbors to consider for each data point.
n Integer representing the total number of data points in each of the random variables Xa and Xb.
nXa Array for storing the count of data points in Xa within a radius of epsilon_Xa / 2 for each point.
nXb Array for storing the count of data points in Xb within a radius of epsilon_Xb / 2 for each point.
s_argMin Shared memory array used to store the indices of the nearest neighbors during the calculation.
s_min Shared memory array used to store the minimum distances calculated in the search for nearest neighbors.
s_epsXa Shared memory array to store the maximum distance (epsilon) within Xa for the k-th nearest neighbors.
s_epsXb Shared memory array to store the maximum distance (epsilon) within Xb for the k-th nearest neighbors.
s_nXa Shared memory array used to temporarily store neighbor counts for Xa within each CUDA block.
s_nXb Shared memory array used to temporarily store neighbor counts for Xb within each CUDA block.

Step 1: Initialization

  • Input: Arrays Xa, Xb, integers k, n, output arrays nXa, nXb
  • Output: Updated nXa, nXb with neighbor counts for each point in Xa, Xb
  • Initialize shared memory arrays: s_argMin[1024], s_min[1024], s_epsXa[1], s_epsXb[1], s_nXa[1024], s_nXb[1024]

Step 2: Parallel Processing

For each data point i processed in parallel CUDA blocks:

  • Load Xa[i] and Xb[i] into registers r_Xai, r_Xbi
  • Set local thread index localThreadIndex = threadIdx.x
  • Initialize s_nXa[localThreadIndex], s_nXb[localThreadIndex] to zero
  • If localThreadIndex == 0, set s_epsXa[0], s_epsXb[0] to zero

Step 3: Load Data in Chunks

  • Iterate over Xa, Xb in chunks, loading into r_Xa, r_Xb
  • Synchronize threads using __syncthreads()

Step 4: Find k Nearest Neighbors

  • Initialize localMin = RAND_MAX, localArgMin = 0
  • Iterate over chunks, updating localMin, localArgMin based on distance dX between r_Xai, r_Xbi and chunk data
  • Update shared memory s_min, s_argMin
  • Perform parallel reduction to find global minimum distance and corresponding index
  • Update s_epsXa, s_epsXb and mark processed points in r_Xa, r_Xb as needed
  • Synchronize threads using __syncthreads()

Step 5: Increment Neighbor Counts

  • Iterate over chunks, incrementing s_nXa, s_nXb based on distance conditions to r_Xai, r_Xbi
  • Synchronize threads using __syncthreads()
  • Perform parallel reduction on s_nXa, s_nXb

Step 6: Update Global Counts

  • If localThreadIndex == 0, update nXa[i], nXb[i] with reduced counts from s_nXa[0], s_nXb[0]
  • Synchronize threads using __syncthreads()

Tutorials

Parallel Mutual Information

from pathlib import Path
import numpy as np
from netcalc import mutualInformation
from cuarray import floatCuArray, IntCuArray
from netchem import Network, data_files
dcd, pdb = data_files("pyro")
trajectory_file = str(dcd)
topology_file = str(pdb)
first_frame = 0
last_frame = 999
stride = 1
network = Network()
network.init(
trajectory_file,
topology_file,
first_frame,
last_frame,
stride
)
num_nodes = network.numNodes()
num_frames = network.numNodes()
k = 4
Definition network.h:12