NetSci
|
NetSci is a specialized toolkit designed for advanced network analysis in computational sciences. Utilizing the capabilities of modern GPUs, it offers a powerful and efficient solution for processing computationally demanding network analysis metrics while delivering state-of-the-art performance.
NetSci is designed with a focus on ease of installation and long-term stability, ensuring compatibility with Linux systems featuring CUDA-capable GPUs (compute capability 3.5 and above). It leverages well-supported core C++ and Python libraries to maintain simplicity and reliability.
Mutual information is used to measure how much two random variables are correlated, including both linear and non-linear relationships. Imagine we have a set of data pairs \((x_i, y_i)\), where each pair is an independent realization of random variables \((X, Y)\). These variables follow a distribution \(\mu(x, y)\). Shannon entropy, denoted as \(H(X)\), is calculated using:
\[ H(X) = -\int\mu(x)\log\mu(x)dx \]
where the logarithm's base determines the information's unit (bits, nats, etc.). We use the natural logarithm in our context. Mutual information, \(I(X, Y)\), is defined as:
\[ I(X, Y) = H(X) + H(Y) - H(X, Y) \]
This value indicates how strongly \(X\) and \(Y\) are connected. If they are completely independent, \(I(X, Y)\) equals zero. Often, we don't know \(\mu\) exactly and need to estimate it. Assuming \(\mu\) is uniform, we approximate \(H(X)\) with:
\[ \widehat{H}(X) = -\frac{1}{N}\sum_{i=1}^N\widehat{\log(\mu(x_i))} \]
We use a k-nearest neighbor estimator for this purpose. To calculate the probability distributions necessary for these estimations, we consider the distances to a data point's nearest neighbors in both X and Y dimensions, and compute probabilities based on these distances.
Variable | Description |
---|---|
Xa | Array containing the data points for the first random variable in the mutual information calculation. |
Xb | Array containing the data points for the second random variable in the mutual information calculation. |
k | Integer specifying the number of nearest neighbors to consider for each data point. |
n | Integer representing the total number of data points in each of the random variables Xa and Xb . |
nXa | Array for storing the count of data points in Xa within a radius of epsilon_Xa / 2 for each point. |
nXb | Array for storing the count of data points in Xb within a radius of epsilon_Xb / 2 for each point. |
s_argMin | Shared memory array used to store the indices of the nearest neighbors during the calculation. |
s_min | Shared memory array used to store the minimum distances calculated in the search for nearest neighbors. |
s_epsXa | Shared memory array to store the maximum distance (epsilon) within Xa for the k-th nearest neighbors. |
s_epsXb | Shared memory array to store the maximum distance (epsilon) within Xb for the k-th nearest neighbors. |
s_nXa | Shared memory array used to temporarily store neighbor counts for Xa within each CUDA block. |
s_nXb | Shared memory array used to temporarily store neighbor counts for Xb within each CUDA block. |
Xa
, Xb
, integers k
, n
, output arrays nXa
, nXb
nXa
, nXb
with neighbor counts for each point in Xa
, Xb
s_argMin
[1024], s_min
[1024], s_epsXa
[1], s_epsXb
[1], s_nXa
[1024], s_nXb
[1024]For each data point i
processed in parallel CUDA blocks:
Xa
[i] and Xb
[i] into registers r_Xai
, r_Xbi
localThreadIndex
= threadIdx.xs_nXa
[localThreadIndex], s_nXb
[localThreadIndex] to zerolocalThreadIndex
== 0, set s_epsXa
[0], s_epsXb
[0] to zeroXa
, Xb
in chunks, loading into r_Xa
, r_Xb
__syncthreads()
localMin
= RAND_MAX, localArgMin
= 0localMin
, localArgMin
based on distance dX
between r_Xai
, r_Xbi
and chunk datas_min
, s_argMin
s_epsXa
, s_epsXb
and mark processed points in r_Xa
, r_Xb
as needed__syncthreads()
s_nXa
, s_nXb
based on distance conditions to r_Xai
, r_Xbi
__syncthreads()
s_nXa
, s_nXb
localThreadIndex
== 0, update nXa
[i], nXb
[i] with reduced counts from s_nXa
[0], s_nXb
[0]__syncthreads()