Abstract
Background
Networkbased learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple offtheshelf computers.
Results
We propose a novel framework for scalable networkbased learning of multispecies protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertexcentric” implementations of networkbased algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memorybased technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern offtheshelf computers. This approach has been applied to the analysis of a large multispecies network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondarymemory based network analysis has been applied to multispecies function prediction using biological networks with hundreds of thousands of proteins.
Conclusions
The combination of these algorithmic and technological approaches makes feasible the analysis of large multispecies networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using wellequipped standalone machines.
Keywords:
Biomolecular networks; Big data analysis; Networkbased learningBackground
In recent years many efforts have been devoted to build automated tools for large scale automated function prediction of proteins (AFP) exploiting the knowledge generated by high throughput biotechnologies [1,2]. As highlighted by a recent international challenge for the critical assessment of automated function prediction [3], scalability and heterogeneity of the available data represent two of the main challenges posed by AFP. Indeed on the one hand no single experimental method can fully characterize the multiplicity of the protein functions, and on the other hand the huge amount of data to be processed poses serious computational problems. The complexity of the problem is furthermore exacerbated by the different level of the functional annotation coverage in different organisms, thus making very difficult the effective transfer of the available functional knowledge from one organism to another.
Computational automated function prediction approaches can be useful for the integration of diverse types of data coming from multiple, often unrelated, proteomic and genomic pipelines. A recent example is represented by the Integrative multispecies prediction (IMP) web server [4] which integrates prior knowledge and data collections from multiple organisms for the generation of novel functional working hypotheses used in experimental followup. Despite its undoubted usefulness, IMP actually covers only seven model organisms, preventing its application to the prediction of the functions of proteins belonging to the proteomes of poorly annotated organisms.
Another popular approach for gene functional annotation transfer between species relies on the availability of a collection of orthology relationships across interspecies proteins, and on the usage of an evolutionary relationships network as a suitable medium for transferring functional annotations to the proteins of poorly annotated organisms [5]. Even if orthology is an evolutionary concept, rather than a functional one, it can be used to link functionally equivalent genes across genomes and enables the functional inference of unknown proteins using one or more functionally characterized orthologs in other species [6,7].
As noticed in [4], the accuracy of machinelearning algorithms for AFP tasks is negatively affected by the sparse coverage of experimental data and by the limited availability of prior functional knowledge. Consequently, these methods are often applied only to biological processes and pathways that are already well characterized for an organism. The construction of large scale multi species networks can be a solution to this problem. Following this approach, network based learning algorithms might benefit of the availability of a priori functional knowledge coming from well annotated species to effectively perform a functional transfer to the proteins of poorly annotated organisms.
Unfortunately this solution is only apparently simple, since the application of classical graphbased algorithms such as the ones based on random walks [8] or label propagation methods [9,10] are often unfeasible with large multispecies networks, especially when only single offtheshelf machines are available. These approaches, indeed, usually rely on an inmemory adjacency matrix representation of the graph network, scale poorly with the size of the graph [11], and may have time complexity that becomes quickly prohibitive. Performance optimization is usually realized by adopting an adjacencylist representation of the graph to take its sparsity into account, or by using parallel strategies for matrix multiplication [12]. However, when the size of the graph becomes so high that is not possible to maintain it entirely in primary memory, either approaches based on parallel distributed computation [1315], or secondary memorybased computation [1618] can be considered. With distributed computation techniques, the graph is spread on different machines and the results are finally collected. However, as outlined in [16], a key issue of these approaches is the need to identify a cut of the graph in order to minimize the communication overhead among machines and their synchronization activities. With secondary memorybased computation, the graph is stored on the disk of a single machine and only limited portions of the graph are loaded in primary memory for computation. In this way, it is possible to overcome the lack of enough primary memory. The use of smart strategies for caching the portions of graph needed for computation [19], the minimization of the number of accesses to secondary memory [20], and the usage of compressed data structures for maintaining the graph in primary memory [21] are the main challenges for making the management of large graph networks in offtheshelf machines comparable to distributed approaches.
In this work we propose a novel framework for scalable semisupervised networkbased learning of multispecies protein functions: on the one hand we adopt a “local learning strategy” to implement classical graphbased algorithms for protein function prediction, and on the other hand we apply secondary memorybased technologies to exploit the large disks available in ordinary offtheshelf computers. The combination of these algorithmic and technological approaches makes feasible the analysis of large multispecies networks in ordinary computers with limited speed and primary memory and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using wellequipped standalone machines.
Only very recently a paper has been devoted to the application of graph database technologies in bioinformatics [22], and to our knowledge this is the first work where secondarymemory based network analysis has been applied to multispecies function prediction using big biological networks with hundreds of thousands of proteins.
This paper is organized as follows. In the next section we introduce our proposed approach based on the local implementation of networkbased algorithms and secondary memorybased computation for the multispecies AFP problem. In particular we discuss the characteristics of Neo4j, a database technology for graph querying and processing, and GraphChi, a diskbased system for graph processing. Then, we show their application to a multispecies network involving proteins of about 300 bacteria species, and to a network including 13 species of Eukaryotes with more than 200.000 proteins, using offtheshelf notebook and desktop computers.
Methods
Our approach to biggraph analysis for AFP leverages on both a novel computational model for network analysis and on novel technologies for fast and efficient secondary memorybased computation. More precisely we adopt at the same time two strategies for scalable networkbased learning of protein function:
1. Local Implementation of networkbased algorithms. To solve the overall AFP problem we adopt a local learning strategy, according to a “vertexcentric” computational model.
2. Secondary memorybased computation. We exploit novel technologies for fast and efficient secondarymemory access: the overall graph is maintained on disk and only small parts of it are loaded each time into primary memory.
It is worth noting that we do not propose novel algorithms, but simply their “local implementation”, according to a vertexcentric programming model, necessary for secondary memorybased computation [14]. Indeed the strength of the proposed approach consists precisely in coupling a “local” vertexcentric implementation of networkbased algorithms with technologies based on secondary memory, to make efficient the local access to graphs stored on disk, thus also allowing the processing of big biological networks when limited RAM memory is available.
Local implementation of networkbased algorithms
The most effective networkbased algorithms for AFP learn by exploiting the overall topology of the networks [2325], and their implementation usually requires to process in primary memory a large part or the overall underlying graph. The main drawback of this implementation is that big networks cannot be entirely loaded into primary memory using offtheshelf machines.
We aim at providing local implementations of “global” network algorithms by iteratively processing only one vertex and its incident edges at a time. In other words we do not reject to think “globally” by exploiting the overall topology of the network, but at the same time we solve “locally” by designing implementations of these algorithms through a vertexcentric programming model [14,26].
As an example, we consider the local implementation of the “vanilla” random walk (RW) algorithm [8], a popular networkbased method just successfully applied to AFP [24]. It is worth noting that the RW algorithm is “global”, in the sense that it may exploit the global topology of the graph, but it is also intrinsically local, since at each step each vertex can be processed considering only its direct neighbours. From this standpoint its local implementation is straightforward, since it is sufficient to iteratively process each vertex, its edges and its directly connected vertices to obtain a “vertexcentric” implementation of the algorithm. Other algorithms that can process the adjacency matrix of a graph row by row (e.g., label propagation algorithms [9]) can be easily implemented according to a vertexcentric programming model and can benefit from diskbased approaches. More in general the proposed approach can be extended to any other networkbased method for which a local implementation can be provided.
Basic notation
Having a graph G=<V,E>, representing a functional network, where the vertices V correspond to proteins, and edges E to functional relationships between proteins, we indicate proteins with integers, i.e. V={1,2,…,n}, where n=V is the number of vertices/proteins, and edges (i,j)∈E represent functional relationships between vertices i,j∈V. The weights associated with edges (i,j) represent the “strength” of their functional relationships and are elements of the symmetric weight matrix W. C⊂V indicates the proteins belonging to a functional class c (e.g., a specific Gene Ontology (GO) term [27]).
Local implementation of random walks
Random walk (RW) algorithms [8] explore and exploit the topology of the functional network, starting and walking around from a subset C⊂V of nodes belonging to a specific class c by using a transition probability matrix Q=D^{−1}W, where D is a diagonal matrix with diagonal elements . The elements q_{ij} of Q represent the probability of a random step from i to j.
The probability to start the walk can be set to p^{o}=1/C for the nodes i∈C and to p^{o}=0 for the proteins i∈V∖C. If p^{t} represents the probability vector of finding a “random walker” at step t in the nodes i∈V (that is, represents the probability for a random walk of reaching node i at step t), then the probability at step t+1 is:
and the update (1) is iterated until convergence or until a finite number of steps is reached.
From a “vertexcentric” standpoint the update rule (1) of the RW algorithm becomes:
where p_{i} is the probability of the i^{th} node, and Q_{i} represents the i^{th} column of the probability transition matrix Q. By recalling that W represents the original weighted adjacency matrix of the graph and W_{i} its i^{th} column, from (2) we obtain:
Equation (3) is the update rule of the random walk resolved at the i^{th} node of the graph, and can be viewed as a “local” version of (1): by updating all the nodes i of the graph, 1≤i≤n, we update the probability vector p^{t+1} exactly in the same way of (1). To compute (3) we need the following information:
1. (i.e., the inverse of the sum of weights of the edges coming from j)
2. w_{ji},1≤j≤n (i.e., the weights of the inedges of i)
3. (i.e., the probabilities of node j at the previous step).
We can observe the following facts:
a) If the graph is undirected (and this is the case for the AFP problem), the weights of incoming and outcoming edges are the same, that is ∀i,∀jw_{ij}=w_{ji}. This implies that only the list of edge weights outcoming from i: L(i)={w_{ij}w_{ij}>0} should be stored. This in turn implies that in sparse graphs the spatial (and temporal) complexity at each node is sublinear, and (3) can be expressed as:
where N(i)={jj∈V∧(i,j)∈E} are the neighborhood vertices of i.
b) We need to store and , 1≤j≤n, that is the probabilities at the current and previous step. Once a step is completed, the current probabilities () can be used as starting probabilities for the next iteration.
c) We can store , as a value associated to each node j. It could be computed at each node j as a preprocessing step: .
d) The algorithm iterates for a predefined number of steps or until convergence.
e) It is easy to see from (3) that the complexity of each iteration of the algorithm is , but with sparse graphs, i.e. when ∀i,{(j,i)w_{ji}>0}<<n, the complexity is .
Secondary memorybased computation
To be actually applicable to realworld big networks, the local implementations of the algorithm described in Section “Local implementation of networkbased algorithms” require specific technologies for an efficient access to the secondary memory: indeed we need to efficiently load small parts of a graph, update them in primary memory and finally store them back to disk.
To this end we experimented with two different secondary memorybased technologies. The first one is based on graph DB technologies [28], and the second one on efficient technologies for diskbased processing of graphs.
Neo4j: a DB technology for graph querying and processing
Neo4j[17] is a data management system written in Java based on the graph data model. Nodes, relationships and their properties are first class citizen in the model and their storage on disk is optimized by the adoption of specific data structures for graph networks. The Neo4j Kernel is a fast graph engine with the main characteristics expected by a DBMS, like recovery, management of transactions and indexing structures. Neo4j can be used both as an embedded database within a Java application and as a standalone server with an extensive REST interface for easy integration with Web applications. A declarative query language, named cypher, for the specification of SQLstyle queries is provided.
Internally, Neo4j stores graph data spread across a number of files. Each store file contains the data for a specific part of the graph (e.g. nodes, relationships, properties) and their overall organization, which entails the separation of graph structure from property data, allows the efficient traversal of the graph and the generation of query answers. Both nodes, relationships and properties have a fixed size representation (e.g. nodes have a fixed dimension of 9 bytes), and relationships are implemented using doubly linked lists on disk in order to render efficient their traversal. The fixedsize representation of nodes, relationships and properties has the advantage that identifiers should not be stored (corresponds to the file offset) and that their retrieval by means of their identifiers can be done in constant time.
Since this information is stored in secondary memory, its access is made efficient through the use of caching techniques. At file system level, each store file is divided in equally sized regions and these regions are cached. The cache holds a fixed number of regions for each file, and regions are replaced relying on a least frequently used (LFU)like policy. On top of this mechanism, a more specific node/relationship cache has been implemented that is optimized for traversal (for example, relationships of a node are organized relying on their type and their direction).
In Neo4j the functional network G used for AFP has been implemented as follows. Each node representing a protein i is associated with the properties name, d (i.e. ), p1 and p2 (i.e. the probability of the protein at the previous and current step). Moreover, between two proteins i and j a relationship of type SIM is specified with a property w_{ij} containing the strength of their functional relationship. The graph has been enhanced with nodes representing the functional classes (with name and count properties, i.e. the name of the class and the number of proteins belonging to the class) and relationships of type CLASS, that represent the classes to which a protein belongs to. Figure 1 reports a simple example of the graph with 10 bacteria proteins and two GO terms with their relationships. For the sake of simplicity, the values of p1 and p2 are not reported.
Figure 1. A sample Neo4j net. A graphical representation of a sample Neo4j net.
Even if the RW algorithm described in Section “Local implementation of networkbased algorithms” has been implemented in Java with the embedded version of Neo4j, it can be easily expressed through the cypher language. This declarative query language allows the expression of the core definition of the “vanilla” RW with a single simple statement (Figure 2). More precisely, starting from a generic protein i and a function class named c, the cypher implementation identifies the proteins j for which a relationship of type SIM exists with i and such that j is of class c. Then, the probability i.p2 (at time t+1) is obtained by using the value j.d and j.p1 (the probability computed at time t). Finally the statement returns the name of protein i, the name of the class c, and the computed probability i.p2 (Figure 2).
Figure 2. Neo4j Implementation of 1step RW algorithm in cypher. The notation (i)[e:rtype]>(j) is used to represent a relationship e of type rtype between nodes i and j. The dotnotation is used to access a single property of a node/edge.
GraphChi: a diskbased system for graph processing
GraphChi is a diskbased system for the analysis of big graphs on single offtheshelf computers [16]. Differently from Neo4j, GraphChi has not been conceived for querying large graphstructured databases, but for efficiently processing graphs stored in secondary memory. To this end it implements specialized data structures to efficiently break large graphs into small parts that can be quickly loaded into primary memory, and provides efficient disk I/O operations to reduce the number of non sequential accesses to disk. Moreover, it offers an asynchronous model of computation that directly supports the vertexcentric programming model.
GraphChi requires enough primary memory to contain the edges and their associated values of only a relatively small subset of vertices at a time, while the rest of the graph is efficiently stored on disk. More precisely, the vertices of the graph are split in K intervals, and each interval is associated to a shard which stores all the inedges for the vertices in the interval itself (Figure 3a). Note that the inedges are sorted by their source vertex. The dimensions of the intervals are chosen in such a way that the corresponding shards can be entirely loaded into primary memory: hence all the inedges are available for the vertices in the interval. Moreover, the outedges can be efficiently loaded requiring at most K non sequential diskreads, through the mechanism of the Parallel Sliding Windows (PSW): by exploiting the ordering of the edges with respect to the source vertices, when PSW moves from an interval to the next, it “slides” a window over each of the shards (Figure 3b).
Figure 3. Efficient disk access with GraphChi. (a) Shards: Int1,…IntK refer to the K intervals in which the vertices are split, while S1,…SK to the corresponding shards (b) Parallel Sliding Windows.
Schematically, the execution flow of GraphChi can be summarized in an iterative cycle, repeated across each interval of vertices:
1. Read: select an interval and load in primary memory its inedges stored in the associated shard (the “memory shard”). Through at most K non sequential reads load its outedges.
2. Execute: perform a parallel update of vertices and edges of the memory shard through multithread asynchronous computation in primary memory.
3. Write: The updated vertices and edges are written back to disk.
Note that the mechanism of Parallel Sliding Windows requires at most K^{2} non sequential reads/writes on disk for a full visit of the entire graph (K reads/writes for each interval), thus resulting in a very efficient management of primary and secondary memory [16].
The GraphChi implementation of the RW algorithm requires a data structure for representing a vertex containing the same properties specified for the Neo4J implementation (namely, d, p1 and p2 – Section “Neo4j: a DB technology for graph querying and processing”). Moreover, a weight is associated with each edge e (referred to as e.w_{ij}). Figure 4 reports the pseudocode of the 1step RW vertexcentric implementation, including the start and the update functions, that specify the actions to perform on a vertex i during the first and the succeeding update iterations. In the start function each vertex is initialized with the value of d and the initial probability p1. In the update function the probability of the 1step RW algorithm is determined by simply applying eq. 4. By means of the GraphChi execution engine, these functions are automatically applied to all the vertices of the graph, according to a multithread and asynchronous mode of computation. This implementation can be easily extended to an arbitrary number of steps by modifying the update function in order to read previous probabilities from p1 during the odd iterations and from p2 during the even iterations (and writing the current probability in the other variable).
Figure 4. Pseudocode of the GraphChi vertexcentric implementation of the 1step RW algorithm.
The C++ implementation of this algorithm in GraphChi entails to keep in main memory a global vector containing a copy of the data structures for each vertex v∈V. Indeed, during the execution of the update function, it is only possible to access the identifier of the neighbour vertex j contained in the data structure representing the edge e, but not its associated probability j.p1 and j.d values, necessary for the computation of the probability i.p2 (Figure 4). Therefore, the global vector in main memory is used just to access this information. We remark that this solution makes our implementation even faster and feasible in our AFP context, since the number of vertices is by far smaller than the number of edges, and thus there is no risk of running out of memory also with offtheshelf computers, even for very large biological networks involving hundreds of thousands or even millions of proteins.
Analyses
We applied our methods based on the local implementation of networkbased algorithms and secondary memorybased computation to the multispecies protein function prediction in bacteria and eukarya. In the remainder of the section we summarize the experimental setup and the characteristics of the data, and then we compare the empirical computational time required by secondary and primary memorybased implementations of network based algorithms for AFP.
Data description and experimental setup
We applied our methods to two multispecies networks of proteins: the first one (Bacterianet, Section “Bacterianet”) accounts 301 species of bacteria, and the second one (Eukaryanet, Section “Eukaryanet”) includes the proteomes of 13 Eukaryotic species.
Bacterianet
We constructed a multispecies bacteria network (Bacterianet), using the proteins proposed as part of a large scale experiment for a recent international challenge aimed at the evaluation of gene function prediction methods (CAFA2: [29]).
The CAFA2 bacteria proteins belong to 10 species (Table 1) and amount to 15,451. We added to this set other 2,187 bacteria proteins having at least one experimental GO annotation in the Uniprot knowledgebase/Swissprot (release: May 2013), but coming from organisms not considered in the CAFA2 challenge^{a}, for a total of 17,638 bacteria proteins belonging to 301 different species.
Table 1. CAFA2 bacteria species and their proteins available in Swissprot (May 2013)
Figure 5 sketches the main steps for the construction of the net of bacteria proteins. At first, we have collected data from the different databases reported in Table 2 to obtain different profiles for each protein. More precisely, each protein has been associated to a binary feature vector, representing a protein profile, whose elements are 1 when the protein is annotated for a specific feature (e.g. includes a specific domain, or a specific motif), or 0 otherwise (second phase in Figure 5). The protein profiles have then been used to construct a set of similarity networks (one for each data type) with edge scores based on the computation of the classical Jaccard similarity coefficient between each possible pair of protein profiles, thus obtaining 8 protein networks. Then we constructed two additional networks by computing the hierarchical Jaccard similarities between the Molecular Function (MF) and Cellular Component (CC) profiles associated to each protein and populated only with the experimentally supported GO annotations previously extracted from Swissprot (May 2013). The hierarchical Jaccard index is computed in the same way of the classical Jaccard, but the components of the vector (the GO terms) are weighted according to their distance from the leaves: GO terms corresponding to the leaves have weight w=1, those at distance d=1 weight w=1/2, and more in general nodes at distance d have weight . In this way we put more emphasis on the most specific annotations, and two proteins annotated with the same more specific terms receive a similarity score larger than that obtained by two proteins annotated with less specific GO terms.
Figure 5. Construction of bacteria net. Data flows from different sources of information, construction of the datatype specific networks and networks integration.
Table 2. Public databases exploited for the construction of protein profiles
The 10 protein networks constructed according to the previously described steps have been integrated in an unique “consensus” network using the Unweighted Average (UA) network integration scheme [37]: the weight of each edge is computed by simply averaging across the available n networks, and “missing data”, i.e. pairs of vertices i,j not present in a given network, result in a weight w_{ij}=0:
where is the weight of the integrated network and represents the weight associated to the edge (i,j) of the d^{th} network (final phase of Figure 5).
As class labels for the proteins included in our integrated network we used the Gene Ontology Biological process (BP) experimental annotations extracted from Swissprot (May 2013). To ensure the availability of a reasonable amount of vertices from which to start the exploration of the direct and indirect neighborhood in the integrated protein network, we discarded all the GO BP classes with less than 20 annotated proteins, and this led to a final set of 381 GO terms with an amount of positives varying from 21 to 2,000 (Table 3).
Table 3. Summary of the distribution of the number of positives across the 381 GO BP classes involved in the functional labelling of the 17638 proteins comprised in the bacterial multi species protein network
The performance of the considered methods have been quantified both in terms of area under the receiving operating curve (AUC) and precision at different recall levels in a standard 5 folds stratified cross validation scheme. We compared the execution times required for the completion of each ranking task in primary memory (on a machine with 16 Gb of RAM) and in secondary memory (on two machines with 16 and 4 Gb of RAM). The machine with 16 Gb of RAM has been equipped with an i7 Intel core processor with 8 cores at 2.80 GHz, while the 4 Gb machine with an Intel i7 core processor with 4 cores at 1.90 GHz (both have been equipped with offtheshelf ATA hard disk drives). Both the machines run an Ubuntu 12.04 Linux operating system.
Eukaryanet
In order to test the ability of the proposed local methods to scale to large multispecies networks, we constructed a second network (hereafter referred to as Eukaryanet). Instead of considering different types of data, as we did in the construction of Bacterianet, all the proteins interactions composing Eukaryanet were downloaded in precomputed form from the STRING proteinprotein interactions database. STRING [38] is a collection of networks composed by real and predicted proteinprotein interactions (based on genetic data, physical data and literature data) and aims at providing a global view of all the available interaction data, including lowerquality data and/or computational predictions for as many organisms as feasible [39]. Since version 7, STRING adopted a twolayered approach when accommodating fully sequenced genomes: important model organisms and those for which experimental data are available from the “core genomes”, while all the other genomes represent the “periphery” [40]. Starting from the STRING interaction data (version 9.05), we selected all the Eukaryotic species in the core region of STRING having 10,000 or more proteins. Our choice is motivated by the expected high quality of the interactions coming from the core region of STRING. The selected Eukaryotic species are listed in Table 4.
Table 4. Selected species from the core region of the STRING protein networks database
This network includes proteins coming from 2 invertebrates (a lancelet of the genus Branchiostoma and the fruit fly), 3 plants (Arabidopsis thaliana, the rice Oryza sativa and the moss Physcomitrella patens), and 8 vertebrates including a frog, the chicken and 6 mammals. The total number of proteins in Eukaryanet is 202,442. This basic version of Eukaryanet is obtained by extracting from STRING all the interactions occurring between proteins of the selected core species. This led to an initial collection of 25,132,538 interactions. A simple connected components analysis revealed that this network is composed by 666 connected components of which only 13 composed by more than 30 vertices (and corresponding to the biggest connected components of the networks associated to each species). This “big” network is thus a collection of the protein networks of the selected species. In order to find a way to “connect” the core components, we extracted all the clusters of orthologous genes from the STRING website according to the following steps:
• the cluster of orthologs ID obtained by the STRING team using the eggNOG database (we considered only clusters of type NOG: nonsupervised orthologous groups);
• the filtering of each NOG cluster in order to remove the proteins not coming from the selected core species. Note that some NOGs are composed by more than 2,000 proteins, but after our filtering procedure each selected NOG is composed by no more than 10 proteins.
After these steps, we selected all the NOGs in which the number of proteins equals the number of species (i.e. NOG composed by 10 proteins coming from 10 species, NOG composed by 9 proteins coming from 9 species, and so on). We finally constructed an enriched version of the basic Eukaryanet network simply by including in Eukaryanet all the possible edges linking the members of the selected set of NOGs. Following this strategy we obtained a network composed by 25,155,631 edges (network density: 0.000613). In order to verify the impact of the additional 23,093 NOGs based edges on the connectivity of Eukaryanet, we repeated the connected components analysis and we found that this “NOGs augmented” version of Eukaryanet is composed by 552 connected components of which two (190,755 nodes (94.22%) and 10,233 (5.05%)) account for more than 99% of the 202,442 proteins composing the network.
As class labels for the proteins included in Eukaryanet we used the GO annotations available in STRING (version 9.05). The STRING website provides flat text files containing a mapping from GO annotations to STRING proteins and a STRING internal confidence score for each GO annotation, ranging from 1 (low confidence) to 5 (high confidence). While extracting the GO labels we considered only the annotations with confidence score 5. We then filtered out all the GO terms associated with less than 20 and more than 100 proteins (473 GO terms). We finally randomly selected from this set 50 GO terms irrespective of their GO division (Molecular function, Biological process and Cellular component). We then repeated all the test performed on Bacterianet on the bigger Eukaryanet network.
Results and discussion
We compared the runtime required by main memory and secondary memorybased implementations (Section “Secondary memorybased computation”) of the RW algorithm described in Section “Local implementation of networkbased algorithms”. Moreover, even if our main aim consists in showing that the combination of local implementation and secondary memorybased computation allows the analysis of big biological networks on small computers, we performed also a comparison of the performance achieved with singlespecies and multispecies networks of bacteria proteins to experimentally assess the impact of a multispecies approach to the prediction of protein functions.
Results with bacterianet
Table 5 shows the average per term runtime required to complete a 5fold crossvalidation on the Bacterianet (17,638 nodes/proteins and more than 7 millions of edges). We considered 381 GO BP terms characterized by more than 20 annotations and involving 301 species of bacteria. (see Section “Bacterianet” for details). Results on the desktop computer (16 Gb RAM machine) show that the computational time required by the secondary memory based implementations, even if larger, is of the same order of magnitude of the time needed by the mainmemorybased implementation. In particular, quite surprisingly, the empirical time complexity of the GraphChi implementation is very close to that of the the mainmemory version. This fact can be partially explained by the very efficient secondary memory access of GraphChi, but above all by the characteristics of the mainmemory implementation of the RW algorithm. Even if the efficient BLASbased fortran subroutines for linear algebra are used for the classical stochastic matrix/probability vector product (eq. 1), the sparsity of the Bacterianet network is not adequately exploited.
Table 5. Empirical time complexity of the main and secondary memorybased implementations of network based algorithms for multispecies function prediction with the Bacterianet
The results of the mainmemory algorithm with the notebook (4 Gb RAM machine) are not reported since on this task the main memory implementation of the algorithm fails, due to disk trashing, by which processor time is mainly used to continuously swap from main memory and the virtual memory on disk. On the contrary, the GraphChi implementation results only in a small increment of the computational time, mainly due to the larger time required to construct the shards when less RAM memory is available (Section “GraphChi: a diskbased system for graph processing”) and to the lower speed of the processor in the notebook machine.
Note that with the smaller machine the empirical computational time required by Neo4j increases of about one order of magnitude, while the GraphChi implementation introduces only a small increment of the required execution time (Table 5). This is particularly relevant when we consider the overall computational time required to predict the 381 GO terms: with the “small” machine Neo4j moves from about 3 hours to about one day with the 1step RW, and from about 7 hours to almost 3 days with the 3steps RW.
Even if the main aim of this work consists in showing that secondarymemory based technologies allow us to analyse large multispecies networks also with “relatively small” standalone computers, we report also the average AUC, and precision at 20 and 40% recall across the considered 381 GO BP terms. Table 6 shows that RW algorithms achieve reasonable results (AUC is always significantly larger than 0.5). In particular 1step RW obtains the best results in terms of both AUC and P20R and P40R: on the average, the direct neighbours of each node seem to be the most informative.
Table 6. Bacterianet: average AUC, precision at 20% recall (P20R) and precision at 40% recall across 381 GO BP terms estimated through 5fold crossvalidation
Results with Eukaryanet
Table 7 summarizes the average perterm runtime required to complete a 5fold cross validation with the Eukaryanet involving more than 200,000 proteins of 13 multicellular eukarya organisms (Section “Eukaryanet”). The spatial requirements induced by Eukaryanet prevents the application of the main memory implementation also with the 16 Gb RAM machine, while secondary memorybased implementations make this task feasible also with this large protein network.
Table 7. Eukaryanet: Average perterm empirical time complexity betweenNeo4j andGraphChi implementations
It is worth noting that in this task involving a bigger net, the GrapChi implementation is significantly faster than the Neo4j implementation (Table 7). Moreover, the average computational time is in practice the same when the 4 Gb and the 16 Gb RAM machines run the GrapChi implementation of the RW algorithm, while we observe a relevant increment in computational time with Neo4j, as previously observed also with Bacterianet.
The performance in terms of the average precision at fixed recall levels obtained in this test are relatively low, especially when compared with the high average AUC obtained with the RW at 1, 2 and 3 steps (Table 8). The observed relatively low precision can be explained by taking into account that it is more negatively affected by class unbalance and, in the Eukaryanet network task, the positives are at most 100 while the number of vertices in the network is 202,442 (i.e. the positives are less than 0.05% of the vertices at best). Note that in this case the 2steps RW achieves the best AUC results: it is likely that these results could be due by the eggNOG orthology relationships added between the singlespecies disconnected components in Eukaryanet (Section “Eukaryanet”). Indeed in this way the annotations for a certain species can be propagated to other philogenetically related species by exploiting the orthology relationships.
Table 8. Eukaryanet: average AUC, precision at 20% recall (P20R) and precision at 40% recall across 50 GO terms estimated through 5fold crossvalidation
Experimental comparison between multispecies and singlespecies approaches
In this section we provide an experimental comparison between multispecies and singlespecies approaches to AFP. We repeated the same AFP task performed with Bacterianet but considering this time each species separately. More precisely, we constructed a separate net for each species of Bacteria, using exactly the same data we used for the multispecies net (Section “Bacterianet”), and then we predicted the probabilities for each of the 381 GO terms considered in the multispecies task (Section “Results with bacterianet”). Average perspecies results show that the multispecies approach, by exploiting the multispecies network of proteins Bacterianet, achieves better results in terms of both AUC, and precision at a fixed recall rate (Table 9), and the difference is statistically significant independently of the number of steps and the performance measure considered (Wilcoxon signed rank test, α=0.01).
Table 9. Comparison of the average AUC, precision at 20% recall (P20R) and precision at 40% recall between multispecies and singlespecies approaches with 301 species of bacteria
These results can be explained, considering two characteristics of multispecies networks: 1) the number of nodes and the number of available annotated proteins; 2) the overall topology of the network.
Indeed in singlespecies nets either the reduced number of available proteins or the reduced number of annotated nodes can negatively affect the generalization capabilities achieved with random walks or any other learning algorithm, while in multispecies networks, by construction, more nodes and more annotated proteins from other species can be available.
Moreover in singlespecies networks usually the number of available functional connections (edges) between proteins can be reduced (for instance, since no sufficient data are available) and in many cases we may have highly disconnected networks, making very difficult the application of algorithms based on the propagation of the information between nodes. On the contrary, in the multispecies setting learning algorithms can enjoy a richer network topology by exploring connections not available in singlespecies nets: the evolutionary relationships between species assure that proteins not connected with other proteins of the same species, can in principle be connected with other homologous proteins in other species, thus enhancing the propagation of the information across the multispecies network.
Summarizing, our results show the feasibility of the “vertexcentric” algorithmic approach coupled with secondary memorybased technologies to process large multispecies protein networks with single offtheshelf computers. Moreover, our preliminary experiments show that in perspective we can also improve performances by constructing large multispecies networks, and by integrating heterogeneous sources of biomolecular and evolutionary information.
Conclusions
Our approach based on local implementations of networkbased algorithms and on novel secondary memorybased technologies provides a solution to the large main memory requirements induced by large multispecies protein networks, thus making possible the analysis of big networks using offtheshelf machines. Our results show that both graph DB technologies (i.e. Neo4j) and secondary memory based systems for graph processing (i.e. GraphChi) can be successfully applied to the analysis of large multispecies networks, even if the latter seems to be less sensitive to the amount of available primary memory, and more efficient for the implementation of networkbased algorithms for AFP. The local implementation strategy can be applied to other networkbased learning algorithms, ranging e.g. from simple guiltbyassociation methods (that are inherently local) [41,42] to more complex label propagation methods [9,10], kernelized graph algorithms [25,43,44] and the recently proposed parametrized Hopfield networks [45], but in principle any algorithm, that can be expressed according to a “vertexcentric” programming model, can be adapted to this framework.
In perspective, by exploiting orthologous genes and multiple genomic sources, multispecies prediction can be applied to annotate poorly annotated species and discover new functions for uncharacterized genes in model organisms. Indeed our proposed approach allows computational biologists to experiment with large multispecies networks using their own notebooks, but in perspective applications to huge networks including e.g. the proteomes available in SwissProt/TrEmbl could be performed using wellequipped standalone machines.
Our framework could be also adapted and devised to other relevant computational biology scenarios characterized by the construction and processing of large networks, such as in the context of the “Network medicine” [46], or in drug discovery and repositioning problems [47].
Availability of supporting data
The files containing the Bacterianet and Eukaryanet along with the files containing the labels used in our experiments are available from GigaDB [48]http://dx.doi.org/10.5524/100090 webcite. The content and format of each file is described in readme files available at the aforementioned database.
Endnote
^{a} For experimental annotation we considered all the available associations having GO evidence codes not included in the following list: IEA, ND, IC, NAS, TAS, ISS, ISO, ISA, ISM, IGC, IBA, IBD, IKR, IRD and RCA. A complete list of the GO evidence codes and their meanings is available at http://www.geneontology.org/GO.evidence.shtml webcite.
Abbreviations
AFP: Automated function prediction; BP: Biological process; CC: Cellular component; GO: Gene ontology; MF: Molecular function; RW: Random walk.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MM analyzed the technological items behind secondary memory based computation and implemented the RW algorithm using Neo4j; MR curated the biological items of multispecies AFP and implemented the same algorithm with GraphChi; GV wrote the draft (with the contribution of the other authors) and implemented the mainmemory version of the algorithms. All authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank the reviewers for their useful comments and suggestions, and acknowledge partial support from the PRIN project “Automi e linguaggi formali: aspetti matematici e applicativi”, funded by the italian Ministry of University.
References

Friedberg I: Automated protein function predictionthe genomic challenge.
Brief Bioinform 2006, 7:225242. PubMed Abstract  Publisher Full Text

Gillis J, Pavlidis P: Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA).
BMC Bioinformatics 2013, 14(Suppl 3):S15. BioMed Central Full Text

Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, BenHur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, NoksoKoivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, et al.: A largescale evaluation of computational protein function prediction.
Nat Methods 2013, 10(3):221227. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG: IMP: a multispecies functional genomics portal for integration, visualization and prediction of protein functions and networks.
Nucleic Acids Res 2012, 40(W1):W484—W490. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kuzniar A, van Ham RC, Pongor S, Leunissen JA: The quest for orthologs: finding the corresponding gene across genomes.
Trends Genet 2008, 24(11):539551. PubMed Abstract  Publisher Full Text

Koonin EV: Orthologs, paralogs, and evolutionary genomics 1.
Annu Rev Genet 2005, 39:309338. PubMed Abstract  Publisher Full Text

Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B: Homologybased inference sets the bar high for protein function prediction.
BMC Bioinformatics 2013, 14(Suppl 3):S7. BioMed Central Full Text

Zhou D, Bousquet O, Lal NT, Weston J, Schölkopf B: Learning with local and global consistency. In Advances in Neural Information Processing Systems 16. Cambridge: MIT Press; 2004:321328.

Bengio Y, Delalleau O, Le Roux N: Label propagation and quadratic Criterion. In SemiSupervised Learning. Edited by Zien A, Schölkopf B, Chapelle O. Cambridge: MIT Press; 2006:193216.

Liu W, Wang J, Chang SF: Robust and scalable graphbased Semisupervised learning.

Foster J: Designing and Building Parallel Programs. Boston: Addison Wesley; 1995.

Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C: PowerGraph: Distributed graphparallel computation on natural graphs. In OSDI’12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. Hollywood, CA: USENIX Association Berkeley; 2012:1730.

Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM: GraphLab: a new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI). Catalina Island: AUAI Press; 2010.

Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G: Pregel: a system for largescale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. Indianapolis, Indiana, USA, New York: ACM Press; 2010:135146.

Kyrola A, Blelloch G, Guestrin C: GraphChi: largescale graph computation on just a PC. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. CA, USA: Hollywood, CA, USA, OSDI’12: USENIX Association Berkeley; 2012:3146.

Webber J: A programmatic introduction to Neo4j. In Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. Tucson: ACM; 2012:217218.

Han WS, Lee S, Park K, Lee JH, Kim MS, Kim J, Yu H: TurboGraph: a fast parallel graph engine handling billionscale graphs in a single PC. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2013:7785.

Karedla R, Love J, Wherry B: Caching strategies to improve disk system performance.

Boldi P, Vigna S: The WebGraph framework I: compression techniques. In In Proc. of the Thirteenth International World Wide Web Conference. New York: ACM Press; 2003:595601.

Have C: Jensen L: Are graph databases ready for bioinformatics?
Bioinformatics 2013, 29(24):3107. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mostafavi S, Ray D, WardeFarley D, Grouios C, Morris Q: GeneMANIA: a realtime multiple association network integration algorithm for predicting gene function.

Kohler S, Bauer S, Horn D, Robinson P: Walking the Interactome for prioritization of candiyear disease genes.

Re M, Mesiti M, Valentini G: A fast ranking algorithm for predicting gene functions in biomolecular networks.

Malewicz G, Austern M, Bik AJ, Dehnert J, Horn I, Leiser N, Czajkowski G: Pregel: a system for largescale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10. Indianapolis, Indiana, USA. New York: ACM Press; 2010:135146.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin JM, Sherlock G: Gene Ontology: tool for the unification of biology.
Nat Genet 2000, 25:25. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Friedberg I, Linial M, Mooney S, Radivojac P: Critical assessment of function annotation experiment.
2013.

Finn RD, Mistry J, SchusterBöckler B, GriffithsJones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A: Pfam: clans, web tools and services.
Nucleic Acids Res 2006, 34(suppl 1):D247—D251. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.
J Mol Biol 2001, 313(4):903919. PubMed Abstract  Publisher Full Text

Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C: PRINTS and its automatic supplement, prePRINTS.
Nucleic Acids Res 2003, 31:400402. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, LangendijkGenevaux PS, Pagni M, Sigrist CJ: The PROSITE database.
Nucleic Acids Res 2006, 34(suppl 1):D227—D230. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, LangendijkGenevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, et al.: New developments in the InterPro database.
Nucleic Acids Res 2007, 35(suppl 1):D224—D228. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, Von Mering C, Doerks T, Jensen LJ, Bork P: eggNOG v2. 0: extending the evolutionary genealogy of genes with enhanced nonsupervised orthologous groups, species and functional annotations.
Nucleic Acids Res 2010, 38(suppl 1):D190—D195. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks.
Nucleic Acids Res 2006, 34(suppl 1):D257—D260. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Re M, Valentini G: Networkbased drug ranking and repositioning with respect to DrugBank therapeutic categories.

STRING database http://stringdb.org webcite

Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, Von Mering C, Jensen LJ: STRING v9. 1: proteinprotein interaction networks, with increased coverage and integration.
Nucleic Acids Res 2013, 41(D1):D808—D815. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Krüger B, Snel B, Bork P: STRING 7 recent developments in the integration and prediction of protein interactions.
Nucleic Acids Res 2007, 35(suppl 1):D358—D362. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Marcotte E, Pellegrini M, Thompson M, Yeates T, Eisenberg D: A combined algorithm for genomewide prediction of protein function.
Nature 1999, 402:8386. PubMed Abstract  Publisher Full Text

Bumgarner McDermottRJand, Samudrala R: Functional annotation from predicted protein interaction networks.
Bioinformatics 2005, 21(15):32173226. PubMed Abstract  Publisher Full Text

Lippert G, Ghahramani Z, Borgwardt K: Gene function prediction form synthetic leathality networks via ranking on demand.
Bioinformatics 2010, 26(7):912918. PubMed Abstract  Publisher Full Text

Re M, Valentini G: Cancer module genes ranking using kernelized score functions.

Frasca M, Bertoni A, Re M, Valentini G: A neural network algorithm for semisupervised node label learning from unbalanced data.
Neural Netw 2013, 43:8498. PubMed Abstract  Publisher Full Text

Barabasi A, Gulbahce N, Loscalzo J: Network medicine: a networkbased approach to human disease.
Nat Rev Genet 2011, 12:5668. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dudley J, Desphonde T, Butte A: Exploiting Drugdisease relationships for computational drug repositioning.
Brief Bioinform 2011, 12(4):303311. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mesiti M, Re M, Valentini G: Supporting materials from ‘Think globally and solve locally: secondary memorybased network learning for automated multispecies function prediction’ GigaScience Database.
2014.