Think globally and solve locally: secondary memory-based network learning for automated multi-species function prediction
- Equal contributors
AnacletoLab - Department of Computer Science, University of Milano, Via Comelico 39/41, 20135 Milano, Italy
GigaScience 2014, 3:5 doi:10.1186/2047-217X-3-5Published: 23 April 2014
Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers.
We propose a novel framework for scalable network-based learning of multi-species protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertex-centric” implementations of network-based algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memory-based technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern off-the-shelf computers. This approach has been applied to the analysis of a large multi-species network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using biological networks with hundreds of thousands of proteins.
The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.