Contents
Cover
Related Titles
Title Page
Copyright
Preface
References
List of Contributors
Part One: Modeling, Simulation, and Meaning of Gene Networks
Chapter 1: Network Analysis to Interpret Complex Phenotypes
1.1 Introduction
1.2 Identification of Important Genes based on Network Topologies
1.3 Inferring Information from Known Networks
1.4 Conclusions
References
Chapter 2: Stochastic Modeling of Gene Regulatory Networks
2.1 Introduction
2.2 Discrete Stochastic Simulation Methods
2.3 Discrete Stochastic Modeling
2.4 Continuous Stochastic Modeling
2.5 Stochastic Models for Both Internal and External Noise
2.6 Conclusions
References
Chapter 3: Modeling Expression Quantitative Trait Loci in Multiple Populations
3.1 Introduction
3.2 IGM Method
3.3 CTWM
3.4 CTWM-GS Method
3.5 Discussion
References
Part Two: Inference of Gene Networks
Chapter 4: Transcriptional Network Inference Based on Information Theory
4.1 Introduction
4.2 Inference Based on Conditional Mutual Information
4.3 Inference Based on Pairwise Mutual Information
4.4 Arc Orientation
4.5 Conclusions
References
Chapter 5: Elucidation of General and Condition-Dependent Gene Pathways Using Mixture Models and Bayesian Networks
5.1 Introduction
5.2 Methodology
5.3 Applications
5.4 Conclusions
References
Chapter 6: Multiscale Network Reconstruction from Gene Expression Measurements: Correlations, Perturbations, and “A Priori Biological Knowledge”
6.1 Introduction
6.2 “Perturbation Method”
6.3 Network Reconstruction by the Correlation Method from Time-Series Gene Expression Data
6.4 Network Reconstruction from Gene Expression Data by A Priori Biological Knowledge
6.5 Examples and Methods of Correlation Network Analysis on Time-Series Data
6.6 Examples and Methods for Pathway Network Analysis
6.7 Discussion
6.8 Conclusions
References
Chapter 7: Gene Regulatory Networks Inference: Combining a Genetic Programming and H∞ Filtering Approach
7.1 Introduction
7.2 Background
7.3 Methodology for Identification and Algorithm Description
7.4 Simulation Evaluation
7.5 Conclusions
Appendix 7.A: Comparison between the Kalman Filter and H∞ Filter
Note
References
Chapter 8: Computational Reconstruction of Protein Interaction Networks
8.1 Introduction
8.2 Protein Interaction Networks
8.3 Characterization of Computed Networks
8.4 Conclusions
References
Part Three: Analysis of Gene Networks
Chapter 9: What if the Fit is Unfit? Criteria for Biological Systems Estimation Beyond Residual Errors
9.1 Introduction
9.2 Model Design
9.3 Concepts and Challenges of Parameter Estimation
9.4 Conclusions
Acknowledgments
References
Chapter 10: Machine Learning Methods for Identifying Essential Genes and Proteins in Networks
10.1 Introduction
10.2 Definitions and Constructions of the Network
10.3 Network Descriptors
10.4 Machine Learning
10.5 Some Examples of Applications
10.6 Conclusions
References
Chapter 11: Gene Coexpression Networks for the Analysis of DNA Microarray Data
11.1 Introduction
11.2 Background
11.3 Construction of GCNs
11.4 Integration of GCNs with Other Data
11.5 Analysis of GCNs
11.6 GCNs for the Study of Cancer
11.7 Conclusions
Acknowledgments
References
Chapter 12: Correlation Network Analysis and Knowledge Integration
12.1 Introduction
12.2 Systems Biology Data Quandaries
12.3 Semantic Web Approaches
12.4 Correlation Network Analysis
12.5 Knowledge Annotation for Networks
12.6 Future Developments
References
Chapter 13: Network Screening: A New Method to Identify Active Networks from an Ensemble of Known Networks
13.1 Introduction
13.2 Methods
13.3 Example Applications
13.4 Discussion
References
Chapter 14: Community Detection in Biological Networks
14.1 Introduction
14.2 Centrality Measures
14.3 Study of Complex Systems
14.4 Overview
14.5 Proposed Algorithm
14.6 Experiments
14.7 Further Improvements
14.8 Conclusions
Acknowledgments
References
Chapter 15: On Some Inverse Problems in Generating Probabilistic Boolean Networks
15.1 Introduction
15.2 Reviews on BNs and PBNs
15.3 Construction of PBNs from a Prescribed Stationary Distribution
15.4 Construction of PBNs from a Prescribed Transition Probability Matrix
15.5 Conclusions
Acknowledgments
References
Chapter 16: Boolean Analysis of Gene Expression Datasets
16.1 Introduction
16.2 Boolean Analysis
16.3 Main Organization
16.4 StepMiner
16.5 StepMiner Algorithm
16.6 BooleanNet
16.7 BooleanNet Algorithm
16.8 Conclusions
Acknowledgements
References
Part Four: Systems Approach to Diseases
Chapter 17: Representing Cancer Cell Trajectories in a Phase-Space Diagram: Switching Cellular States by Biological Phase Transitions
17.1 Introduction
17.2 Beyond Reductionism
17.3 Cell Shape as a Diagram of Forces
17.4 Morphologic Phenotypes and Phase Transitions
17.5 Cancer as an Anomalous Attractor
17.6 Shapes as System Descriptors
17.7 Fractals of Living Organisms
17.8 Fractals and Cancer
17.9 Modifications in Cell Shape Precede Tumor Metabolome Reversion
17.10 Conclusions
References
Chapter 18: Protein Network Analysis for Disease Gene Identification and Prioritization
18.1 Introduction
18.2 Protein Networks and Human Disease
18.3 ToppGene Suite of Applications
18.4 Conclusions
References
Chapter 19: Pathways and Networks as Functional Descriptors for Human Disease and Drug Response Endpoints
19.1 Introduction
19.2 Gene Content Classifiers and Functional Classifiers
19.3 Biological Pathways and Networks Have Different Properties as Functional Descriptors
19.4 Applications of Pathways as Functional Classifiers
19.5 Single Pathway Learning for Identifying Functional Descriptor Pathways
19.6 Multiple-Path Learning (MPL) Algorithm for Pathway Descriptors
19.7 Applications of MPL-Deduced Pathway Descriptors
19.8 Combining Advantages of Pathways and Networks
19.9 Key Upstream and Downstream Interactions of Genetically Altered Genes and “Universal Cancer Genes”
19.10 Conclusions
References
Index
Preface
For the field of systems biology to mature, novel statistical and computational analysis methods are needed to deal with the growing amount of high-throughput data from genomics and genetics experiments. This book presents such methods and applications to data from biological and biomedical problems. Nowadays, it is widely recognized that networks form a very fruitful representation for studying problems in systems biology [1, 2]. However, many traditional methods do not make explicit use of a network representation of the data. For this reason, the topics treated in this book explore statistical and computational data analysis aspects of networks in systems biology [3–6].
Biological phenotypes are mediated by very intricate networks of interactions among biological components. This book covers extensively what we view as two complementary but strongly interrelated challenges in network biology. The first lies in inferring networks from experimental observations of state variables of a system. Interactions among molecular components are traditionally characterized through equilibrium binding or kinetic experiments in vitro with dilute solutions of the purified components. However, such experiments are typically low throughput and unable to properly account for the conditions prevailing in vivo, where factors such as molecular crowding, spatial heterogeneity, and the presence of ligands might strongly modify the interactions of interest. The possibility of inferring network connectivity and even quantitative interaction parameters from observations of intact living systems is attracting considerable research interest as a way of escaping such shortcomings. The fact that biological networks are complex, that problems are often poorly constrained, and that data are often high dimensional and noisy makes this challenge daunting. The second and perhaps equally difficult challenge lies in deriving results that are both biologically relevant and reliable from incomplete and uncertain information about biological interaction networks. We hope that the contributions in the subsequent chapters will help the reader understand and meet these challenges.
This book is intended for researches and graduate and advanced undergraduate students in the interdisciplinary fields of computational biology, biostatistics, bioinformatics, and systems biology studying problems in biological and biomedical sciences. The book is organized in four main parts: Part One: Modeling, Simulation, and Meaning of Gene Networks; Part Two: Inference of Gene Networks; Part 3: Analysis of Gene Networks; and Part Four: Systems Approach to Diseases. Each part consists of chapters that emphasize the topic of the corresponding part, however, without being disconnected from the remainder of the book. Overall, to order the different parts we assumed an intuitive – problem-oriented – perspective moving from Modeling, Simulation, and Meaning of Gene Networks to Inference of Gene Networks and Analysis of Gene Networks. The last part presents biomedical applications of various methods in Systems Approach to Diseases.
Each chapter is comprehensively presented, accessible not only to researchers from this field but also to advanced undergraduate or graduate students. For this reason, each chapter not only presents technical results but also provides background knowledge necessary to understand the statistical method or the biological problem under consideration. This allows to use this book as a textbook for an interdisciplinary seminar for advanced students not only because of the comprehensiveness of the chapters but also because of its size allowing to fill a complete semester.
Many colleagues, whether consciously or unconsciously, have provided us with input, help, and support before and during the preparation of this book. In particular, we would like to thank Andreas Albrecht, Gökmen Altay, Subhash Basak, Danail Bonchev, Maria Duca, Dean Fennell, Galina Glazko, Martin Grabner, Beryl Graham, Peter Hamilton, Des Higgins, Puthen Jithesh, Patrick Johnston, Frank Kee, Terry Lappin, Kang Li, D. D. Lozovanu, Dennis McCance, James McCann, Alexander Mehler, Abbe Mowshowitz, Ken Mills, Arcady Mushegian, Katie Orr, Andrei Perjan, Bert Rima, Brigitte Senn-Kircher, Ricardo de Matos Simoes, Francesca Shearer, Fred Sobik, John Storey, Simon Tavaré, Shailesh Tripathi, Kurt Varmuza, Bruce Weir, Pat White, Kathleen Williamson, Shu-Dong Zhang, and Dongxiao Zhu and apologize to all who have not been named mistakenly. We would also like to thank our editors Andreas Sendtko and Gregor Cicchetti from Wiley-VCH who have been always available and helpful.
Finally, we hope that this book will help to spread out the enthusiasm and joy we have for this field and inspire people regarding their own practical or theoretical research problems.
March 2011
Belfast, Hall/Tyrol, and Coimbra
Matthias Dehmer,
Frank Emmert-Streib,
Armin Graber,
and Armindo Salvador
References
1. Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 5, 101–113.
2. Emmert-Streib, F. and Glazko, G. (2011)Network biology: a direct approach to study biological function. WIREs Syst. Biol. Med., in press.
3. Alon, U. (2006) An Introduction to Systems Biology: Design Principles of Biological Circuits, Chapman & Hall/CRC.
4. Bertalanffy, L. von. (1950) An outline of general systems theory. Br. J. Philos. Sci., 1 (2)
5. Kitano, H. (ed.) (2001) Foundations of Systems Biology, MIT Press.
6. Palsson, B.O. (2006) Systems Biology: Properties of Reconstructed Networks, Cambridge University Press.
List of Contributors
Andreas Bernthaler
Vienna University of Technology
Institute of Computer Languages
Theory and Logics Group
Favoritenstrasse 9
1040 Vienna
Austria
Marina Bessarabova
Thomson Reuters
Healthcare & Life Sciences
169 Saxony Road
Encinitas, CA 92024
USA
Mariano Bizzarri
Sapienza University
Department of Experimental Medicine
Viale Regina Elena 324
00161 Rome
Italy
Gianluca Bontempi
Université Libre de Bruxelles
Computer Science Department
Machine Learning Group
Boulevard du Triomphe
1050 Brussels
Belgium
Gastone Castellani
Università di Bologna
Department of Physics
INFN Bologna Section and
Galvani Center for Biocomplexity
40127 Bologna
Italy
Jing Chen
University of Cincinnati
Department of Environmental Health
Cincinnati, OH 45229
USA
Xi Chen
The University of Hong Kong
Department of Mathematics
Pok Fu Lam Road
Hong Kong
China
Wai-Ki Ching
The University of Hong Kong
Department of Mathematics
Pok Fu Lam Road
Hong Kong
China
Zoltan Dezso
Thomson Reuters
Healthcare & Life Sciences
169 Saxony Road
Encinitas, CA 92024
USA
Cathy S. J. Fann
Academia Sinica
Institute of Biomedical Sciences
Academia Road, Nankang
115 Taipei
Taiwan
Raul Fechete
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Rudolf Freund
Vienna University of Technology
Institute of Computer Languages
Theory and Logics Group
Favoritenstrasse 9
1040 Vienna
Austria
Alessandro Giuliani
Istituto Superiore di Sanità
Department of Environment
and Health
Viale Regina Elena 299
00161 Rome
Italy
Erich Gombocz
IO Informatics Inc.
2550 Ninth Street
Berkeley, CA 94710-2549
USA
Jing-Dong J. Han
Chinese Academy of Sciences
Institute of Genetics and
Developmental Biology
Center for Molecular Systems Biology
Key Laboratory of
Molecular Developmental Biology
Lincui East Road
100101 Beijing
China
and
Chinese Academy of Sciences–
Max Planck Partner Institute for
Computational Biology
Shanghai Institutes for
Biological Sciences
Chinese Academy of Sciences
320 Yue Yang Road
200031 Shanghai
China
Katsuhisa Horimoto
National Institute of Advanced
Industrial Science Technology
Computational Biology Research Center
2-4-7, Aomi, Koto-ku
135-0064 Tokyo
Japan
Ching-Lin Hsiao
Academia Sinica
Institute of Biomedical Sciences
Academia Road, Nankang
115 Taipei
Taiwan
Jialiang Huang
Chinese Academy of Sciences
Institute of Genetics and
Developmental Biology
Center for Molecular Systems Biology
Key Laboratory of
Molecular Developmental Biology
Lincui East Road
100101 Beijing
China
Anil G. Jegga
Cincinnati Children's Hospital
Medical Center
Division of Biomedical Informatics
Cincinnati, OH 45229
USA
and
University of Cincinnati
Department of Biomedical Engineering
Cincinnati, OH 45229
USA
and
University of Cincinnati
College of Medicine
Department of Pediatrics
Cincinnati, OH 45229
USA
Eugene Kirillov
Thomson Reuters
Healthcare & Life Sciences
169 Saxony Road
Encinitas, CA 92024
USA
Younhee Ko
University of Illinois at
Urbana-Champaign
Department of Animal Sciences
1207 W. Gregory Dr.
Urbana, IL 61801
USA
and
University of Illinois at
Urbana-Champaign
Institute for Genomic Biology
1205 W. Gregory Drive
Urbana, IL 61801
USA
Rainer König
University of Heidelberg
Institute of Pharmacy and Molecular
Biotechnology
Bioquant
Im Neuenheimer Feld 267
69120 Heidelberg
Germany
Xiangfang Li
Texas A&M University
Genomic Signal Processing Laboratory
TAMU 3128
College Station, TX 77843
USA
Arno Lukas
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Bernd Mayer
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Patrick E. Meyer
Université Libre de Bruxelles
Computer Science Department
Machine Learning Group
Boulevard du Triomphe
1050 Brussels
Belgium
Konrad Mönks
Vienna University of Technology
Institute of Computer Languages
Theory and Logics Group
Favoritenstrasse 9
1040 Vienna
Austria
and
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Irmgard Mühlberger
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Tatiana Nikolskaya
Thomson Reuters
Healthcare & Life Sciences
169 Saxony RD
Encinitas, CA 92024
USA
Yuri Nikolsky
Thomson Reuters
Healthcare & Life Sciences
169 Saxony Road
Encinitas, CA 92024
USA
Catharina Olsen
Université Libre de Bruxelles
Computer Science Department
Machine Learning Group
Boulevard du Triomphe
1050 Brussels
Belgium
Paul Perco
Emergentec Biodevelopment GmbH
Gersthofer Strasse 29-31
1180 Vienna
Austria
Kitiporn Plaimas
University of Heidelberg
Institute of Pharmacy and Molecular
Biotechnology
Bioquant
Im Neuenheimer Feld 267
69120 Heidelberg
Germany
Thomas N. Plasterer
Northeastern University
Department of Chemistry and
Chemical Biology
360 Huntington Ave.
Boston, MA 02115
USA
and
Pharmacogenetics Clinical Advisory
Board
2000 Commonwealth Avenue, Suite 200
Auburndale, MA 02466
USA
Lijun Qian
Texas A&M University System
Prairie View A&M University
Department of Electrical and
Computer Engineering
MS2520, POB 519
Prairie View, TX 77446
USA
Daniel Remondini
Università di Bologna
Department of Physics
INFN Bologna Section and
Galvani Center for Biocomplexity
40127 Bologna
Italy
Sandra Rodriguez-Zas
University of Illinois at
Urbana-Champaign
Department of Animal Sciences
1207 W. Gregory Drive
Urbana, IL 61801
USA
Debashis Sahoo
Instructor of Pathology and Siebel
Fellow at Institute of Stem Cell Biology
and Regenerative Medicine
Lorry I. Lokey Stem Cell Research
Building
265 Campus Drive, Rm G3101B
Stanford, CA 94305
USA
Shigeru Saito
Infocom Corp.
Chem & Bio Informatics Department
Sumitomo Fudosan Harajuku Building
2-34-17, Jingumae, Shibuya-ku
150-0001 Tokyo
Japan
Robert Stanley
IO Informatics Inc.
2550 Ninth Street
Berkeley, CA 94710-2549
USA
Gautam S. Thakur
University of Florida
Department of Computer and
Information Science and Engineering
Science
PO Box 116120
Gainsville, FL 32611-6120
USA
Tianhai Tian
University of Glasgow
Department of Mathematics
University Gardens
Glasgow G12 8QW
UK
Nam-Kiu Tsing
The University of Hong Kong
Department of Mathematics
Pok Fu Lam Road
Hong Kong
China
Eberhard O. Voit
Georgia Tech and Emory University
The Wallace H. Coulter Department
of Biomedical Engineering
313 Ferst Drive
Atlanta, GA 30332
USA
Haixin Wang
Fort Valley State University
Department of Mathematics and
Computer Science
CTM 101A
Fort Valley, GA 31030
USA
Matthew Weirauch
University of Toronto
Banting and Best Department
of Medical Research and
Donnelly Centre for
Cellular and Biomolecular Research
160 College Street
Toronto, ON, M5S 3E1
Canada
Hong Yu
Chinese Academy of Sciences
Institute of Genetics and
Developmental Biology
Center for Molecular Systems Biology
Key Laboratory of
Molecular Developmental Biology
Lincui East Road
100101 Beijing
China
Wei Zhang
Chinese Academy of Sciences
Institute of Genetics and
Developmental Biology
Center for Molecular Systems Biology
Key Laboratory of
Molecular Developmental Biology
Lincui East Road
100101 Beijing
China
Chapter 1
Network Analysis to Interpret Complex Phenotypes
Hong Yu, Jialiang Huang, Wei Zhang, and Jing-Dong J. Han
1.1 Introduction
Gene network analysis is an important part of systems biology studies. Compared with traditional genotype/phenotype studies that focused on establishing the relationships between single genes and interested traits, network analysis give us a global view of how all the genes work together properly, which in turn leads to the correct biological functions [1].
Unlike the Mendelian “one gene–one phenotype” relationship, C.H. Waddington in 1957 came up with the “epigenetic landscape” to visually illustrate the multigene or network effects of genes on shaping the landscapes (various states) of cellular metabolism. Given our current knowledge, “cellular metabolism” in Waddington's landscapes model can be extended to “molecular networks,” which turn steady states into network representations or snapshots. Such steady states and the transitions from one steady state to another have been computationally analyzed through simulated networks [2–4] and experimentally validated by checking gene expression profiles during proliferation/differentiation transitions, gene mutation perturbations, or environmental or physical stresses [5, 6]. The transition from one stable state to another is usually related to complex phenotypes, which could be both physiological and pathological, such as diabetes mellitus or cancerous proliferation (Figure 1.1) [7]. Gene function is not isolated, so we could not study their function separately. Not only the function of the individual gene products, but also their interaction with each other, which is increasingly more important to the success of higher organisms, determines the selective advantage of the genes and the networks they formed.
What can network analysis do? Here, we mainly talk about given a gene network, mostly validated by experiments, what information could be got from it? How could we understand the biological process with the help of a network? Basically, there are three aspects. The most traditional aspect is to identify the importance of each node in the network (e.g., which genes are more important or crucial, which genes are less important or dispensable). Another aspect is to identify which genes are more functionally related through the whole network view, not only by measuring the direct connections, but also by considering the connections through the whole network. In this way, we could establish functional relationships between all the genes by protein–protein interaction networks or other kinds of experimentally validated networks. More recent studies have focused on identifying the paths or flows through the networks with known input and output genes. These methods could identify the unknown mediated genes and also identify which genes are more important in these processes. All these different aspects could serve well in understanding human diseases at different level and views. We will start by discussing these three aspects in detail, including some methods related to them, but not limited in pure network analysis in later sections.
Before we begin to talk about network analysis, we first explain several definitions that are very basic, but will be frequently mentioned in the following parts.
A network N consists of a set V(N) of vertices (or nodes) together with a set E(N) of edges (or links) that connect various pairs of vertices. Usually, nodes represent genes or proteins and edges represent interactions.
A network N is a weighted network if each of its edges has a number associated with it indicating the strength of the edge. Usually, the edge weights represent the confidences of interactions in biological experiments.
A network N is called a directed network if all of its edges are directed and a network N is called an undirected network if none of its edges is directed. Usually, signaling networks and transcriptional regulatory networks could be directed networks whose directions indicate signal transduction or transcriptional regulation.
For any network N and any particular vertex v in V(N), the number of vertices v′ in V(N) that are directly linked to v is called the degree of v.
In particular, for any directed network N and any particular vertex v in V(N), the number of vertices v′ in V(N) that are directly linked to v by an inward-pointing edge to v is called the in-degree of v and the number of vertices v′ in V(N) that are directly linked to v by an edge pointing outward from v is called the out-degree of v.
The minimum number of edges that must be traversed to travel from a vertex v to another vertex v′ of a network N is called the shortest path length between v and v′. For any connected network N, the average shortest path length between any pair of vertices is called the network's “characteristic path length” (CPL).
1.2 Identification of Important Genes based on Network Topologies
Identification of important genes in biological processes is one of the most common and important aspects in all kinds of biology studies [8, 9]. The basic idea to achieve this goal in biological networks is to measure the influence or damage to the network by perturbing certain genes [10]. If removing a gene from a network leads to small changes or influences, this gene should be less important in maintaining the correct function of the biological network. In contrast, if it leads to the collapse or a large influence on the network, such as dividing the whole network into two subnetworks, this gene might play a crucial role in biological processes. This hypothesis has been increasingly supported by experimental data showing that genes with higher influences on the network were more lethal, more conserved through evolution, and basically more important in maintaining biological functions [11]. In order to evaluate genes' importance, several different measurements could be used due to different considerations.
1.2.1 Degree
The most intuitive consideration is that the more edges are removed, the more damage is taken by the network. Thus, the genes with high degrees, known as hubs in the network, should be more important. Evidence has shown that the perturbation of hubs leads to a more dramatic increase of CPL in a biological network than random perturbations [12]. Besides, other information could be further used, such as gene expression data, to find “date hubs” and “party hubs,” which indicate different biological functions [12].
1.2.2 Betweenness
The centrality or connectivity of a network can be measured by the CPL. In biological networks, the CPL indicates the speed of signal transduction or the quickness of biological response. Thus, another consideration of a gene's importance is the CPL changes when perturbing it. These changes could be measured directly by recalculating the CPL when removing each gene from the network or indirectly using the betweenness of each gene. The betweenness of a vertex v is calculated as the number of shortest paths that pass through it divided by the number of all shortest paths. Compared with betweenness, recalculating the CPL is more accurate, but more time consuming. In fact, a very high correlation exists between the CPL recalculating results and the betweenness measurements, so basically measuring the betweenness of a gene is sufficient to see its influence on the CPL. We could also easily see that a gene with high betweenness is not necessarily a hub or has a very high degree, but in view of the whole gene set, betweenness does correlate with degree.
1.2.3 Network Motifs
Compared with the former two measurements, which could be applied to any kinds of networks, network motifs are usually employed in directed networks, such as transcriptional regulatory networks or transcription factor target networks. Network motifs could be regarded as the basic blocks to form the whole network [13], and they were shown to be important in maintaining robustness, perturbation buffering, quick responses, and accurate signal transductions [13–15]. Thus, the genes that take part in multiple network motifs should be more important and counting network motifs becomes one measurement for evaluating the importance of genes. Here, we introduce several commonly used network motifs (Scheme 1.1).
- Single-input motifs (SIM): a group of nodes regulated by a single node without any other regulation.
- Multi-input motifs (MIM): a group of nodes regulate another group of nodes together.
- Feed-forward loops (FFL): a node regulates another and then these two nodes regulate a third one together.
- Feed-back loops (FBL; also known as a multicomponent loops (MCL): an upstream node is regulated by a downstream one.
In biological networks, genes in SIMs or MIMs usually determine the bottleneck of the network, which possibly indicates that the deletion or mutation of these genes is likely to cause lethal influences. FFLs and FBLs could enable precise control or quick response, which was precisely required in biological processes and responses. Network motifs are not limited to those mentioned above, but all the motifs that have been proved to have biological meanings. By searching for different kinds of network motifs, we could find important genes for certain functions that we are interested in.
1.2.4 Hierarchical Structure
In signal transduction networks or transcriptional regulatory networks, genes can be divided into several layers and the signals flow from top to bottom (with feedback allowed). This kind of structure is called a hierarchical structure. Apart from the degree and network motifs, genes on different layers or having different offspring nodes (regulated by this gene) could provide information on understanding biological processes [16].
These network topology-based analyses have been widely used in identifying important genes in multiple studies of different species. However, some other cautions should be announced in all of these measurements besides the fact that they are based on different considerations. First, it is hard to consider the combinatorial influence of the genes, such as when removing either one of two genes with very similar connections, the network will not be badly influenced because there is a backup gene, but when removing both of them, the whole network will collapse. Backup genes exist widely in real biological processes to ensure the robustness of organisms. Currently, it is possible to detect these combinatorial effects through applying newly developed IT methods, although calculations may be very time-consuming. Another problem is that the qualities of networks negatively influence the results, especially when the edges in the networks are biased. This does happen, especially in human studies. For instance, when using literature-supported protein–protein interactions (PPIs), the “hot” genes or interesting genes are much more intensively studied than the “cold” genes and they are more likely to be hubs, because most of their interactions are discovered, while for the “cold” genes, most of their interactions are unknown.
1.3 Inferring Information from Known Networks
1.3.1 Understanding Biological Functions based on Network Modularity
The existence of modular structures (clusters of tightly connected subnetworks) has been noticed in various biological networks. In biological networks, these modules often indicate particular biological functional processes [17, 18]. The modules can be identified by various algorithms, such as the Lin Log energy model (http://www.informatik.tu-cottbus.de/∼an/GD/linlog.html), the MCODE algorithm (http://baderlab.org/Software/MCODE), and the Markov Clustering algorithm (http://www.micans.org/mcl/). Then, by examining the modules' enriched Gene Ontology (GO) terms, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, and other functional annotations, we can discover their biological functions.
1.3.2 Inferring Functional Relationships and Novel Functional Genes Through Networks
In the past few years, more and more studies have focused on identifying functional relationships between genes. These studies came from the collaborations of human association studies and gene function prediction studies. These methods aim to identify unknown disease-related genes with a candidate list derived from association studies. Usually these methods include not only PPIs, but also many other kinds of information, which could be summarized into different kinds of edges. The basic idea is that genes sharing similar functions are usually highly connected in PPI networks. Thus, in order to identify novel disease-related genes from a candidate list, we just need to find the known genes with similar phenotypes in PPI networks.
Several studies analyzed Online Mendelian Inheritance in Man (OMIM) data using PPI and description similarity between genes and phenotypes, which is the result based on human association studies over recent decades [19, 20]. With the development of new technologies, more and more association studies have been finished on large populations and specific phenotypes at high coverage and high resolution levels. These genome-wide association studies (GWAS) provided opportunities for the application of all these methods. As the integration of different kinds of networks could be seen as a whole weighted network with different weights on different edges, we would mainly introduce one method with wide applications and a good computational performance, which is based on the random walk algorithm [21].
The random walk on graphs is defined as an iterative walker's transition from its current node to all its neighbors through all weighted edges starting at given source nodes, s. Each source node could take a different weight and basically the sum value could be normalized to 1, so this value could also be considered as the probability of the information transition through the whole network. Here, compared to the traditional random walk, it added another restart process that in every step, the signal restarts at node s with a probability r. It indicated that in every step of transition, only (1 − r) of total information is continuously transitioned, with r of total restart. The goal of this method is to add a continuous input and when the stable status is achieved, all the other nodes have a stable proportion of information to be output, the sum of which is r.
Formally, the random walk with restart is defined as:
where W is a matrix that is based only on the network topology; basically, it is the column-normalized adjacency matrix, each none zero value represents the weight of one edge in the network. Pt is a vector in which each element holds the probability of information on a node at step t. In this application, the initial probability vector P0 was constructed as weighted probabilities where each probability represents the influence of a source gene on the disease we are interested in, with the sum of these probabilities equal to 1. When the difference between Pt and Pt + 1 is smaller than an arbitrarily given threshold, the steady-state PN was obtained and considered as the result. Candidate disease-related genes are then ranked according to the values in PN.
The performance of the random walk algorithm was shown to be better than the previous algorithms. Also, this algorithm is easily applied. One obvious benefit of this method is that PN is additive, which makes this algorithm very convenient. Take one simple example, consider the steady state PN of only one source node A or B to be PN(A) or PN(B). When we want to consider the combinatorial effect of A and B, we can apply the weighted probabilities of the two source nodes as a and (1 − a), and the steady state PN of using both A and B as source nodes could be simply calculated as PN(AB) = a ∗ PN(A) + (1 − a) ∗ PN(B). This formula could be extended to a set s of multiple source genes. Thus, basically, for a certain network, we do not have to recalculate PN for each set of source genes. Instead, we could calculate each source gene individually and sum the weighted results. In this algorithm, different r indicates different affinity. High r indicates more influence of input genes and less transition in the network, while low r leads to more transition steps. Empirically, the stable result could be obtained within 30–50 steps considering different r and thresholds used, and the algorithm is not very time-consuming. Thus, it is possible to calculate PN of each gene in a network.
As mentioned above in Section 1.2, all of these algorithms are negatively influenced by the quality of networks and those “hot” genes. We were very likely to be stuck in those “hot” genes if a biased network was used.
1.3.3 Unraveling Transcriptional Regulations from Expression Data through Transcriptional Networks
Transcription factors play a crucial regulatory role in various biological processes; however, they are unlikely to be detected from expression data due to their low, and often sparse, expression. To fill this gap, Reverter et al. proposed a regulatory impact factor (RIF) algorithm to identify critical transcription factors from gene expression data by integrating coexpression networks [22]. RIF analysis assigns a score to each transcription factor by considering both the correlation between the transcription factor and the differentially expressed genes and the expression level of the differentially expressed genes. In particular, for a given functional module, its potential regulators are scored by their absolute coexpression correlation averaged across all genes in the module [23].
1.3.4 Extracting the Pathway-Linked Regulators and Effectors based on Network Flows
Recently, high-throughput techniques have been widely used to detect the potential components of biological networks. So far, these high-throughput techniques cover two classes: (i) genetic screens including overexpression, deletion, or RNA interference library screens and (ii) mRNA profiling using microarray or RNA sequencing technology. By comparing the results of these two methods, Yeger-Lotem et al. found that genetic screens tend to identify regulators that are critical for the cell response, while the differentially expressed genes identified by mRNA profiling are likely their downstream effectors, whose changes indirectly reflect the genetic changes in the regulatory networks [24]. It is also true in diseases; using type II diabetes and hypertension as study cases [25], we found that the disease-causing genes, which have high probability to cause type II diabetes and hypertension phenotypes when perturbed, tend to be hubs in the interactome networks and enriched in signaling pathways, whereas the significantly differentially expressed genes identified by microarrays are mostly enriched in the metabolic pathways. The connection between these two gene sets is significantly tight.
To bridge the gap between the genetic screen data and the mRNA expression data using known molecular networks, Yeger-Lotem et al. developed an integrative approach called “Response Net” [24]. Briefly, Response Net is a flow optimization algorithm that redefines a crucial subnetwork that connects genetic hits (source) and differentially expressed genes (target) from a whole weight network, where each node or edge has been assigned a weight according to their biological importance or confidence. The cost of an edge is defined by the −log value of its weight. Thus, the goal of Response Net can be achieved by solving a linear programming optimization problem that minimizes the overall cost of the network when distributing the maximal flow from source to target. According to the solution, those edges with positive flow defined the predicted crucial subnetwork.
1.4 Conclusions
We have introduced basic methods and applications in network analysis to interpret complex phenotypes. Although these methods have many advantages, network biology still faces many challenges. Most of the methods rely on the quality of datasets, which determine the false-positives and limited coverage. Most edges in network maps are still lacking detailed attributes and directions. Post-transcriptional modifications are hardly monitored at a large scale. Tissue- and cell-type specificities are hard to consider. However, with the development of new technologies, such as high-throughput and single-cell dynamic measurement techniques, and with increasing accuracy and coverage of high-throughput technologies, the ever-accelerating data acquisition will raise further need for data integration and modeling at the network level. More and more methods have emerged, which provide important tools for network analysis. Mastering these methods is necessary, but far from sufficient for understanding biology. More important things to do are to ask the right questions, to choose proper network analysis tools, and to validate analysis results by solid experimentation. After all, network biology is biology and the fundamental goal is the same for network biology and molecular biology – to better understand basic biological processes and the mechanisms of human diseases.
References
1. Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 5, 101–113.
2. Bergman, A. and Siegal, M.L. (2003) Evolutionary capacitance as a general feature of complex gene networks. Nature, 424, 549–552.
3. Kauffman, S.A. (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol., 22, 437–467.
4. Li, F., Long, T., Lu, Y., Ouyang, Q., and Tang, C. (2004) The yeast cell-cycle network is robustly designed. Proc. Natl. Acad. Sci. USA, 101, 4781–4786.
5. Chen, J.F., Mandel, E.M., Thomson, J.M., Wu, Q., Callis, T.E., Hammond, S.M., Conlon, F.L., and Wang, D.Z. (2006) The role of microRNA-1 and microRNA-133 in skeletal muscle proliferation and differentiation. Nat. Genet., 38, 228–233.
6. Huang, S., Eichler, G., Bar-Yam, Y., and Ingber, D.E. (2005) Cell fates as high-dimensional attractor states of a complex gene regulatory network. Phys. Rev. Lett., 94, 128701.
7. Han, J.D. (2008) Understanding biological functions through molecular networks. Cell Res., 18, 224–237.
8. Jeong, H., Mason, S.P., Barabasi, A.L., and Oltvai, Z.N. (2001) Lethality and centrality in protein networks. Nature, 411, 41–42.
9. Tew, K.L., Li, X.L., and Tan, S.H. (2007) Functional centrality: detecting lethality of proteins in protein interaction networks. Genome Inform., 19, 166–177.
10. Albert, R., Jeong, H., and Barabasi, A.L. (2000) Error and attack tolerance of complex networks. Nature, 406, 378–382.
11. He, X. and Zhang, J. (2006) Why do hubs tend to be essential in protein networks? PLoS Genet., 2, e88.
12. Han, J.D., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G.F., Zhang, L.V., Dupuy, D., Walhout, A.J., Cusick, M.E., Roth, F.P. et al. (2004) Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature, 430, 88–93.
13. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., and Alon, U. (2002) Network motifs: simple building blocks of complex networks. Science, 298, 824–827.
14. Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M., and Alon, U. (2004) Superfamilies of evolved and designed networks. Science, 303, 1538–1542.
15. Wuchty, S., Oltvai, Z.N., and Barabasi, A.L. (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat. Genet., 35, 176–179.
16. Yu, H. and Gerstein, M. (2006) Genomic analysis of the hierarchical structure of regulatory networks. Proc. Natl. Acad. Sci. USA, 103, 14724–14731.
17. Bader, G.D. and Hogue, C.W. (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4, 2.
18. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868.
19. Lage, K., Karlberg, E.O., Storling, Z.M., Olason, P.I., Pedersen, A.G., Rigina, O., Hinsby, A.M., Tumer, Z., Pociot, F., Tommerup, N. et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat. Biotechnol., 25, 309–316.
20. Wu, X., Jiang, R., Zhang, M.Q., and Li, S. (2008) Network-based global inference of human disease genes. Mol. Syst. Biol., 4, 189.
21. Kohler, S., Bauer, S., Horn, D., and Robinson, P.N. (2008) Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet., 82, 949–958.
22. Reverter, A., Hudson, N.J., Nagaraj, S.H., Perez-Enciso, M., and Dalrymple, B.P. (2010) Regulatory impact factors: unraveling the transcriptional regulation of complex traits from expression data. Bioinformatics, 26, 896–904.
23. Hudson, N.J., Reverter, A., Wang, Y., Greenwood, P.L., and Dalrymple, B.P. (2009) Inferring the transcriptional landscape of bovine skeletal muscle by integrating co-expression networks. PLoS ONE, 4, e7249.
24. Yeger-Lotem, E., Riva, L., Su, L.J., Gitler, A.D., Cashikar, A.G., King, O.D., Auluck, P.K., Geddie, M.L., Valastyan, J.S., Karger, D.R. et al. (2009) Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity. Nat. Genet., 41, 316–323.
25. Yu, H., Huang, J., Qiao, N., Green, C.D., and Han, J.D. (2010) Evaluating diabetes and hypertension disease causality using mouse phenotypes. BMC Syst. Biol., 4, 97.