Agent-Based Modeling and Simulation of Collaborative Social Networks Research in Progress Greg Madey Yongqin Gao Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science North Carolina State University Renee Tynan Chris Hoffman Department of Management University of Notre Dame AMCIS2003 Tampa, FL August 2003 Supported in part by the National Science Foundation - Digital Society & Technology Program
Outline Definitions: Agents, models, simulations, collaborative social networks, computer experiments Phenomenon: Free/Open Source Software (F/OSS) Conceptual models ER model BA model BA model with constant fitness BA model with dynamic fitness Experiments and results Summary Some discussion questions
Agent-Based Modeling and Simulation Conceptual models of a phenomenon Simulations are computer implementations of the conceptual models Agents in models and simulations are distinct entities (instantiated objects) Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence Contrasted with higher level intelligent agents Foundations in complexity theory Self-organization Emergence
Collaborative Social Networks Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) Interlocking corporate directorships Open-source software developers (Madey et al, AMCIS 2002) Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph
Classical Scientific Method 1. Observe the world a) Identify a puzzling phenomenon 2. Generate a falsifiable hypothesis (K. Popper) 3. Design and conduct an experiment with the goal of disproving the hypothesis a) If the experiment fails,, then the hypothesis is accepted (until replaced) b) If the experiment succeeds,, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated
The Computer Experiment
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Observation Agent -Based Simulation (Experiment)
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Observation Analysis of SourceForge Data Agent -Based Simulation (Experiment) Grow Artificial SourceForge
Open Source Software (OSS) GNU Savannah Free to view source to modify to share of cost Examples Apache Perl GNU Linux Sendmail Python KDE GNOME Mozilla Thousands more Linux
Free Open Source Software (F/OSS) Development Mostly volunteer Global teams Virtual teams Self-organized - often peer-based meritocracy Self-managed - but often a charismatic leader Often large numbers of developers, testers, support help, end user participation Rapid, frequent releases Mostly unpaid
F/OSS Developers Larry Wall Perl Linus Tolvalds Linux Eric Raymond Cathedral and Bazaar Richard Stallman GNU GNU Manifesto
F/OSS: A Puzzling Phenomenon Contradicts traditional wisdom: Software engineering Coordination, large numbers Motivation of developers Quality Security Business strategy Almost everything is done electronically and available in digital form Opportunity for IS Research -- large amounts of online data available Research issues: Understanding motives Understanding processes Intellectual property Digital divide Self-organization Government policy Impact on innovation Ethics Economic models Cultural issues International factors
SourceForge VA Software Part of OSDN Started 12/1999 Collaboration tools 58,685 Projects 80,000 Developers 590,00 Registered Users
Savannah Uses SourceForge Software Free Software Foundation 1,508 Projects 15,265 Registered Users
F/OSS: Importance Major Component of e-technology Infrastructure with major presence in e-commerce e-science e-government e-learning Apache has over 65% market share of Internet Web servers Linux on over 7 million computers Most Internet e-mail runs on Sendmail Tens of thousands of quality products Part of product offerings of companies like IBM, Apple Apache in WebSphere, Linux on mainframe, FreeBSD in OSX Corporate employees participating on OSS projects
Free/Open Source Software Seems to challenge traditional economic assumptions Model for software engineering New business strategies Cooperation with competitors Beyond trade associations, shared industry research, and standards processes shared product development! Virtual, self-organizing and self-managing teams Social issues, e.g., digital divide, international participation Government policy issues, e.g., US software industry, impact on innovation, security, intellectual property
Research Model Cross Validation Conceptual Explanatory Model of OSS: Agent-Based Modeling and Simulation Combined Data Mining Parameter Values Parameter Values Structural Features Understanding the Social and Task Dynamics that Predict Developer Behaviors Social Network Analysis: Longitudinal Study of Preferential Attachment and Dynamic Attachment Structural Features Parameter Values
Observations Web mining Web crawler (scripts) Python Perl AWK Sed Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database PROJ DEVELOPER 8001 dev378 8001 dev8975 8001 dev9972 8002 dev27650 8005 dev31351 8006 dev12509 8007 dev19395 8007 dev4622 8007 dev35611 8008 dev8975
Models of the F/OSS Social Network (Alternative Hypotheses) General model features Agents are nodes on a graph (developers or projects) Behaviors: Create, join, abandon and idle Edges are relationships (joint project participation) Growth of network: random or types of preferential attachment, formation of clusters Fitness Network attributes: diameter, average degree, degree distribution, clustering coefficient Four specific models ER (random graph) - (1960) BA (preferential attachment) - (1999) BA ( + constant fitness) - (2001) BA ( + dynamic fitness) - (2003)
F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects 2 Linchpin Developers 1 Cluster Project 7597 dev[64] Project 6882 dev[72] dev[67] dev[47] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 6882 dev[47] 6882 dev[58] dev[79] dev[47] dev[79] dev[52] dev[55] dev[58] dev[83] Project 15850 Project 7028 dev[99] dev[51] 15850 dev[46] dev[58] dev[57] 7597 dev[46] 7028 dev[46] dev[70] 7028 dev[46] dev[57] dev[99] 7028 dev[46] dev[51] dev[46] 15850 dev[46] 15850 dev[46] dev[56] dev[83] 15850 dev[46] dev[48] dev[48] dev[70] 7597 dev[46] dev[72] dev[56] 7597 dev[46] dev[64] 7597 dev[46] dev[67] 7597 dev[46] dev[55] 7597 dev[46] dev[45] 7597 dev[46] dev[61] 7597 dev[46] dev[58] 9859 dev[46] dev[54] 9859 dev[46] 9859 dev[46] dev[49] dev[53] 9859 dev[46] dev[59] dev[53] dev[54] dev[58] dev[59] dev[49] Project 9859 dev[65] dev[45] dev[61]
Computer Experiments Agent-based simulations Java programs using Swarm class library Validation (docking) exercises using Java/Repast Grow artificial SourceForge SourceForge s (Epstein & Axtell, 1996) Parameterized with observed data, e.g., developer behaviors Join rates New project additions Leave projects Evaluation of four models (hypotheses) Verification/validation
Four Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Observation Analysis of SourceForge Data Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Agent -Based Simulation (Experiment) Grow Artificial SourceForge
ER model degree distribution Degree distribution is binomial distribution while it is power law in empirical data Fit fails
ER model - diameter Average degree is decreasing while it is increasing in empirical data Diameter is increasing while it is decreasing in empirical data Fit fails
ER model clustering coefficient Clustering coefficient is relatively low around 0.4 while it is around 0.7 in empirical data. Clustering coefficient is decreasing while it is increasing in empirical data Fit fails
ER model cluster distribution Cluster distribution in ER model also have power law distribution with R 2 as 0.6667 (0.9953 without the major cluster) while R 2 in empirical data is 0.7457 (0.9797 without the major cluster) The actual distribution is different from empirical data The later models (BA and further models) have similar behaviors Fit fails
BA model degree distribution Power laws in degree distribution, similar to empirical data (+ for simulated data and x for empirical data). For developer distribution: simulated data has R 2 of 0.9798 and empirical data has R 2 of 0.9712. Fit succeeds For project distribution: simulated data has R 2 of 0.6650 and empirical data has R 2 of 0.9815. Fit fails
BA model diameter and CC Small diameter and high clustering coefficient like empirical data Diameter and clustering coefficient are both decreasing like empirical data Fit succeeds
BA model with constant fitness Power laws in degree distribution, similar to empirical data (+ for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9742 and empirical data has R 2 as 0.9712. Fit succeeds For project distribution: simulated data has R 2 as 0.7253 and empirical data has R 2 as 0.9815. Fit fails Diameter and CC are similar to simple BA model. Fit succeeds
Discovery: BA with dynamic fitness Problem with BA with constant fitness Intuition: Project fitness might change with time. Data mining observation: project life cycle property - fitness generally decreases with time New model not in the literature Hypothesis: BA with dynamic fitness of projects Computer experiment
BA model with dynamic fitness Power laws in degree distribution, similar to empirical data (+ for simulated data and x for empirical data). For developer distribution: simulated data has R 2 as 0.9695 and empirical data has R 2 as 0.9712. Fit succeeds (as before) For project distribution: simulated data has R 2 as 0.8051 and empirical data has R 2 as 0.9815. Fit is better, but more work needed
Agent-Based Modeling and Simulation as Components of the Scientific Method Hypothesis Observation Experiment
Summary Why Agent-Based Modeling and Simulation? Can be used as components of the Scientific Method A research approach for studying socio-technical systems Case study: F/OSS - Collaboration Social Networks SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. Simulations Computer experiments that tested conceptual models Provided insight into the phenomenon under study and guided data mining of collected observations
Discussion The social sciences are, in fact, the hard sciences, Herbert Simon (1987) Computational social science: agent-based modeling and simulation Kuhn s periods of Normal Science punctuated by Paradigm shifts Karl Popper s theory-testing through falsification Relevant literature on the role of simulation in the process of scientific discovery
Thank you