Disambiguation and Co-authorship Networks of the U.S. Patent Inventor Database Lee Fleming Many thanks to Julia Lane and SciSIP 199704!
Will the real Matt Marx please stand up? Disambiguation Matt Marx Plainview NY Everett MA Mt View CA Class 704
Many years and a cast of thousands Ron Lai Vetle Torvik Alex D Amour Edward Sun Amy Yu David Doolin Guan-Cheng Li Lee Fleming almost literally, if you count everyone who has helped with data and feedback (Thank you!)
Agenda Overview of disambiguation process flow Peek under the hood Results Implications for science policy Coming attractions
Public Databases Weekly USPTO patent data (1975 2010) Data preparation load and validate clean and format generate datasets Inventor disambiguation algorithm Consolidated inventor dataset Primary Datasets Disambiguated inventor dataset Assignee Inventors Classes Patents Fung Institute Servers and Dataverse Network Platform
Agenda Overview of disambiguation process flow Peek under the hood Results Implications for science policy Coming attractions
In the beginning Compare various fields across patents Weight each field and tune to curated dataset Worked surprisingly well, but Cannot predict insidious model interdependencies e.g., technology field is more important in a large firm Small hand curated datasets are inherently biased So let machine learn a non-parametric model
How does a machine learn? 1) Start with curated data Assumes no bias 2) Sample two sets of variables: name/patent Given one set, learn how well other set predicts a match/non-match Assumes independent influences on match probability Not clear which is better, we use 2) After learning, estimate matches in remaining dataset Learn Patent Name Attributes Learn Name Patent Attributes This is a match Pairs of perfect full name match of rare name Pairs that share > 1 common co-authors and >1 tech classes This is not a match Pairs of different full name non-match of rare names Pairs of inventors from same patent
Disambiguation essentially clustering challenge (10.4M)*(10.4M 1) is a big number Block to reduce pairwise computation Truncate last and first names e.g., all M. Marx s or L. Flem s Lends itself to parallel processing Relax and tighten blocking in series of iterative improvements
Lumping vs. splitting Splitting = #records not in correct cluster / total records Lumping = #records in wrong cluster / total records You may choose one of two poisons Upper bound: more likely to be split Lower bound: more likely to be lumped 2011 Disambiguation results (based on updated Gu 2008 standard): 3.2%, 1.5% for lower bound 3.6%, 1.5% for upper bound Run both if design is sensitive to cut-points Or design a better experiment
How to get the goods Harvard DataVerse Network (DVN) 2011 Disambiguation and network variables 12,000+ downloads https://github.com/funginstitute/downloads Current disambiguations (Sept 4, 2012) Working papers Fung Institute @ Github Code repository
Agenda Overview of disambiguation process flow Peek under the hood Results Implications for science policy Coming attractions
Demographics and ethnicity Kim Jones David Doolin
Regional Disadvantage? Non-competes and inventor mobility Disambiguation enables diffs in diffs model MARA: Michigan s gift to noncompete research Marx, Strumsky, Fleming 2009 Decreased intra-state mobility Marx, Singh, Fleming Brain drain from states that enforce non-competes Of best inventors And ideas The real Matt Marx
Results (also hold with econometric models and CEM matching) pre-mara post-mara relative risk Michigan 0.24% 0.32% 1.353 non-michigan 0.20% 0.13% 0.677 Michigan % increase over non-michigan 99.9% CITATIONS PER PATENT median and below above median pre-mara post-mara relative risk pre-marapost-mara relative risk Michigan 0.20% 0.33% 1.625 Michigan 0.27% 0.31% 1.134 Marx, Strumsky, Fleming 2009 non-michigan 0.13% 0.14% 1.112 non-michigan 0.26% 0.10% 0.395 Michigan % increase over non-michigan 46.1% Michigan % increase over non-michigan 186.8% Decreased intra-state mobility Marx, Singh, Fleming DEGREE Brain median drain and below from states that above median pre-mara post-mara odds ratio pre-marapost-mara odds ratio Michiganenforce non-competes 0.25% 0.22% 0.870 Michigan 0.21% 0.51% 2.388 non-michigan Of best 0.17% inventors 0.11% 0.635 non-michigan 0.29% 0.20% 0.710 Michigan % increase over non-michigan 37.0% Michigan % increase over non-michigan 236.3%
Inventor emigration from MI, pre and post MARA (1985) 1983 1986 1984 1987 Guan-Cheng Li and Laurent El-Ghaoui
Best inventors piling up in states which do not enforce noncompetes M. Marx and L. Fleming, 2012. Noncompetes: Barriers to Exit and Entry? National Bureau of Economic Research Innovation Policy and the Economy, 12: 39-64. University of Chicago Press.
Noncompetes can be bad for firms too Bump in acquisitions, Tobin s q, post MARA NCs trap HK in firms But firms go stale when they can t hire Younge, Tong, Fleming Younge and Marx The real Ken Younge
Agenda Overview of disambiguation process flow Peek under the hood Results Implications for science policy Coming attractions
Implications for Science and Innovation Policy NCs decrease diffusion of people - and ideas within regions Drive best people - and ideas to regions that do not enforce! Managers at incumbent firms like them at first Provides a hiring shield but takes away your sword! Firms fall behind tech frontier because cannot hire fresh blood Active policy controversy: MA considering weakening noncompetes GA just strengthened China just weakened
Agenda Overview of disambiguation process flow Peek under the hood Results Implications for science policy Coming attractions
Coming attractions/discussion Torvik research group Linked PubMed and USPTO disambiguations! API for programmatic access Move beyond citations as measure of value Community built and validated curation standards How can we become a more cohesive and productive community?
http://abel.lis.illinois.edu/resources.html!
News around patent #6,505,559 topics Joint with Laurent El-Ghaoui, UC Berkeley After hits /before hits Before patent filing After Nobel announcement
Verification Standards: Gold: Personal validation Silver: friend-of-a-friend Bronze: scraped resumes Educated guess Synthetic: aka plastic Need public contribution and supported wiki!
Towards a cohesive and productive community Preserve precedence while providing ongoing and intermediate results Build community assets Source code, all of it. Data, all of it. Results, all of them (after publication) High standards in code development Revision control Test coverage to validate implementation And finally collective effort and support!