Combining Large Datasets of Patents and Trademarks

Similar documents
Areas of Composite Figures 8.4. ACTIVITY: Estimating Area. How can you find the area of. a composite figure?

Meet the National Builder Division Team

Fair Game Review. Chapter 6. Identify the basic shapes in the figure

LPL Insured Cash Account (ICA): Current Priority Bank List Retail Accounts

Public Policy for Angels. Angels are Important to the Economy: Public Policy Strategies to Promote More Investment in Entrepreneurial Companies

LPL Insured Cash Account (ICA): Current Priority Bank List Qualified Accounts

Fair Game Review. Chapter 8. Name Date. Identify the basic shapes in the figure

Creating Powerful Indicators for Innovation Studies with Approximate Matching Algorithms. A test based on PATSTAT and Amadeus databases

NPI Are You Ready? The presentation was created to assist Navicure clients in navigating the information received regarding NPI.

FHWA s Demonstration Project for Enhanced Durability Through Increased Density

Regional Innovation Ecosystems:

Recommended Citations

Keeping a Secret: Evidence from Process and Product Innovation. IPSDM, Mexico City November 14, 2017

Characteristics of Competitive Places: Changing Models of Economic Dynamism

MRN/SWANA-Mid Atlantic Annual Conference

A domestic address must contain the following data elements:

Acknowledging Jackson s Challenges for Growth The Significance of People & Place

Critical Thinking Use the clues below to write each decimal number. Be sure to put the decimal in the correct place.

LERA Perspectives on Work

ARRL UHF and Above Contest Details

Industrial Conference 2013 Thursday, November 14, 2013

The University of North Carolina at Chapel Hill. Kenan-Flagler Business School. Full-Time MBA Recruiting Guide

Journal Article Download Ads

English - Or. English DIRECTORATE FOR SCIENCE, TECHNOLOGY AND INDUSTRY

THE NATION OF CALIFORNIA: alive again in 2010?

INTELLIGENT COMPACTION

Chen-Ching Liu. Washington State University

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Infection Control The Power of Integration

Democracy in a Digital World. flickr: Jason Howie

State Capitals Directions:

Daniel R. Cahoy Smeal College of Business Penn State University VALGEN Workshop January 20-21, 2011

Patent Statistics at Eurostat: Methods for Regionalisation, Sector Allocation and Name Harmonisation

The Danish-American Entrepreneurship Summit

- Examining Opportunities for Georgia

Cognitive Distances in Prior Art Search by the Triadic Patent Offices: Empirical Evidence from International Search Reports

1

Third Grade. Name Week of September 15-19th. News

Clear Roads Overview. National Winter Maintenance Peer Exchange September 12-13, 2017 Pittsburgh, PA

List of Allocation Recipients

Using Indicators to Assess Evolving Industry-Science Relationships

LOVE SO MUCH PERFECT WE WANT EACH ONE TO BE YOUR PARTS OUR PEOPLE LOVE YOUR PARTS. In this Brochure:

Chapter 3 WORLDWIDE PATENTING ACTIVITY

Patent Statistics as an Innovation Indicator Lecture 3.1

State Population Yes No.Alabama 4,822,023 2 Alabama: Sessions (R-AL), Nay.Alaska 731,449 2 Alaska: Begich (D-AK), Nay.Arizona 6,553, Arizona:

Be Counted, America! The Challenge Ahead An analysis of mail-in participation in the 2010 Census as door-to-door enumeration begins

Sean E. Savage Financial Advisor

Click to edit Master title style The State of the Venture Capital Industry Click to edit Master text styles

BendixKing Sales Bulletin. KX165A w/8.33khz Channel Spacing Promotion

Network Infrastructure Products for Harsh Environments

Larry Katzenstein Partner

Taxonomic Name Recognition (TNR) in Biodiversity Heritage

Patent Geography - Global Applications, Regional Protection

INTELLECTUAL PROPERTY OVERVIEW. Patrícia Lima

Brestyan's Las Vegas 2016 Meet Schedule

WORLDWIDE PATENTING ACTIVITY

S205/S206 LaserSwiss. Combine CNC Laser Cutting with Swiss Machining. Imagine the Possibilities...

Saving Lives and Saving Money: Transforming Health in the 21 st Century to Achieve 100% Insurance Coverage

Supplementary Materials for

Fasten Your Seatbelts! Can The Patent Prosecution Highway Take Your Application Down The Fast Lane? Vanessa Behrens, Dirk Czarnitzki, Andrew Toole

Infringement and Enforcement Panel How can you identify infringement and enforce your rights?

Health Impact Assessment (HIA) & Transportation

Media Insiders Panel Sweepstakes

Gender in Invention. Are Females Gaining Ground?

WIPO-WASME Program on Practical Intellectual Property Rights Issues for Entrepreneurs, Economists, Bankers, Lawyers and Accountants

Prof. Steven S. Saliterman. Department of Biomedical Engineering, University of Minnesota

5/30/2018. Prof. Steven S. Saliterman Department of Biomedical Engineering, University of Minnesota

INSTALLATION AND SERVICE INSTRUCTIONS I.D. AND O.D. GRIP COLLET CHUCKS

SPEAKERS BUREAU. Disclaimer

any questions I had after the job was done, they didn't just vanish after the bill was paid. To edit this sidebar, go to admin backend's.

Lexisnexis PatentOptimizer Streamline your patent analysis and applications

A Metric-Based Machine Learning Approach to Genealogical Record Linkage

Intellectual Property Importance

Using patent data as indicators. Prof. Bronwyn H. Hall University of California at Berkeley, University of Maastricht; NBER, NIESR, and IFS

Fourth Round 2006 New Markets Tax Credit Allocations

Patent Agenda. Egyptian National Group of AIPPI

Size of California s economy US$ trillions, 2009

An Empirical Look at Software Patents (Working Paper )

Patent Analysis in External Technology Acquisition: A Case of Taiwan Semiconductor Manufacturing Company

IEEE C802.16h-05/020. Proposal for credit tokens based co-existence resolution and negotiation protocol

LexisNexis PatentOptimizer

IXYS. Preliminary. MX841 White LED Step-Up Converter. Features: General Description. Applications: Ordering Information

BW129Z & BW209Z. CNC Precision Automatic Lathe. Speed, Accuracy, Rigidity

What s in the Spec.?

M08J. CNC Lathe Speed, Accuracy, Rigidity

Topic 2: Patent Families

Clear Roads Overview and Highlights

GPM INVESTMENTS, LLC fa$ MILLION SWEEPSTAKES AND SCRATCH CARD GAME OFFICIAL RULES

Testing & Evaluation. School Order Form Service Package

Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval

Patent portfolio audits. Cost-effective IP management. Vashe Kanesarajah Manager, Europe & Asia Clarivate Analytics

TAKE 5 ON PATENTS. and Trademark Office is increasing every year, with over 18,000 issuing in January 2015 according

International Intellectual Property Practices

Is the Dragon Learning to Fly? China s Patent Explosion At Home and Abroad

Artificial Intelligence (AI) and Patents in the European Union

Private Equity: Top States and Districts in

Topic 3: Patent Family Concepts and Sources for Family Information

No. Name State Pet Area

Patents as Indicators

National Standard of the People s Republic of China

Transcription:

Combining Large Datasets of Patents and Trademarks Grid Thoma Computer Science Division, School of Science & Technology University of Camerino 14 th Italian STATA User Annual Meeting Florence, 16 Nov 2017

Motivations Where do innovators come from? location, industry, cohort, size, listing, VC, How to appraise correctly IP counts at the patentee s portfolio level? Patents, trademarks, and designs EPO, WIPO, USPTO,, families of priority links Citations / self-citations The problem of harmonization of entity names

Different spellings/misspellings MINNESOTA MINING AND MANUFACTURING COPANY MINNESOTA MINING AND MANUFACTURING COPMANY MINNESOTA MINING AND MANUFACTURING CORP BSH BOSCH UND SIEMENS AKTIENGESELLSCHAFT BSH BOSCH UND SIEMENS AKTINGESELLSCHAFT BSH BOSCH UND SIEMENS HANSGERAETE GMBH BSH BOSCH UND SIEMENS HAUS-GERAETE GMBH BSH BOSCH UND SIEMENS HAUSERATE GMBH

Variations in naming conventions MINNESOTA MINING & MFG CO 3M CORP MINNESOTA & MINING MANUFACTURING... INTERNATIONAL BUSINESS MACHINES IBM IBM CORP. (INTERNATIONAL BUSINESS MACHINES) IBM CORPORATION (INTERNATIONAL BUSINESS MACHINES)

Assignment to aggregate entities (ownership issues) Subsidiaries with parent MINNESOTA MINING & MFG CO: ADHESIVE TECHNOLOGIES INC AVI INC D L AULD CPY DORRAN PHOTONICS INCORPORATED EOTEC CORPORATION NATIONAL ADVERTISING CPY RIKER LABORATORIES INC TRIM LINE INC

Sources NBER Patent Data Project (harmonized entity names) sites.google.com/site/patentdataproject USPTO s data disclosure initiative (in STATA files) www.uspto.gov/economics Magerman et al. (2006). Data production methods for harmonized patent statistics: Patentee name standardization. KU Leuven FETEW MSI. Thoma et al. (2010). Harmonizing and combining large datasets an application to firm-level patent and accounting data. NBER WP # 15851.

Agenda Background Dataset Software creation and results Quality checks

Agenda Background Dataset Software creation and results Quality checks

Dictionary based approach Large collections of entity names, serving as examples for a specific entity class Exact matching of dictionary entries OR fuzzify the dictionary by (automatically) generating typical spelling variants for every entry The problem of recall rate (e.g. ANSI / UNICODE)

Articulation of a dictionary Every known variation of an entity name Harmonized to one agreed standard name

Existing dictionaries of patenting entity names USPTO / EPO standard patentee codes DERWENT patentee codes NBER Patent Data Project (file: patassg.dta) sites.google.com/site/patentdataproject Harmonization procedure to build a dictionary (Magerman et al. 2006)

Magerman et al. (2006) s procedure 1. Character cleaning 2. Punctuation cleaning 3. Legal form indication treatment 4. Spelling variation harmonization 5. Umlaut harmonization 6. Common company name removal 7. Creation of a unified list of entity names

Rule-based approach Definition of rules to compare the similarity of names (Thoma et al. 2010) Initially, hand-crafted rules to describe the composition of named entities and their context Some core words and components of words used to extract candidates for more complex names OR viceversa

Approximate string matching algorithms (1) Edit distance: the minimum number of operations to switch from one word to another Typically used to account for spelling variations Similarity of two strings x and y of length n x and n y calculated as 1 d/n where 1 is the maximum similarity; d is the distance between x and y; N=max{n x, n y }.

Edit distance: examples 1. HILLE & MUELLER GMBH & CO./ HILLE & MULLER GMBH & CO KG / HILLE & MÜLLER GMBH & CO KG 2. AB ELECTRONIK GMBH/ AB ELEKTRONIK GMBH 3. BHLER AG / BAYER AG

Approximate string matching algorithms (2) Jaccard Similarity measure: number of unique common tokens of two strings divided by the number of tokens in the union J = T 1 T 2 T 1 T 2

Approximate string matching algorithms (2) Jaccard Similarity measure: number of unique common tokens of two strings divided by the number of tokens in the union J = T 1 T 2 T 1 T 2 Computationally Easy J Similarity Measure: J 2 T 1 T 2 T 1 + T 2

Jaccard similarity: examples 1. AAE HOLDING / AAE TECHNOLOGY INTERNATIONAL 2. JAPAN AS REPRESENTED BY THE PRESIDENT OF THE UNIVERSITY OF TOKYO /PRESIDENT OF TOKYO UNIVERSITY 3. AAE HOLDING / AGRIPA HOLDING 4. VBH DEUTSCHLAND GMBH / IBM DEUTSCHLAND GMBH

Approximate matching algorithms (3) Weighted Jaccard Similarity Measure Inversely weighted by the frequency n i of a given token i across different entity names J w X, Y = 2 k xk X Y w k i xi X w i + j yj Y w j where w i = 1 log n i + 1

Agenda Background Dataset Software creation and results Quality checks

Patent and trademark datasets Patenting entity names at the USPTO Reference dictionary (NBER Patent Data Project) A unique ID code for a patentee (file: patassg.dta) Trademarking entity names at the USPTO www.uspto.gov/economics (file: owner.dta) Time coverage Patents: 1976-2006; Trademarks: 1977-2015 Focus: US business organizations 117,443 unique ID codes from the reference dictionary 3,462,601 (unharmonized) trademarking entity names Entity name matching executed within state level

Harmonization of address information Only state & city info in patent records Full address info for trademarks 5 digit zip codes in 98.5% of the US addresses Harmonization of city names Removing numbers & non standard chars Geocoding based on geonames.usgs.gov Edit distance / Soundex for matching city names

Agenda Background Dataset Software creation and results Quality checks

STATA implementation (1) An augmented harmonization procedure to create a dictionary for the trademarking entity names (Thoma et al. 2010) J w similarity measure for the matching of the patenting & trademarking entity name dictionaries Location information to reduce false positives and false negatives Manual inspection to improve accuracy and matching rate Improvement of dictionary use through priority links

STATA implementation (2) 1. Reshape entity names as tokens in long format 2. Remove non standard chars & numbers 3. Drop single char tokens 4. Pool tokens to create a dictionary of tokens 5. Inflate the dictionary with tokens from patent titles / wordmarks (improving statistical weights) 6. Drop stop words (frequent/non discriminating) 7. Compute the defined statistical weight of a token

STATA implementation (3) 8. Merge files based on tokens and state level codes of an entity name 9. Collapse the tokens statistical weights to compute the J w measure s numerator of a matched pair 10. Compute the J w measure, including the denominator 11. Sort matched pairs based on the J w measure, selecting the best match

Figure 1: Share of US business patentees matched with trademarks (Notes: States with 1000+ patentees; Source: USPTO) 100% 80% 60% 40% 20% 0% IL MA WI MO MN DE OH IN PA NC CT NY GA NJ CA TN KS VA WA OR MD UT CO TX FL MI AZ OK state code 2 digits Share of patentees Weighted by patents

Figure 1: Share of US business patentees matched with trademarks (Notes: States with 1000+ patentees; Source: USPTO) 100% 80% 60% 40% 20% 0% Kruskal-Wallis rank test accepted (p=0.998) IL MA WI MO MN DE OH IN PA NC CT NY GA NJ CA TN KS VA WA OR MD UT CO TX FL MI AZ OK state code 2 digits Share of patentees Weighted by patents Weighted by marks

Agenda Background Dataset Software creation and results Quality checks

Selection of the best match Below a certain threshold of J w, select the best match with the highest J w Define a goodness index (matching score) of a matched pair using J w & address information (state city correspondence) Manual inspection in order to define the appropriate thresholds of the matching score Select the best match with the lowest matching score

Selection of the best match through the matching score For each matched name a mutually exclusive goodness score is given from 1-9, where: J w Similarity Measure Same location Unknown location Different location J w 67% 1 2 3 57 J w < 67% 4 5 6 47 J w < 57% 5 8 9 Thresholds defined through manual scrutiny

Figure 2 Distribution of the matching score of the matched names: US business patentees matched to the trademarking entity names 70% 60% 56.1% 50% 40% 30% 20% 16.7% With priority links and manually matched 14.7% 10% 0% 0.0% 4.7% 0.0% 7.5% 0.0% 0.0% 0.1% 0 1 2 3 4 5 6 7 8 9 Matching score values (lower is better)

Improvement of dictionary usage through priority links Priority links in patents and trademarks Potential limitations Copatentees of a patent/trademark Entity name changes (synonymies) Subsidiaries Distinct entity names Entity address changes

Harmonization tasks of entity names through priority links Focus on the trademarking entity names Retrieve forward/backward priority links Consolidate links to create self containing families of priorities Manual scrutiny in merging families with standard entity names In the overall dataset, propagate standard entity names using perfect name matching, and having the same zip code

Diagnostics: resolving duplicate matching candidates (potential) The earliest patenting entity Technological-market affinity Name changes over time Ownership structure of companies

25% Figure 3. Time lag of the first trademark since year of the first patent (Notes: US business patentees active with patenting & trademarking during 1981 2003; Source: USPTO) 20% 15% 10% 5% 0% -5 or more -4-3 -2-1 0 1 2 3 4 5 or more Lag in years overall dataset small firms (less than 500 employees)