A Knowledge Discovery Framework for XML-Literature-Data

Similar documents
Combining scientometrics with patentmetrics for CTI service in R&D decisionmakings

Exploring the New Trends of Chinese Tourists in Switzerland

Liu Xiwen. National Science Library of CAS Mailing address: No. 33 Beisihuan Xilu, Zhongguancun, Beijing, , China

II. MEASUREMENT OF THE CITY PERFORMANCE EFFICIENCY

Identify Technology Main Paths by Adding Missing Citations Using Bibliographic Coupling and Co-citation Methods in Photovoltaics

College of Information Science and Technology

Dissemination Patterns of Technical Knowledge in the IR Industry. Scientometric Analysis of Citations in IR-related Patents

A STUDY ON THE DOCUMENT INFORMATION SERVICE OF THE NATIONAL AGRICULTURAL LIBRARY FOR AGRICULTURAL SCI-TECH INNOVATION IN CHINA

Opening Science & Scholarship

A Cross-Database Comparison to Discover Potential Product Opportunities Using Text Mining and Cosine Similarity

This list supersedes the one published in the November 2002 issue of CR.

Solutions. Trusted Content to Innovative. From

Journal Title ISSN 5. MIS QUARTERLY BRIEFINGS IN BIOINFORMATICS

Applying Text Analytics to the Patent Literature to Gain Competitive Insight

The Study on the Architecture of Public knowledge Service Platform Based on Collaborative Innovation

Inter-enterprise Collaborative Management for Patent Resources Based on Multi-agent

Matheo Patent - Automatic Patent Analysis Technology mapping Technological choices

An Intellectual Property Whitepaper by Katy Wood of Minesoft in association with Kogan Page

ScienceDirect: Empowering researchers at every step. Presenter: Lionel New Account Manager, Elsevier Research Solutions

Patent Threat Analysis Search Engine

Scientific linkage of science research and technology development: a case of genetic engineering research

A New Forecasting System using the Latent Dirichlet Allocation (LDA) Topic Modeling Technique

MSc(CompSc) List of courses offered in

Technological Relatedness based on Co-classification Network Analysis: A Case Study on Electricity Sector

Research Projects BSc 2013

Mapping Iranian patents based on International Patent Classification (IPC), from 1976 to 2011

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

A Technology Forecasting Method using Text Mining and Visual Apriori Algorithm

Measurement Methods and Application Research of Triple Helix Model in Collaborative Innovation Management

A Regional University-Industry Cooperation Research Based on Patent Data Analysis

Mapping Iranian patents based on International Patent Classification (IPC), from 1976 to 2011

The role of universities in attaining regional competitiveness under adversity a research proposal

Analysis of Temporal Logarithmic Perspective Phenomenon Based on Changing Density of Information

Building National Infrastructure for Supporting Technology Transfer in Slovakia

Integrated Digital System for Yarn Surface Quality Evaluation using Computer Vision and Artificial Intelligence

Big data for the analysis of digital economy & society Beyond bibliometrics

Comparison of Patents Studies between China and Abroad

Evolution and scientific visualization of Machine learning field

Research Challenges in Forecasting Technical Emergence. Dewey Murdick, IARPA 25 September 2013

How does Basic Research Promote the Innovation for Patented Invention: a Measuring of NPC and Technology Coupling

Case Study: Sustainable Chemistry Mergeflow AG Effnerstrasse 39a München Germany

Using Named Entity Recognition as a Classification Heuristic

TITLE OF PRESENTATION. Elsevier s Challenge. Dynamic Knowledge Stores and Machine Translation. Presented By Marius Doornenbal,, Anna Tordai

Review of the Research Trends and Development Trends of Library Science in China in the Past Ten Years

Technology Roadmap using Patent Keyword

Combining scientometrics with patent-metrics analysis for CTI service in R&D decisionmakings: practices of National Science Library of CAS

The Thomson Scientific Pick of 2006

MINGON KANG. (817) UTA Boulevard, Engineering Research Building 544, Arlington, TX 76019

Mining Technical Topic Networks from Chinese Patents

istockphoto.com Source: Fraunhofer ISI

Introducing Elsevier Research Intelligence

Development of Research Topic Map for Analyzing Institute Performed R&D Projects-based on NTIS Data

Patent Analysis in External Technology Acquisition: A Case of Taiwan Semiconductor Manufacturing Company

An ontology-based knowledge management system to support technology intelligence

Statistical Thinking & Methodology: Pillars of Data Availability & Quality in the Big Data Era

Knowledge-based Collaborative Design Method

Patent portfolio audits. Cost-effective IP management. Vashe Kanesarajah Manager, Europe & Asia Clarivate Analytics

Research on Technological Innovation Capability Evaluation of Guangxi Pharmaceutical Industry

Design and Implementation of Privacy-preserving Recommendation System Based on MASK

New frontiers in the strategic use of patent information Dr. Victor Zhitomirsky PatAnalyse Ltd

Evolution of the Development of Scientometrics

China: Technology Leader or Technology Gap?

Tracking and predicting growth of health information using scientometrics methods and Google Trends

Energy modeling/simulation Using the BIM technology in the Curriculum of Architectural and Construction Engineering and Management

Producer Price Index (PPI) Manufacturing )2012=100( Producer Price Index (PPI)

Increased Visibility in the Social Sciences and the Humanities (SSH)

Chinese civilization has accumulated

An entropy-based indicator system for measuring the potential of patents in technological innovation: Rejecting moderation

Technology forecasting used in European Commission's policy designs is enhanced with Scopus and LexisNexis datasets

U-Multirank 2017 bibliometrics: information sources, computations and performance indicators

Producer Price Index (PPI) Manufacturing )2012=100( Producer Price Index (PPI)

TF-IDF

FORESIGHT AND UNDERSTANDING FROM SCIENTIFIC EXPOSITION (FUSE) Incisive Analysis Office. Dewey Murdick Program Manager

Reversible data hiding based on histogram modification using S-type and Hilbert curve scanning

Department of Computer Science and Engineering

Emerging Sources Citation Index. More research and trends from emerging and less-established sources. Romania Case Study

Application of Artificial Intelligence in Mechanical Engineering. Qi Huang

Semantic networks for improved access to biomedical databases

Automatic Categorization : Future Perspectives

Brad Fenwick Elsevier Senior Vice President, Global Strategic Alliances

The Design and Application of Public Opinion Monitoring System. Hongfei Long

Developing and Advancing a Cyberinfrastructure to Gain Insights into Research Investments: An Organizing Research Framework

A Study on Retrieval Algorithm of Black Water Aggregation in Taihu Lake Based on HJ-1 Satellite Images

Don R. Swanson Impact on Information Science

Science of Science & Innovation Policy and Understanding Science. Julia Lane

Scientific Breakthrough Study of Extenics

Building a Machining Knowledge Base for Intelligent Machine Tools

Research on the Innovation Elements In the Process of

Vision. The Hague Declaration on Knowledge Discovery in the Digital Age

Where the brightest scientific minds thrive. IMED Early Talent and Post Doc programmes

2018 NISO Calendar of Educational Events

Patent Analysis for Organization based on Patent Evolution Model

Text Mining Patent Data

COMPREHENSIVE COMPETITIVE INTELLIGENCE MONITORING IN REAL TIME

Semiotics in Digital Visualisation

Comprehensive IP expertise, analytics and intelligence for the Asia Pacific region. Derwent Powering IP Innovation

Online Access to Cultural Heritage through Digital Collections: the MICHAEL Project

ty of solutions to the societal needs and problems. This perspective links the knowledge-base of the society with its problem-suite and may help

An Introduction to SIMDAT a Proposal for an Integrated Project on EU FP6 Topic. Grids for Integrated Problem Solving Environments

Connections with Leading Thinkers. Economist Fernanda de Negri discusses the merits and shortcomings of Brazil s innovation policies.

Transcription:

National Science Library Chinese Academy of Sciences A Knowledge Discovery Framework for XML-Literature-Data Lixue Zou*, Li Wang, Xiaoli Chen, Xiwen Liu zoulx@mail.las.ac.cn National Science Library, Chinese Academy of Sciences

Contents Literature Review Motivation Data Processing Methodology Literature Data Mining Concepts and Topic Clustering Substances and roles Link scientific research to industry Conclusion 2 Future Work

Literature Review XML-Literature-Data collection: Scientometric analysis, text information extraction and mining have recently been applied to knowledge discovery using literature data modeled using XML, including publications or patent data. The existing methods proposed some methods using either the paper data or the patent data from XML-data. However, the method of extracting the publications and patent data from the same XML-data file has not been seen. Different source data mining: Scientific research and development play important roles in enhancing national competitiveness, so knowledge discovery of literature data becomes a strategic endeavor. These publications or patents data are retrieved from different databases that do not share the same indexing system, which can not allow us to conduct a comparative study at the same level. 3

Motivation CAplus, a database of Chemical Abstracts Service, which is the world s largest repository of information on chemistry and related publications, provides the XML-data and covers both papers (Types of publications include journal articles, preprints, conference articles, dissertations, and books) and patents in one database. One of the merits is that all the data are provided in the same indexing system, including the concepts, substances and roles, commercial or government entity, source of publication, and various other data entities. Thus, the indexing terms can be used to deep mining and make comparisons between the papers and patents. 4

Motivation Our study focuses on the knowledge discovery framework for XML-Literature-Data based on the CAplus database. This study presents the methods of text information extraction and text mining on xml-data from CAplus database. The integrative use of indexing data on papers and patents of CAplus and the systematic exploration and comparative study of the distribution trends in topics, substance roles, and industrial mapping are distinctive and insightful. This study provides a valuable reference for scientists and developers, policy makers, industry and business. 5

Data Processing Methodology Data Collection Data Exploration Data Transformation Run the Java Program to Transform the Data Knowledge Discovery Featured Data Mining Data Cleaning Data Validation And Verification 6

Data Exploration Structure of the data Documents, Indexing, Family, Substances, Nomenclature, Keymap Attributes collected: Title Author Abstract Date Concept Substance Document 7

Data Transformation Keymap analysis: Analysis of document centric key map between all documents and any associated indexing. Reclassification: Each file contains one document and associated indexing files, and separates the papers from the patents. Data extraction: Extraction for each attribute for papers and patents. Output: Comma-Separated Values Challenge: Substances need to match its function. Map files Xml path 8

Data Cleaning and Statistics Data Cleaning: Remove noise and merge different spelling of one entity. A case study: Global Graphene Research Data statistics: 78756 papers, 23057 patents,which covered all years 9424 concepts in papers, 8471 concepts in patents 19413 substances in papers, 27568 substances in patents Publication year range: 1985-2017 for papers, 1997-2017 for patents 9

Concepts and Topic Clustering Papers Patents Differences of topic clustering between papers and patents The concepts clustering presents the difference that the research fields of papers focus on the fundamental research, such as material structure and electrical properties, while the patents show the hot topics of application fields, such as batteries, semiconductors. 10

Substances and roles Different roles of substances in papers and patents The roles of substances in papers are related to properties, while the patents focus more on the technical or engineered material use, industrial manufacture or additive use. 11

Link scientific research to industry Data extraction and processing Title, abstract, concepts in papers or patents Terms of industrial classification from International Standard Industrial Classification of all economic activities (ISIC) Term Frequency - Inverse Document Frequency (TF-IDF) processing Paper or Patent data Abstract words from title, abstract and concept words characteristic words from anlyzed data TF-IDF processing Industrial Classification Abstract words from industrial classification notes characteristic words from ISIC TF-IDF processing Data mining Cosine similarity computing Determine the threshold Words library based on vector space mode Find the relationship between scientific research and industry, industry coverage, economic activities and industry focus. Similarity computing Mapping Vector for searching text 12

Link scientific research to industry Industrial mapping of papers and patents The papers are linked to the industrial classification, such as the electrical equipment, the consumer electronics, on contrast, few patents are linked to these areas. 13

Conclusion We introduced literature research methods in knowledge discovery and proposed a knowledge discovery framework for XML-literature-data, which tailored for the CAplus database. We designed a customized tool for the CAplus data transformation, and XML data files were mapped into an internal processing file format. We presented the data mining methods to indicate the differences between the fundamental research and technology development, based on the same indexing system. 14

Future Work The customized tool that extracts the citation data. Add literature data mining methods: Citation network analysis Topic modeling for concepts Deeper mining for substances and roles To study and understand the relationship between the fundamental research and technology development 15

References Consoli, S. and Stilianakis N. I.. A quartet method based on variable neighborhood search for biomedical literature extraction and clustering. International Transactions in Operational Research., 2017, 24(3), 537 558. Eck, N.J.V., Waltman, L. Text mining and visualization using VOSviewer. ISSI Newsletter, 2011, 7(3), 50 54. Eck, N.J.V., Waltman, L. How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of the American society for information science and technology, 2009, 60(8), 1635 1651. Jessop, D.M., Adams, S.E., Murray-Rust P. Mining chemical information from open patents. Journal Of Cheminformatics, 2011, 3(1), 40. Klincewicz, K.. The emergen t dynamics of a technological research topic: The case of graphene. Scientometrics, 2016, 106(1), 319 345. Le, S.S., Polytechnic, N. Technological innovation trend of graphene technology: A research based on the patentometric analysis. World nonferrous metals, 2017(9), 94 95. Lee, K., Kim, B., Choi, Y., et al. Deep learning of mutation-gene-drug relations from the literature. BMC bioinformatics, 2018, 19(1), 21. Lee, K., Shin, W., Kim, B., et al. Translated PubMed and PMC texts to networks for knowledge discovery. Bioinformatics, 2016, 32(18), 2886-2888. Zhao Z.X., Chen H. Development of graphane technology in China: Present and future based on patent statistics. China Textile Leader, 2016(9), 40 43. Zheng J. Comparative analysis of research paper and high level research paper of graphene field. Advanced materials industry, 2016(10), 48 51. 16

National Science Library Chinese Academy of Sciences Thank you! zoulx@mail.las.ac.cn 17