A Cross-Database Comparison to Discover Potential Product Opportunities Using Text Mining and Cosine Similarity

Similar documents
A Technology Forecasting Method using Text Mining and Visual Apriori Algorithm

Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis

Latest trends in sentiment analysis - A survey

Intelligent Identification System Research

Text Mining Patent Data

A New Forecasting System using the Latent Dirichlet Allocation (LDA) Topic Modeling Technique

Sentiment Analysis of User-Generated Contents for Pharmaceutical Product Safety

Exploring the New Trends of Chinese Tourists in Switzerland

IT ADOPTION MODEL FOR HIGHER EDUCATION

Dissemination Patterns of Technical Knowledge in the IR Industry. Scientometric Analysis of Citations in IR-related Patents

A Knowledge Discovery Framework for XML-Literature-Data

A VISUAL ANALYSIS OF TECHNICAL KNOWLEDGE EVOLUTION BASED ON PATENT DATA

East Asia Innovation System: Collaboration and Fusion

Combining scientometrics with patentmetrics for CTI service in R&D decisionmakings

Content Based Image Retrieval Using Color Histogram

Revisiting the USPTO Concordance Between the U.S. Patent Classification and the Standard Industrial Classification Systems

An Intellectual Property Whitepaper by Katy Wood of Minesoft in association with Kogan Page

Replicating an International Survey on User Experience: Challenges, Successes and Limitations

Creating New Innovation model for Future Society and Economic Growth

From Future Scenarios to Roadmapping A practical guide to explore innovation and strategy

Detection of Image Forgery was Created from Bitmap and JPEG Images using Quantization Table

To be published by IGI Global: For release in the Advances in Computational Intelligence and Robotics (ACIR) Book Series

Views from a patent attorney What to consider and where to protect AI inventions?

Image Extraction using Image Mining Technique

Patent Analysis for Organization based on Patent Evolution Model

Evolution and scientific visualization of Machine learning field

Technology forecasting used in European Commission's policy designs is enhanced with Scopus and LexisNexis datasets

Analysis of Computer IoT technology in Multiple Fields

A Novel Fuzzy C-means Clustering Algorithm to Improve the Recognition Accuracy

Retrieval of Large Scale Images and Camera Identification via Random Projections

Global Journal of Engineering Science and Research Management

Spatial Color Indexing using ACC Algorithm

D.T Benchmarking Report

Patent-based Measurements on Technological Convergence and Competitor Identification: The Case of Semiconductor Industry

The User Activity Reasoning Model Based on Context-Awareness in a Virtual Living Space

Technology Roadmap using Patent Keyword

Patent Threat Analysis Search Engine

Daniel R. Cahoy Smeal College of Business Penn State University VALGEN Workshop January 20-21, 2011

A Study on the KSF Evaluations of Design Management for Korean Small and Medium Companies

RESEARCH AND DEVELOPMENT OF DSP-BASED FACE RECOGNITION SYSTEM FOR ROBOTIC REHABILITATION NURSING BEDS

A new quad-tree segmented image compression scheme using histogram analysis and pattern matching

ScienceDirect. From Patent Data to Business Intelligence PSALM Case Studies

Mapping Iranian patents based on International Patent Classification (IPC), from 1976 to 2011

Research of key technical issues based on computer forensic legal expert system

Spatial-Temporal Data Mining in Traffic Incident Detection

Automated Driving Car Using Image Processing

CONSENT IN THE TIME OF BIG DATA. Richard Austin February 1, 2017

Methods for Reducing the Activity Switching Factor

A Regional University-Industry Cooperation Research Based on Patent Data Analysis

Mining Technical Topic Networks from Chinese Patents

A New Framework for Color Image Segmentation Using Watershed Algorithm

InSciTe Adaptive: Intelligent Technology Analysis Service Considering User Intention

Rev. Integr. Bus. Econ. Res. Vol 5(NRRU) 233 ABSTRACT

Inter-enterprise Collaborative Management for Patent Resources Based on Multi-agent

Practical Text Mining for Trend Analysis: Ontology to visualization in Aerospace Technology

Digital Image Sharing and Removing the Transmission Risk Problem by Using the Diverse Image Media

Navigating the Healthcare Innovation Cycle

Technological Relatedness based on Co-classification Network Analysis: A Case Study on Electricity Sector

An Introduction to Compressive Sensing and its Applications

Great Minds. Internship Program IBM Research - China

The Design and Application of Public Opinion Monitoring System. Hongfei Long

Computing Touristic Walking Routes using Geotagged Photographs from Flickr

- Innovation Mapping - White space Analysis for Biomaterials in Complex Patent Landscapes

RF Front-End. Modules For Cellphones Patent Landscape Analysis. KnowMade. January Qualcomm. Skyworks. Qorvo. Qorvo

Institute of Information Systems Hof University

How does Basic Research Promote the Innovation for Patented Invention: a Measuring of NPC and Technology Coupling

Design and Implementation of Privacy-preserving Recommendation System Based on MASK

The effect of technology deployment policies on renewable energy R&D

Before the United States Patent and Trademark Office Alexandria, VA COMMENTS OF COMPUTER & COMMUNICATIONS INDUSTRY ASSOCIATION

A Novel Adaptive Method For The Blind Channel Estimation And Equalization Via Sub Space Method

Social Network Analysis and Its Developments

Measuring patent similarity by comparing inventions functional trees

Access to Medicines, Patent Information and Freedom to Operate

Mapping Iranian patents based on International Patent Classification (IPC), from 1976 to 2011

Performance study of Text-independent Speaker identification system using MFCC & IMFCC for Telephone and Microphone Speeches

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

TF-IDF

Available online at ScienceDirect. Procedia Engineering 131 (2015 ) World Conference: TRIZ FUTURE, TF

The study of combining hive-grid target with sub-pixel analysis for measurement of structural experiment

Drum Transcription Based on Independent Subspace Analysis

3D Face Recognition System in Time Critical Security Applications

A Novel Fault Diagnosis Method for Rolling Element Bearings Using Kernel Independent Component Analysis and Genetic Algorithm Optimized RBF Network

Innovation and Collaboration Patterns between Research Establishments

Efficient Target Detection from Hyperspectral Images Based On Removal of Signal Independent and Signal Dependent Noise

SRM Research Institute

W. Liu 1,a, Y.Y. Yang 1,b and Z.W. Xing 2,c

TMI Sakata&Mori Laboratory(Eng. Bildg 3, +81(0) )

Science and technology interactions discovered with a new topographic map-based visualization tool

I. INTRODUCTION II. LITERATURE SURVEY. International Journal of Advanced Networking & Applications (IJANA) ISSN:

MULTIPATH fading could severely degrade the performance

PREPARATION OF METHODS AND TOOLS OF QUALITY IN REENGINEERING OF TECHNOLOGICAL PROCESSES

Predicting Content Virality in Social Cascade

MODELLING AND SIMULATION TOOLS FOR SET- BASED DESIGN

Applying Text Analytics to the Patent Literature to Gain Competitive Insight

International Journal of Advanced Research in Computer Science and Software Engineering

A literature review on the state-of-the-art on intellectual property analytics

Chapter 2 Distributed Consensus Estimation of Wireless Sensor Networks

Chapter 22. Technological Forecasting

I-Hao Hsiao, Chun-Tang Chao*, and Chi-Jo Wang (2016). A HHT-Based Music Synthesizer. Intelligent Technologies and Engineering Systems, Lecture Notes

World Journal of Engineering Research and Technology WJERT

Transcription:

Journal of Scientific & Industrial Research Vol. 76, January 2017, pp. 11-16 A Cross-Database Comparison to Discover Potential Product Opportunities Using Text Mining and Cosine Similarity Yung-Chi Shen 1, Grace T R Lin 2 *, Jan-Ruei Lin 3 and Chun-Hung Wang 4 1 Department of BioBusiness Management, National Chiayi University, 580, Sinmin Road, Chiayi 600, Taiwan *2 Institute of Technology Management, National Chiao Tung University, 1001, Ta-Hsueh Road, Hsinchu 300, Taiwan 3,4 Institute of Technology Management, National Chiao Tung University, 1001, Ta-Hsueh Road, Hsinchu 300, Taiwan Received 22 August 2015; revised 13 June 2016; accepted 19 October 2016 With the rise of the Internet in recent decades, commentary, news, and further information on new products are present on various web sites. Thus, the Internet has become an abundant source of market intelligence. In addition, technological applications are presented in patent databases and disseminated to popular media, based on the perspective of technology life cycle. Therefore, the interaction between technological applications coded by patents and market intelligence reported by the Internet assists in discovering potential product opportunities. This study intends to identify the areas where technological applications exist, excluding Internet reports, by exploring the corresponding relationships between patents and media reports. Such areas could be identified as potential product opportunities. Text-mining and the arbitrarily oriented projected cluster generation (ORCLUS) algorithm is employed to classify important fields of patents and media reports. The cosine similarity of tfidf is then used to detect the relationship between patents and media reports. The remote health monitoring technology is applied as a case in this study. The results show four product opportunities, namely: wireless sensor devices, telecommunication systems and technology, wearable devices and systems, and medical services and systems. Keywords: Product Opportunity, Text Mining, ORCLUS, Cosine Similarity of tfidf, Remote Health Monitoring Technology Introduction To enhance organizational competitiveness, identifying potential product opportunity has been crucial for product developers 1-3. To identify a potential product opportunity, it is important not only to synthesize and analyze the voice of the customer data but also to quickly turn this data into actionable knowledge 4. With the rise of the Internet, commentary, news, and further information on a new product are collected by web sites 5. Therefore, the Internet has been adopted as a useful data source to extract abundant market intelligence and product development trends 6. Nevertheless, technological applications presented on the news or the Internet are at the late stage of technology life cycle 7, which implies that media reports on new products are on the basis of mature technology. From the perspective of technology life cycle, technological development is presented in patent databases and then disseminated to newspapers or popular media 7. By comparing *Author for Correspondence E-mail: gtrl@faculty.nctu.edu.tw different data sources, managers can not only understand the interaction between R&D and the market but also reduce the uncertainty 8. As shown in Table 1, the relationship between patents and Internet reports can be categorized into four types based on the existence of patents and the popular press. A particular technology is in area A, where patents coding the technology and Internet reports co-evolve. In this area, we expect patents related to the technology and Internet reports to have a reciprocal influence, while the intensity of integration varies depending on the commercialization extent of patents. In area B, although patents exist, there are hardly any media reports, whereas the opposite applies to C. Neither engineers nor practitioners pay attention to area D. Area C can be regarded as an overly applied report, or it may reflect a possible lack of patenting. Table 1 Relationships between patents and Internet reports Patents Internet Reports Exist Do Not Exist Exist A B Do Not Exist C D

12 J SCI IND RES VOL 76 JANUARY 2017 Conversely, topics in area B could represent opportunities for industry to commercialize in the near future. Therefore, investigating the correlation between patents and media enables the identification of potential product opportunities. To estimate the relationship between patents and Internet reports, the semantic similarity of topics in both patent and Internet reports is suggested 8. Before comparing topics in technology and Internet reports, the representative topics discussed in each layer have to be extracted. Text-mining and citation-mining are two main streams of methods to categorize the documents and extract the topics 8. However, citation-based methods are not appropriate in this study because media reports lack citations. Instead, the text-mining method is applied to detect emerging topics. Using the arbitrarily oriented projected cluster generation (ORCLUS) algorithm, patents and Internet reports are categorized in each layer 9, 10. The relationship between patents and Internet reports is then detected by cosine similarity 11. Previous studies comparing different databases tend to focus on exploring technological opportunities, rather than on new opportunities at product level 8, 10, 11. This study intends to identify the areas where technological development exists, but not Internet reports, by exploring the corresponding relationships between patents and media reports. As shown in Table 1, such areas could be identified as potential product opportunities. The remote health monitoring technology is applied to demonstrate the identification of product opportunity, due to the fact that this technology is emerging and directly adopted by end users 12. The process of this study could serve as a possible approach to detect potential product opportunities. technology transfer and international trade, combined with the territoriality of patent protection, thereby luring inventors to file patent applications in the U.S. 13 Therefore, this study adopted the USPTO database as a source for patent collection. Regarding collection of Internet reports, this study selected the top 10 technology news websites in United States, ranked by Similar Web. These websites and their respective rankings are presented in Table 2. After the selection of databases, the search strategy for retrieving documents related to remote health monitoring technology was the submission of keywords to patent and news databases. The keywords adopted to retrieve documents include remote health monitoring, remote health care, remote healthcare, wearable health monitoring, wearable health care, wearable healthcare, smart health monitoring, smart health care, smart healthcare, intelligent health monitoring, intelligent health care, and intelligent healthcare. This set of keywords was used to search patents in the fields of abstract and claims. These keywords were searched in the top 10 technology news websites as shown in Table 1 to collect media reports on new Research Process The research process of this study includes four stages: document collection, text mining, clustering for patents and Internet reports, and identification of product opportunities. Figure 1 illustrates the research process. The detailed process is described as follows. Stage 1 Document Collection In stage 1, patent documents and Internet reports were collected. The patent database providing the most abundant patents was used. The database of the United States Patent and Trademark Office (USPTO) is one of the favored sources for patent searches because the U.S. is an important market for Fig.1 Research process

WANG et al.: A CROSS-DATABASE COMPARISON USING TEXT MINING 13 Table 2 Top 10 Technology News Websites in the United States Rank Domain Name 1 cnet.com 2 lifehacker.com 3 gizmodo.com 4 wired.com 5 engadget.com 6 pcworld.com 7 digitaltrends.com 8 techcrunch.com 9 arstechnica.com 10 techradar.com products related to remote health monitoring technology. Internet technology news and patents issued before the end of 2015 are searched. Stage 2 Text Mining The second stage is to conduct preprocessing and document representation by vector space matrix. In preprocessing, this study erased word suffixes to retrieve their radicals, removed stopwords, numbers, and punctuation, eliminated whitespace, and converted characters to lower case. Extracted terms called sparse terms did not appear in most documents. Due to their low usefulness, this study excluded them to reduce noise. The purpose of document representation by vector space matrix is to organize a document-term matrix, where each cell records the term s importance in a given document. This study adopted the widely used frequency-inverse document frequency (TF-IDF) weighting measures 14. Stage 3 Clustering for Patents and Internet Reports The extracted document-term matrices in Stage 2 were two high-dimensional spaces. The extracted terms could be inter-correlated. ORCLUS, a generalized projected clustering algorithm, was used to cluster patent documents and Internet reports. To implement ORCLUS, the following four parameters were pre-specified: final number of clusters (k), dimensionality of subspaces where the final clusters are concentrated (l), initial number of clusters (k 0 ), and factor for the cluster number reduction in each iteration of the algorithm (a and a < 1). The field of remote health monitoring technology consists of different underlying knowledge streams and technologies. Thus, when implementing ORCLUS, this study specified the final number of clusters k, ranging from 5 to 10 for each corpus in question. ORCLUS performs well when the specified dimensionalities are between 2 and 8, provided the synthetic data is generated from 5 clusters. The optimal value of dimensionality l is 6. This study widened the range and attempted dimensionalities from 2 to 12. The chosen k 0 was large to enable the iterations to begin with a larger number of seeds, thereby improving the likelihood that each cluster would be covered by at least one seed 9. For sensitivity analysis, several values of k 0 near the largest were additionally specified. This study set a as 0.75, which slowly reduced the number of clusters in the merging iterations 10. A broad range of parameter settings for patents and Internet report corpuses were evaluated separately, and appropriate settings were determined by referring to the cluster sparsity coefficients provided by ORCLUS to assess the performance of clustering results. Based on the best clustering results, patents in a cluster represented a technological application field with a similar concept, and Internet reports in a cluster represented a product field with a similar concept. This study applied tm, Snowball, and ORCLUS packages in R language for preprocessing, documental representation, and ORCLUS algorithm implementation. Stage 4 Identification of Product Opportunities In stage 4, the semantic similarity between patent fields and Internet report fields were detected by cosine similarity of tfidf. In this study, the quanteda package in R language was employed to perform cosine similarity of tfidf. Based on the relationships between mined patent and media report clusters, fields with technological development but no media reports could be identified as product opportunities, and those with media reports but no patenting activities could be identified as patenting opportunities. Results Using the search strategy described in the previous section, this study obtained 917 patents and 264 Internet reports. After document collection, the obtained texts were preprocessed, such as unified synonyms, removing stopwords, and stemming. After preprocessing text, 3,031 and 6,217 words from patent and media report corpuses, respectively, were obtained. To reduce noise, terms with high zero-entries, i.e. high sparsity, in the derived document-term matrix were excluded. For the patent corpus, sparsity was set at 98%, which means that terms with more than 98% zero-entry in the derived

14 J SCI IND RES VOL 76 JANUARY 2017 document-term matrix were excluded. A 98% sparsity-level setting retained 391 terms for patent corpus. Internet report corpus sparsity was set at 85%. The two sparsity settings produced a document-term matrix of 917 391 for the patent corpus and a document-term matrix of 264 532 for the Internet report corpus, respectively. In this study, TF-IDF weighting was applied to weigh the importance of each cell for the two document-term matrices. Based on the derived document-term matrices, the ORCLUS algorithm with the broad range of parameter settings separately described in the previous section for patent and Internet report corpuses was performed to cluster the documents. The cluster sparsity coefficients provided by ORCLUS to preliminarily screen out inappropriate parameter settings were applied. In this study, given that the specified final clusters k range between 5 and 10, the specified subspace dimensionalities l are not less than (k 2), and the specified initial number of clusters k 0 are, or approach the greatest value that computers can compute 10. When performing clustering, a minimum threshold for sparsity coefficients and determination of the largest value of l, where the cluster sparsity coefficient is less than the threshold, was suggested. The threshold value for sparsity coefficients in this study was set as 0.003. In addition, a trade-off between values of l and k 0 occurred frequently under the limitation of computer memory, which means that to increase one, the other must be reduced to derive a converge clustering solution. Therefore, the highest specified value of l as 12 to ensure k 0 at a high level is maintained 10. After determining the ranges of parameters, ORCLUS was implemented for several runs, based on different parameter settings, to cluster the patent and Internet report corpuses. The ORCLUS clustering was stopped until approaching k 0 which the computers could handle. As a result, the greatest value of k 0 was 39 for patent and 13 for Internet report corpuses, respectively. And the corresponding cluster numbers k were 9 and 5 for patent and Internet media report corpuses, respectively. The value of subspace dimensionalities l was 9 for the patent corpus and 6 for the Internet report corpus. The corresponding sparsity coefficients were 0.0007 and 0.0026 for patent and Internet reports, respectively. The ORCLUS clustering algorithm applied in this study generated 9 patent fields and 5 Internet report fields. After clustering, these patent fields and Internet report fields were named by reading document titles and abstracts. The clustering result is presented in Table 3. The cosine similarity of tfidf described in the previous section was then applied to detect the relationship between patent and Internet report fields. The analysis of cosine similarity is shown in Table 4. In this study, the third quartile (0.0566) was set as the threshold, to eliminate the less similar relationships. According to the relationship between patent and Internet media report fields detected by cosine similarity, the semantic similarity is presented as Figure 2. As shown in Figure 2, the following patent and Internet report fields are semantically similar: (T1, P1), (T1, P2), (T1, P3), (T1, P4), (T1, P5), (T2, P4), (T3, P4), (T5, P1), (T5, P4), (T5, P5), and (T7, P4). It should be Patents Table 3 Patent and Internet Report Fields, Clustered by ORCLUS Internet Reports Notation Count Field Naming Notation Count Field Naming T1 306 Medical electronics devices and systems P1 57 Application of remote health care technology T2 124 Medical monitoring systems P2 21 Microsoft s remote health care application T3 102 Medical testing equipment and systems and technology P3 21 Data processing of remote health care T4 18 Wireless sensor devices P4 132 The networking application of medical industry T5 89 Secure protection of management systems and technology P5 32 Application of remote health monitoring technology T6 33 Telecommunication systems and technology T7 62 Data management systems and technology T8 114 Wearable devices and systems T9 68 Medical services and systems

WANG et al.: A CROSS-DATABASE COMPARISON USING TEXT MINING 15 Table 4 Cosine Similarity between Patent Fields and Internet Report Fields P1 P2 P3 P4 P5 T1 0.0874 * 0.0615 * 0.0614 * 0.1067 * 0.0890 * T2 0.0566 0.0490 0.0422 0.0660 * 0.0538 T3 0.0480 0.0313 0.0364 0.0611 * 0.0477 T4 0.0467 0.0306 0.0266 0.0483 0.0469 T5 0.0677 * 0.0538 0.0529 0.0871 * 0.0686 * T6 0.0291 0.0106 0.0142 0.0273 0.0197 T7 0.0480 0.0307 0.0357 0.0608 * 0.0472 T8 0.0450 0.0261 0.0255 0.0470 0.0374 T9 0.0385 0.0212 0.0230 0.0424 0.0334 Note: * represents the relationship existing between patent and Internet report fields. Fig.2 Relationship between Patent Fields and Internet Report Fields noted that T4, T6, T8, and T9 have no connections with any Internet report fields. Therefore, wireless sensor devices (T4), telecommunication systems and technology (T6), wearable devices and systems (T8), and medical services and systems (T9) tend to be future product opportunities since these patent fields have low semantic similarity with Internet report fields. Conclusions and Suggestions The perspective of technology life cycle argues that applied research appears in patent databases and then is diffused to media reports 7. Nowadays, the Internet has become an important data source, because commentary, news, and forums collected by web sites provide insight on product trends or information 5, 6. Therefore, one might extract potential product opportunities by comparing patents and Internet reports. This study tends to identify the potential product opportunities by exploring the areas where technological development exists but Internet reports do not. Such areas represent the future opportunities for new product development. To compare the two different data sources, the ORCLUS algorithm is applied to cluster patents and Internet report documents. Cosine similarity is then used to determine the semantic similarity between technological development fields and technological applications fields extracted from patents and Internet reports. The advantages of the process demonstrated in this study are objective, time and cost savings, thanks to relying on patents and the Internet, rather than experts judgments. The remote health monitoring technology is applied to verify the feasibility of the process proposed in this study. According to the results, wireless sensor devices, telecommunication systems and technology, wearable devices and systems, and medical services and systems are the four fields where technological applications exist with rare corresponding Internet reports. Manufacturers in industries related to remote health monitoring technology are suggested to develop products based on these four technology fields. In future works, it is necessary to apply the approach of product opportunity identification proposed by this study to explore the potential product opportunities in other technology fields to verify its feasibility. References 1 Wolff J A & Pett T L, Small-firm, performance: modeling the role of production and process improvements, J Small Bus Manag, 44 (2006) 268-284. 2 Kowang T O, Rasli A & Long C S, A triangulated approach to the assessment of important constructs in new product development process, Life Sci J, 11 (2014) 330-338. 3 Seo W, Yoon J, Park H, Coh B, Lee J M & Kwon O J, Product opportunity identification based on internal capabilities using text mining and association rule mining, Technol Forecast Soc Chang, 105 (2016) 94-104. 4 Peng W, Sun T, Revankar S & Li T, Mining the voice of the customer for business prioritization, ACM Trans Intell Syst Technol, 3 (2012). 5 Thorleuchter D, Van den Poel D & Prinzie A, Extracting consumers needs for new products a web mining approach, Third International Conference on Knowledge Discovery and Data Mining, (2010) 440-443. 6 Geum Y, Lee H, Lee Y & Park Y, Development of datadriven technology roadmap considering dependency: an ARM-based technology roadmapping, Technol Forecast Soc Chang, 91 (2015) 264-279. 7 Martino J P, A review of selected recent advances in technological forecasting, Technol Forecast Soc Chang, 70 (2003) 719-733.

16 J SCI IND RES VOL 76 JANUARY 2017 8 Shibata N, Kajikawa Y & Sakata I, Extracting the commercialization gap between science and technology case study of a solar cell, Technol Forecast Soc Chang, 77 (2010), 1147-1155. 9 Aggarwal C C & Yu P S, Finding generalized projected clusters in high dimensional spaces, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, (2000) 70-81. 10 Wang M Y, Fang S C & Chang Y H, Exploring technological opportunities by mining the gaps between science and technology: microalgal biofuels, Technol Forecast Soc Chang, 92 (2015) 182-195. 11 Shibata N, Kajikawa Y & Sakata I, Detecting potential technological fronts by comparing scientific papers and patents, foresight, 13 (2011) 51-60. 12 Baig M M & Gholamhosseini H, Smart health monitoring systems: an overview of design and modeling, Journal of Medical Systems, 37 (2013). 13 Lai K K & Wu S J, Using the patent co-citation approach to establish a new patent classification system, Inf Process Manage, 41 (2005) 313-330. 14 Fieldman R & Sanger J, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, Cambridge (2007).