Automating NSF HERD Reporting Using Machine Learning and Administrative Data

Similar documents
About This Survey. General Concepts and Definitions

Field of Science Codes (FOS) A. Engineering 1. Aerospace, Aeronautical, and Astronautical Engineering Aerodynamics Aerospace engineering Space

Research Categories Bioenergy Machinery Transportation. Seed Science Soil Soybeans Water

Machine-Learning Based Classification of Research Grant Award Records

Location Codes and Examples of Disciplines

SSB Debate: Model-based Inference vs. Machine Learning

The USGEO Satellite Needs process provides the firstever whole-of-government approach to identifying desired satellite products across the civilian

Table of Contents. Graduate and Postdoctoral Studies Fellowship Awards Database GPS-FAD 2012

COUNTRY SPECIALISATION REPORT

The Contribution of the Social Sciences to the Energy Challenge

GRADUATE MAJORS. PH.D. Programs. Iowa State University

BETTER THAN REMOVING YOUR APPENDIX WITH A SPORK: DEVELOPING FACULTY RESEARCH PARTNERSHIPS

Research & Development (R&D) defined (3 phase process)

APCAS/10/21 April 2010 ASIA AND PACIFIC COMMISSION ON AGRICULTURAL STATISTICS TWENTY-THIRD SESSION. Siem Reap, Cambodia, April 2010

1. New structure of the NABS. 2. Content of the chapter level. 3. Content of the sub-chapter level

Contents. List of Figures List of Tables. Structure of the Book How to Use this Book Online Resources Acknowledgements

5 TH MANAGEMENT SEMINARS FOR HEADS OF NATIONAL STATISTICAL OFFICES (NSO) IN ASIA AND THE PACIFIC SEPTEMBER 2006, DAEJEON, REPUBLIC OF KOREA

ISCED: INTERNATIONAL STANDARD CLASSIFICATION OF EDUCATION 2013

NATIONAL SURVEY of Research and Development in Singapore

202000AAW ASSOCIATE OF ARTS

Violent Intent Modeling System

BU PhD Studentship Competition Marketing Briefing Form Making Business Process Collaboration Trustable

ENGINEERING What can I do with this degree?

Hybrid algorithm for floor detection using GSM signals in indoor localisation task

2018 Federal Scientists Survey FAQ

Space Utilization Not just for classrooms anymore

Undergraduate Majors and Minors

DISCIPLINARY AND INTERDISCIPLINARY RESEARCH AT NSF

COUNTRY SPECIALISATION REPORT

Overview of the NSF Programs

Toward Improving the Life of Amputees by Integrating Neural-Machine Interface with Machine Learning Technology

I. INTRODUCTION A. CAPITALIZING ON BASIC RESEARCH

Course. Hours Number Course Title Hours Semester Anthropology. Credit Course

General Education Program

CS 102: Big Data Tools and Techniques Discoveries and Pitfalls. Spring 2018

POST-EVENT REPORT. TECHNOLOGY CONFERENCE May 2017 \\ Houston, Texas, USA \\ NRG Park

COUNTRY SPECIALISATION REPORT

POLYTECHNIC GRADUATES SAW HIGHER SALARIES IN 2015

BACHELOR S DEGREE PROGRAMMES FOR 2017/2018 ACADEMIC YEAR

S CIENCEC ONCEPTS &PROCESSES

Graduate Studies in Computational Science at U-M. Graduate Certificate in Computational Discovery and Engineering. and

Appendix Table 1: Distribution of Credit Granting Scores and Test Timing

RENEWABLE ENERGY Department Chair: Kit Thompson , ext. 311

OVER 90% OF POLYTECHNIC GRADUATES EMPLOYED WITHIN SIX MONTHS IN 2016, HIGHER THAN IN 2015

Phone # s: or

Missouri Educator Gateway Assessments

Undergraduate and Integrated Masters Programmes

Field of Research Classification (Classification of ARRS)

COUNTRY SPECIALISATION REPORT

Asking Questions on Knowledge Exchange and Exploitation in the Business R&D and Innovation Survey

Report on Emerging and Interdisciplinary Research Fields. - Solving Social Issues and Expanding the Frontiers of Science and Technology -

Master degree programs' list for 2018/2019 academic year

Teaching specialisations

FOUNDATION CERTIFICATE FOR ARTS, DESIGN AND MEDIA

KT for TT Ensuring Technologybased R&D matters to Stakeholders. Center on Knowledge Translation for Technology Transfer University at Buffalo

Optimal Harvesting Strategies for Farmed Fish and Shrimp in Hawaii, Year 2

SIUC. College of Engineering

2010 Census Coverage Measurement - Initial Results of Net Error Empirical Research using Logistic Regression

Sustainability-Related Learning Outcomes Department/ Program

Preparing IPUMS samples for Ireland. Deirdre Cullen Senior Statistican

Ken Buetow, Ph.D. Director, Computation Science and Informatics, Complex Adaptive ASU Professor, School of Life Science

Undergraduate Programmes

POCKET FACTS. ltu.se

Italian Americans by the Numbers: Definitions, Methods & Raw Data

A Proposal for the Repurposing of Penfield Hall

COUNTRY SPECIALISATION REPORT

Data: Integration and Science

ANALYSIS OF CITIES DATA USING PRINCIPAL COMPONENT INPUTS IN AN ARTIFICIAL NEURAL NETWORK

How Machine Learning and AI Are Disrupting the Current Healthcare System. Session #30, March 6, 2018 Cris Ross, CIO Mayo Clinic, Jim Golden, PwC

NSDL/NSTA Web Seminar Teach Engineering: Because Dreams Need Doing

Level Below Basic Basic Proficient Advanced. Policy PLDs. Cognitive Complexity

Graduate Programs. Graduate Programs 1

Welcome. There s no more exciting time, and no more exciting place, to put your talents to work.

Agricultural & Biosystems Engineering (

Scientific Research Journals Publication Open Access Journal Publisher (Online and Print)

NSF Investments in Synthetic Biology

Undergraduate Programmes

Bridging Disciplines: Assessing the Interdisciplinary Impact of Open Data

Science Policy and Social Change. December 2003

Considering the Role of Gender in Developing a Science Identity: Undergraduate Students in STEM Fields at Large, Public, Research Universities

Best Practices for Automated Linking Using Historical Data: A Progress Report

The Federal Polytechnic, Nasarawa

IIT Madras - Faculty Recruitment Areas - (Summer 2018)

GENERAL EDUCATION PROGRAM AT THE COMMUNITY COLLEGE OF RHODE ISLAND

Information Sociology

Eligibility for Junior Colleges based on net L1R5 (information from 2014 JAE Booklet)

Lecture 11 Business [Information] Classification Schemes

A Bibliometric Analysis of Australia s International Research Collaboration in Science and Technology: Analytical Methods and Initial Findings

Management Bios. Michael P. Norcio, Chairman and CEO

The EU s new Research Framework Programme : FP7

MASTER S DEGREE PROGRAMMES FOR 2017/2018 ACADEMIC YEAR

History of Photography. History of Photography. Ecosystem-Based Management of Marine Systems Marine Mammal Science. Sustainable Aquaculture

REPORT ON THE EUROSTAT 2017 USER SATISFACTION SURVEY

Media Technology Program Articulation

An Introduction to Machine Learning for Social Scientists

A HANDBOOK LINKING PROJECT LEARNING TREE S SECONDARY MODULES

Lecture # 7 Coordinate systems and georeferencing

TLC 2 Overview. Lennart Johnsson Director Cullen Professor of Computer Science, Mathematics and Electrical and Computer Engineering

Welcome to the Webinar

Removing Duplication from the 2002 Census of Agriculture

Transcription:

Automating NSF HERD Reporting Using Machine Learning and Administrative Data Rodolfo H. Torres CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans LA November 11, 2018 This research has been supported in part by the National Science Foundation under the EAGER Awards 1547464 / 1547513 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project in Collaboration with Joshua Rosenbloom Professor and Chair Department of Economics Iowa State University Joseph St.Amand Former Graduate Student EECS University of Kansas Current Position Chief Technology Officer Patients Voices Adrienne Sadovsky Principal Analyst Senior Office of Research University of Kansas Luke Huan Initial co-pi - Former Professor EECS / ITTC University of Kansas Current Position Head of Beijing Big Data Lab Baidu Research

The HERD Survey From https://www.nsf.gov/statistics/srvyherd/#sd The Higher Education Research and Development Survey,, is the primary source of information on R&D expenditures at U.S. colleges and universities. The survey collects information on R&D expenditures by field of research and source of funds and also gathers information on types of research and expenses The survey is an annual census of institutions that expended at least $150,000 in separately budgeted R&D in the fiscal year. In FY 2016 there were 902 institutions reporting data for a total of $72B in total R&D expenditures, of which $39B were from Federal sources.

Some Features of the HERD Survey R&D expenditures by source of funds (federal government, state and local government, business, nonprofit, institutional, and other) R&D expenditures passed through to sub-recipients or received as a sub-recipient Federally funded R&D expenditures by federal agency R&D expenditures by purpose of work (e.g., Basic Research, Applied Research, Development, etc. ) Federally and non-federal funded R&D expenditures by field (e.g., Computer Sciences, Chemistry, Economics, etc.)

Sample of Tables in the HERD Report Total and federally financed higher education R&D expenditures, by type of R&D: 2010 2016 (in thousands) Total Federal Fiscal year All R&D expenditures Basic research Applied research Development All R&D expenditures Basic research Applied research Development 2010 61,286,610 40,416,177 15,478,375 5,392,058 37,477,582 25,399,596 9,361,940 2,716,046 2011 65,274,393 42,809,196 16,733,579 5,731,618 40,768,251 27,331,458 10,498,586 2,938,207 2012 65,729,007 42,401,697 17,295,653 6,031,657 40,142,223 26,469,347 10,577,754 3,095,122 2013 67,013,138 43,305,409 17,390,865 6,316,864 39,445,931 26,071,617 10,327,219 3,047,095 2014 67,196,537 42,989,478 17,745,860 6,461,199 37,960,175 24,905,121 10,015,778 3,039,276 2015 68,566,890 43,865,982 18,022,569 6,678,339 37,848,552 24,945,232 9,969,994 2,933,326 2016 71,833,308 45,101,655 19,986,766 6,744,887 38,793,542 24,944,577 10,893,286 2,955,679 SOURCE: National Science Foundation, National Center for Science and Engineering Statistics, Higher Education Research and Development Survey https://ncsesdata.nsf.gov/herd/2016/html/herd2016_dst_08.html

Expenditures by Field and Source 2016 All R&D State and local Nonprofit expenditures Federal government government Institution funds Business organizations All other sources All R&D fields 71,833,308 38,793,542 4,025,280 17,974,962 4,210,563 4,614,800 2,214,161 Science 56,290,662 31,090,354 3,023,028 13,541,084 3,031,096 3,868,151 1,736,949 Computer and information sciences 2,077,884 1,442,771 49,502 399,965 90,288 59,588 35,770 Geosciences, atmospheric sciences, and ocean sciences 3,087,774 1,992,990 157,693 614,647 109,478 127,763 85,203 Atmospheric science and meteorology 626,518 513,275 18,416 68,923 6,319 7,660 11,925 Geological and earth sciences 999,351 605,706 47,334 226,541 51,237 32,557 35,976 Ocean sciences and marine sciences 1,097,864 665,121 59,874 241,440 32,896 69,841 28,692 Geosciences, atmospheric sciences, and ocean sciences, nec 364,041 208,888 32,069 77,743 19,026 17,705 8,610 Life sciences 40,887,850 21,798,334 2,437,745 9,700,749 2,569,302 3,038,475 1,343,245 Agricultural sciences 3,293,092 976,912 873,403 1,031,049 166,341 134,067 111,320 Biological and biomedical sciences 13,048,981 7,707,943 554,094 2,983,417 552,727 958,620 292,180 Health sciences 22,393,716 12,098,295 813,806 5,025,036 1,802,695 1,832,951 820,933 Natural resources and conservation b 689,725 315,559 115,681 193,967 14,949 30,632 18,937 Life sciences, nec 1,462,336 699,625 80,761 467,280 32,590 82,205 99,875 Mathematics and statistics 681,661 444,419 25,714 170,414 8,844 23,601 8,669 Physical sciences 4,893,565 3,286,816 93,518 1,044,829 139,153 200,852 128,397 Astronomy and astrophysics 622,008 418,147 1,839 122,375 4,578 34,443 40,626 Chemistry 1,775,071 1,097,719 48,331 421,143 82,673 82,956 42,249 Materials science b 172,086 111,802 4,579 38,435 9,518 5,465 2,287 Physics 2,124,098 1,523,751 33,703 417,189 37,851 71,221 40,383 Physical sciences, nec 200,302 135,397 5,066 45,687 4,533 6,767 2,852 Psychology 1,218,721 761,433 49,603 291,319 13,084 84,105 19,177 Social sciences 2,366,571 898,576 145,563 908,025 50,569 282,278 81,560 Anthropology b 96,505 39,440 2,501 42,190 1,982 7,860 2,532 Economics 396,393 112,338 37,543 166,032 8,910 54,860 16,710 Political science and government 385,245 103,681 15,042 177,119 3,991 61,439 23,973 Sociology, demography, and population studies 504,594 269,371 27,602 135,118 8,213 52,471 11,819 Social sciences, nec 983,834 373,746 62,875 387,566 27,473 105,648 26,526 Sciences, nec 1,076,636 465,015 63,690 411,136 50,378 51,489 34,928 Engineering 11,381,727 6,583,476 699,032 2,335,527 1,055,444 359,441 348,807 Aerospace, aeronautical, and astronautical engineering 883,260 623,571 24,846 115,771 80,432 31,049 7,591 Bioengineering and biomedical engineering 1,084,355 650,752 56,057 254,840 46,976 53,428 22,302 Chemical engineering 885,273 467,678 40,386 199,334 121,432 34,328 22,115 Civil engineering 1,331,155 591,637 221,119 348,873 84,724 46,354 38,448 Electrical, electronic, and communications engineering 2,517,147 1,742,632 51,270 416,262 167,403 55,818 83,762 Industrial and manufacturing engineering b 239,078 148,464 10,846 56,714 16,372 3,652 3,030 Mechanical engineering 1,435,828 860,745 55,454 279,079 169,124 33,587 37,839 Metallurgical and materials engineering 771,683 442,893 29,270 181,287 74,702 19,512 24,019 Engineering, Source necnsf https://ncsesdata.nsf.gov/herd/2015/html/herd2016_dst_05.html 2,233,948 1,055,104 209,784 483,367 294,279 81,713 109,701 Non-S&E 4,160,919 1,119,712 303,220 2,098,351 124,023 387,208 128,405 Source of funds

A Proof of Concept Project Categorizing each project by purpose and field of research requires considerable time and effort as it is done manually at KU Goals Labor intensive (expensive) Subjective Questionable reliability and validity Apply machine learning and text analysis tools to automate project classification Ease administrative burden Generate more objective classifications

Methods We identified 1,700 historical awards that had been manually classified. We try to classify them using the project Title, SOW/Abstract, PI Home Department, and additional metadata. We treated the purpose and the field classification as two different tasks. After eliminating awards for which electronic abstracts were not available, we were left with a set of roughly 1,500 awards that could be used as a training data set. We used the bag-of-words model to represent the data; each word is considered as a separate feature. There were 17,046 separate data fields or features. However using tools for feature weighting and selection we reduced this number to a few hundreds. This feature extraction pipeline is configurable, allowing us to experiment with different means of producing features for the classification models.

Methods (cont.) We divide the awards into a testing set (of about 30% of the data) a training set (which is then divided into 5 parts for cross-validation). We explored the application of established machine-learning models: Decision Tree Support Vector Machine Logistic Regression Random Forest Naïve Bayes Neural Network We evaluate the quality of the models on a per-category basis in terms of an F1-score by comparison with the human classification done by hand.

Methods (cont.) Actual Outcome Predicted Outcome In Field Not in Field In Field TP FN Not in Field FP TN Precision = TP / (TP + FP) Recall =TP / (TP + FN) F1 Score =2 (Precision * Recall) / (Precision + Recall)

Results Greater success with Field of Study than Research Purpose Best models: Logistic Regression and Support Vector Machine models Surprisingly using the Title of the project alone we do better than with the SOW/Abstract Potentially compromising factors: o o o o SOW/Abstract not sufficiently clear Models cannot understand complex relationships between the words Words have different meaning in different contexts Insufficient sample size

F1 Score Results (cont.) Training and Testing F1 Scores 1.20 1.00 0.80 0.60 0.40 0.20 0.00 Field Label Training Score Testing Score

F1 Score Results (cont.) Label Distribution 200 180 160 140 120 100 80 60 40 20 0 F1 scores vs. sample size Training and Testing F1 Scores 1.20 1.00 0.80 0.60 0.40 0.20 0.00 Field Label

Conclusions and Future Work It is feasible to classify the projects using machine-learning if enough data is available Need to collect more data points Need to understand better in which areas the tools do not perform well and why is so Recruit other universities: o o Expand training data Determine whether tool is applicable cross-university

More Information Publication: Enhancing and Automating University Reporting Of R&D Expenditure Data Using Machine Learning Techniques. Joshua L. Rosenbloom, Rodolfo H. Torres, Joseph St. Amand, and Adrienne Sadovsky Merrill Advanced Studies Center Report, No. 121, 2017. https://merrill.ku.edu/sites/merrill.ku.edu/files/docs/2017_whitepaper/ University_Research_Planning_in_the_Data_Era_2017.pdf Software and documentation: https://github.com/jstamand/kuherd

Questions?

Field Code New Field Spring 2016 Aerospace / Aeronautical / Astronautical Engineering A1 Bioengineering and Biomedical Engineering A2 Chemical Engineering A3 Civil Engineering A4 Electrical, Electronic, and Communications Engineering A5 Mechanical Engineering A6 Metallurgical & Materials Engineering A7 Other Engineering A8 Industrial and Manufacturing Engineering A9 * Astronomy and Astrophysics B1 Chemistry B2 Physics B3 Other Physical Sciences B4 Materials Science B5 * Atmospheric Sciences and Meteorology C1 Geological and Earth Sciences C2 Ocean Sciences and Marine Sciences C3 Other Geosciences, Atmospheric, and Ocean Sciences C4 Mathematics and Statistics D Computer and Information Sciences E Agricultural Sciences F1 Biological and Biomedical Sciences F2 Health Sciences F3 Other Life Sciences F4 Natural Resources and Conservation F5 * Psychology G Economics H1 Political Science and Government H2 Sociology, Demography, and Population Studies H3 Other Social Sciences H4 Anthropology H5 * Other Sciences I Education K Law L Humanities M Visual and Performing Arts N Business Management and Business Administration O Communication and Communications Technologies P Social Work Q Other Non-S&E Fields R