Automating NSF HERD Reporting Using Machine Learning and Administrative Data Rodolfo H. Torres CIMA Session: The Use of Advance Analytics to Drive Decisions 2018 APLU Annual Meeting New Orleans Marriott, New Orleans LA November 11, 2018 This research has been supported in part by the National Science Foundation under the EAGER Awards 1547464 / 1547513 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Project in Collaboration with Joshua Rosenbloom Professor and Chair Department of Economics Iowa State University Joseph St.Amand Former Graduate Student EECS University of Kansas Current Position Chief Technology Officer Patients Voices Adrienne Sadovsky Principal Analyst Senior Office of Research University of Kansas Luke Huan Initial co-pi - Former Professor EECS / ITTC University of Kansas Current Position Head of Beijing Big Data Lab Baidu Research
The HERD Survey From https://www.nsf.gov/statistics/srvyherd/#sd The Higher Education Research and Development Survey,, is the primary source of information on R&D expenditures at U.S. colleges and universities. The survey collects information on R&D expenditures by field of research and source of funds and also gathers information on types of research and expenses The survey is an annual census of institutions that expended at least $150,000 in separately budgeted R&D in the fiscal year. In FY 2016 there were 902 institutions reporting data for a total of $72B in total R&D expenditures, of which $39B were from Federal sources.
Some Features of the HERD Survey R&D expenditures by source of funds (federal government, state and local government, business, nonprofit, institutional, and other) R&D expenditures passed through to sub-recipients or received as a sub-recipient Federally funded R&D expenditures by federal agency R&D expenditures by purpose of work (e.g., Basic Research, Applied Research, Development, etc. ) Federally and non-federal funded R&D expenditures by field (e.g., Computer Sciences, Chemistry, Economics, etc.)
Sample of Tables in the HERD Report Total and federally financed higher education R&D expenditures, by type of R&D: 2010 2016 (in thousands) Total Federal Fiscal year All R&D expenditures Basic research Applied research Development All R&D expenditures Basic research Applied research Development 2010 61,286,610 40,416,177 15,478,375 5,392,058 37,477,582 25,399,596 9,361,940 2,716,046 2011 65,274,393 42,809,196 16,733,579 5,731,618 40,768,251 27,331,458 10,498,586 2,938,207 2012 65,729,007 42,401,697 17,295,653 6,031,657 40,142,223 26,469,347 10,577,754 3,095,122 2013 67,013,138 43,305,409 17,390,865 6,316,864 39,445,931 26,071,617 10,327,219 3,047,095 2014 67,196,537 42,989,478 17,745,860 6,461,199 37,960,175 24,905,121 10,015,778 3,039,276 2015 68,566,890 43,865,982 18,022,569 6,678,339 37,848,552 24,945,232 9,969,994 2,933,326 2016 71,833,308 45,101,655 19,986,766 6,744,887 38,793,542 24,944,577 10,893,286 2,955,679 SOURCE: National Science Foundation, National Center for Science and Engineering Statistics, Higher Education Research and Development Survey https://ncsesdata.nsf.gov/herd/2016/html/herd2016_dst_08.html
Expenditures by Field and Source 2016 All R&D State and local Nonprofit expenditures Federal government government Institution funds Business organizations All other sources All R&D fields 71,833,308 38,793,542 4,025,280 17,974,962 4,210,563 4,614,800 2,214,161 Science 56,290,662 31,090,354 3,023,028 13,541,084 3,031,096 3,868,151 1,736,949 Computer and information sciences 2,077,884 1,442,771 49,502 399,965 90,288 59,588 35,770 Geosciences, atmospheric sciences, and ocean sciences 3,087,774 1,992,990 157,693 614,647 109,478 127,763 85,203 Atmospheric science and meteorology 626,518 513,275 18,416 68,923 6,319 7,660 11,925 Geological and earth sciences 999,351 605,706 47,334 226,541 51,237 32,557 35,976 Ocean sciences and marine sciences 1,097,864 665,121 59,874 241,440 32,896 69,841 28,692 Geosciences, atmospheric sciences, and ocean sciences, nec 364,041 208,888 32,069 77,743 19,026 17,705 8,610 Life sciences 40,887,850 21,798,334 2,437,745 9,700,749 2,569,302 3,038,475 1,343,245 Agricultural sciences 3,293,092 976,912 873,403 1,031,049 166,341 134,067 111,320 Biological and biomedical sciences 13,048,981 7,707,943 554,094 2,983,417 552,727 958,620 292,180 Health sciences 22,393,716 12,098,295 813,806 5,025,036 1,802,695 1,832,951 820,933 Natural resources and conservation b 689,725 315,559 115,681 193,967 14,949 30,632 18,937 Life sciences, nec 1,462,336 699,625 80,761 467,280 32,590 82,205 99,875 Mathematics and statistics 681,661 444,419 25,714 170,414 8,844 23,601 8,669 Physical sciences 4,893,565 3,286,816 93,518 1,044,829 139,153 200,852 128,397 Astronomy and astrophysics 622,008 418,147 1,839 122,375 4,578 34,443 40,626 Chemistry 1,775,071 1,097,719 48,331 421,143 82,673 82,956 42,249 Materials science b 172,086 111,802 4,579 38,435 9,518 5,465 2,287 Physics 2,124,098 1,523,751 33,703 417,189 37,851 71,221 40,383 Physical sciences, nec 200,302 135,397 5,066 45,687 4,533 6,767 2,852 Psychology 1,218,721 761,433 49,603 291,319 13,084 84,105 19,177 Social sciences 2,366,571 898,576 145,563 908,025 50,569 282,278 81,560 Anthropology b 96,505 39,440 2,501 42,190 1,982 7,860 2,532 Economics 396,393 112,338 37,543 166,032 8,910 54,860 16,710 Political science and government 385,245 103,681 15,042 177,119 3,991 61,439 23,973 Sociology, demography, and population studies 504,594 269,371 27,602 135,118 8,213 52,471 11,819 Social sciences, nec 983,834 373,746 62,875 387,566 27,473 105,648 26,526 Sciences, nec 1,076,636 465,015 63,690 411,136 50,378 51,489 34,928 Engineering 11,381,727 6,583,476 699,032 2,335,527 1,055,444 359,441 348,807 Aerospace, aeronautical, and astronautical engineering 883,260 623,571 24,846 115,771 80,432 31,049 7,591 Bioengineering and biomedical engineering 1,084,355 650,752 56,057 254,840 46,976 53,428 22,302 Chemical engineering 885,273 467,678 40,386 199,334 121,432 34,328 22,115 Civil engineering 1,331,155 591,637 221,119 348,873 84,724 46,354 38,448 Electrical, electronic, and communications engineering 2,517,147 1,742,632 51,270 416,262 167,403 55,818 83,762 Industrial and manufacturing engineering b 239,078 148,464 10,846 56,714 16,372 3,652 3,030 Mechanical engineering 1,435,828 860,745 55,454 279,079 169,124 33,587 37,839 Metallurgical and materials engineering 771,683 442,893 29,270 181,287 74,702 19,512 24,019 Engineering, Source necnsf https://ncsesdata.nsf.gov/herd/2015/html/herd2016_dst_05.html 2,233,948 1,055,104 209,784 483,367 294,279 81,713 109,701 Non-S&E 4,160,919 1,119,712 303,220 2,098,351 124,023 387,208 128,405 Source of funds
A Proof of Concept Project Categorizing each project by purpose and field of research requires considerable time and effort as it is done manually at KU Goals Labor intensive (expensive) Subjective Questionable reliability and validity Apply machine learning and text analysis tools to automate project classification Ease administrative burden Generate more objective classifications
Methods We identified 1,700 historical awards that had been manually classified. We try to classify them using the project Title, SOW/Abstract, PI Home Department, and additional metadata. We treated the purpose and the field classification as two different tasks. After eliminating awards for which electronic abstracts were not available, we were left with a set of roughly 1,500 awards that could be used as a training data set. We used the bag-of-words model to represent the data; each word is considered as a separate feature. There were 17,046 separate data fields or features. However using tools for feature weighting and selection we reduced this number to a few hundreds. This feature extraction pipeline is configurable, allowing us to experiment with different means of producing features for the classification models.
Methods (cont.) We divide the awards into a testing set (of about 30% of the data) a training set (which is then divided into 5 parts for cross-validation). We explored the application of established machine-learning models: Decision Tree Support Vector Machine Logistic Regression Random Forest Naïve Bayes Neural Network We evaluate the quality of the models on a per-category basis in terms of an F1-score by comparison with the human classification done by hand.
Methods (cont.) Actual Outcome Predicted Outcome In Field Not in Field In Field TP FN Not in Field FP TN Precision = TP / (TP + FP) Recall =TP / (TP + FN) F1 Score =2 (Precision * Recall) / (Precision + Recall)
Results Greater success with Field of Study than Research Purpose Best models: Logistic Regression and Support Vector Machine models Surprisingly using the Title of the project alone we do better than with the SOW/Abstract Potentially compromising factors: o o o o SOW/Abstract not sufficiently clear Models cannot understand complex relationships between the words Words have different meaning in different contexts Insufficient sample size
F1 Score Results (cont.) Training and Testing F1 Scores 1.20 1.00 0.80 0.60 0.40 0.20 0.00 Field Label Training Score Testing Score
F1 Score Results (cont.) Label Distribution 200 180 160 140 120 100 80 60 40 20 0 F1 scores vs. sample size Training and Testing F1 Scores 1.20 1.00 0.80 0.60 0.40 0.20 0.00 Field Label
Conclusions and Future Work It is feasible to classify the projects using machine-learning if enough data is available Need to collect more data points Need to understand better in which areas the tools do not perform well and why is so Recruit other universities: o o Expand training data Determine whether tool is applicable cross-university
More Information Publication: Enhancing and Automating University Reporting Of R&D Expenditure Data Using Machine Learning Techniques. Joshua L. Rosenbloom, Rodolfo H. Torres, Joseph St. Amand, and Adrienne Sadovsky Merrill Advanced Studies Center Report, No. 121, 2017. https://merrill.ku.edu/sites/merrill.ku.edu/files/docs/2017_whitepaper/ University_Research_Planning_in_the_Data_Era_2017.pdf Software and documentation: https://github.com/jstamand/kuherd
Questions?
Field Code New Field Spring 2016 Aerospace / Aeronautical / Astronautical Engineering A1 Bioengineering and Biomedical Engineering A2 Chemical Engineering A3 Civil Engineering A4 Electrical, Electronic, and Communications Engineering A5 Mechanical Engineering A6 Metallurgical & Materials Engineering A7 Other Engineering A8 Industrial and Manufacturing Engineering A9 * Astronomy and Astrophysics B1 Chemistry B2 Physics B3 Other Physical Sciences B4 Materials Science B5 * Atmospheric Sciences and Meteorology C1 Geological and Earth Sciences C2 Ocean Sciences and Marine Sciences C3 Other Geosciences, Atmospheric, and Ocean Sciences C4 Mathematics and Statistics D Computer and Information Sciences E Agricultural Sciences F1 Biological and Biomedical Sciences F2 Health Sciences F3 Other Life Sciences F4 Natural Resources and Conservation F5 * Psychology G Economics H1 Political Science and Government H2 Sociology, Demography, and Population Studies H3 Other Social Sciences H4 Anthropology H5 * Other Sciences I Education K Law L Humanities M Visual and Performing Arts N Business Management and Business Administration O Communication and Communications Technologies P Social Work Q Other Non-S&E Fields R