Lecture Notes in Computer Science 1481 Edited by G. Goos, J. Hartmanis and J. van Leeuwen
3 Berlin Heidelberg New York Barcelona Budapest Hong Kong London Milan Paris Singapore Tokyo
Ethan V. Munson Charles Nicholas Derick Wood (Eds.) Principles of Digital Document Processing 4th International Workshop, PODDP 98 Saint Malo, France, March 29-30, 1998 Proceedings 13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Ethan V. Munson University of Wisconsin-Milwaukee Department of Electrical Engineering and Computer Science Milwaukee, WI 53211, USA E-mail: munson@cs.uwm.edu Charles Nicholas University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering 1000 Hilltop Circle, Baltimore, MD 21250, USA E-mail: nicholas@cs.umbc.edu Derick Wood Hong Kong University of Science and Technology Department of Computer Science Clear Water Bay, Kowloon, Hong Kong SAR E-mail: dwood@cs.ust.hk Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Principles of digital document processing : 4th international workshop ; proceedings / PODDP 98, Saint Malo, France, March 29-30, 1998. Ethan V. Munson...(ed.).-Berlin;Heidelberg;New York ; Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1998 (Lecture notes in computer science ; Vol. 1481) ISBN 3-540-65086-5 CR Subject Classification (1991): I.7, H.5, I.3.7, I.4, H.2.8 ISSN 0302-9743 ISBN 3-540-65086-5 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1998 Printed in Germany Typesetting: Camera-ready by author SPIN 10638766 06/3142 5 4 3 2 1 0 Printed on acid-freepaper
Preface The Fourth International Workshop on Principles of Digital Document Processing took place in Saint Malo, France on March 29 and 30, 1998. PODDP 98 was the fourth in a series of international workshops that provide forums to discuss the modeling of document processing systems using theories and techniques from, for example, computer science, mathematics, and psychology. PODDP 98 took place in conjunction with EP 98 at the Palais des Congr es in Saint Malo. The charter of PODDP is deliberately ambitious and its scope broad. Indeed, we have added the adjective \Digital" to the series title to reflect the workshop s emphasis on multimedia documents. The current state of digital document systems can be characterized as a plethora of tools without a clear articulation of unifying principles and underlying concepts. The practical and commercial impact of these tools editors and formatters for many media (text, graphics, video, audio, animation), composition systems, digital libraries, the World Wide Web, word processing systems, structured editors, document management systems is too pervasive and obvious to require further elaboration and emphasis. With the rapid development in hardware technology (processors, memory, and high bandwidth networks), however, the notion of a document and of document processing itself is undergoing a profound change. The growing use of multimedia has expanded our notions about content, scale, and dynamicity of documents. To address these changes, we hope to bring to bear theories and techniques developed by researchers in other areas of science, mathematics, engineering and the humanities (such as databases, formal speci cation languages and methodologies, optimization, workflow analysis, and user interface design). The PODDP workshops are intended to promote a happy marriage between documents and document processing, and theories and techniques. PODDP provides an ideal opportunity for discussion and information exchange between researchers who are grappling with problems in any area of document processing. We invited researchers to submit papers with a good balance between theory and practice in document processing. Papers that addressed both on a somewhat equal basis were preferred. Each paper was subjected to rigorous peer review. Finally, we hope that the work presented in this volume will stimulate other researchers to join us in investigating the principles of digital document processing. To support the dissemination of this research, plans are being made for the next workshop in this series, PODDP 2000, to be held in conjunction with DDEP 2000, the Eighth International Conference on Digital Documents and Electronic Publishing. July 1998 Ethan V. Munson Charles Nicholas Derick Wood
Organization Steering Committee Derick Wood (Hong Kong University of Science & Technology), Chair Anne Brueggemann-Klein (Technische Universität München, Germany) Richard Furuta (Texas A&M University, USA) Ethan V. Munson (University of Wisconsin-Milwaukee, USA) Makoto Murata (Fuji Xerox Informations Co. Ltd., Japan) Charles Nicholas (University of Maryland, Baltimore County, USA) Program Committee Charles Nicholas (University of Maryland, Baltimore County, USA), Co-Chair Derick Wood (Hong Kong University of Science & Technology), Co-Chair Howard Blair (Syracuse University, USA) Heather Brown (University of Kent-Canterbury, UK) Anne Brueggemann-Klein (Technische Universität München, Germany) Norbert Fuhr (Universität Dortmund, Germany) Richard Furuta (Texas A&M University, USA) Heikki Mannila (University of Helsinki, Finland) Ethan V. Munson (University of Wisconsin-Milwaukee, USA) Makoto Murata (Fuji Xerox Information Systems, Japan) Cecile Roisin (INRIA Rh^one-Alpes, France)
Table of Contents Document Models and Structures Context and Caterpillars and Structured Documents ::::::::::::::::::: 1 Anne Brüggemann-Klein, Stefan Hermann, and Derick Wood A Conceptual Model for Tables :::::::::::::::::::::::::::::::::::::: 10 Xinxin Wang and Derick Wood Analysis of Document Structures for Element Type Classi cation :::::::: 24 Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola, and Mika Klemettinen Using Document Relationships for Better Answers ::::::::::::::::::::: 43 Mingfang Wu and Ross Wilkinson Characterization of Documents and Corpora Generating, Visualizing, and Evaluating High-Quality Clusters for Information Organization ::::::::::::::::::::::::::::::::::::::::::: 53 Javed Aslam, Katya Pelekhov, and Daniela Rus On the Speci cation of the Display of Documents in Multi-lingual Computing 70 Myatav Erdenechimeg, Richard Moore, and Yumbayar Namsrai Spotting Topics with the Singular Value Decomposition ::::::::::::::::: 82 Charles Nicholas and Randall Dahlberg A Linear Algebra Approach to Language Identi cation ::::::::::::::::: 92 Laura A. Mather Accessing Collections of Documents Indexed Tree Matching with Complete Answer Representations ::::::::::104 Holger Meuss Combining the Power of Query Languages and Search Engines for Online Document and Information Retrieval : The QIRi@D Environment ::::::::116 Laure Berti, Jean-Luc Damoiseaux, and Elisabeth Murisasco Intensional HTML :::::::::::::::::::::::::::::::::::::::::::::::::128 Bill Wadge, Gord Brown, m. c. schraefel, and Taner Yildirim Data Model for Document Transformation and Assembly :::::::::::::::140 Makoto Murata