Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Similar documents
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science. Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science

Lecture Notes in Computer Science

Lecture Notes in Computer Science 2599 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

Lecture Notes in Artificial Intelligence. Lecture Notes in Computer Science

Lecture Notes in Computer Science 2006 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science

Lecture Notes in Computer Science

Lecture Notes in Computer Science 2500 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

Lecture Notes in Computer Science

Lecture Notes in Computer Science

Data Assimilation: Tools for Modelling the Ocean in a Global Change Perspective

Lecture Notes in Computer Science

Lecture Notes in Computer Science. Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science 1096 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Control and Information Sciences 283. Editors: M. Thoma M. Morari

Modeling Manufacturing Systems. From Aggregate Planning to Real-Time Control

Communications in Computer and Information Science 85

Lecture Notes in Artificial Intelligence

Lecture Notes in Computer Science

Lecture Notes in Computer Science

Spatio-Temporal Image Processing

U. Lindemann (Ed.) Human Behaviour in Design

StraBer Wahl Graphics and Robotics

Lecture Notes in Economics and Mathematical Systems

Design for Innovative Value Towards a Sustainable Society

Founding Editor Martin Campbell-Kelly, University of Warwick, Coventry, UK

MATLAB Guide to Finite Elements

Lecture Notes in Computer Science 1924 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Cognitive Systems Monographs

Technology Roadmapping for Strategy and Innovation

Lecture Notes in Computer Science 1500 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Architecture Design and Validation Methods

ZEW Economic Studies. Publication Series of the Centre for European Economic Research (ZEW), Mannheim, Germany

Lecture Notes in Artificial Intelligence

Simulation by Bondgraphs

ICT for the Next Five Billion People

TECHNOLOGY, INNOVATION, and POLICY 3. Series of the Fraunhofer Institute for Systems and Innovation Research (lsi)

Lecture Notes in Computer Science 1946 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science. Edited by G. Goos and J. Hartmanis Advisory Board: W. Brauer D. Gries J. Stoer

Future-Oriented Technology Analysis

The Future of Civil Litigation

Springer Series on. Signals and Communication Technology

Handbook of Engineering Acoustics

Lecture Notes in Applied and Computational Mechanics

Hierarchy Process. The Analytic. Bruce L. Golden Edward A. Wasil Patrick T. Harker (Eds.) Applications and Studies

3 Forensic Science Progress

Lecture Notes in Computer Science

Advances in Behavioral Economics

Dao Companion to the Analects

Cost Analysis and Estimating

Lecture Notes in Computer Science 2013 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

Lecture Notes in Computer Science

Matthias Pilz Susanne Berger Roy Canning (Eds.) Fit for Business. Pre-Vocational Education in European Schools RESEARCH

Computer-Aided Production Management

2 Forensic Science Progress

Enabling Manufacturing Competitiveness and Economic Sustainability

Studies in Economic Ethics and Philosophy

146 Advances in Polymer Science

Lecture Notes in Control and Information Sciences

Lecture Notes in Computer Science 1885 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

Peter Stavroulakis (Ed.) Third Generation Mobile Telecommunication Systems

Health Information Technology Standards. Series Editor: Tim Benson

Scientific Data Mining and Knowledge Discovery

Social Understanding

SpringerBriefs in Space Development

Requirements Engineering for Digital Health

ANALOG CIRCUITS AND SIGNAL PROCESSING

Lecture Notes in Economics and Mathematical Systems

Acoustic Emission Testing

Risk-Based Ship Design

Lecture Notes in Computer Science 3081

Inside the Smart Home

Dry Etching Technology for Semiconductors. Translation supervised by Kazuo Nojiri Translation by Yuki Ikezi

Application of Evolutionary Algorithms for Multi-objective Optimization in VLSI and Embedded Systems

Advances in Modern Tourism Research

Human-Computer Interaction Series

Broadband Networks, Smart Grids and Climate Change

MATHEMATICAL ECONOMICS

.. Algorithms and Combinatorics 17

Testing Safety-Related Software

Lecture Notes in Computational Science and Engineering 68

COOP 2016: Proceedings of the 12th International Conference on the Design of Cooperative Systems, May 2016, Trento, Italy

Lecture Notes in Artificial Intelligence

Lecture Notes in Artificial Intelligence 2922

Lecture Notes in Artificial Intelligence 3396

Dynamics of Fibre Formation and Processing

B.I. Dundas M. Levine P.A. Østvær O. Röndigs. Motivic Homotopy Theory. Lectures at a Summer School in Nordfjordeid, Norway, August 2002 ABC

Applied Technology and Innovation Management

Lecture Notes in Computer Science 5000

Advances in Computer Vision and Pattern Recognition

Sustainable Development

Introduction to Computational Optimization Models for Production Planning in a Supply Chain

Active Perception in the History of Philosophy

6 Forensic Science Progress

Lecture Notes in Computer Science 2379 Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

AutoCAD for Windows Express

Studies in Empirical Economics

The Relations between Defence and Civil Technologies

Transcription:

Lecture Notes in Computer Science 1481 Edited by G. Goos, J. Hartmanis and J. van Leeuwen

3 Berlin Heidelberg New York Barcelona Budapest Hong Kong London Milan Paris Singapore Tokyo

Ethan V. Munson Charles Nicholas Derick Wood (Eds.) Principles of Digital Document Processing 4th International Workshop, PODDP 98 Saint Malo, France, March 29-30, 1998 Proceedings 13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Ethan V. Munson University of Wisconsin-Milwaukee Department of Electrical Engineering and Computer Science Milwaukee, WI 53211, USA E-mail: munson@cs.uwm.edu Charles Nicholas University of Maryland, Baltimore County Department of Computer Science and Electrical Engineering 1000 Hilltop Circle, Baltimore, MD 21250, USA E-mail: nicholas@cs.umbc.edu Derick Wood Hong Kong University of Science and Technology Department of Computer Science Clear Water Bay, Kowloon, Hong Kong SAR E-mail: dwood@cs.ust.hk Cataloging-in-Publication data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Principles of digital document processing : 4th international workshop ; proceedings / PODDP 98, Saint Malo, France, March 29-30, 1998. Ethan V. Munson...(ed.).-Berlin;Heidelberg;New York ; Barcelona ; Budapest ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 1998 (Lecture notes in computer science ; Vol. 1481) ISBN 3-540-65086-5 CR Subject Classification (1991): I.7, H.5, I.3.7, I.4, H.2.8 ISSN 0302-9743 ISBN 3-540-65086-5 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. c Springer-Verlag Berlin Heidelberg 1998 Printed in Germany Typesetting: Camera-ready by author SPIN 10638766 06/3142 5 4 3 2 1 0 Printed on acid-freepaper

Preface The Fourth International Workshop on Principles of Digital Document Processing took place in Saint Malo, France on March 29 and 30, 1998. PODDP 98 was the fourth in a series of international workshops that provide forums to discuss the modeling of document processing systems using theories and techniques from, for example, computer science, mathematics, and psychology. PODDP 98 took place in conjunction with EP 98 at the Palais des Congr es in Saint Malo. The charter of PODDP is deliberately ambitious and its scope broad. Indeed, we have added the adjective \Digital" to the series title to reflect the workshop s emphasis on multimedia documents. The current state of digital document systems can be characterized as a plethora of tools without a clear articulation of unifying principles and underlying concepts. The practical and commercial impact of these tools editors and formatters for many media (text, graphics, video, audio, animation), composition systems, digital libraries, the World Wide Web, word processing systems, structured editors, document management systems is too pervasive and obvious to require further elaboration and emphasis. With the rapid development in hardware technology (processors, memory, and high bandwidth networks), however, the notion of a document and of document processing itself is undergoing a profound change. The growing use of multimedia has expanded our notions about content, scale, and dynamicity of documents. To address these changes, we hope to bring to bear theories and techniques developed by researchers in other areas of science, mathematics, engineering and the humanities (such as databases, formal speci cation languages and methodologies, optimization, workflow analysis, and user interface design). The PODDP workshops are intended to promote a happy marriage between documents and document processing, and theories and techniques. PODDP provides an ideal opportunity for discussion and information exchange between researchers who are grappling with problems in any area of document processing. We invited researchers to submit papers with a good balance between theory and practice in document processing. Papers that addressed both on a somewhat equal basis were preferred. Each paper was subjected to rigorous peer review. Finally, we hope that the work presented in this volume will stimulate other researchers to join us in investigating the principles of digital document processing. To support the dissemination of this research, plans are being made for the next workshop in this series, PODDP 2000, to be held in conjunction with DDEP 2000, the Eighth International Conference on Digital Documents and Electronic Publishing. July 1998 Ethan V. Munson Charles Nicholas Derick Wood

Organization Steering Committee Derick Wood (Hong Kong University of Science & Technology), Chair Anne Brueggemann-Klein (Technische Universität München, Germany) Richard Furuta (Texas A&M University, USA) Ethan V. Munson (University of Wisconsin-Milwaukee, USA) Makoto Murata (Fuji Xerox Informations Co. Ltd., Japan) Charles Nicholas (University of Maryland, Baltimore County, USA) Program Committee Charles Nicholas (University of Maryland, Baltimore County, USA), Co-Chair Derick Wood (Hong Kong University of Science & Technology), Co-Chair Howard Blair (Syracuse University, USA) Heather Brown (University of Kent-Canterbury, UK) Anne Brueggemann-Klein (Technische Universität München, Germany) Norbert Fuhr (Universität Dortmund, Germany) Richard Furuta (Texas A&M University, USA) Heikki Mannila (University of Helsinki, Finland) Ethan V. Munson (University of Wisconsin-Milwaukee, USA) Makoto Murata (Fuji Xerox Information Systems, Japan) Cecile Roisin (INRIA Rh^one-Alpes, France)

Table of Contents Document Models and Structures Context and Caterpillars and Structured Documents ::::::::::::::::::: 1 Anne Brüggemann-Klein, Stefan Hermann, and Derick Wood A Conceptual Model for Tables :::::::::::::::::::::::::::::::::::::: 10 Xinxin Wang and Derick Wood Analysis of Document Structures for Element Type Classi cation :::::::: 24 Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola, and Mika Klemettinen Using Document Relationships for Better Answers ::::::::::::::::::::: 43 Mingfang Wu and Ross Wilkinson Characterization of Documents and Corpora Generating, Visualizing, and Evaluating High-Quality Clusters for Information Organization ::::::::::::::::::::::::::::::::::::::::::: 53 Javed Aslam, Katya Pelekhov, and Daniela Rus On the Speci cation of the Display of Documents in Multi-lingual Computing 70 Myatav Erdenechimeg, Richard Moore, and Yumbayar Namsrai Spotting Topics with the Singular Value Decomposition ::::::::::::::::: 82 Charles Nicholas and Randall Dahlberg A Linear Algebra Approach to Language Identi cation ::::::::::::::::: 92 Laura A. Mather Accessing Collections of Documents Indexed Tree Matching with Complete Answer Representations ::::::::::104 Holger Meuss Combining the Power of Query Languages and Search Engines for Online Document and Information Retrieval : The QIRi@D Environment ::::::::116 Laure Berti, Jean-Luc Damoiseaux, and Elisabeth Murisasco Intensional HTML :::::::::::::::::::::::::::::::::::::::::::::::::128 Bill Wadge, Gord Brown, m. c. schraefel, and Taner Yildirim Data Model for Document Transformation and Assembly :::::::::::::::140 Makoto Murata