What Do You Need to Know About OCR?

Similar documents
AccuBuild Version 9.3 Release 05/11/2015. Document Management Speed Performance Improvements

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Puget Sound Company Overview. Purpose of the Project. Solution Overview

Hands-Free Music Tablet

A Quick & Dirty Guide to Revising your Novel

Photoshop Elements: Color and Tonal Correction Basics

Excel Step by Step Instructions Creating Lists and Charts. Microsoft

High Level Design Circuit CitEE. Irere Kwihangana Lauren Mahle Jaclyn Nord

Banner pocket v3 Page 1/7. Banner pocket v3

Figure 1: A Battleship game by Pogo

This app uses callas pdftoolbox server as the imposition engine and consequently you have to have that program installed on your Switch server.

FOOD SERVICE SOLUTIONS, INC.

DXF2DAT 3.0 Professional Designed Computing Systems 848 W. Borton Road Essexville, Michigan 48732

Table of Contents. ilab Solutions: Core Facilities Core Usage Reporting

Cleveland Public Theatre. Catapult. Request for Proposals. Deadline for submissions is Monday, June 12 th, 2017

Processors with Sub-Microsecond Response Times Control a Variety of I/O. *Adapted from PID Control with ADwin, by Doug Rathburn, Keithley Instruments

Consciousness Shifting

Creating Gift Card Batches

Upgrading to PlanetPress Suite Version 5

1. Constraint propagation

Microsoft PowerPoint 2007

6 th Grade Jingle Composition Project

Using the Laser Cutter

Desktop Teller Exception User Guide

ANTIOCH UNIVERSITY VIRTIUAL WRITING CENTER

Altis Flight Manager. PC application for AerobTec devices. AerobTec Altis v3 User Manual 1

National Curriculum Programme of Study:

VILLAGE COORDINATOR AGREEMENT

Notes on using an external GNSS receiver with smart phone mapping app

Name: Date: Period: 1. Multi-Genre Character Project

State Bank Virtual Card FAQs

Automated Meters Frequently Asked Questions

Introduction. Version 8.2.2

CATA Composer R2016 Fact Sheet. Add a New Dimension to Your Product Communications

Here is a list of important features to look for in a frequency unit before you purchase one.

Guidelines for Preparation of Accepted Papers (Full Length) / Posters

Xerox 8160/8142 Wide Format Color

NanoScan v2 Readme Version 2.7. Change log. v2.7 - Added information for new product Pyro/9/5-MIR.

DEAD MAN S DOUBLOONS. Rules v1.2

Spinning Mills Registration Guidelines

PhotoVu Digital Picture Frame Service & Repair Guide

KIP Cost Center User Guide

TimeLapse Photography

1.12 Equipment Manager

Session 8. MAKING DECISIONS Steps 1 & 2 of Do It!

Insert Picture, reduce the size of a Picture and Wrap text around a picture

The WHO e-atlas of disaster risk for the European Region Instructions for use

JJ / CP RFP Response to Inquiries

You Be The Chemist Challenge Official Competition Format

GANTOM iq AND iqx USER GUIDE

Hospital Task Scheduling using Constraint Programming

RiverSurveyor S5/M9 & HydroSurveyor Second Generation Power & Communications Module (PCM) Jan 23, 2014

Support Subscribers call

Super ABC Plug-in kit for Pacman or Ms Pacman

The following guide contains the workstation setup instructions for the Eclipse Manifest Companion Product.

Automatic Number Plate Recognition

SVT Tab and Service Visibility Tool Job Aid

The Motorcycle Industry in Europe. L-category vehicles type approval regulation ACEM comments on draft TRL durability study

PaperStream IP (TWAIN) change history

8.1. Name authority concepts and problems

Security Exercise 12

Effective Presentations

HP LF Printing Knowledge Center

Manual Zeiss Axio Zoom.V16 microscope and ZEN 2 Pro software

a) Which points will be assigned to each center in the first iteration? b) What will be the values of the k new centers (means)?

Big Kahuna Assembly Instructions

Why Kodak CTP is best for process free plates

Producing Research Posters

PaperStream IP (TWAIN x64) change history

IDEXX VetConnect PLUS on the Patient Clipboard displays the latest laboratory results for the current patient:

Appendix D. Photography

Service Update 7. PaperStream IP (TWAIN x64) for SP Series. change history. Version Version Version

The Mathematics of the Rubik s Cube

Data Sheet - cctvxanpr PC based 1-4 channel ANPR (Automatic Licence Number Plate Recognition)

1. Give an example of how one can exploit the associative property of convolution to more efficiently filter an image.

BigMouth

TUTORIAL I ECE 555 CADENCE SCHEMATIC SIMULATION USING SPECTRE

SHADOW OF THE DRAGON AGE OF SIGMAR

Materials: Metals, timber, plastics, composites, smart and nanomaterials Candidates should:

Lab 1 Load Cell Measurement System

6. Verifying Identification for DBS (England only)

E-Jobsheet Tablet Application Functionality

Art of Work Look & See: Who do you want to be? Utah Museum of Fine Arts Educator Resources and Lesson Plans Fall 2016

COMP 110 INTRODUCTION TO PROGRAMMING WWW

Flux Bender Equalizer

BTEC EXTENDED DIPLOMA IN CREATIVE MEDIA PRODUCTION (GAMING)

Personal Statement Workshop: The Do s and Don ts. A Guide to a Successful Personal Statement

Experiment 4 Op-Amp Circuits

Introduction to Life Cycle Risk Management Help Page

Slavic and Celtic Folklore: Heroic, Spiritual, Practical, Spring 2018 Poster Project, Due Monday, May 28, by 5 p.m.

Spring 06 Assignment 3: Robot Motion, Game Theory

60min Tinkerb t games

Photoshop Elements 7 Intermediate: Layout & Design

Martel LC-110H Loop Calibrator and HART Communications/Diagnostics

OBJECT OF THE GAME COMPONENTS

GANTOM iq AND iqx USER GUIDE

After Earth Saving Our Future Lesson Plan

ACES & PIES. What They Are and What They Are Not

Our exhibit, AUDACIOUS FREEDOM, is about ordinary people who lived extraordinary lives dedicated to the struggle for freedom.

Application for Drive Technology

Transcription:

Title: Created: 2/2/2005 Scanner Mdels: Operating Systems: What D Yu Need t Knw Abut OCR? All Windws 98 / ME / 2000 / XP The right OCR prduct can save yu time and mney. Buying the wrng prduct will waste yur time and mney, and yu'll ultimately be dissatisfied and quit using it. T buy the right OCR prduct, yu need t knw what OCR is and what it can d fr yu. This dcument prvides sme imprtant infrmatin t help yu evaluate yur OCR ptins. What is OCR? OCR stands fr Optical Character Recgnitin. Very simply stated, OCR means cnverting an electrnic picture f text (such as a letter) int a frm yur text-based applicatins-such as wrd prcessrs, DTP, spreadsheets, and databases-can use. Technically speaking, OCR prducts lk at a picture f a character and cnvert it int an ASCII r ANSI character that applicatins prgrams can utilize. This cnversin prcess is called recgnitin. Wh's Using OCR and Why Over 65% f the time spent at the keybard is spent retyping existing material that's been typed at least nce already! With OCR, yu save this valuable time. Busy prfessinals are using OCR in many ways. Fr example, exchanging cntracts between clients and attrneys invlves many runds f fax, retype, fax, retype, etc. OCR can easily cnvert the faxed dcument int WrdPerfect, fr example. Clumns f numbers frm financial reprts n lnger need t be rekeyed int Ltus r Quattr. OCR can put them directly int useable spreadsheet frmat. There are lts f ther examples, like string articles and abstracts, revising manuals prduced n bslete r dedicated wrd prcessing equipment, and capturing data frm frms. Furthermre, yu can easily d things yu wuld never have attempted. Fr example, yu can create a crrespndence database, r file resumes n-line fr easy applicant/psitin match-up. Or even analyze yur phne bills by scanning them and putting the data int a spreadsheet, s yu can easily ttal up the cst f calls t particular phne numbers r cities.

What Scanners D Scanners are nthing mre than fancy cameras. They simply take a picture f a page, then pass it t the PC in electrnic frmat as a bitmapped image file. Taking an electrnic picture is called scanning. Fr graphics applicatins, an electrnic picture is just fine. Using a prgram like PC Paintbrush", yu can easily clip the picture r illustratin that yu want t put in a reprt, newsletter, r brchure, fr example. Fax machines include a scanner; they take an electrnic picture, then send it ver telephne lines. S, if yu are using OCR n a fax file, whever sent the fax has dne the scanning fr yu. What yu receive n a fax bard is a scanned image which can als be cnverted t text using OCR. What Scanners Dn't D Scanners d nt give yu useable text. That's what OCR is all abut. Hw D Yu Buy OCR? Generally, scanners and OCR are sld separately. Sme specialized OCR prducts have bth a scanner and an OCR engine built in. This means that yu dn't have t buy the cmpnents separately. Usually these types f systems are designed fr prductin/high-vlume applicatins and are mre expensive than a system where the cmpnents are sld separately. Yur vlume and thrughput requirements will determine the slutin mst apprpriate fr yu. Beynd OCR: Getting it all Right There's mre t OCR than recgnizing the characters n the page. Yu've gt t get it all and get it all right. Lk at a few pages f text, and yu'll begin t see sme f the prblems that simple OCR cannt address: multiple clumns spanned by large headlines, a cmbinatin f type styles and faces, text wrapped arund illustratins, indented lists, numbered lines, headers and fters, tables, and a number f ther characteristics that wuld becme a jumbled mess if the characters n the page were simply recgnized and placed int lines f text. If yu've ever had t use a wrd prcessing prgram t refrmat a dcument that smene typed "typewriter style" (spaces in place f tabs and centering, hard returns at the end f every line, and s

frth), yu understand hw imprtant intelligent analysis and frmatting can be. The term dcument recgnitin means extracting everything that is imprtant frm a dcument. This ability separates highly cmpetent OCR prducts frm simple OCR. The end results f dcument recgnitin can nly be evaluated in yur wrd prcessr, spreadsheet, r DTP. Reading the Dcument Right Think abut the types f dcuments yu need t read. Are they all first-generatin, clean and crisp, black n white pages? Or (mre likely), are they phtcpied pages, cmputer-printed pages, and pages that have been circulated, written n, and crrected? D yu need t read bth typewritten and typeset dcuments? Sme OCR systems are severely limited by the types f dcument imperfectins they can handle. Others are limited by what they cnsider t be a recgnizable character. Yu shuld lk fr prducts that will handle the types f pages yu need t read. Check fr the ability t handle these kinds f pages: Typeset dcuments in a range f pint sizes (ftntes can be as small as 6 pints; big headlines smetimes range up t 28 pints) Laser printed dcuments (whse characters tend t bleed tgether mre than typeset dcuments) Typewritten dcuments Secnd, third, furth... generatin phtcpies Dt-matrix dcuments including draft-quality typical f spreadsheet utput (sme vendrs claim t recgnize dtmatrix, but they generally refer t "near letter quality" printing) Line printer listings Fax images frm PC fax bards Paper faxes frm a fax machine Keep the future in mind, and think abut hw thers in yur cmpany can benefit frm OCR. If yu need t read printed faxes r dt-matrix speadsheets, fr example, check t see if the prduct has a setting fr dt-matrix. Run a sample page f draft-quality dt- matrix text and check the results. Generally, a special setting fr dt-matrix r mn- spaced text will significantly imprve recgnitin n these pages.

Reading clumns Right Think abut dcuments with multiple clumns and tables. OCR prducts that autmatically declumnize multiple-clumn dcuments will be much easier t use than thse that dn't. Why Declumnize? If yu scan a multi-clumn dcument t use in yur wrd prcessing prgram and the OCR prduct des nt declumnize it, yu'll see multiple clumns n yur screen. Hwever, yu'll sn realize that the clumns are made by dividing each fine with spaces r tabs. Imagine trying t edit the text in the first clumn. As yu type, the characters in all the ther clumns mve simultaneusly, even thugh they are suppsed t be in separate clumns. And when they wrap arund at the end f the line, the entire page becmes a mess. When a dcument is declumnized, each clumn is kept tgether as a separate entity Yu can edit the text in ne clumn withut affecting the text in ther clumns. What yu see n yur screen and in the final utput depends n the frmatting abilities f yur applicatin. Mst wrd prcessing prgrams enable yu t frmat and print in multiple clumns, s reclumnizatin is pssible. Reading Tables Right On the ther hand, yu can have t much f a gd thing. Yu dn't want yur recgnitin prduct t declumnize tables, as they becme nearly impssible t recnstruct. A gd recgnitin prduct will declumnize intelligently, distinguishing between bdy text and tables. Check frmatting f a multiple-clumn dcument and a dcument that has tables. Give preference t systems that can prperly declumnize dcuments with n peratr interventin. Reading Bth Sides Duble-sided dcuments can pse a prblem t sme OCR prducts. Sme scanners have a tray that hlds a stack f pages (generally 20 t 50 pages). Hwever, if yu need t scan bth sides f the page, yu may have t d ne sheet at a time t get bth sides in the prper rder. Check fr the ability t read duble-sided dcuments by scanning all f ne side, then all f the ther side, ending up with the whle dcument in the right rder.

Reading at the Right Time Scanning lng dcuments nrmally requires smene feeding in pages and waiting fr bth scanning and recgnitin t cmplete. Mst prducts scan a page, then recgnize it, then scan anther, and s frth. Scanning is usually a matter f 3 t 11 secnds per page, depending n the scanner. Recgnitin can take cnsiderably mre time 50 t 60 secnds n a typical 2,000-character page. Very few prducts ffer the ability t "scan nw, recgnize later." This ability may be called deferred prcessing, delayed recgnitin, batch prcessing, r sme similar term. This means yu can d the fastest part f the jb (the scanning) all at nce, while the slwer part (the recgnitin) can take place unattended, at anther time. Yu can scan 100 pages, then g t lunch while recgnitin prceeds withut further interventin. Check fr deferred rcessing and chse a reliable, high-speed scanner that can keep up with yur wrklad. Reading Parts f Pages Right Often, yu wn't want t recgnize the entire page. Fr example, if yu are wrking with magazine r newspaper articles, frms, r invices, yu shuld lk fr a prduct that supprts zning r clipping. With such a prduct, yu see a graphic representatin f each page n yur screen, and then can draw bxes arund the text r numeric areas yu want t recgnize and arund the graphic images yu want t capture. Zning increases thrughput because the OCR system isn't spending time recgnizing unwanted text. Clipping enables yu t capture illustratins in such a way that yu dn't have t clean up the extraneus surrunding material later. By cmbining deferred prcessing with clipping, yu can really speed thrugh magazine articles, where yu want just part f the text n each page. Clipping lets yu quickly specify the areas f each page yu want t prcess. With deferred prcessing, yu can quickly mve frm page t page withut waiting fr recgnitin t cmplete. Recgnitin can be dne unattended after yu've clipped what yu want frm each page. Check fr the ability t cmbine deferred prcessing with clipping different areas frm each page.

Reading Frms Right Anther bvius use f zning is frms. Sme OCR prducts let yu define zne templates fr prcessing similar pages. A smaller subset f prducts let yu specify an identificatin zne t autmatically determine which template t apply t each frm, Yu then can prcess a stack f different frms; the OCR system determines which template t apply t each ne. If yu plan t prcess multiple frms, make sure the OCR prduct yu chse has an identificatin zne feature. Prcessing Multiple Jbs at Once Sme OCR prducts let yu put a blank page between each separate dcument in a stack, then prcess the whle stack at nce, saving each f the recgnized dcuments in a separate file. This can be a handy feature if yu rutinely prcess resumes, jb applicatins, reader respnse surveys, r just want t read several dcuments at a time. If yu chse an OCR prduct that allws page jb separatrs, be sure t chse a scanner with a large-capacity, reliable dcument feeder. Handling Frmatting Right Yu may need t refrmat dcuments t match yur cmpany's style. If yu will rutinely be refrmatting a particular type f dcument, such as in technical manual cnversin, yu'll need a "style sheet" feature. This is useful fr remving frmatting s that yu can easily apply yur wn. With custm style sheets, yu can specify yur wn margins, indents, line spacing, page length, fnt, and s frth, and autmatically apply these t yur finished dcument. If refrmatting is imprtant t yu, chse a prduct with style sheet features. Reading the Mst Type Right The OCR systems available tday prvide vastly different fnt recgnitin capabilities. (See the Glssary at the end f this bklet fr definitins f unfamiliar terms.) They can be bradly categrized as fllws: Plyfnt Recgnitin Plyfnt recgnitin means the ability t read several fnts, in many cases, a plyfnt prduct will nly recgnize specific fnts and cannt recgnize thers.

Plyfnt systems are sufficient when yu are nly ging t read a specific set f knwn dcuments that yu can test befre yu buy the system. Hwever it's unlikely that yu culd anticipate all the fnts yu need t read at the time yu buy an OCR system, and nce yu've bught it, yu're cmmitted t this inherent limitatin. Trainable Recgnitin Trainable prducts may seem an attractive alternative t plyfnt prducts. Yu can train the system t recgnize virtually any fnt that cmes alng. What yu'll find when yu begin t use a trainable system, hwever, is that each individual fnt and style requires a time cnsuming training sessin. Even phtcpying a dcument can change a fnt enugh t require retraining. Trainable systems are mst useful fr reading certain unusual display fnts and freign languages such as Cyrillic. Keep in mind, hwever, that the training time must be added t the verall thrughput f the system when making cmparisns. Omnifnt Recgnitin Omnifnt prducts can recgnize virtually any fnt that maintains fairly standard character shapes. True mnifnt systems require n training r ther adjustments t accmmdate different fnts. If yu want the mst versatile and pwerful recgnitin, an mnifnt system is prbably the nly type that will satisfy yur needs. Check recgnitin n a variety f different pages yu're likely t want t read. Lk fr a system that gives gd results and different pages withut cnstant peratr interventin. Handling Pr Quality Dcuments Many OCR prducts ffer adjustments that can imprve things like page cntrast fr maximum accuracy. This can be helpful, especially if yu have pr quality dcuments that have been phtcpied several times. Prfing the Dcument Right Sme OCR prducts prvide built-in r ptinal editrs fr Prfing a dcument after recgnitin. Only a very few use a text editr that can shw yu the actual image f a character r wrd just as the scanner saw it. By using image "pp - ups," yu can dramatically cut yur prfing time (ften by up t 50%), because yu wn't need t cnstantly refer t the riginal paper dcument. Yu can see right away the crrect character r wrd image, and if the text cunterpart is misrecgnized, crrect it and mve n.

Check fr a built-in editr that includes pp-up images fr easy verificatin. There's N Substitute fr Experience There is mre t recgnitin accuracy than statistics and numbers. When yu read a printed page, yu are drawing upn years and years f experience and stred memries. nce yu master reading, yu n lnger see individual characters n a page, yu see patterns that merge int wrds, sentences, and ideas. Dictinaries Help An effective OCR prduct must be able t vercme the limitatins f seeing individual characters by evaluating grups f characters t see if they can make sense (by frming actual wrds, fr example). This requires a recgnitin dictinary that checks fr wrds as an integral part f the recgnitin prcess. The recgnitin device can d a much better jb f "reading" a page if it has sme "experience," in this case, a dictinary f prper wrd spellings. Prducts that check spelling after recgnitin cannt be as effective as prducts that end up with the crrect recgnitin because they have already checked t see if the recgnized character shapes frm actual wrds. Check fr a dictinary r wrd list that is applied during recgnitin. D nt accept a spelling checker as a substitute fr a recgnitin dictinary. There's N Substitute fr Accuracy Recgnitin accuracy has t d with hw many characters are misrecgnized n a given page. Fr example, n a typical 2000- character page, 98% accuracy means 20 errrs; 99% accuracy means 10 errrs. Think f this as a 100% difference in the number f errrs, nt a 1% difference in accuracy. Beware f prducts that check and display their wn recgnitin accuracy. The numbers displayed by these prducts are based n the number f failed attempts at recgnitin, nt n the number f actual misrecgnized characters. There's N Substitute fr Perfrmance When chsing a recgnitin prduct, speed is certainly imprtant. Hwever, many ther factrs affect verall recgnitin perfrmance. T understand the imprtance f perfrmance, think abut the amunt f time yu'll have t spend prfing and crrecting yur dcument if it cntains lts f frmatting errrs, inapprpriate declumnizatin,

extraneus text and images, misrecgnized characters and wrds, and s frth. Overall perfrmance measurements must include: Scanning time Recgnitin time Verificatin time Revisin time Refrmatting time Think abut the amunt f time it takes t g frm printed page t a crrectly frmatted, crrectly spelled n-line dcument. After all, the bject f OCR is t save yu time! Is It Easy t Use? Recgnitin shuld nt be cmplex. After all, the cmputer shuld d the hard wrk, leaving yu t make simple selectins. Hw easily can yu change settings r read just fr different kinds f jbs? Remember that an OCR prduct shuld help yu get yur wrk dne faster. What Abut Supprt and Upgrades? As with any majr purchase, yu shuld lk at cmpanies that ffer bth the prducts yu need tday and the prducts yu'll need tmrrw. Lk fr a cmpany that specializes in OCR, nt tape backup systems, spell checkers, r graphics tls. When yu make the decisin t put dcument recgnitin int yur ffice, yu'll find an ever increasing range f pprtunities t use it. S lk fr a brad prduct line with upgrade ptins t allw a smth migratin t higher-perfrmance prducts.