Saturday, September 24, 2011

Old Print into Searchable Text

I will add a brief post here and leave the heavy lifting for John Raynor to detail a few small projects in progress by some chapter members....

  • there is a lot of information that was printed on paper in the past on archaeology, typeset and typewritten, and in some cases later photocopied
  • more value and wider use of this information becomes readily available once it moves into a searchable digital character-based form
Ways to move it that we are experimenting with:
  • OCR scan
  • retyping
OCR Scan: Last year I bought a $50 four in one wireless printer which had a software bundle which included a Optical Character Recognition software capability.  How this works is quite simple,  Put your printed page on the glass of the scanner.  Select the function scan to text file. Press Scan start.  And the scanner printer sends the digital file to your wireless connected PC and opens a WordPad file inserts the text.   Takes about three seconds.  The caveat is it is not one hundred per cent accurate.   It can get confused by smudged copier quality, or uneven typewriter letter blackness, etc.    My particular testing which is just prelimary was an old piece of my writing that was printed out of either a daisy wheel printer or dedicated word processor printer from around 1985.  In three pages of text it missed about five characters and created homonyms, which need to get caught by proofreading ("hail" mistaken for "hall").

Retyping:  someone reads the document and types it directly word for word into a new word processing file.  There will be typos and other errors to catch in proofreading.


there is a Sagard text that has been retyped.   I will stop my post here and let John pick up the story since he knows all the details about this effort.

1 comment:

John Raynor said...

Some 5 years ago when I was doing research for my book I read Sagard's "Long Voyage into the Land of the Huron" as it was available on the Champlain society website. As a followup to this I wanted to read Sagard's Histoire du Canada in English but soon determined that this was not possible as it had not been translated. In a future random search I came upon an obscure reference to a manuscript at the Robart's Library U Of T of the translation that I had been looking for. I sent a friend who was taking classes at U of T to investigate and he confirmed that an 800 pg manuscript existed but that it would be a tall task to copy it as it would be a supervised white glove process at a cost/pg for photocopying. I put the project on the back burner because I did not have the time or resources to continue. Some months later Conrad Heidenreich was giving a presentation at the Huronia museum and in a casual conversation afterwords he indicated that he in fact had a copy of this same transcript and that much to my delight I would be welcome to borrow it. Jamie Hunter and I arranged for a visit to the Heidenreich's and returned with Conrad's copy. I then arrange for copies to be made for myself and the museum and once again the project went on the shelf. After reading through the transcript I became obvious to me that this resource could be best utilized if it was made available in a searchable format but with my one finger attempt at typing this project returned to the back burner. fast forward to the spring of this year when again through casual conversation Peter Davis of the Huronia Chapter not only expressed an interest in the project but followed through with the his interest and we now have a completed transcript of Sagard's Histoire in English. once this searchable transcript is protected from wholesale copying it will be made available on disk for sale via the Huronia Museum. It is my hope that this resource will be ready for release in time for the OAS symposium next month in Ottawa.