Learn how software converts images or paper documents to searchable text

What is OCR?

OCR or optical character recognition refers to technology that can recognize letters, numbers, and other written characters. This allows it to convert images or scanned paper documents into searchable electronic text.

Surprisingly, OCR was first used over 100 years ago in the Optophone, a reading device for the blind that translated letters into sounds. Today, OCR is much more advanced, using artificial intelligence to identify characters. This allows it to recognize letters even when they appear remarkably different. It can, for instance, identify the letters “a” and “g” despite dramatic differences in their structures across different common fonts. In some applications — common in mail sorting, but not in ediscovery — OCR can even be used to decipher handwriting.

In ediscovery, OCR is used to convert image- or paper-based discovery into computer-based text. When discoverable materials are received as images such as TIFF files, software with OCR can identify any letters, numbers, or other text characters. It then converts those characters from pixel-based pictures into readable text.

Similarly, OCR can be used when discovery involves physical documents such as typed letters or printouts. Those physical pages can be scanned, processed through an OCR system, and rendered as computer files in a fraction of the time it would take a human to read them.

OCR resolves several problems with image- and paper-based discovery.

The primary advantage is that electronic text, unlike images or paper, is fully searchable. This eliminates the need for human review teams to laboriously flip through stacks of paper discovery. Instead, a litigant can scan documents, use OCR to generate text files, and then search those files for keywords, names, dates, and any other text-based content. This significantly reduces the amount of time required to process or review image– or paper-based discovery. As a result, OCR can streamline and simplify discovery, allowing for better early case assessment (ECA). It can also drastically lower costs, particularly within the review phase.

Unlike paper-based documents, electronically stored information (ESI) can be edited, copied, and distributed almost instantaneously.

Digitizing paper documents also obviates the need for physical storage space or file organization. Unlike boxes of paper discovery, which are susceptible to damage or destruction through fire, flooding, or natural disasters, OCR allows discoverable information to be extracted as text and electronically stored. This eliminates the risk of misplacing a specific piece of paper or wasting time locating a particular document.

Of course, OCR has its shortcomings. Despite considerable advances, it is still not 100 percent accurate, which can lead to misspelled words and missed search terms. It also may not work well on faded or damaged documents where portions of letters are smudged or obscured. Nonetheless, OCR may eventually render paper-based discovery entirely obsolete, as what remains on paper can be rapidly ingested for processing as ediscovery.

Glossary definition

OCR or optical character recognition is a technology that can identify letters, numbers, and other characters, converting images or scanned paper documents into searchable electronic text.

Related Posts