Published 2022. 5. 18. 17:16

OCR 입문하기

OCR을 tesseract로 테스트 해본 적은 있으나 어떤 원리인지는 잘 몰랐다.

Coursera로 간단하게 입문해보고자 한다.

미시건 대학교의 파이썬 전문 과정 중 <Python Project: pillow, tesseract, and opencv>의 2주차 강의인 tesseract and optical character recognition 를 수강한다.

Introduction to Optical Character Recognition

optical character recognition(OCR)

- conversion of text captured in imges into text usable by a computer

- read text in images

Tesseract OCR

Open-Source Software

OSI

Apache License

GPL

The (Py)Tesseract Library

from PIL import Image
import pytesseract

image = Image.open('test.png')
dir(pytesseract)

['ALTONotSupported', 'Output', 'TSVNotSupported', 'TesseractError', 'TesseractNotFoundError', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'get_languages', 'get_tesseract_version', 'image_to_alto_xml', 'image_to_boxes', 'image_to_data', 'image_to_osd', 'image_to_pdf_or_hocr', 'image_to_string', 'pytesseract', 'run_and_get_output']

help(pytesseract.image_to_string)

- 비교적 간단하게 작성된 docs 이므로 조금 더 자세하게 살펴보기로 함

import inspect
src = inspect.getsource(pytesseract.image_to_string)
print(src)

소스코드를 들여다볼 수 있게 됨

pytesseract lib is calling C++ lib --> 구글링이 다소 필요

run_and_get_output() 함수

- save_image : 이미지 저장

- prepare_image : PIL 이미지로 불러오기

text = pytesseract.image_to_string(image)
print(text)

More Tesseract

입력 이미지에 노이즈가 섞여 있을 때에는?

pytesseract와 PIL 을 활용해 OCR 효율성을 높일 수 있음

1. 이미지의 사이즈를 (비율 유지) 변형하기

- Image.ANTIALIAS 옵션으로 주기

2. 이미지를 흑백으로 변환하기

3. 이진화(binarization)

- 두 개의 명확한 파트로 분리하는 것 (이 경우 흑백)

- thresholding 으로 가능함; 픽셀값이 threshold보다 크면, 검정색 픽셀로 변형한다

- 이 과정을 통해 노이즈를 제거할 수 있음

# note that it's string
img = Image.open('test.png').convert('1')

PIL 기능

def binarize(image_to_transform, threshold):
	# 흑백 변환
	output_image = image_to_transform.convert('L')
    for x in rnage(output_image.width):
    	for y in range(output_image.height):
        	if output_image.getpixel((x, y)) < threshold:
            	output_image.putpixel((x, y), 0)
            else:
            	output_image.putpixel((x, y), 255)
    return output_image

위 함수를 직접 구현해보기

for thresh in range(0, 257, 64):
    pytesseract.image_to_string(binzerize(Image.open('test.png'), thresh))

threshold가 어느정도 될 때 제일 적절한지 테스트할 수 있음

Tesseract and Photographs

이러한 실사 이미지는 어떨까?

1. 필요한 부분만 crop 하기

bounding_box = (315, 170, 700, 270)
title_image = Image.crop(bounding_box)
pytesseract.image_to_string(title_image)

2. 잘라낸 사진을 고해상도로 만들기

+ 이미지 리샘플링

little_sign = Image.crop((900, 420, 940, 445))
new_size = (little_size.width * 10, little_sign.height * 10)

# resize 의 다양한 옵션
options = [Image.NEAREST, Image.BOX, Image.BILINEAR, Image.HAMMING, Image.BICUBIC, Image.LANCZOS]
for option in options:
	little_sign.resize(new_size, option)

고해상도로 만든 다음 이진화까지 적용할 수 있음

이때 최상의 이진화 threshold 를 어떻게 찾을 수 있을까?

단어 데이터셋을 활용할 수 있음

# 영어 사전 데이터셋
eng_dict = []
with open('words_alpha.txt', 'r') as f:
	data = f.read()
    eng_dict = data.split('\n')

# threshold 테스트
for i in range(150, 170):
	string = pytesseract.image_to_string(binarize(bigger_sign, i))
    
    # 알파벳 아닌 것들 삭제
    string = string.lower()
    import string
   	comparison = ''
    for character in string:
    	if chracter in string.ascii_lowercase:
        	comparison = comparison + charater
            
    # 영어 사전에 있을 때 출력
    if comparison in eng_dict:
    	print(comparison)

전문용어 사전 등등을 이용할 수 있음

당연한 거였는데, 전처리가 이렇게 중요한 줄 몰랐다. 이전에 테스트한 결과가 별로였던 것도 전처리가 부실했던 거로 ...

'인공지능 > OCR' 카테고리의 다른 글

OCR 논문 리뷰 (0)	2022.05.18