I’ve ditched gocr in favor of tesseract. I found a script that makes all of the image manipulation, OCR’ing and clean up a snap. All I had to do was modify the parameters for ImageMagick’s convert program to generate the best image possible.

For example, this:
ftp screen shot

turns into this:
ftp screen shot converted

Which OCR’s to:

$6 + $0.50 Sit & Go (Turbo)
Game: Hold’em(Turbo)No Limit Status: Completed
Buy-In: $6 + $0.50 Started: May 16 09:33
Entrants: 9 Ended: May 16 10:15

To show why I ditched gocr, here is the output from the same command line switches to imagemagick but instead of writing to tesseract’s required tiff format, I used gocr favored pbm format.

_6 + _O.5O Sit & Go (Turbo)
Game: HoId’em (Turbo) No Limit 5tatus: CompIeted
Bu_-In: _6 + _D.5D 5tarted: May 16 D9:33
Entrants: 9 Ended: May 16 lD:15

While the spacing is correct, the quality is vastly different.

Now while the spacing isn’t PERFECT, the text, numbers and symbols ARE in this example, whereas FTOPS #3 shows FTOPS as F1~OPS as the F and T run together. Next step, update the tournaments with valid screen shots with the OCR’ed data. As a bonus, the start and end times will be 100% correct.

tags: , , ,

This work is published under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.