Optical Character Recognition software is a cool technology that allows you to ‘digitise’ pages of text. We’ve interviewed a professor of Sanskrit and computer-techie, Oliver Hellwig about the OCR software he developed, that can understand Hindi and Sanskrit characters. Read all about it below…
Tell us a bit about yourself.
I was born and still live in Berlin, the capital of Germany, together with my wife. Apart from writing OCR software, I work as a scientist and teacher of Sanskrit at several Germany universities.
Why did you get interested in OCR?
I needed OCR for my work, so I started to find out about it as much as I could.
What is OCR?
OCR stands for Optical Character Recognition, but put simply these programs are “computer typists.” Imagine you have a book, document or letter and you want to use the text on your computer, you can either type it by hand or you use an OCR.
The advantage of a digital text is that you can work with it on the computer: You can change its appearance, font, layout, search for special words, copy and paste it into other documents or use other tools such as translation software. It is very useful when you have large amounts of text because OCR saves you a lot of time and never gets tired like a human being.
How does the computer know what and how to read?
Like every child, an OCR program has to learn a lot of things: What do Hindi letters look like? How are letters combined into words and those words into sentences? Which letters are never combined, for instance you will never see four consonants at the beginning of a word? And of course a list of Hindi vocabulary words.
A part of this knowledge is fixed by rules, which are hard-coded in the program source code by the programmer. The other part has to be learned by repeated training: the program is given a page of text to read. The programmer will correct errors made by the OCR and give it some learning feedback. The more training the OCR has, the better it becomes – like every school child.
Can you explain the technology behind OCR to me?
Basically, an OCR imitates the process of human reading by using mathematics and geometry. This is done in several steps.
First, the OCR distinguishes between dark and light areas and finds out where the text is printed on a page.
Next, it detects lines in these areas, again comparing light and dark regions, and splits the lines into words and letters.
In the next step, the OCR produces a geometrical description of every possible letter it has found.
These descriptions are now compared with stored models of letters, and the most similar letter is elected.
The letters are put together into whole words again. Finally, these words are checked in a dictionary of Hindi.
Why did you decide to make a Hindi OCR software?
Although I was trained in classical Sanskrit, I do a lot of computer-based analysis of Sanskrit literature. If you want to know how often, in which texts and in which contexts a special Ayurvedic plant is mentioned, you would get the fastest and best results if you let the computer do the search. However, you need digital texts to do this. Because no OCR for Indian languages was available and I liked programming as a hobby, I decided to write it for myself. Later I expanded the OCR to include Hindi.
Can this be used on Live TV? For instance, I see a sign in English but my TV then converts it on the fly in Hindi for me?
To use OCR in this way, you would have to combine it with a translation software. In this case, you would have an OCR for English, which recognizes the text, and a software that translates the English text into Hindi.
How long did it take you to make and what were the challenges?
I started writing the OCR in 2000 and the first working version was ready three years later. There were two big challenges. First, since it was one of the first OCR programs for Indian languages, I could not build on previous work. Second, I did not have access to any training material for Hindi, so the complete learning process had to be done from scratch.
What is the difference between Optical Character Recognition and On-line Character Recognition?
While OCR programs read printed, scanned or photographed texts, online character recognition is used to recognize your handwritten text in real time, for example on your tablet computer.
Where all is OCR used?
You can use OCR whenever you want to transfer printed text into the computer. There is a huge range of applications. Apart from typical areas such as libraries and administration, this also includes augmented reality, digital text collections and helpful tools for vision disabled people.
Are you still improving the software and why?
Of course, the OCR software has not reached its final state. The existing functions are constantly improved and new features are added to make the program as fast and error free as possible.
What is next in OCR technology?
Today, many OCR programs recognize words letter by letter. The next step is recognizing full words, without splitting them into single letters. This would help avoid errors that occur when single letters are split in the wrong way. Apart from this, OCR technology will certainly be integrated in many more computer programs as a kind of “helper.” Think of translation software or something like Google Street View where every street sign could be read automatically.
What hobbies do you have?
I like to travel, especially to India, and I love Sanskrit literature.
Tell us a bit about your childhood and growing up.
I grew up in Berlin as the only child of my beloved mother. Here, I also got my complete education from grammar school to university. I always liked to read books and play in the garden. My mother used to take me to museums and the planetarium, which I also enjoyed. When I was fourteen, I found a Sanskrit grammar book by chance, in a public library. This is when my fascination for India was born.
What message would you give a kid today?
Learning is fun and opens up new worlds!
If you were an animal, what would you be and why?
I would like to be a bird, to be able to fly and have a good overview.
What’s your favourite kids character and why?
Tintin, an investigative journalist who travels the world and has adventures.
For more such science articles and videos, visit Science for Kids.