Image Recognition: Teaching Computers to See
Björn Ommer finds questions in many fields of research. His answers help many, including industry.
Björn Ommer hesitates. No, there is no time of day at which he has never sat at the computer in his office. He is honest, but he wants to avoid the impression that he might as well do his work as a computer scientist from home. After all, it is the contact with other researchers and industrial partners that provides fertile ground for Ommer’s creative progress.
Ommer has just returned from a conference on human vision and its equivalent in machines — computer vision, his area of expertise. He has talked with other computer scientists, with medical scientists and scholars from the humanities. “As usual, I haven’t brought back a single line of programme code”, he says, “but I do have lots of new ideas.”
Since mid-2009, when Ommer, then 28 years old, became a Junior Professor at the Heidelberg Collaboratory for Image Processing (HCI), he has worked with pharmacologists and biologists, with cultural scientists and art historians — always with two central questions in mind: How can we teach computers to see? And: How could this benefit other areas of science? While search engines like Google find images based on written key words, Ommer wants his algorithms to learn the semantics they need to recognise a beach in the image of sand next to an ocean. Machines can process larger quantities of data than humans. And yet even small children are more creative and intelligent than a computer. Ommer points at the stylised drawing of a polar bear and a giant panda. A three-year-old can easily find the eye on both animals — even though the panda sees with a white dot within a black oval while the eye of the polar bear only consists of a black dot. “This would be a real challenge for a computer”, the scientist says. But why? Because our visual perception is not just based on the image information, but to a large extent on what we learned in seeing other images in the past. This previous information is our reference.
Finding a known object in an image — Prince Charles’ ear, for instance, or a specific crown — is easy for today’s computers. The location in the image of a specific type of object — let’s say an airplane — is more difficult to determine. After all, airplanes are not always white, nor are their noses always round; they don’t always fly horizontally, and not all of them have a propeller. The search becomes even more complex if there are no clues as to what exactly we are looking for — and Ommer doesn’t want to manually feed the computer thousands of images. He wants computers to learn the required criteria on their own.
Until now, images were scanned in blocks with a kind of data window — the system looked for known objects and compared edge contours with appropriate libraries. But that is not exact enough for Ommer. “Even the triangle formed by a flock of birds in the sky is too complex for a computer. The form is the result of a combination of many individual parts. But it can’t be measured directly at any point.” Such emergent phenomena are considered to be the greatest challenge for computer vision. Individual parts contain too little, and often inconsistent, information about the big picture — a holistic approach, on the other hand, in which one looks directly at the whole object, does not yield reliable results, because such objects are too complex and change too frequently.
That is why an algorithm programmed by Ommer uses the “compositional regularity” of the world, in which all objects consist of a few universal and basic elements — lines, dots, circles, the vocabulary of image recognition. Then the algorithm determines the position of these elements with respect to each other and groups image parts that match the same object, while simultaneously searching for an “explanation” for the whole object. The computer checks: “If this image element is a mouth, where should the centre of the object be and which other elements should be grouped around it?” Every image element has a voice; the centre is determined democratically. Unlike in other voting procedures, the simultaneous grouping helps the individual parts to agree on a common explanation for the whole object. This means that objects can be identified more clearly and above all more quickly, an important factor if you are searching through large data volumes, e.g. videos instead of photos.
What previous research lacked was not more computer power, but a new and creative idea. Ommer breaks new ground because he isn’t clingy. Without hesitation, he exchanged windsurfing in Berkeley and the mountains near Zurich for sedate bicycle tours in the Odenwald. And he made sure that he wasn’t absorbed by his excursions into physics (during his university studies), computational neuroscience (for his PhD thesis) and the psychology of learning (recent research). Today, he uses these experiences like additional synapses for creative research. He could never work in a place that doesn’t allow him this kind of freedom. “If applications were all I wanted to think about, I’d be better off at Google”, says Ommer, “but I want to be able to turn things on their head and inside out once in a while.” Ommer’s work takes him wherever answers are needed. In April, he invited colleagues from all over the world to a workshop on “unsolved problems of pattern recognition”. He is fascinated by medical applications in the same way as by a football rolling across the street: “Can I help a car to recognise that ball and react accordingly?”
The questions are often more exciting than the an - swers in terms of specific applications. It took ten years from the time algorithms for face recognition were first published until cameras actually used them. “From a scientific point of view, things had nearly come to a standstill”, says Ommer. With the help of the Heidelberg “Industry on Campus” programme, he wants to shorten this “dead” phase in which results exist, but are not applied, while still maximising the practical benefits of his research. Industrial partners like Bosch are already using computer vision, e.g. for quality control in the production of tools.
Even as a child, Ommer was interested in the natural sciences. His father was a physics teacher, and Ommer wrote his first programme code as a nine-yearold. He skipped eleventh grade. Today, the young Junior Professor often works with researchers from other disciplines. He wants to develop new approaches in computer science to help colleagues advance their research while solving problems in his own discipline. “By now, many people have realised that together we can overcome disciplinary boundaries.”
“With Björn, we’ve really struck gold”, says biologist Thomas Kuner. The two scientists worked together to analyse healthy and damaged nerve ends in mouse paws, charting the effect of pain and healing in 3D. Kuner had previously worked with mathematicians, and nearly despaired in the face of communication problems: “When they started talking about diffusion equations, they lost us, and when we talked about synapses, they stopped listening”, he says. “Björn was different; he really delved into our field of research. We brought him a biologically relevant question. He helped us come up with a unique analysis — that’s no standard solution.”
Unlike scientists in bionics, Björn Ommer doesn’t want to imitate complex natural processes. “But looking at how Nature works helps me find out if my solution is plausible and elegant.” This way of working is often based on nothing but intuition. Ommer finds the holistic image recognition process with scanning windows implausible: “If you were looking at an image you would never start at the upper left corner and then proceed block by block. What if there was a lion waiting on the lower right? You’d be eaten”, says Ommer and laughs. “An algorithm that doesn’t take that into account just doesn’t feel right.”
Prof. Dr. Björn Ommer
Björn Ommer studied computer science and physics at the University of Bonn and earned his PhD in computer science in 2007 from the Swiss Federal Institute of Technology (ETH). After a first position as postdoctoral researcher in Zurich, he transferred to a similar position at the University of Berkeley in California. There he continued his work on computer vision. Teaching computers to see has remained the focus of his research after his transfer to Heidelberg in 2009, where he is now a Junior Professor of Computer Science at the Heidelberg Collaboratory for Image Processing (HCI) and heads a research group of seven scientists. His work centres on the question of how objects and actions can be automatically recognised in static and moving images — his findings are applied in interdisciplinary projects with Heidelberg cultural and biomedical scientists.