Computer vision encapsulates several problems related, obviously, to making computers see. Object recognition focuses on making computers recognize the object in a scene (car, person, building, flower, and so on.). Image registration tries to match (or register) two images into a common plane of reference, allowing generation of panoramas, for example. Image segmentation tries to form neighboring groups of pixels (segments) that represent one homogeneous object. Activity recognition tries to find what is happening in a video sequence.
What can be done with computer vision technology today? There are several things that we are now fairly good at.
1) Image or video registration and panorama generation: A bunch of pictures (or a bunch of video frames) of the same scene can be stitched together using image registration technology in a very reliable and robust manner. Thanks largely to SIFT and SIFT-like local features (see my previous post) along with robust estimation algorithms, there are several commercial products available based on this technology. Want to stitch two or more images very accurately to form a mosaic? Try Mayachitra AIPR. Want quick and dirty panaroma on your iPhone? Try 360 Panaroma. Of course, there are many in-betweens, but you can be assured that this technology is getting matured.
2) Image recognition via matching: Recognizing images by matching is a comparatively simple technique that leverages the basic ideas from above (image registration). Have you been awed by the recognition capabilities of Google Goggles, kooaba Visual Search, or oMoby? Loosely speaking, matching is what they do. It is easier to recognize a book cover or landmarks by matching (or registering) it with the picture of the same item in a database. This is how book cover, or Golden gate bridge, or a wine label, or a famous painting can be recognized — because (again loosely speaking) there are picture(s) of the same item already in their database. Since matching works reliably using image registration-like technology, recognition is reliable and robust for items/objects that satisfy two conditions: (i) whose pictures are available to pre-ingest, and (ii) whose geometry is uniquely defined (e.g., different chairs will have different geometry, but the Tim Ferris book will always look the same). Book/CD/DVD covers, wine labels, famous artwork, major landmarks, all fall in this category. And this is how the apps you love, work.
3) Face detection and recognition in a constrained setup: By the shear amount of effort that has been put into face detection and recognition over the past couple of decades, we have made good progress in this inherently difficult problem. If you have used Picassa or iPhoto, you can see how the computer can find faces in your photo collection and ask for your approval. By using key descriptions of your eyes, nose, mouth, and facial structure (geometry), the algorithms can recognize faces from a limited collection. The bottomline today is that, finding faces from your photo collection is reliable (e.g., 40-100 faces in a collection of tens of thousands of pictures), but finding and recognizing faces from ALL of facebook is still very hard. We sure are making progress.
This concludes my trilogy of posts on the possibilities with computer vision technology today. Watch out for more technology related posts and leave comments on what you’d like to hear about.
Mobile computer vision is set to touch our lives in a tangible way. To continue the parallel, the big bang has started and the universe is expanding fast enough for us to experience the magic. There are three primary factors, in my view, that have contributed to the recent advance.
1) Powerful local descriptors: Early 2000s marked an exciting development in the field of image recognition, that has now touched every aspect of computer vision. The publication of SIFT descriptors will certainly go down in the history of computer vision in the same light as that of Turbo codes in digital communications. Using SIFT or SIFT-like framework, engineers can now robustly and accurately describe and match local regions in an image or video. A big leap forward, thanks to David Lowe and his team!
2) Machine learning: Although artificial intelligence with it’s rule-based deduction did not deliver on its promise, back in the 1980s, of solving all our problems, a related discipline, that of machine learning, has come to our rescue. Not depending on hand-coded rules and letting machines learn by looking at several examples via solving large optimization problems, it turns out, is the way to go. And we are figuring this out now!
3) Faster machines: Computer vision and image analysis problems are one of few that are always hungry for computing resources. Any computer vision researcher will tell you that having a powerful desktop is better than a laptop, having a cluster is better than a desktop, and having a cloud with thousands of computers is like having a vacation home on the moon. Needless to say Moore’s law has helped.
This is just the beginning. There are many unsolved problems and exciting challenges. This post gives the perspective and I’ll differ the question of what can be done and what can’t today to my next post.
Have you seen the latest episode of The Big Bang Theory? Despite its name (or due to its name) The Bus Pants Utilization was an amazingly entertaining episode. What caught my attention and had me writing this blog was the fact that it was about smartphone applications! And more so because both the apps mentioned in the episode are –hold your breath– about mobile computer vision! Woah! Interesting.
Leonard, Sheldon, Howard, and Rajesh are sitting in the cafe and Leonard comes up with an interesting app idea (that clicked with my nerdy mind, for sure ): Create an app that can read-in a mathematical equation by snapping a picture of it’s handwritten form. Then use handwriting recognition followed by symbolic mathematics tools to solve the equation, let the user plug in variables, and in general, play around with it. I wonder, what are the chances that some real geek is already working on this?
Then, later in the episode, Penny comes up with another idea that appeals to her (and likely to lots of girls): shopping for shoes by snapping a picture of someone wearing it. Here is another challenging computer vision problem, that anyone in the field has surely heard and thought of before.
This incited mixed feeling in me, a long-time proponent and “well-wisher” of computer vision (CV) technology for mainstream consumers. At first I thought, “This is awesome. Mainstream media is making this field hip. People like what computer vision can do!.” But the next moment I thought “But… People expect these algorithms to work perfectly out of the box.” I and other computer vision researchers may have to go hide ourselves somewhere, because the field is still evolving and things work in a cleverly constrained setup. Problems, such as finding shoes for Penny, are still difficult…
We are seeing some very interesting CV-based applications that have come out in recent times (Google goggles, kooaba, SnapTell, and now Amazon.com app, thanks to A9). You would be amazed how these apps (with some differences among each other) can effortlessly recognize the cover of a book at any angle you snap it, or a CD/DVD cover, billboard ads, print ads/pages, and in case of Goggles, art, paintings, and so on.
If you are not in the CV field, you are wondering what works and what doesn’t, right? Why can kooaba recognize book covers, but not shoes? Why would like.com allow pattern and color search for shoes, but still not totally recognize the exact shoe you like (that you saw someone wearing on the street)? Why can SnapTell recognize a shoe ad on a billboard, but not on a street? For the answers, you’ll have to wait for my upcoming posts, where I’ll cover what works and what doesn’t today in Computer Vision.
There is no denying that there are exciting times ahead for mobile computer vision! This is a new dawn for computer vision, a field that has remained out of consumer limelight for two decades now. Watch out for apps that will take your breath away!