Computer vision encapsulates several problems related, obviously, to making computers see. Object recognition focuses on making computers recognize the object in a scene (car, person, building, flower, and so on.). Image registration tries to match (or register) two images into a common plane of reference, allowing generation of panoramas, for example. Image segmentation tries to form neighboring groups of pixels (segments) that represent one homogeneous object. Activity recognition tries to find what is happening in a video sequence.

What can be done with computer vision technology today? There are several things that we are now fairly good at.

1) Image or video registration and panorama generation: A bunch of pictures (or a bunch of video frames) of the same scene can be stitched together using image registration technology in a very reliable and robust manner. Thanks largely to SIFT and SIFT-like local features (see my previous post) along with robust estimation algorithms, there are several commercial products available based on this technology. Want to stitch two or more images very accurately to form a mosaic? Try Mayachitra AIPR. Want quick and dirty panaroma on your iPhone? Try 360 Panaroma. Of course, there are many in-betweens, but you can be assured that this technology is getting matured.

2) Image recognition via matching: Recognizing images by matching is a comparatively simple technique that leverages the basic ideas from above (image registration). Have you been awed by the recognition capabilities of Google Goggles, kooaba Visual Search, or oMoby? Loosely speaking, matching is what they do. It is easier to recognize a book cover or landmarks by matching (or registering) it with the picture of the same item in a database. This is how book cover, or Golden gate bridge, or a wine label, or a famous painting can be recognized — because (again loosely speaking) there are picture(s) of the same item already in their database.  Since matching works reliably using image registration-like technology, recognition is reliable and robust for items/objects that satisfy two conditions: (i) whose pictures are available to pre-ingest, and (ii) whose geometry is uniquely defined (e.g., different chairs will have different geometry, but the Tim Ferris book will always look the same). Book/CD/DVD covers, wine labels, famous artwork, major landmarks, all fall in this category. And this is how the apps you love, work.

3) Face detection and recognition in a constrained setup: By the shear amount of effort that has been put into face detection and recognition over the past couple of decades, we have made good progress in this inherently difficult problem. If you have used Picassa or iPhoto, you can see how the computer can find faces in your photo collection and ask for your approval. By using key descriptions of your eyes, nose, mouth, and facial structure (geometry), the algorithms can recognize faces from a limited collection. The bottomline today is that, finding faces from your photo collection is reliable (e.g., 40-100 faces in a collection of tens of thousands of pictures), but finding and recognizing faces from ALL of facebook is still very hard. We sure are making progress.

This concludes my trilogy of posts on the possibilities with computer vision technology today. Watch out for more technology related posts and leave comments on what you’d like to hear about.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>