Gazing At The Future
Welcome back to my humble attempt to re-write the rule book on teleconferencing software, a journey that will see it dragged from its complacent little rectangular world. It’s true we've had 3D for years, but we've never been able to communicate accurately and directly in that space. Thanks to the Gesture Camera, we now have the first in what will be a long line of high fidelity super accurate perceptual devices. It is a pleasure to develop for this ground breaking device, and I hope my ramblings will light the way for future travels and travellers. So now, I will begin my crazy rant.
You may recall that last week I proposed to turn that green face blob into a proper head and transmit it across to another device. The good news is that my 3D face looks a lot better, and the bad news is that getting it transmitted is going to have to wait. In taking the advice of judges, I dug out a modern webcam product and realised the value-adds where nothing more than novelties. The market has stagnated, and the march of Skype and Google Talk do nothing more than perpetuate a flat and utilitarian experience.
I did come to appreciate however that teleconferencing cannot be taken lightly. It’s a massive industry and serious users want a reliable, quality experience that helps them get on with their job. Low latency, ease of use, backwards compatibility and essential conferencing features are all required if a new tool is to supplant the old ones.
Voice over I.P Technology
A long story short, this new API is right up to date and my test across two Windows PCs, a Mac and an iPad in four way audio conferencing mode worked a treat. Next week I shall be breaking down the sample provided to obtain the vital bits of code needed to implement audio and packet exchange between my users. As a bonus, I am going to write it in such a way that existing Linphone client apps can call into my software to join the conference call, so anyone with regular webcams or even mobile phones can join in. I will probably stick a large 3D handset in the chair in place of a 3D avatar, just for fun.
On a related note, I have decided to postpone even thinking about voice recognition until the surrounding challenges have been conquered. It never pays to spin too many plates!
Gaze Solved? – Version One
In theory, this should be a relatively simple algorithm. Find the head, then find the eyes, then grab the RGB around the eyes only. Locate the pupil at each eye, take the average, and produce a look vector. Job’s a good one, right? Well, no. At first I decided to run away and find a sample I once saw at an early Beta preview of the Perceptual SDK which created a vector from face rotation which was pretty neat. Unfortunately that sample was not included in Beta 3, and it was soon apparent why. On exploring the commands for getting ‘landmark’ data, I noticed my nose was missing. And more strikingly, all the roll, pitch and yaw values where empty too. Finding this out from the sample saved me a bucket load of time had I proceeded to add the code to my main app first. Phew. I am sure it will be fixed in a future SDK (or I was doing something silly and it does work), but I can’t afford the time to write even one email to Intel support (who are great by the way). I needed Gaze now!
I plumbed for option two, write everything myself using only the depth data as my source. I set to work and implemented my first version of the Gave Algorithm. I have detailed the steps in case you like it, and want to use it:
- Find the furthest depth point from the upper half of the camera depth data
- March left and right to find the points at which the ‘head’ depth data stops
- Now we know the width of the head, trace downwards to find the shoulder
- Once you have a shoulder coordinate, use that to align the Y vector of the head
- You now have a stable X and Y vector for head tracking (and Z of course)
- Scan all the depth between the ears of the face, down to the shoulder height
- Add all depth values together, weighting them as the coordinate moves left/right
- Do the same for top/bottom weighting them with a vertical multi-player
- You are essentially using the nose and facial features to track the bulk of the head
- Happily, this bulk determines the general gaze direction of the face
- You have to enhance the depth around the nose to get better gaze tracking
I have included my entire source code to date for the two DBP commands you saw in the last blog so you can see how I access the depth and colour data, create 3D constructs and handle the interpretation of the depth information. This current implementation is only good enough to determine which corner of the screen you are looking at, but I feel with more work this can be refined to provide almost pinpoint accurate gazing.
Interacting with Document
One thing I enjoyed when tinkering with the latest version was holding up a piece of paper, maybe with a sketch on it, and shout ‘scan’ in a firm voice. Nothing happened of course, but I imaged what could happen. We still doodle on paper, or have some article or clipping during a meeting. It would be awesome if you could hold it up, bark a command, and the computer would turn it into a virtual item in the conference. Other attendees could then pick it up (copy it I guess), and once received could view it or print it during the call. It would be like fax but faster! I even thought of tying in your tablet to the conference call too, so when a document is shared, it instantly goes onto a tablet carousel so everyone who has a tablet can view the media. It could work in reverse too, so you could find a website or application, and then just wave the tablet in front of the camera, the camera would detect you are waving your tablet and instantly copy the contents of the tablet screen to the others in the meeting. It was around this time I switched part of my brain off so I could finish up and record the video for your viewing pleasure.
TIP 1 : Infra-red gesture cameras and 6AM sun rise do not mix very well. As I was gluing Saturday and Sunday together, the sun’s rays blasted through the window and disintegrated my virtual me. Fortunately a wall helped a few hours later. For accurate usage of the gesture camera, ensure you are not bathed in direct sunlight!
TIP 2 : If you think you can smooth out and tame the edges of your depth data, think again. I have this one about five hours of solid thought and tinkering, and I concluded that you can only get smoothing by substantially trimming the depth shape. As the edges of a shape leap from almost zero to full depth reading, it is very difficult to filter or accommodate it. In order to move on, I moved on, but I have a few more ideas and many more days to crack this one. The current fussy edges are not bad as such, but it is something you might associate with low quality and so I want to return to this. The fact is the depth data around the edges is very dodgy, and some serious edge cleaning techniques will need to be employed to overcome this feature of the hardware.
Last week you had some DBP code. This week, try some C++. Here is the code which shows some pretty horrible unoptimised code, but it's all there and you might gleam some cut and paste usage from something that's been proved to compile and work:
Now I have the two main components running side by side, the 3D construction and the audio conferencing, next week should be a case of gluing them together in a tidy interface. One of the judges has thrown down the gauntlet that the app should support both Gesture Camera AND Ultrabook, so I am going to pretend the depth camera is ‘built’ into the Ultrabook and treat it as one device. As I am writing the app from scratch, my interface design will make full use of touch when touch makes sense and intuitive use of perception for everything else.
P.S. The judges’ video blog was a great idea and fun to watch! Hope you all had a good time in Barcelona and managed to avoid getting run over by all those meals on wheels.