Gazing At The Future
Welcome back to my humble attempt to re-write the rule book
on teleconferencing software, a journey that will see it dragged from its
complacent little rectangular world. It’s true we've had 3D for years, but we've never been able to communicate accurately and directly in that space.
Thanks to the Gesture Camera, we now have the first in what will be a long line
of high fidelity super accurate perceptual devices. It is a pleasure to develop for this ground
breaking device, and I hope my ramblings will light the way for future travels
and travellers. So now, I will begin my crazy rant.
Latest Progress
You may recall that last week I proposed to turn that green
face blob into a proper head and transmit it across to another device. The good
news is that my 3D face looks a lot better, and the bad news is that getting it
transmitted is going to have to wait. In
taking the advice of judges, I dug out a modern webcam product and realised the
value-adds where nothing more than novelties. The market has stagnated, and the
march of Skype and Google Talk do nothing more than perpetuate a flat and
utilitarian experience.
I did come to appreciate however that teleconferencing
cannot be taken lightly. It’s a massive industry and serious users want a
reliable, quality experience that helps them get on with their job. Low
latency, ease of use, backwards compatibility and essential conferencing
features are all required if a new tool is to supplant the old ones.
Voice over I.P
Technology
I was initially temped to write my own audio streaming
system to carry audio data to the various participants in the conferencing call,
but after careful study of existing solutions and the highly specialised
disciplines required, I decided to go the path of least resistance and use an
existing open source solution. At first I decided to use the same technology
Google Talk uses for audio exchange but after a few hours of research and light
development, it turns out a vital API was no longer available for download,
mainly because Google had bought the company in question and moved the
technology onto HTML5 and JavaScript. As luck would have it, Google partnered
with another company who they did not buy called Linphone, and they provide a
great open source solution that is also cross platform compatible with all the
major desktops and mobiles.
https://www.linphone.org/
A long story
short, this new API is right up to date and my test across two Windows PCs, a
Mac and an iPad in four way audio conferencing mode worked a treat. Next week I shall be breaking down the sample
provided to obtain the vital bits of code needed to implement audio and packet
exchange between my users. As a bonus, I am going to write it in such a way
that existing Linphone client apps can call into my software to join the
conference call, so anyone with regular webcams or even mobile phones can join
in. I will probably stick a large 3D handset in the chair in place of a 3D
avatar, just for fun.
On a related note, I have decided to postpone even thinking
about voice recognition until the surrounding challenges have been conquered.
It never pays to spin too many plates!
Gaze Solved? –
Version One
In theory, this should be a relatively simple algorithm.
Find the head, then find the eyes, then grab the RGB around the eyes only.
Locate the pupil at each eye, take the average, and produce a look vector.
Job’s a good one, right? Well, no. At first I decided to run away and find a
sample I once saw at an early Beta preview of the Perceptual SDK which created
a vector from face rotation which was pretty neat. Unfortunately that sample
was not included in Beta 3, and it was soon apparent why. On exploring the
commands for getting ‘landmark’ data, I noticed my nose was missing. And more
strikingly, all the roll, pitch and yaw values where empty too. Finding this
out from the sample saved me a bucket load of time had I proceeded to add the
code to my main app first. Phew. I am sure it will be fixed in a future SDK (or
I was doing something silly and it does work), but I can’t afford the time to
write even one email to Intel support (who are great by the way). I needed Gaze
now!
I plumbed for option two, write everything myself using only
the depth data as my source. I set to
work and implemented my first version of the Gave Algorithm. I have detailed
the steps in case you like it, and want to use it:
- Find the furthest depth point from the upper half of the camera depth data
- March left and right to find the points at which the ‘head’ depth data stops
- Now we know the width of the head, trace downwards to find the shoulder
- Once you have a shoulder coordinate, use that to align the Y vector of the head
- You now have a stable X and Y vector for head tracking (and Z of course)
- Scan all the depth between the ears of the face, down to the shoulder height
- Add all depth values together, weighting them as the coordinate moves left/right
- Do the same for top/bottom weighting them with a vertical multi-player
- You are essentially using the nose and facial features to track the bulk of the head
- Happily, this bulk determines the general gaze direction of the face
- You have to enhance the depth around the nose to get better gaze tracking
I have included my entire source code to date for the two DBP commands you saw in the last blog so you can see how I access the depth and colour data, create 3D constructs and handle the interpretation of the depth information. This current implementation is only good enough to determine which corner of the screen you are looking at, but I feel with more work this can be refined to provide almost pinpoint accurate gazing.
Interacting with
Document
One thing I enjoyed when tinkering with the latest version
was holding up a piece of paper, maybe with a sketch on it, and shout ‘scan’ in
a firm voice. Nothing happened of
course, but I imaged what could happen. We still doodle on paper, or have some
article or clipping during a meeting. It would be awesome if you could hold it
up, bark a command, and the computer would turn it into a virtual item in the
conference. Other attendees could then pick it up (copy it I guess), and once
received could view it or print it during the call. It would be like fax but
faster! I even thought of tying in your tablet to the conference call too, so
when a document is shared, it instantly goes onto a tablet carousel so everyone
who has a tablet can view the media. It could work in reverse too, so you could
find a website or application, and then just wave the tablet in front of the
camera, the camera would detect you are waving your tablet and instantly copy
the contents of the tablet screen to the others in the meeting. It was around this time I switched part of my
brain off so I could finish up and record the video for your viewing pleasure.
Developer Tips
TIP 1 : Infra-red gesture cameras and 6AM sun rise do not mix
very well. As I was gluing Saturday and Sunday together, the sun’s rays blasted
through the window and disintegrated my virtual me. Fortunately a wall helped a
few hours later. For accurate usage of
the gesture camera, ensure you are not bathed in direct sunlight!
TIP 2 : If you think you can smooth out and tame the edges
of your depth data, think again. I have this one about five hours of solid
thought and tinkering, and I concluded that you can only get smoothing by
substantially trimming the depth shape. As the edges of a shape leap from
almost zero to full depth reading, it is very difficult to filter or
accommodate it. In order to move on, I moved on, but I have a few more ideas
and many more days to crack this one. The current fussy edges are not bad as
such, but it is something you might associate with low quality and so I want to
return to this. The fact is the depth data around the edges is very dodgy, and
some serious edge cleaning techniques will need to be employed to overcome this
feature of the hardware.
The Code
Last week you had some DBP code. This week, try some C++. Here is the code which shows some pretty horrible unoptimised code, but it's all there and you might gleam some cut and paste usage from something that's been proved to compile and work:
Next Time
Now I have the two main components running side by side, the 3D construction and the audio conferencing, next week should be a case of gluing them together in a tidy interface. One of the judges has thrown down the gauntlet that the app should support both Gesture Camera AND Ultrabook, so I am going to pretend the depth camera is ‘built’ into the Ultrabook and treat it as one device. As I am writing the app from scratch, my interface design will make full use of touch when touch makes sense and intuitive use of perception for everything else.
P.S. The judges’ video blog was a great idea and fun to watch! Hope you all had a good time in Barcelona and managed to avoid getting run over by all those meals on wheels.
No comments:
Post a Comment