SUBSCRIBE TO OUR BLOG
That’s the promise of a system being developed at the University of Cambridge that mimics the way we see the world around us.
“When you drive a car, you’re able to identify hazards and estimate what's going to happen in the next few seconds – all from your visual senses. We’re trying to replicate or even improve upon that,” explains PhD student Alex Kendall, who works in the Machine Intelligence Laboratory at the University of Cambridge.
The system, called SegNet, operates in real-time on images with resolutions of just 480 x 360 pixels, so the camera itself doesn’t need to be all that good. You can try it for yourself by uploading one of your own pictures to the web demo and watch SegNet identify its component parts.
So how does it work? SegNet recognises the contents of images using machine learning – in particular, a technique known as deep learning. This involves understanding complex concepts by starting with basic ones.
Between the input and output, the data progresses through a hierarchy of ‘layers’ as the system increases its understanding. For example, the first layer would ‘see’ basic shapes. In the second layer it would recognise particular features on the car. Eventually, says Alex, the system would work its way up to comprehending the entire scene.
“First, it would pull out simple edges and corners in the image, and it might transpose those into a car tyre, a bonnet or a door. Then it might grab those features and, as it goes up the hierarchy, it might form an idea that there's a car there,” he explains.
To get this smart, though, SegNet is just like any human – it needs training, says Alex. “A computer does that through supervised learning. We feed it a whole bunch of images and tell it what they should be. In SegNet's case we feed it one image, and then another one in which each pixel is labelled with the category of object. Then we train the system to reproduce this.”
The process is very labour intensive, however. It involved asking undergraduate students to manually label some 700 pictures pixel by pixel, beginning with images taken in Cambridge. “It's a one-off process that you need to do at the start of the project to generate this supervised training data, but once you have trained the neural network, it should work,” he says.
The team has now labelled around 5,000 road scenes from around the world. “We're trying to learn features that are more generic than just Cambridge. What makes a car a car; what makes a road a road? These are applicable around the world. Of course, if you get an extremely different scenario like a dusty road in the Outback of Australia then it may fail. That's when you need to go back to the labelling process, collect data from the new location, label it and retrain it,” says Alex.
For outdoor scenes, SegNet correctly labels more than 90 per cent of pixels in any given image. Alex is happy with this level of performance, and explains that the next step is for the system to have a greater understanding of uncertainty. “Let’s say that SegNet predicts that an object is a cyclist. An autonomous vehicle might predict that a cyclist is going to do certain things and respond accordingly. But let's say SegNet is wrong and the object is actually a pedestrian. The pedestrian can do different things and move in different ways. SegNet needs to say, ‘I think this is a cyclist - it may also be a pedestrian, but it's definitely not a car’.”
Understanding uncertainty is equally important when a car encounters bad weather, although Alex says computer vision copes better than technologies like LIDAR. “It will start to deteriorate in the most extreme cases but in the same sense as human vision would deteriorate. That's why it's really key to know if you're wrong. The uncertainty will help it say: ‘Hey, I don't know what's going on - maybe slow down’.”
Computer vision can also beat the location-determining accuracy of GPS. A second part of the project involves training the system to work out exactly where it is from a video feed.
Again, there’s a training process involved: “I used a technique called ‘structure from motion’ – it takes a video and determines the geometry of the scene by looking at how landmarks move. I used it to train my deep learning system. If you give it an image, it will predict where the camera is in 5 milliseconds,” says Alex.
So when will autonomous vehicles be able to navigate using just one camera? “We've got the computing power and the software. To make it work everywhere, from the Australian outback to the centre of Manhattan, we probably require more data. Otherwise it performs really well, and as a system it could go on a car today.”