Postulate is the best way to take and share notes for classes, research, and other learning.
Paper Summary: "Deep learning in robotics: a review of recent research":
Abstract:
There's a lot of interest here
This paper = reviewing those papers in terms of applications, benefits, limitations and examples
Introduction:
Unique thing about robotics in Machine learning is that there's a ton of data coming in from it's sensors, a ton!
Neural Networks are great for robotics compared to other ML models. It still has it's limitations but we'll review it all in this paper
Deep Learning:
Some history:
First we got regular linear regression. I mean it's not that bad, but it's boring
We slap on a non-linearity! Now it can fit non-linear functions!
What if we added some new layers? Now it's a multi-layer perceptron
We got backprop! Now we can actually train them
Oh turns out that neural networks are also universal function approximators!
Problem: It's slow. Solution: GPU's
Problem: Vanishing and exploding gradients. Solution: Hinton found a better way to do it
For robotics it's been applied since the 1980's. But with new advancements we're using them like crazy now! In 2011 we can give a bunch of unlabeled images and it can model the state and dynamics
For intelligence. We thought chess was the prime of human intelligence. They beat us. Then we thought that image recognition was something super unique. They beat us. The next hurdle will be agility and dexterity --> robotics
Common structures:
A) Regular old feed in, spit out model (not the offical name)
Train it with training pairs (x,y)
Loss for regression = sum square error, classification = cross entropy loss
Gradient decent to train
We just feed in x and we get out a predicted y out
B) Autoencoder
It's unsupervised since there's no need to provide labels.
It takes the input image and distills it down to it's more important features, or internal latent encoding. Ex: We can take hundreds of thousands of pixels and condense it down to 30 values. We can use these values and reconstruct the whole image.
Good for hybrids between supervised and unsupervised learning. We can also provide input in the latent encoding layer
Generative models are closely linked to autoencoders too!
C) Recurrent Neural Networks
Used for dynamic systems (things that change with respect to time)
Train by using backpropagation through time
LSTM's = improvements on regular RNNs
Basically you feed it an input, it takes it, spits out a latent encoding. This latent encoding is fed back into the model for the next time step, meanwhile we decode the latent encoding to achieve our useful result
The latent encoding = can be also adjusted by the user
D) Reinforcement Learning
At least this specific learning = Q-learning
Take in the state and action (or control vector) to get out some q value. This q value = determines the next best action of a state
Now this is good for stuff where we have an objective, we have 0 idea how to code in the labels and stuff
At first it moves randomly, but we reward it when it hits certain things, thus it will learn how to do those things
CNN's
They're so important that they have their own section in the paper
Imagine an image that's 100,000 x 100,000. If you used a normal neural network: You'd be boned
Now we can use CNN's which applies the same small set of weights across the whole image
They dominated anything to do with images and anything that has spatio-temporal proximity (aka the data points around the data point I'm looking at is correlated)
It's great for robotics since we can feed it all our swamp of data and it'll merge (ex: Lidar + regular images) and do powerful stuff with it
Trajectory of Deep Learning
We can take complex functions, and break them apart into other functions
Using different models we can learn each function and then merge them all together to create our mega model to help us model the complex system.
Many algorithms are starting to take the shape of how our brain took shape
A & C = the function of the cerebellum. B = the cerebral cortex. D = Basal Ganglia
We can start putting these models together such that it starts forming a fully cognitive system
Deep Learning in Robotics:
Goals of the robotics community for the next 5-20 years:
Human Like walking + running
Teaching by demonstration
Mobile navigation in pedestrian environments
Collaborative automation
Automated bin picking
Automated combat recovery
Automated aircraft inspection and maintenance
Robotic disaster mitigation and recovery
Some major challenges of robotics
Learning complex and new environments. The total state of the world isn't totally known. We have to deal with uncertainty. We have to interact with new tools, objects and surfaces. What if some parts fail, we still have to accomplish our mission. What if there are thousands of degrees of freedom? Lots of uncertainty, complex and high dimensional
Learning in super dynamic environments. (similar to the last one) What happens when there's a ton of degrees of freedom (like multiple arms, or human hands, or swarm robots). How to do we make them reliable and safe when there's lots of uncertainty
Advanced manipulation. We're still not good at using tools or interacting with complex objects, or actuating systems
Advanced object recognition. We're good at classifying and finding objects, but we still have to learn how to interact with deformable objects and how to do complete tasks. We also have to recognize additional properties of objects (wet? sharp? slippery?)
Interpreting anticipating human actions. Robots will have to interact with humans, thus they have to understand us. One interesting avenue is teaching through demonstration. We also need to know when humans need help and when to step in
Sensor fusion & dimensionality reductions. We have a ton of data, but we have to turn them into useful representations
High Level task planning. AKA if we just say "get the milk" it has to plan and go through all the steps to getting the milk
We can actually see these challenges as the axioms which we have to solve to get all the cool functions we want.
A table of how DL is tacking these problems:
Structure A = the most popular because it's the most intuitive
But we need the other structures too in order to make progress
In the author's opinion, the empty cells = lack of research focus rather than inherent lack of compatibility
We also need some benchmarks for 7 - that will accelerate the field
Structure A:
The role it plays:
Function Approximator (take in inputs, try to map to y)
Super General Purpose since there's a ton of functions that we might want to approximate.
Ex: Actions -> Change of state, Change of state -> Actions, Forces -> Motions
Even if there are equations, the environment = too complex to approximate it correctly
Can be regression (continuous) or classifications
Examples of recent research:
Modeled the dynamics of a RC helicopter
Modeled time between driver's head movement + occurrence vs vehicle speed
Domains:
1) Detection + Perception:
NN's are super alluring because they can interact with the high-dimensional input
Other models must have their inputs be hand engineered by experts (dependence = bad).
We might have to train longer but it offsets the need to expert input
Predicting the grasp point of garments
Recognize 48 different kitchen objects from cooking youtube videos, or finding doors
Predicting their pose, predicting haptic adjectives to help with function (avoiding slippery surfaces or adjusting grip)
2) Grasping + object manipulation:
Recognize different objects, find their orientation + grasp it
Find the highest success rate of grasp positions
They also did it in 1 pass and got 88% under 100 ms
They also did a actor critic thing to help train the robot
3) Scene understanding + Sensor fusion:
Extract meaning from videos and images
One group could recognize 20 italian gesrures
Another group created a unified system that can extract features, articulation + motion handling, occlusion handling, classification for pedestrian detection
Can predict the physical outcomes of dynamic scenes
Can also handle multiple types of inputs (vision+ Haptic) --> Like controlling a robotic hand
Recommendations for working with A:
Need to be familiarized with ML to be able gain advantage of the flexibility of all the hyperparameters
Structure A = really good for object detection
Deeper models = better, but we need more regularization to avoid overfitting
Specialized layers = big deal. Conv + Max pooling + Batch norm + residual
Lots of toolkits to implement our algorithms (Tensorflow + Keras + Theano+ Torch + Caffe + GroundHog + Theanets + Kaldi + CURRENNT + a ton of other stuff)
Structure B:
The role it plays:
To understand what they see
It can take high dimensional data and turn it into a lower dimensional data
Generative models = the opposite. Takes lower data and blows it up to a higher dimension
Inference methods can predict what the output of the encoding instead of just an encoding
Examples of recent research:
Used an auto-encoder to extract meaning + achieve control --> Trained on how actions change the configuration of objects
Generative models for outcomes of physics simulations
Generative models for modeling + controlling nonlinear dynamics
Combined RGB Images, sound, joint angles and used Unsupervised + recurrent models
Modeled the inverse dynamics of the manipulator. It was able to adapt over time better than other types of models
Took high dimension data and created meaningful feature vectors --> Fed into other models
Recommendations for working with B:
They're really good for the sensor fusion part
Autoencoders + Conv layers in the encoding layers (little advantage for decoders
We could also get Decoders to only predict 1 pixel, then we could allow the user to select which pixel we want
There are also specialized regularized methods for autoencoders. (But some studies have shown that we can just throw out the entire encoder)
We train until the internal representation converges with the decoder
Nonlinear dimensionality reduction = good for pretraining
Structure C:
The role it plays:
Anticipate complex dianatmics
There's a built in memory into it, thus is good for time in changing environments
Examples of recent research:
Predict traffic maneuvers of cars -> For collision avoidance systems (inputs = driver video, video of dashcam, state of vehicle, GPS coordinates + street maps
Simulate a knife cutting through a fruit. Can model frication, deformation and hysteresis. Trained through full observations
Can focus on a hand. The hand point to an object and it switches it's attention to the object and picks it up
Recommendations for working with C:
Good for anything to do with time
Negative reputation for being difficult to train
Vanishing gradient (cause it's hella deep since it traverses through time). But we can just use LSTMs
There's also a trick where you present the data like you would in Structure A to pre-train the network, then transfer learn that over to the full recurrent net to train faster
Structure D:
The role it plays:
Learning the control policy of machines
Similar to Structure A, but training process = very different
Instead of minimizing loss, we maximize future potential reward + we balance exploration vs exploitation
Good when you don't have 100% of information. Computationally efficient
Technically it can just learn everything from just it's rewards, But it's really slow to train
Examples of recent research:
For Autonomous aerial vehicles --> Less computation power + less sensors since we don't need all information
We can just feed it video and it's okay with controlling everything
Also it can use videos of humans doing tasks to create a cost function, then use it to train RL to do tasks
They're used this by feeding video directly to the actuator (thing that does the actions) system, and it's better than separating the two systems
Recommendations for working with D:
Biggest barrier = computation time --> GPUs
Train in simulations before real robots, even if the simulation is bad, we can transfer learn it
Also use modern architechtures + modern training methods
Current Shortcomings of NN's for Robotics:
You need lots of training data, and it's hard to obtain this training data
But there are clever ways of mitigating this. You can use simulations to generate this data. Also you can manipulate your data to get more data (slow down or speed up videos)
It also takes a long time to train. For millions of parameters we need to train for days
But we can reduce training time by distributing tasks along smaller NN's. But there are trade offs to this technique (since we've also found that integrating into one = higher perforamnce)
Unsupervised learning is not practical for robotic systems (since failure = catastrophic)
Also having deep learning on robots = hella costly
Conclusion:
NN's = super powerful in terms of sensing, cognition, and action -> We can even combine them
They can be fed direct raw data (no feature engineering)
They can fuse multimodal data
They improve with experience, and can deal with complex environments
Barriers = big training data + big training time --> Expensive
Data = from crowdsources (it doesn't even have to be from robots (youtube cooking videos)
Training time = better processing + algorithmic + technological improvements
Still many obstacles which prevent it from reaching it's full potential - only time will tell how much it can contribute to robotics
A Collection of Summaries of my favourite ML papers!