Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Paper Summary: "Deep learning in robotics: a review of recent research":

Profile picture of Dickson WuDickson Wu
Jun 3, 202112 min read

Paper Summary: "Deep learning in robotics: a review of recent research":

Abstract:

  • There's a lot of interest here

  • This paper = reviewing those papers in terms of applications, benefits, limitations and examples



Introduction:

  • Unique thing about robotics in Machine learning is that there's a ton of data coming in from it's sensors, a ton!

  • Neural Networks are great for robotics compared to other ML models. It still has it's limitations but we'll review it all in this paper



Deep Learning:

  • Some history:

    • First we got regular linear regression. I mean it's not that bad, but it's boring

    • We slap on a non-linearity! Now it can fit non-linear functions!

    • What if we added some new layers? Now it's a multi-layer perceptron

    • We got backprop! Now we can actually train them

    • Oh turns out that neural networks are also universal function approximators!

    • Problem: It's slow. Solution: GPU's

    • Problem: Vanishing and exploding gradients. Solution: Hinton found a better way to do it

    • For robotics it's been applied since the 1980's. But with new advancements we're using them like crazy now! In 2011 we can give a bunch of unlabeled images and it can model the state and dynamics

    • For intelligence. We thought chess was the prime of human intelligence. They beat us. Then we thought that image recognition was something super unique. They beat us. The next hurdle will be agility and dexterity --> robotics

  • Common structures:

    • 

    • A) Regular old feed in, spit out model (not the offical name)

      • Train it with training pairs (x,y)

      • Loss for regression = sum square error, classification = cross entropy loss

      • Gradient decent to train

      • We just feed in x and we get out a predicted y out

    • B) Autoencoder

      • It's unsupervised since there's no need to provide labels.

      • It takes the input image and distills it down to it's more important features, or internal latent encoding. Ex: We can take hundreds of thousands of pixels and condense it down to 30 values. We can use these values and reconstruct the whole image.

      • Good for hybrids between supervised and unsupervised learning. We can also provide input in the latent encoding layer

      • Generative models are closely linked to autoencoders too!

    • C) Recurrent Neural Networks

      • Used for dynamic systems (things that change with respect to time)

      • Train by using backpropagation through time

      • LSTM's = improvements on regular RNNs

      • Basically you feed it an input, it takes it, spits out a latent encoding. This latent encoding is fed back into the model for the next time step, meanwhile we decode the latent encoding to achieve our useful result

      • The latent encoding = can be also adjusted by the user

    • D) Reinforcement Learning

      • At least this specific learning = Q-learning

      • Take in the state and action (or control vector) to get out some q value. This q value = determines the next best action of a state

      • Now this is good for stuff where we have an objective, we have 0 idea how to code in the labels and stuff

      • At first it moves randomly, but we reward it when it hits certain things, thus it will learn how to do those things

  • CNN's

    • They're so important that they have their own section in the paper

    • Imagine an image that's 100,000 x 100,000. If you used a normal neural network: You'd be boned

    • Now we can use CNN's which applies the same small set of weights across the whole image

    • They dominated anything to do with images and anything that has spatio-temporal proximity (aka the data points around the data point I'm looking at is correlated)

    • It's great for robotics since we can feed it all our swamp of data and it'll merge (ex: Lidar + regular images) and do powerful stuff with it

  • Trajectory of Deep Learning

    • We can take complex functions, and break them apart into other functions

    • Using different models we can learn each function and then merge them all together to create our mega model to help us model the complex system.

    • Many algorithms are starting to take the shape of how our brain took shape

    • A & C = the function of the cerebellum. B = the cerebral cortex. D = Basal Ganglia

    • We can start putting these models together such that it starts forming a fully cognitive system



Deep Learning in Robotics:

  • Goals of the robotics community for the next 5-20 years:

    • Human Like walking + running

    • Teaching by demonstration

    • Mobile navigation in pedestrian environments

    • Collaborative automation

    • Automated bin picking

    • Automated combat recovery

    • Automated aircraft inspection and maintenance

    • Robotic disaster mitigation and recovery

  • Some major challenges of robotics

    • Learning complex and new environments. The total state of the world isn't totally known. We have to deal with uncertainty. We have to interact with new tools, objects and surfaces. What if some parts fail, we still have to accomplish our mission. What if there are thousands of degrees of freedom? Lots of uncertainty, complex and high dimensional

    • Learning in super dynamic environments. (similar to the last one) What happens when there's a ton of degrees of freedom (like multiple arms, or human hands, or swarm robots). How to do we make them reliable and safe when there's lots of uncertainty

    • Advanced manipulation. We're still not good at using tools or interacting with complex objects, or actuating systems

    • Advanced object recognition. We're good at classifying and finding objects, but we still have to learn how to interact with deformable objects and how to do complete tasks. We also have to recognize additional properties of objects (wet? sharp? slippery?)

    • Interpreting anticipating human actions. Robots will have to interact with humans, thus they have to understand us. One interesting avenue is teaching through demonstration. We also need to know when humans need help and when to step in

    • Sensor fusion & dimensionality reductions. We have a ton of data, but we have to turn them into useful representations

    • High Level task planning. AKA if we just say "get the milk" it has to plan and go through all the steps to getting the milk

  • We can actually see these challenges as the axioms which we have to solve to get all the cool functions we want.

  • A table of how DL is tacking these problems:

  • 

    • Structure A = the most popular because it's the most intuitive

    • But we need the other structures too in order to make progress

    • In the author's opinion, the empty cells = lack of research focus rather than inherent lack of compatibility

    • We also need some benchmarks for 7 - that will accelerate the field

  • Structure A:

  • The role it plays:

    • Function Approximator (take in inputs, try to map to y)

    • Super General Purpose since there's a ton of functions that we might want to approximate.

      • Ex: Actions -> Change of state, Change of state -> Actions, Forces -> Motions

      • Even if there are equations, the environment = too complex to approximate it correctly

    • Can be regression (continuous) or classifications

  • Examples of recent research:

    • Modeled the dynamics of a RC helicopter

    • Modeled time between driver's head movement + occurrence vs vehicle speed

    • Domains:

    • 1) Detection + Perception:

      • NN's are super alluring because they can interact with the high-dimensional input

      • Other models must have their inputs be hand engineered by experts (dependence = bad).

      • We might have to train longer but it offsets the need to expert input

      • Predicting the grasp point of garments

      • Recognize 48 different kitchen objects from cooking youtube videos, or finding doors

      • Predicting their pose, predicting haptic adjectives to help with function (avoiding slippery surfaces or adjusting grip)

    • 2) Grasping + object manipulation:

      • Recognize different objects, find their orientation + grasp it

      • Find the highest success rate of grasp positions

      • They also did it in 1 pass and got 88% under 100 ms

      • They also did a actor critic thing to help train the robot

    • 3) Scene understanding + Sensor fusion:

      • Extract meaning from videos and images

      • One group could recognize 20 italian gesrures

      • Another group created a unified system that can extract features, articulation + motion handling, occlusion handling, classification for pedestrian detection

      • Can predict the physical outcomes of dynamic scenes

      • Can also handle multiple types of inputs (vision+ Haptic) --> Like controlling a robotic hand

  • Recommendations for working with A:

    • Need to be familiarized with ML to be able gain advantage of the flexibility of all the hyperparameters

    • Structure A = really good for object detection

    • Deeper models = better, but we need more regularization to avoid overfitting

    • Specialized layers = big deal. Conv + Max pooling + Batch norm + residual

    • Lots of toolkits to implement our algorithms (Tensorflow + Keras + Theano+ Torch + Caffe + GroundHog + Theanets + Kaldi + CURRENNT + a ton of other stuff)

  • Structure B:

  • The role it plays:

    • To understand what they see

    • It can take high dimensional data and turn it into a lower dimensional data

    • Generative models = the opposite. Takes lower data and blows it up to a higher dimension

    • Inference methods can predict what the output of the encoding instead of just an encoding

  • Examples of recent research:

    • Used an auto-encoder to extract meaning + achieve control --> Trained on how actions change the configuration of objects

    • Generative models for outcomes of physics simulations

    • Generative models for modeling + controlling nonlinear dynamics

    • Combined RGB Images, sound, joint angles and used Unsupervised + recurrent models

    • Modeled the inverse dynamics of the manipulator. It was able to adapt over time better than other types of models

    • Took high dimension data and created meaningful feature vectors --> Fed into other models

  • Recommendations for working with B:

    • They're really good for the sensor fusion part

    • Autoencoders + Conv layers in the encoding layers (little advantage for decoders

    • We could also get Decoders to only predict 1 pixel, then we could allow the user to select which pixel we want

    • There are also specialized regularized methods for autoencoders. (But some studies have shown that we can just throw out the entire encoder)

    • We train until the internal representation converges with the decoder

    • Nonlinear dimensionality reduction = good for pretraining

  • Structure C:

  • The role it plays:

    • Anticipate complex dianatmics

    • There's a built in memory into it, thus is good for time in changing environments

  • Examples of recent research:

    • Predict traffic maneuvers of cars -> For collision avoidance systems (inputs = driver video, video of dashcam, state of vehicle, GPS coordinates + street maps

    • Simulate a knife cutting through a fruit. Can model frication, deformation and hysteresis. Trained through full observations

    • Can focus on a hand. The hand point to an object and it switches it's attention to the object and picks it up

  • Recommendations for working with C:

    • Good for anything to do with time

    • Negative reputation for being difficult to train

    • Vanishing gradient (cause it's hella deep since it traverses through time). But we can just use LSTMs

    • There's also a trick where you present the data like you would in Structure A to pre-train the network, then transfer learn that over to the full recurrent net to train faster

  • Structure D:

  • The role it plays:

    • Learning the control policy of machines

    • Similar to Structure A, but training process = very different

    • Instead of minimizing loss, we maximize future potential reward + we balance exploration vs exploitation

    • Good when you don't have 100% of information. Computationally efficient

    • Technically it can just learn everything from just it's rewards, But it's really slow to train

    • 

  • Examples of recent research:

    • For Autonomous aerial vehicles --> Less computation power + less sensors since we don't need all information

    • We can just feed it video and it's okay with controlling everything

    • Also it can use videos of humans doing tasks to create a cost function, then use it to train RL to do tasks

    • They're used this by feeding video directly to the actuator (thing that does the actions) system, and it's better than separating the two systems

  • Recommendations for working with D:

    • Biggest barrier = computation time --> GPUs

    • Train in simulations before real robots, even if the simulation is bad, we can transfer learn it

    • Also use modern architechtures + modern training methods



Current Shortcomings of NN's for Robotics:

  • You need lots of training data, and it's hard to obtain this training data

  • But there are clever ways of mitigating this. You can use simulations to generate this data. Also you can manipulate your data to get more data (slow down or speed up videos)

  • It also takes a long time to train. For millions of parameters we need to train for days

  • But we can reduce training time by distributing tasks along smaller NN's. But there are trade offs to this technique (since we've also found that integrating into one = higher perforamnce)

  • Unsupervised learning is not practical for robotic systems (since failure = catastrophic)

  • Also having deep learning on robots = hella costly



Conclusion:

  • NN's = super powerful in terms of sensing, cognition, and action -> We can even combine them

  • They can be fed direct raw data (no feature engineering)

  • They can fuse multimodal data

  • They improve with experience, and can deal with complex environments

  • Barriers = big training data + big training time --> Expensive

  • Data = from crowdsources (it doesn't even have to be from robots (youtube cooking videos)

  • Training time = better processing + algorithmic + technological improvements

  • Still many obstacles which prevent it from reaching it's full potential - only time will tell how much it can contribute to robotics


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!