Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

A new frontier of Machine Learning

Profile picture of Dickson WuDickson Wu
Sep 11, 202117 min read

Paper Summary: “Deep Learning in Spiking Neural Networks”



Abstract:

Artificial Neural Networks (ANNs) have dominated the ML world — they’re extremely powerful through the use of backpropagation, tons of data, and the simplicity of the neuron itself. But despite their name, ANN’s aren’t like our brains. 

Enter: Spiking Neural Networks

ANN’s have continuous-valued activations — but biological neurons have discrete values (spikes). So why not try to adapt the spiking characteristic into Neural Networks? That’s what Spiking Neural Networks (SNNs) are all about. 

They’re more biologically realistic (thus can help us understand the brain better), They’re more hardware friendly, and energy-efficient than ANN’s (thus good for portable devices).

But we have a challenge → SNN’s are non-differentiable, thus we can’t do backpropagation. This paper is about techniques to train SNNs. SNNs are lagging behind ANN’s → but the gap is closing. And perhaps one-day SNN’s will surpass its artificial kin. 

Introduction:

ANN’s base unit is the (non-spiking) neuron. The neurons are continuous, have weights, have non-activations (which is where the power of stacking ANNs comes from), and are differentiable → meaning with lots of data and computing power, it can perform great feats. 

But ANNs aren’t like our brains — the way they communicate is different. Brains communicate through spike trains downstream to neurons. But it’s sparse in time, thus each spike has a log of information of them. Through the spike train, we have different spike timings, latencies, spike rates over different populations. SNNs use an idealized version of spike trains.

Since SNNs are sparse they can achieve super high energy efficiency. They’re also intrinsically tuned for temporal characteristics — since spike timings is highly important in brains. Additionally, they can emulate the brain’s ability to do local learning.

From a scientific motivation, we know that brains are super good because of Deep Spiking Networks. But we can also use SNNs to investigate how the brain learns by being able to test general hypotheses, and discovering new ones!

From an engineering standpoint, SNN’s beat ANNs in terms of energy efficiency. Since SNN’s output is sparse, high information spikes it doesn't use that much energy (versus high-end graphics cards). 

In terms of representation power — SNN’s, in principle, have a higher representation power than ANNs. SNN’s are based on the brain, thus we could build brain-like representations. 

The communication strategy, spike trains, is not differentiable. So how are we going to use derivative-based optimization to train SNN’s? This is the major challenge that SNNs face — especially muli-layer ones. This paper highlights methods to train SNNs. 

Spiking Neural Network: A Biologically Inspired Approach To Information Processing

SNN Architecture:

In biological neurons, the spikes are generated by running sums of charges through the membrane potential (from the previous neurons (known as presynaptic neurons)). When the sum of the charges crosses a certain threshold, then the neurons spikes. The rate at which these spikes spike and the temporal pattern of the spike trains carry information. SNNs are similar.

Alright so let’s build the SNN! First, we have input data → But how are we going to feed that into our model? There are many different ways to encode input data (rate-based methods, temporal coding or population coding). 

Biological neurons are still super complex. They have particular action potential generation dynamics, networks dynamics. But for SNNs we simplify the network dynamics. In SNN’s the neurons all have pure threshold dynamics. 

Just like in biological neurons, pre-synaptic neurons fire to modulate the membrane potential of the postsynaptic neuron. When the membrane potential crosses a certain threshold then the spike is generated! 

The first models to simulate this were by Hodgkin and Huxley but they had lots of biological details — meaning they had a very high competition cost. But other models have been proposed like the spike response model (SRM), the Izhikevich neuron model, and the leaky integrated-and-fire-neuron (LIF, a very popular model).

In ANNs, synapses have weights that are multiplied by the outputs of the previous neurons. SNNs have the same thing! Adaptor synopsis (weights) can either be excitatory, meaning they amplify the signal, or they can be inhibitory, meaning they dampen the signal. These weights can be learned through training. But it is these learning algorithms that are the most challenging part of developing deep SNNs. 

Learning Rules in SNNs:

For both non-spiking and spiking in neural networks, learning is always done by adjusting the weights of neurons. But spiking allows for particular bio-possible learning that cannot be used in non-spiking neural networks. This is called spiking timing-dependent plasticity (STDP).

STDP adjusts the weight of the neurons connecting the pre-and postsynaptic neuron according to the relative spike times. This means that learning is done locally in the synapses and locally in time.

Unsupervised Learning via STDP:

Unsupervised learning with STDP follows the biological version of STDP. If the presynaptic neuron fires before the postsynaptic neuron then the connection between the two is strengthened. But if the presynaptic neuron fires after the postsynaptic neuron then the connection is weakened. Strengthening = long-term potentiation (LTP). Weakening = long-term depression (LTD).

This is shown in the equation below:


  • t_pre and t_post = the times at which the spikes fire

  • A and B are scalar learning rates

  • τ is the time constant for the temporal learning window

In other words, the top part is strengthening and the bottom part is weakening. When it’s strengthened the learning rate will be positive — thus the ∆w will be negative. When it’s weakened, the learning rate is negative — thus ∆w will be positive. {In my opinion, this sounds unintuitive/wrong. But I’m not sure why it isn’t the other way around}.

The strength of the effect is modulated by a decaying exponential — which is controlled by the time difference between the pre- and post-synaptic spikes. But SNNs rarely follow this rule, since people usually modify it to be simpler or satisfy a mathematical property. 

STDP also has accumulated network-level effects. For example, certain spike trains can cause neurons to be able to spike faster (thus have shorter latencies). STDP can shape the neuronal selectivity of the SNN — meaning certain patterns makes neurons respond faster. 

People have applied STDP to the first layer (and it still works). STDP can solve difficult problems by creating Spatio-temporal spike patterns. And we can develop more complex networks with output neurons too!

Probabilistic Characterization of Unsupervised STDP:

There’s evidence of Bayesian analysis (where we use prior knowledge and the likelihood of new observations to get a posterior probability). So researchers have implemented probabilistic computation with STDP. 

One paper did just that! They start off with defining the STDP of the network like this:


  • LTP occurs when the post-synaptic neuron fires within Ɛ ms after the pre-synaptic neuron

  • LTD occurs in all other cases.

The paper uses Poisson spiking neurons + ~stochastic winner-take-all (WTA) circuit to emulate an expectation-maximization (EM) algorithm. The Expectation part is done through the output neuron (which creates a posterior distribution). The Maximization step is STDP. There are lots of papers which build on top of this and on each other.

The drawback of this STDP model is that the LTP weights are negative — but we just slap on a constant parameter to make it positive. 

{There are tons of terms throw around here which I don’t completely understand. But the gist is that there’s been work in probabilistic SNNs}

Supervised Learning:

Supervised learning = we compare the output with the labels, get the loss, then use gradient descent to minimize that. Supervised learning in SNNs does the same thing. 

So how are we going to use gradient descent on SNNs? Instead, let’s first talk about the core backpropagation formula:


  • δ = the partial derivative

  • g = the activation function

  • a = the input

  • w = weight

Now the network itself is made up of layers. j = the current layer, and k = the layer right next to it (thus outputs of j are the inputs to k)

  • δu _j = partial derivative of our current neuron (j)

  • g’(a^u _j) = input information pumped through the derivative of the activation function

  • Summation part = We sum through the neurons which j feeds into. We grab the weight between the two and multiply it with the partial derivative of that neuron

Cool! This is how backprop works in ANNs → So let’s try and transfer it over to SNNs! We have 2 big problems in this formula:

  • g’() is the derivative of g() with respect to w_kj… For SNNs, g() is the spiking neuron → which is represented by the Dirac delta functions → which don’t have a derivative

  • The weight transport problem. Basically, the weights that we used in feedforward, must be used again to update the gradients again. 

The first issue is addressed by using a substitute or approximate derivatives. These solutions are not bio-plausible but still hold value from engineering and scientific standpoints. 

The second issue is addressed because recent research has shown that if you give it random feedback weights, you can still do backpropagation. 

Some Supervised Learning Methods for SNNs:

SpikeProp is the first SNN with backpropagation working. The task was to classify non-linearly separable temporal data. The cost function = took into account the spike timings. The key to their success was that they used a spike response model. This avoided the problem with the derivative since we could get continuous-valued functions out of it. 

But the 2 constraints of that paper were that each output neuron could only output 1 spike + the encoding of the data as spike timed delays made it long.

But recent advances have expanded SpikeProp to MultiSpikeProp. They improved the spike codings and spike time errors which improve the spiking backpropagation.

Another type of approach is remote supervised learning (ReSuMe) along with Chronotron and spike pattern association neuron (SPAN). They have 1 spiking neuron that gets inputs from tons of pre-synaptic neurons. The spiking neuron should fire the correct spike train. 

ReSuMe uses a variant of the Widrozw-Hoff rule. The Widrozw-Hoff rule look like:


  • y^d = the ground truth (desired output)

  • y^o = the output (observed output)

  • x = pre-synaptic inputs

And the modified version looks like:


  • ∆w^STDP and ∆w^aSTDP are both function

  • ∆w^STDP takes in the presynaptic spike train and desired spike train

  • ∆w^aSTDP take in the presynaptic spike train and observed spike train

We call this remote supervised learning because for the ∆w^STDP, there is no physical connection between the presynaptic inputs and the desired outputs. 

There’s something called the Victor Purpora (VP) distance metric — which represents the minimum cost of transforming a spike train into another. Chronotron implemented the VP distance to make it piecewise differentiable so we can use backpropagation on it. 

SPAN is like ReSuMe but in the reverse direction — instead of making the Widrow-Hoff rule compatible with STDP, we just make STDP compatible with the Widrow-Hoff rule. They do so in a way similar to SpikeProp, where they use digital to analogue conversions. The change in weight looks like:



All the ~ signs = the analog versions of the spike trains. 

There are a whole bunch of other ways to do it. One paper replaced the hard spike threshold with a narrow support gate function → Led to continuity, thus differentiation. Another paper sent backprop update rules through temporally local STDP rules — achieved comparable results with ANNs

Another paper modified the SRM model to use a stochastic threshold. Another paper had each output neuron represent a class of the data. If the desired neuron is fired then we do STDP on incoming neurons. If we fire non-target ones we do anti-STDP on incoming neurons. 

Deep Learning in SNNs

Let’s take it a step (or rather a few layers) further! Stacking more and more layers makes our models more powerful. Let’s do that with SNNs! 

Traditional Deep Neural Networks (DNN) have many different flavours: Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), and Recurrent Neural Networks (RNNs). We can also have spiking versions of each of them!

Lots of recent research has been on SNNs and their variants due to SNNs’ awesome power efficiency → But they have yet to be as powerful as DNNs. That’s the next challenge which we have to solve!

Deep Fully Connected SNNs:

There are many ways of achieving Fully Connected SNNs! 

Recent studies have been in STDP + Stochastic gradient descent. Others have used LIF neurons + synaptic plasticity rules. By using an unsupervised technique, we could get a 95% accuracy!

Lots of methods for backpropagation have been developed. People can do forward and backward backpropagation. One paper use the pre- and post-synaptic spike trains in order to get back prop (97.93%). Another paper used the membrane potential like a non linear function by giving différentiable signals (98.88%). Plus it was 5 times cheaper to train than normal.



If you only want the power of energy at inference, you could transform ANNs into SNNs. You can convert the floating point activation to spike rates and it works just fine. You can see that converted SNN’s are able to achieve much less operations:



Spiking CNNs:

Traditional CNNs are really good at vision. This is due to their convolutions + their pooling. The first layers of a CNN is used to extract primary visual features (they resemple oriented-edge detectors). We can do pooling by taking the maximum or the average over a square neighborhood of neurons.

For SNN’s the CNN version of it will also have convolutions. These convolutions can either be learned or hand-crafted. Hand crafted ones can take concepts from biology — like a Difference-of-Gaussian filter that is found in the mammalian primary visual cortex. 



You can train Spiking CNNs through STDP learning rules. So we can also learn deep CNNs → which can obtain 99.05% on MNIST. We can also train them using backpropagation, like using neural membrane potential to replace the differentiable activation functions. 

Just like in Deep SNNs, we can also convert CNNs into spiking CNNs. Spiking CNNs are able to do less operations while consuming less energy. We converts classical CNNs to Deep SNNs just like how we do them with ANN to SNN conversions. One work applied weight normalization and boosted the performance. 



Spiking Deep Belief Networks:

Deep Beleif Networks (DBNs) are something that was developed by Hinton back in 2006. They use layerwise unsupervised learning to do tasks. It’s made up of binary units, where you update the state of the unit stochastically. 

It trains using contrastive divergence (CD) which approximates the maximum likelihood. This is different than backpropagaiotn because there are not derivatives involved! We could still use backpropagation here if we have some labeled data. 



People then expanded the original concept to make it more sparse, using different energy functions, and slapping on some convolutions.

DBN’s are made up Restricted Boltzmann machines (RBMs). So to make spiking DBN’s, we have to make spikign RBMs. We just switch out the neurons, change up STDP to approximate CD and we’re good! Now we’ve got spiking RBMs and DBNs! And we can convert regular DBNs to spiking ones too! 

{They talk about variants of RMBs too: Hopfield and hybrid Boltzmann machines (HBMs). which are shown to also be computationally cheaper.}

Recurrent SNNs:

{They talk about how a lot of stuff is implictaly recurrent like Winner take all modules, softmax, and backpropagation}

Gatted SNNs:

Traditional RNNs are trained using backpropagation. They’re cyclical → But we unroll them with time and pretend that they’re feed forward networks. But we have a problem, the weights are literally the same in every single cell, when we backpropgation on it, the derivatives are going to just multiply with themselves over and over again.

Meaning the gradients will explode or disappear → Leading to instability or stopped learning. How do we get around that? With gates! LSTMS and GRUs use them and this prevents the death of gradients. Gates control the information (and are parameterized).

We can take an LSTM and then turn it into a spiking LSTM. We can represent positive and negative values through 2 channels of spike trains (biologically plausible). We can also rate code everything inside LSTMs 

There have been other innovations in LSTMs as well which we can implement into SNNs. Phased LSTMs take take in multiple inputs from different time scales and process them — we can implement that for the inputs of SNNs, and to interpret the outputs of multiple SNNs.

There’s also something called a time gate within memory cells → It only opens (allows updates to happen) during certain oscillations, and closes during other times. 

Liquid State Machines and Reservoirs:

An interesting phenomenon is that mammals are able to scale the neocortex’s surfaces area a ton → Like 1cm² of a mouse to 2500 cm² for humans — All while keeping the thickness around the same! A hypothesis for this is because mammals have found a structural module which we can just clone stamp over and over again. 

We’ve been trying to model this module in real life → and it’s developed into the liquid state machine (LSM). it’s a sparsely connected recurrent SNN that has lots of excitatory and inhibitory neurons. 

This reservoir model has 3 parts:

  1. We need time-varying input streams of spikes

  2. Recurrent SNN (reservoir / liquid). Neurons have a literal physical space, and we have probablistic connections → where the probabilities decrease with respect to distance. We have spartse connections + have lots of inhibitory neurons so we don’t get chaotic dynamics

  3. Linear readout units which can recognize instantaneous patterns in the liquid. 

It doesn’t have to be 3D — it can be 3D too like the NeuCube. NeoCube tries to mimic the human neocortex.

There also ways of taking a reservoir approach to LSTMS — called LSNN (Long short term memory SNNs). There are 4 modules:

  • X — Inputs of multiple streams of input spikes

  • R — Reservoir of excitatory and inhibitory neurons

  • A — Module of excitory neurons + adaptive thresholds

  • Y — Readout neurons

We can train this guy using backpropagation through time using pseudo derivatives of the membrane potential. It’s able to achieve similar results to LSTMs + can learn properties that are unique (or at least were unique) to LSTMs. 

We can also apply some other innovates in LSTMs into LSM. We can add subtractive modules (known as subLSTM) into lateral inhibitory circuits. We can also add in rate coded LIF neurlns that makes it more compatible with deep learning environemtns. 

Performance Comparisons of Contemporary Models:



Offline learning has better performance, than online ones, but online ones are able to train multi-layer SNNs. 

Summary:

Deep learning is super powerful, but it’s computationally expensive, and we can’t implement it on portable devices as easily. SNNs on the other hand are power efficient models + are much more bio-plausable. 

SNNs work by sending spike trains to each other in a sparse manner. This means it doesn’t have a derivative — but we can still use other techniques to get around them. 

Spiking Neural Networks open a new gate of possibilities of models and learning techniques! This paper reviews the state of the art for Spiking Neural Networks. We can adapt SNNs to fully connected SNNs, spiking CNNs, spiking DBNs, spiking RNNs, and LSMs! 

SNNs truely have an insane potential in the future! They’re incredibly power efficient + one day they’ll outperform traditional deep learning! 

---

If you want to find out more: Read the paper here!

---

Thanks for reading! I’m Dickson, an 18-year-old ML enthusiast who’s excited to use it to impact billions of people 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on LinkedIn or Twitter 😃


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!