Yes, I just heard this too. I knew the speed of technological evolution is somewhat unprecedented here in the data science world, but are you kidding me? I just got to LSTM - . Now you are telling me that the effect of LSTM can be achieved by CNN, I don't know what to say.
I am going to assume we all know what LSTM is what why we use it, at least in a genera sense. If not, please read Understanding LSTM Networks by Colah's blog or The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy blog.
In this blog post, The Fall of RNN/LSTM, Eugenio Culurciello argues that "it is time to drop them[RNN/LSTM]."
Here's the gist of Mr. Culurciello's blog post (written in April, 2018). In 2014, RNN (and especially LSTM) rose as the all-mighty neural network structure that could be applied to solve all kinds of stuff: sequence translation (seq2seq), neural machine translation, images2text, text2images, captioning video and so on. Comes 2015, ResNet and Attention network began to gain a recognition. By the start of the second quarter of 2018, Mr. Culurciello claimed that many leading IT companies are ever more replacing RNN with likes of Attention based networks.
The biggest limitation of RNN was vanishing gradient. LSTM was a solution to that problem by introducing information gates: four gates inside each LSTM unit to update, add, forget, and predict for the next value. However this four-gate structure is very complicated meaning it is not hardware friendly, difficult to manage, and "they can remember sequences of 100s, not 1000s or 10,000s or more."
"In addition, RNN and LSTM are difficult to train because they require memory-bandwidth-bound computation, which is the worst nightmare for hardware designer and ultimately limits the applicability of neural networks solutions. In short, LSTM require 4 linear layer per cell to run at and for each sequence time-step. Linear layers require large amounts of memory bandwidth to be computed, in fact they cannot use many compute unit often because the system has not enough memory bandwidth to feed the computational units. And it is easy to add more computational units, but hard to add more memory bandwidth (note enough lines on a chip, long wires from processors to memory, etc). As a result, RNN/LSTM and variants are not a good match for hardware acceleration." - Eugenio Culurciello
Mr. Culurciello proposes two alternatives in place of RNN/LSTM networks. They are a 2D convolutional based neural network with causal convolution and Attention based models like the Transformer.
2D convolutional based neural network
A single 2D convolutional neural network across input and output sequences. Each layer of the network re-codes source tokens on the basis of the output sequence, therefore, having an attention-like properties. In Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction by Elbayad, Besacier, and Verbeek this model yielded competitive results with state-of-the-art encoder-decoder systems, while being more efficient and simpler.
Attention (based) model
In a white paper, Attention Is All You Need (NIPS 2017), researchers proposed a new simple network architecture, the Transformer, based solely on attention mechanisms, discharging recurrence and convolutions entirely.
RNN, LSTM in particular has been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. The fundamental problem is with heavy sequential computation, and while numerous studies have tried to alleviate this issue, none was able to remove the problem.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models , allowing modeling of dependencies without regard to their distance in the input or output sequences. An attention is simply a vector, often the outputs of dense layer using softmax function. Before attention mechanism, translation relied on reading a complete sentence and compress all information into a fixed-length vector. As you can imagine, a sentence with hundreds of words represented by several words will surely lead to information loss, inadequate translation, etc. However, attention partially fixes this problem. It allows machine translator to look over all the information the original sentence holds, then generate the proper word according to current word it works on and the context. It can even allow translator to zoom in or out (focus on local or global features).
It is reported that "[f]or translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers." They tested the model on two translation tasks, and in both tasks, the model "achieved a new state of the art." In one of the tasks, the model outperformed all previously reported ensemble models.
At a recent big data conference, I had a pleasure of talking with Dr. Chiu Man Ho, renowned AI architect at Oppo. At that time, I was investigating the possibility of LSTM being adopted by my team's video recommendation project. Dr. Ho suggested that there are better models (in terms of ease of use and being able to manage/operate properly) and referred me couple white papers to check out.
Faster RNN: Simple Recurrent Unit (SRU)
In this paper, Simple Recurrent Units for Highly Parallelizable Recurrence, the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability comes as rescue to the problem of poor scalability of common recurrent neural architectures due to the intrinsic difficulty in parallelizing their state computations. According to the paper, SRU achieves 5-9X faster training time over LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. SRU achieves this high parallelization by simplifying the hidden-to-hidden dependency. "This simplification is likely to reduce the representational power of a single layer and hence should be balanced to avoid performance loss."
Temporal Convolutional Network (TCN)
In this paper, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, researchers have presented a empirical evidence that for certain tasks, a simple convolutional architecture(TCN: Temporal Convolutional Network) outperforms canonical recurrent networks such as LSTMs.
Sequenced-replacement sampling (SRS)
This is a paper by Dr. Ho: Sequenced-replacement Sampling For Deep Learning. In his paper, Dr. Ho and fellow researchers suggests that a relatively smaller mini-batch can outperform a larger one in many applications, which is contrary to the common notion that smaller mini-batch yields to larger approximation, hence to a model less accurate. The paper argues that "[t]he reason behind this counter-intuitive phenomenon is that a relatively smaller mini-batch induces more stochasticity and hence exploration during the training process. It is true that a larger mini-batch may lead to faster convergence, but a relatively smaller mini-batch size could lead to a better accuracy." Using this approach, the team currently holds (as of 2018 December) the #1 accuracy in CIFAR-100 dataset.
These studies are exciting and I am a bit relieved at the fact that LSTM can be replaced by simpler and better manageable models. After all, if I want to productize the model, it cannot be too difficult to manage and optimize the model. In my team, we are using feed-forward model to solve a classification problem. I'd like to see if applying CNN or TCN can improve our work.
Understanding LSTM Networks by Colah's blog
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction by Elbayad, Besacier, and Verbeek
Basic encoder-decoder architecture on StackExchange
Encoder-Decoder Recurrent Neural Network Models for Neural Machine Translation by Machine Learning Mastery
Memory, attention, sequences by Eugenio Culurciello
A Brief Overview of Attention Mechanism by Synced