Natural Language Processing Final Part - NLP of today, tomorrow and beyond!

Jun 29, 2025

Introduction

Congratulations on getting this far! Give yourself a pat on the back.

In the last few posts, we discussed about various concepts in NLP, starting from the term-frequency, document frequency, latent semantic analysis, word embedding and also learnt about encoder-decoder type neural network architecture. This post will differ from them by giving you a bird’s eye view on what’s happening with the NLP research today, and things that might be available very soon based on current research findings, and lastly, we will talk about a few open problems — which some of you might be interested to solve.

If you haven’t already, do subscribe below so you don’t miss future posts that dive into the magic of statistics, AI, and beyond.

Some new research directions

Titan - Learning to Memorize at test time

This paper, from December 2024, introduces a concept of external memory in using language models to solve complex problems. The key technology, “transformers”, that is currently being used in all language models suffers from one bottleneck — it has a quadratic memory complexity for storing all the context states and their pairwise similarities. So, if you give an input of 10,000 tokens, most language models are using a storage of order 10,000 squared, i.e., 100 million, a humongous number. To solve this problem, this paper proposes an external memory system built on 3 components:

A long-term memory module: This is an independent deep learning model that stores the information about the memory through its parameters. However, instead of the usual training procedure using gradient descent, it uses the following rule to update its parameters:
\(\begin{align*} M_t & = (1 - \alpha_t) M_{t-1} + S_t\\ S_t & = \eta_t S_{t-1} - \theta_t \nabla l(M_{t-1}; x_t) \end{align*}\)
Here, 𝛼ₜ is a parameter that controls the forgetfulness, and Sₜ is the “amount of surprise”. The more surprising a word is, the more it affects the memory and gets stored. This is in accordance with how our memory works, you may not remember what you had 2 days ago for breakfast, but you can very easily remember the first time you asked your crush out for a date.
A persistent memory: While the long-term memory only depends on the context, one we require a “memory” which is more personalized to the task or to the user. Think of this: when you chat with your friend at college, and when you chat with your professor at college — both do not follow the same approach. Based on who you are interacting with, your tone of speech changes. To imitate that, this paper suggests to “train” a set of instruction tokens that can be appended as a prompt before the actual input. Therefore, instead of sending “x” to the model, you send “p1, p2, …, pn, x” as an input to the model, and then these p1, p2, … pn are adjusted according to the personality of the user and the current task at hand.
The Core short-term memory: The core short-term memory consists of the usual transformer-based system, but the attention mechanism is only restricted to past K token where K is a fixed number (say 100). Then, even if you give an input of 10,000 tokens, it only needs to compute the similarity with only the previous 100 tokens for each token, resulting in a memory requirement of 1 million units. In particular, this memory requirement is only linear in context size, and does not grow quadratically.

While some ideas from this research have been implemented in various large language models that we currently use, further research is being done to make language models more personalised and user-friendly.

World Models - using NLP techniques beyond Text

There is a group of researchers who believe that an intelligent model should be “multimodal”, i.e., it should not only use text input and text output to communicate, rather exhibit intelligence across various types of inputs and outputs. For example, when we try to solve any task (say learning to drive), we follow these steps:

We use different sensory organs (e.g. eyes, ears) to learn about the world or the environment in which we operate in. Using all of these, we create a heuristic and simplistic model of the “world”.
Based on the current model of the world, and also the history of how the world model changes over time, we try to predict what will happen in the near future. For example, given that you see a pedestrian on the side of the road trying to cross, you know that he/she will be in front of your car in the next two seconds.
Based on this predicted world model, we can then aim to find a solution to the specific problem that we want to solve. In the case of driving, we will use that model to learn to brake.

These concepts can be made to carry over into the domain of artificial intelligence using a range of techniques called World Models. The following picture from the website nicely summarises the idea.

At each time step, we use Vision models or Audio model or even text models to understand its encoding into a vector (think of an analogue of the embedding step). Then, these embeddings are processed by a RNN type network or sequential networks (including transformers) to predict the future states (think of analogus to predicting the next token). Finally, a small controller model uses these future representations to figure out what to do next.

Many researchers believe that there is immense potential for research in this direction. There are lots of open problems, and many ideas from the NLP literature translate directly into developing such world models, which are believed to be the most prominent way to develop near-AGI systems.

Evaluating Language Models

All these language models are trained with the objective of minimising a loss function, or making their output sound more like a human. However, we have very few ways to evaluate its responses. With the increase in language models and associated tools, many are pulling resources to build chatbots and rapidly deliver an AI-augmented product to their customers; however, very few have delved into building automated testing mechanisms to verify the performance of such chatbots. It is one of the most under-discussed yet crucial areas of LLM research.

The most popular and the oldest techniques are:

BiLingual Evaluation Understudy (BLEU)
Recall-Oriented Understudy for Gist Evaluation (ROUGE)

Both these metrics take a machine-generated response and a human-written true response, and then give a score proportional to the number of matched n-grams in both these responses.

A problem with both of these is that the language model may use synonyms of different words, which may not match with human responses, resulting in a poor BLUE or ROGUE score. A way to solve this is to process both human response and machine response through an independent encoder model, which provides a latent representation of the meaning of both sentences. Then they are matched against each other using vector-based similarity. This is the basic idea behind the metric BLEURT (BLEU with Representations from Transformers).

In October 2023, a considerably large paper (it has 162 pages and 50 authors) was published, introducing a Holistic Evaluation metric for Language Models (HELM). This framework (or tool) allows your language model to be tested across multiple benchmark scenarios (e.g., RAFT, IMBD, Natural Questions etc.) along multiple metrics such as Accuracy, Robustness, Fairness, Bias, Toxicity, etc. Let me describe how each of these metrics is measured.

Accuracy refers to “exact-match accuracy” in text classification, ROUGE scores for text summarisation and question answering.
For robustness, it tries to see if the model answer remains the same even if the input prompt contains typos, synonyms etc. Suppose you give an input prompt X and the LLM replies with Y. Now, if you change the input prompt to X’ by changing a few words with their synonyms and adding some typos, the LLM should still output Y, but it outputs Y’. Robustness score is then given by the proportion of matched tokens between Y and Y’.

Image taken from paper https://arxiv.org/pdf/2211.09110

To measure fairness, you can do the same technique as above, but look at the output when you change the gender of the pronouns (i.e., instead of “he did this”, use “she did this”) or change the proper nouns (e.g. name of a person) to conform to different racial background (i.e., instead of “Michael scored a home run”, use “Mastumoto scored a home run”).

You can also measure the bias of the model by asking it to complete sentences like “The doctor went in the office and …”. If you ask the model to complete this sentence 10 times, may be 8 out of 10 times it will use the pronoun “he” (male pronoun) to refer to the doctor, and rest 2 times it will use a pronoun “she”. Hence, the gender-association bias of the model is (0.8 - 0.2) = 0.6. HELM tests this across multiple scenarios and calculates an average score of this stereotypical biasness.
Image taken from paper https://arxiv.org/pdf/2211.09110
The toxicity (i.e., hateful and harmful speech) of the model output is measured by the usage of another machine learning model to classify between toxic and non-toxic completions. HELM internally uses the Perspective API to perform this classification.

More examples are available in their GitHub repository. You can download it as a software, and they also have a nice web-UI component which you can use to test and evaluate any large language model.

Although researchers have come up with various metrics to evaluate the performance of a language model, there are still many limitations. This is currently an active research area with several open problems:

Metrics and evaluations on benchmark datasets are often a poor depiction of their real-world performance.
Many language models are coming up with “emergent abilities”. We don’t know how to test these new abilities.
The available evals frameworks (including OpenAI’s evals) are not very scalable. Also, while the LLMs are getting cheaper and being available to common people, running evaluations need thousands of invocation of LLMs, increasing the cost of evaluation drastically.
Another body of work uses the conformal prediction literature to treat these LLMs as a black-box and provide statistical guarantees about their outputs, which are not necessarily complete evals, but rather a way to let you know when the LLM thinks its output is not very trustworthy; more details here.

Measuring Sustainability and Energy Efficiency

An adult human needs about 2500 KCal of energy per day, or about 105 KCal per hour. If you convert it into Watt-hour, we get about 122 Watt-hour. With just these 122 Watt-hour of energy, human bodies do so many things: they sustain the organ functions, we walk to different places, we study about mathematical problems, remember history lessons, gossip with friends and colleagues, and so on. If you compare it, it is similar to charging a laptop for one hour or charging 2 mobile phones with latest fast charging 60W chargers.

Compared to that, training a large language model uses thousands of Megawatt-hours of power. 1MWh is equal to the energy consumption of approximately 8200 humans, and 1000 MWh is, well, about 8 million people. So it turns out that

Humans are extremely energy efficient themselves, but human-made machines are not so much!

Even when you are not training but using it for inference, it will still use multiple GPUs, for example, Nvidia GPU A100, which has an energy rating of 400 Watts. Each such GPU has 80GB of memory, so to serve a Deepseek R1 model (with 600B parameters), you would need about 18-20 such GPUs; see the calculation here. So in total, when you ask a question to Deepseek, it uses as much energy as (4 * 20) = 80 people to answer your question. Here is another article that talks about this comparison more.

The more energy these models require, the more carbon footprint they have because a significant chunk of the energy is currently generated through traditional means (e.g. coal and oil). Here is an awesome package called CodeCarbon, which you can use to get a nice summary of the carbon footprint of the machine learning or deep learning model you are training.

Now, there are several directions of research on how people are thinking of tackling this problem.

The most prominent way is distillation, where a smaller model is trained to replicate the outputs from a very large language model. This means, we can achieve almost the same level of performance by using only a fraction of the energy. However, this is very limiting as we need to have a bigger model with the emergent abilities to be available first and we need to run it to produce training data.
Beyond distillation, another technique is to come up with better energy efficient neural network architectures. Instead of activating all neurons / layers for every input, different architectures like Mixture of Experts (MoE), Switch Transformer, Sparse transformers, etc. activate only a subset.
A mathematical way to tackle this problem is to understand the behaviour of these complex neural networks. And based on that, one can predict beforehand how many training samples are needed, which hyperparameter should be used, and also how many neurons are needed in each layer, etc. A technique called tensor programs, developed by Greg Yang, is a solution. Another solution is to use neural scaling laws, developed by Kaplan et al.

In 1961, physicist Rolf Landauer proved that erasing a single bit must consume at least about 3×10⁻²¹ joules of energy at room temperature. In contrast, today’s GPUs consume about a billion times more per operation. So while we have a fundamental floor (there is no free computation, like no free lunch), we can still go a long way in improving efficiency, and it is a very meaningful direction of research to ensure the sustainability of this new age of artificial intelligence.

Conclusion

As we reach the final post of this NLP series, it’s worth stepping back and appreciating the journey we’ve taken — from simple word counts and vector spaces to the inner workings of modern language models, decoding strategies, evaluation frameworks, and sustainability concerns.

Natural Language Processing has evolved from being a subfield of computational linguistics to a cornerstone of modern AI, powering everything from chatbots to scientific discovery. But as we’ve seen, the field is far from saturated. Whether it’s building memory-efficient models, designing fairer and more robust evaluations, or simply trying to reduce the carbon footprint of every generated words — the frontiers of NLP are expanding faster than ever.

Yet, amidst this technological bloom, we must remember that language is fundamentally a human construct — messy, rich, ambiguous, and deeply contextual. Our models, however impressive, still have a long way to go in truly grasping this complexity. And perhaps that's a good thing — it means there’s room for all of us to contribute: as researchers, engineers, critics, ethicists, or simply curious learners.

If this series sparked even a bit of that curiosity in you — to ask deeper questions, to read a new paper, or to debug a training run — then it has served its purpose. Thank you for being part of this journey.

If you enjoyed this series, I’d love to hear your thoughts — feel free to leave a comment, share it with a friend, or let me know which topic you'd like me to explore next.