Previous year i.e., 2023 has clearly been a standout year in terms of advancements in field of AI domain. Traditionally it’s always been felt that to get the most out of AI one need a strong investment in infrastructure and support. It’s never been as clear
as last year due to the virtue of advent of Generative AI. Most of the traditional AI technology prior to Gen AI performed reasonably well on a handful of GPUs and RAM. All this changed after the release of GPT-3 by Open AI and the further release of large
number of opensource models. These Large Language Models were large in every sense, they needed massive computation resources in form of high-performance GPUs and large memory in terms of RAM. Financial services sector in particular is recognized as the top
beneficiary of this technology. The number of resources utilized in this sector in analyses and processing of data particularly textual data can be optimized to a large extent using LLMs. Infact it is the opensource LLMs that has found its most utility in
this sector. There are multiple reasons for this
(a) Criticality of data and its security: Quite a lot of data in financial sector are sensitive. They are to be secured and refrained from public access. The potential leak of these data can cause serious issues for the business. It makes the case
for opensource or internal solutions instead of proprietary ones particularly for critical and sensitive usecases.
(b) Customization of LLMs: Most of the usecases in this sector requires customization of LLM models with very specific dataset varying from company to company in order to provide the correct response.
It’s is quite evident that the applicability of opensource LLM in financial sector is increasing but at same time there are many challenges in basic implementation of LLM solution. The sheer number of resources required in terms of both computation capability
and memory is costly as well as difficult to support. Take the case of a recent milestone of Big Science project's unveiling of BLOOM, a model with 176 billion parameters capable of supporting 46 natural languages and 13 programming languages. While the public
accessibility of these 100B+ parameter models has facilitated their use, the associated challenges of high memory and computational costs persist. Notably, models like OPT-175B and BLOOM-176B demand over 350 GB of accelerator memory for inference, and even
more for fine-tuning. Consequently, the practical utilization of such LLMs often necessitates multiple high-end GPUs or multi-node clusters, which, due to their high costs, limits accessibility for many researchers and practitioners.
This makes the case for for testing completely different outlook all together like they say
Thinking out of the box.
Client – Server Approach
This makes the case for distributed computing setup for the LLMs as one of possible solutions. It also makes sense since we are already using normal distributed computing systems like cloud and edge computing. This facilitates collaboration among multiple
users for the purpose of inference and fine-tuning of large language models over the Internet. Participants in distributed network can assume the roles of a server, a client, or both. A server is responsible for hosting a subset of model layers, typically
Transformer blocks, and managing requests from clients. Clients, in turn, can form a chain of pipeline-parallel consecutive servers to execute the inference of the entire model. Beyond inference, one can engage in fine-tuning activities using parameter-efficient
training methods like adapters, or by training entire layers. Trained submodules can be shared on a model hub, where others can leverage them for inference or further training. This demonstrates the efficient execution of existing 100B+ models in this collaborative
setting, aided by several optimizations such as dynamic quantization, prioritizing low-latency connections, and load balancing between servers. Let discuss this in bit more detail.
Design and Technical Overview
Practical applications of large language models can be broadly categorized into two main scenarios: inference and parameter-efficient adaptation to downstream tasks. I would try to outline the design of distributed network, elucidating how it effectively
manages both scenarios and facilitates the seamless sharing of trained adapters among system users.
- Inference of Billion-Scale Models : In the token generation process, a client locally stores the model's token embeddings, typically constituting a small fraction of the total parameter count and fitting comfortably in the RAM of most modern laptops,
servers, and workstations. The client relies on servers to execute Transformer blocks, with each server hosting several consecutive blocks, the number of which is determined by the available GPU memory. Before each inference session, the client establishes
a chain of servers that collectively encompass all model layers. During the active session, the client utilizes the local embedding layer to retrieve embedding vectors for prefix tokens, transmitting these vectors to servers and receiving updated representations.
After obtaining the outputs of the final block, the client calculates next token probabilities and iterates through this process. Servers retain attention keys and values from past client inputs for subsequent inference steps, and clients store past inputs
to each server to facilitate a quick replacement if a server fails or goes offline.
- Training for Downstream Tasks: While Large Language Models (LLMs) excel on many problems with simple prompt engineering, achieving optimal results often requires training. Traditional fine-tuning strategies, which involve updating all model parameters
for the downstream task, become impractical for very large models due to extensive hardware requirements. For instance, fine-tuning BLOOM- 176B would demand nearly 3 TB of GPU memory to accommodate model, gradients, and optimizer states. To address
this challenge, the NLP community has devised parameter-efficient fine-tuning methods that preserve most pretrained model parameters. Some approaches select a subset of existing parameters, while others augment the model with additional trainable weights.
Despite lower memory requirements, these parameter-efficient approaches often compete favorably with full model fine-tuning and can outperform it in low-data scenarios.
- Distributed Fine-tuning: The fundamental idea behind fine-tuning in a distributed network is that clients own trained parameters, while servers host the original pretrained layers. Servers can run backpropagation through their layers, returning gradients
concerning activations, but they do not update server-side parameters. This allows clients to concurrently execute different training tasks on the same set of servers without interference.
Internal Structure and Optimizations
Performance considerations are paramount for distributed inference, involving three key aspects: computation speed (comparing a 5-year-old gaming GPU with a new data center GPU), communication delay due to node distance (intercontinental vs. local), and
bandwidth-induced communication delay (10 Mbit/s vs. 10 Gbit/s). While even consumer-grade GPUs like the GeForce RTX 3070 boast the capability to execute a complete inference step of BLOOM-176B in less than a second, the challenge lies in GPU memory constraints,
necessitating efficient solutions. One way to address this is by employing quantization for optimized parameter storage and dynamic server prioritization for enhanced communication speed.
- Using Consumer GPUs: Considering the fact that each server possesses at least 16 GB of CPU RAM and 8 GB of GPU memory, the primary objective is to minimize the model's memory footprint, enabling each device to accommodate more Transformer
blocks. For BLOOM with 176B parameters, requiring 352 GB of GPU memory in 16-bit precision, we can optimize this by compressing hidden states through dynamic blockwise quantization and reducing the weights to 8-bit precision using mixed matrix decomposition.
This results in a substantial reduction in the required number of nodes, effectively halving latency and minimizing the likelihood of failure.
- Compressing Communication Buffers:
We can use Dynamic Blockwise quantization on hidden states before pipeline-parallel communication, halving bandwidth requirements without compromising generation quality.
- Compressing Model Weights: Utilizing 8-bit mixed matrix decomposition for matrix multiplication, reduces the memory footprint by roughly half without sacrificing quality.
- Collaborating Over the Internet: In order to ensure reliable inference and training despite nodes joining, leaving, or failing. We can utilize the hivemind library for decentralized training and custom fault-tolerant protocols for servers and clients.
Democratization and Privacy Concerns
We can take inspiration from Blockchain to address potential imbalance between peers supplying GPU resources (servers) and those utilizing these servers for inference or fine-tuning. To address this, a system of incentives could be implemented. Peers running
servers could earn special points, redeemable for high-priority inference and fine-tuning or other rewards. This approach aims to encourage active participation and maintain a balanced network. An acknowledged limitation of our current approach is the potential
privacy concern where peers serving the initial layers of the model might leverage inputs to recover input tokens. One way to address this is users handling sensitive data are advised to limit their clients to trusted servers or establish their isolated swarm.
Though we can explore privacy-enhancing technologies such as secure multi-party computing or privacy-preserving hardware from NVIDIA.
Conclusion
My aim through this blog is to introduce my take on Distributed Computing for AI and to explain both why it’s required and a brief technical overview on one possible approach to implement it. I am open to discuss new ideas to implement this. Considering
the fact that there will be massive application of AI in financial sector in coming years, we have to start thinking about how can we optimally utilize current resources before creating new ones. The another aim is to democratize access to large language models,
enabling a broader range of applications, studies, and research questions that were previously challenging or cost- prohibitive.