Understanding Mixture of Experts

Details on Mixture of Experts and how to run it.
sagemaker
transformers
NLP
MoE
Published

May 19, 2024

Understanding Misture of Experts

Pre-requisites

Basic knowledge of Python.

Access to Amazon SageMaker Jumpstart.

What is Mixture of Experts

Mixture of Experts (MoE) idea dates back to 2010, where it has been explored for example in SVMs and Gaussian Process (ref. Learning Factored Representations in a Deep Mixture of Experts). Lately is has incorporated in LSTM with the introduction of sparsity (i.e. to allow running only parts of the whole neural network) (ref. Switch Transformers).

The general idea of MoE is to replicate certain model components many times while routing each input only to a small subset of those replicas (a.k.a. experts). MoEs achieve faster inference for same model quality at the expense of significatly higher memory cost as all replicated components (a.k.a. parameter) need to be loaded in memory.

Mixture of Experts (MoE) consists of the main two elements:

Sparse MoE layer

Instead of using dense feed-forward network (FFN), MoE makes use of sparse MoE layers known as “experts”. As show in picture below, each expert is a neural network.

Router or gate network

In an MoE, the router determines which tokens are sent to which experts. The router is complsed by the learned parameters and its pre-trained at the same time as the rest of the network.

moe

MoE Benefits:

The conditional computation on MoE where parts of the network are active on a per-token basis, allow us to scale the size of the model without increasing the computation.

MoE Challenges:

As described by Mixture of Experts Explained by HuggingFace, MoE comes with some challenges:

  • Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.

  • High memory requirement at inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to fast inference compared to a dense model with same number of parameters. But, ALL parameters need to be loaded in RAM, so memory requirements are high. This is a disadvantage for MoE in edge devices as memory size is restricted.

MoE - Future Works:

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models) proposed a new compression framework called QMoE which uses quantization as a way to compress trillion-parameters MoEs to less than 1 bit per parameter. Basically quantization converst the parameters (a.k.a model weights) to lower numerical precision (e.g. going from 16bits - half precision to 4 bits per weight.

Misture of Experts Myths

Myth 1: There are 8 experts in Mixtral 8x7B

Every transformer layer has 8 experts and they are permuted in each layer. Instead of 8 experts what we have is a 256 independent experts in total accross the layers (32 x 8).

Myth 2: There are 56B parameters in Mixtral 8x7B

In reallity there are not 56B (8x7B) but 46.7B as the gating and attention layers are shared among the experts. Thus each token will see 12.9B active parameters instead of 14B parameters.

Myth 3: Cost and amount of active parameters are proportional

Mixtral 8x7B has fewer active parameters than Llama2 13B. But by having expert routing in MoE you have a higher communication cost as you need to send tokens to different experts. Thus the cost and amount of active parameters are NOT proportiona in MoE.

Note that in MoE you can not program which token you send to which expert. Thus while gaining on performance/cost, the absolute cost is not proportional to the amount of active parameters.

How to implement MoE in PyTorch

ToBeDefine

Deploy Mixtral 8x7B Instruct MoE using SageMaker Jumpstart

This notebook is inspired by Amazon SageMaker Jumpstart notebooks, which uses SageMaker Python SDK to deploy Mixtral 8x7B text generation model.

We use the instance ml.g5.48xlarge which contains 8 x NVIDIA A10g with a total 192 GB memory.

import sagemaker

print(sagemaker.__version__) # 2.214.3
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
2.221.0
#!pip install --upgrade sagemaker #2.221.0
from sagemaker.jumpstart.model import JumpStartModel
model_id = "huggingface-llm-mixtral-8x7b-instruct" #"huggingface-llm-mistral-7b-instruct" #
accept_eula = True

Deploying the model

We deploy Mixtral 8x7B using Amazon SageMaker Jumpstart. Amazon SageMaker Jumpstart is a machine learning (ML) Hub with foundation models (FM), build-in algorithms and pre-build ML solutions that you can deploy with just a few clicks.

For further information ref. AWS ML Blog Mixtral-8x7B is now available in Amazon SageMaker JumpStart.

For a complete list of all pre-trained model in Amazon SageMaker Jumpstart please check: https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html

We make use of Amazon SageMaker Jumpstart JumpStartModel class to deploy the model. You can also use Amazon SageMaker Jumpstart to fine-tune a foundation model by using JumpStartEstimator class.

import json
number_of_gpu = 4

config = {
  'HF_API_TOKEN': "XX",
    'SM_NUM_GPUS': json.dumps(number_of_gpu),
    'HF_MODEL_QUANTIZE': "bitsandbytes-nf4",
}
model = JumpStartModel(model_id=model_id, env=config)

# By default sagemaker expects an ml.p4d.24xlarge instance (NVIDIA A100 - 8 GPUs and 320 GB memory).
# Due to quota restriction decided to use ml.g5.12xlarge (NVIDIA A10g  - 4 GPUs and 96 GB memory)
predictor = model.deploy(
    accept_eula=accept_eula,
    instance_type= 'ml.g5.12xlarge',
    container_startup_health_check_timeout= 2000) # 10 minutes to be able to load the model
Using model 'huggingface-llm-mixtral-8x7b-instruct' with wildcard version identifier '*'. You can pin to version '1.4.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.
----------!

Invoke the endpoint

With the endpoint deployed we can now run inference. We will use the predict method from the predictor to run inference on our endpoint. We can call the model with different parameters to impact the text generation. For the list of parameters available for the model check Philschmid blog.

The mistralai/Mixtral-8x7B-Instruct-v0.1 is a conversational chat model meaning we can chat with it using the following prompt:

<s> [INST] User Instruction 1 [/INST] Model answer 1</s> [INST] User instruction 2 [/INST]
prompt= f'<s> [INST] Simply put, the theory of relativity states that [/INST]'

payload = {
    'inputs': prompt,
    'parameters': {
        'max_new_tokens':64,
        'top_p':0.9, 
        'temperature': 0.6,
        'stop': ['</s>']
    }
}
predictor.predict(payload)
[{'generated_text': '<s> [INST] Simply put, the theory of relativity states that [/INST] The theory of relativity, developed by Albert Einstein, is actually composed of two parts: the special theory of relativity and the general theory of relativity.\n\nThe special theory of relativity, proposed in 1905, states that the laws of physics are the same for all observers moving at'}]

Cleaning

After you are done running the notebook, make sure to delete all the resources that you created in the process to make sure your billing is stopped. Use the following commands:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion / Remarks

ToBeDefine

References: