Nick Jetten | 31 May 2023

Beyond GPT: Implementing and finetuning open source LLMs

The recent Artificial Intelligence (AI) hype focuses on ChatGPT and the AI war between OpenAI and Google. However, an interesting movement is coming up, since open source Large Language Models (LLMs) are picking up momentum. The obvious benefit: these models can be used by anyone, including you, in contrast to the proprietary ones like GPT-4. At the beginning of May 2023, Google even predicted that the open-source community will outperform both Google and OpenAI. At Enjins, we are strong believers in the open-source community and expect that LLMs will become part of the standard data scientist “knowledge stack”, in the nearby future.

In this blog post, we’ll describe how to get started with open-source Large Language Models. We’ll go over finding open-source models, implementing LLM pipelines in Python using LangChain, parameter tweaking, finetuning, determining infrastructure requirements, and evaluating LLM performance. Disclaimer: This blog will most probably be out of date within 3 months, but reflects the current state of the art (as someone in the AI field said recently: ‘new month, new model’).

Selecting an (open-source) model

The first step in implementing an LLM is to choose a pre-trained model. Next to the commercialized GPT models, there are a variety of open-source LLM models available, including T5, Llama, Alpaca, Bloom, Dolly, BERT, and RoBERTa. Each model has its own strengths and weaknesses, and the choice of model depends on the specific use case. For example, GPT-3 is excellent for natural language generation, while BERT is better suited for language understanding tasks such as sentiment analysis and question answering. The size of the model also matters: overall, larger models perform better and have more emergent capabilities, such as zero-shot classification, but are also less computationally efficient. It is good practice to start with a smaller model and scale up if it doesn’t perform satisfactorily. Most of these models can be referenced through the HuggingFace transformers package, but make sure that the model you’re choosing for your organization has a license that allows for commercial use. If you don’t know which model to choose, check out the HuggingFace LLM leaderboard: it contains scores for popular LLMs on four different benchmarks.

Creating modular inference pipelines with LangChain

Once you’ve chosen a pre-trained model, the next step is to load it into a pipeline for inference to test its performance. For this, LangChain is the recommended option. LangChain is an open-source Python package that facilitates turning most state-of-the-art LLMs into prompt-able pipelines by providing a modular setup with puzzle pieces in the shape of prompt templates, databases, memory, and so on. LangChain supports a variety of pre-trained models such as GPT-3 and T5 and a range of tasks such as question-answering summarisation, and language translation. To do inference on your chosen model, after loading, for example, a HuggingFace model into Python with the Hugginfacepipeline it is usually easiest to start with the standard “chain” setup LLM Chain for simple prompting and then try other chains or agents for specialized tasks such as summarization.

Tweaking and Finetuning an LLM

To fine-tune an LLM, for example, to teach it new knowledge or improve performance on a specific NLP task, you’ll need a labeled dataset for the specific task you want to perform. For example, if you want to use an LLM for sentiment analysis, you’ll need a dataset of labeled text data where each sample is labeled as either positive or negative. If you don’t have a dataset yet, you can pick one from HuggingFace datasets or Kaggle. Once you have the labeled dataset, you can use LangChain to fine-tune the pre-trained model on this dataset.

A difficulty with training LLMs is that natural language data is non-tabular, which many developers aren’t used to. For working with text document data in Python, take a look at LangChain’s Document class. Alternatively, it can be useful to load documents into a vector database with Chroma and Pinecone. Vector databases are specifically designed to handle vector embedding data, meaning that while you still communicate with your model in natural language, behind the scenes your query is converted into a vector embedding to be inputted into the vector database, and the result is converted from a vector embedding back into human-readable language.

Parameter tweaking

When fine-tuning a pre-trained model, there are a few parameters you can tweak to achieve better results. The first parameter is the learning rate. The learning rate determines how quickly the model updates its weights during training. A high learning rate can cause the model to miss the optimal weights, while a low learning rate can cause the model to find the optimum too slowly. The ideal learning rate depends on the specific use case and can be determined through trial and error, or by using a learning rate schema that converges the learning rate during training.

The second parameter is the number of epochs. An epoch refers to one pass through the entire training dataset. Training an LLM on a large dataset can take a long time, so it’s important to find the optimal number of epochs for your specific use case. Too few epochs can result in an underfit model, while too many epochs can result in an overfit model.

geforce RTX -> in implementing and finetuning open source LLMs

Infrastructure Requirements

Implementing an LLM requires significant computational resources, especially when finetuning a pre-trained model. To achieve optimal performance, it’s essential to use a Graphical Processing Unit (GPU) or even a Tensor Processing Unit (TPU) for training the model. A GPU is specialized hardware that can perform parallel computations much faster than a traditional CPU. There are cloud-based services such as Google Cloud Platform, Amazon Web Services, and Microsoft Azure that provide access to GPUs for training and deploying LLMs. For PoC development in the “playground phase”, it is a good idea to start with Google Colab, as you can start using a GPU on the free tier (depending on availability). If you are working on a more mature project, training the model on an AWS Sagemaker Notebook in a GPU environment is a better option. For most models, you can use the HuggingFace Estimator, which ensures that only training is done on the GPU and all other code editing and running is done on the notebook’s instance type (which can be a cheaper CPU).

As soon as you have access to a GPU, not all is well: it is important to manage the memory usage of your script, as GPUs have a tendency to run out of memory. This happens especially when dealing with large batch sizes or large sample sizes such as is the case for document data. On Sagemaker, it is recommended to choose an instance type for your Estimator that has a large amount of memory (look for “memory-optimized” instance types), and if you upgrade Google Colab to Pro, you can set your runtime to High RAM. However, it is even better to ensure that your script is set up efficiently. One way to do this is by reducing your batch size or by using iterable datasets, which are not held in memory piece by piece instead of all at once. Another way is to simply use a smaller model, either by picking a version of your chosen model with fewer parameters (often, the HuggingFace registry contains multiple versions of the same model) or switching to a different model altogether.

Evaluating the Performance of LLMs

After you have your LLM set up in Python and fine-tuned to your data and task, you probably want to evaluate its performance. How best to evaluate LLMs is still a topic under debate: how do you determine whether a newly created text is “good”? It can be quite subjective. Many metrics try to solve this by comparing generated texts to “true labels” (human- or ChatGPT-written texts). There are several metrics you can use to evaluate the performance of an LLM, depending on the specific use case. For example, if you’re using an LLM for sentiment analysis, you can use metrics such as accuracy, precision, or recall, but when your task is summarisation, a better metric might be Rouge. To choose your metric, it is recommended to check out the HELM project of Stanford University, in which they compare many LLMs on many different tasks and metrics.


Now that you have a well-performing model, it’s time to deploy it. For this, you can use cloud platforms such as AWS or Azure (AWS Sagemaker is a good option). Deployment has most of the same hardware requirements as training, such as a GPU and memory optimization. When you’ve set it up, your just-created LLM will hopefully solve all your problems (unlike ChatGPT)!

Want to stay updated?

Please fill in your e-mail and we'll update you when we have new content!