7.2. 指示に従うためのファインチューニング

Reading time: 9 minutes

tip

AWSハッキングを学び、実践する：HackTricks Training AWS Red Team Expert (ARTE)
GCPハッキングを学び、実践する：HackTricks Training GCP Red Team Expert (GRTE) Azureハッキングを学び、実践する：HackTricks Training Azure Red Team Expert (AzRTE)

HackTricksをサポートする

サブスクリプションプランを確認してください！
**💬 Discordグループまたはテレグラムグループに参加するか、Twitter 🐦 @hacktricks_liveをフォローしてください。
HackTricksおよびHackTricks CloudのGitHubリポジトリにPRを提出してハッキングトリックを共有してください。

tip

このセクションの目的は、テキストを生成するだけでなく、チャットボットとしてタスクに応答するなど、指示に従うように既に事前トレーニングされたモデルをファインチューニングする方法を示すことです。

データセット

LLMを指示に従うようにファインチューニングするためには、指示と応答を含むデータセットが必要です。LLMを指示に従うようにトレーニングするための異なるフォーマットがあります。例えば：

Apply Alpacaプロンプトスタイルの例：

csharp

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Calculate the area of a circle with a radius of 5 units.

### Response:
The area of a circle is calculated using the formula \( A = \pi r^2 \). Plugging in the radius of 5 units:

\( A = \pi (5)^2 = \pi \times 25 = 25\pi \) square units.

Phi-3 プロンプトスタイルの例:

vbnet

<|User|>
Can you explain what gravity is in simple terms?

<|Assistant|>
Absolutely! Gravity is a force that pulls objects toward each other.

トレーニングデータセットにこのようなデータセットを使用することで、生のテキストだけでなく、LLMは受け取る質問に対して具体的な回答を提供する必要があることを理解するのに役立ちます。

したがって、リクエストと回答を含むデータセットで最初に行うべきことの1つは、そのデータを希望するプロンプト形式にモデル化することです。例えば:

python

# Code from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/ch07.ipynb
def format_input(entry):
instruction_text = (
f"Below is an instruction that describes a task. "
f"Write a response that appropriately completes the request."
f"\n\n### Instruction:\n{entry['instruction']}"
)

input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

return instruction_text + input_text

model_input = format_input(data[50])

desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Then, as always, it's needed to separate the dataset in sets for training, validation and testing.

Batching & Data Loaders

Then, it's needed to batch all the inputs and expected outputs for the training. For this, it's needed to:

テキストをトークン化する
すべてのサンプルを同じ長さにパディングする（通常、長さはLLMの事前トレーニングに使用されるコンテキストの長さと同じくらい大きくなる）
カスタムコレート関数で入力を1つシフトして期待されるトークンを作成する
トレーニング損失から除外するために、いくつかのパディングトークンを-100に置き換える：最初のendoftextトークンの後、他のすべてのendoftextトークンを-100に置き換える（cross_entropy(...,ignore_index=-100)を使用することで、-100のターゲットを無視することを意味する）
[オプション] LLMが回答を生成する方法だけを学ぶように、質問に属するすべてのトークンも-100を使用してマスクする。Apply Alpacaスタイルでは、### Response:までのすべてをマスクすることを意味する

これが作成されたら、各データセット（トレーニング、検証、テスト）のデータローダーを作成する時間です。

Load pre-trained LLM & Fine tune & Loss Checking

It's needed to load a pre-trained LLM to fine tune it. This was already discussed in other pages. Then, it's possible to use the previously used training function to fine tune the LLM.

During the training it's also possible to see how the training loss and validation loss varies during the epochs to see if the loss is getting reduced and if overfitting is ocurring.
Remember that overfitting occurs when the training loss is getting reduced but the validation loss is not being reduced or even increasing. To avoid this, the simplest thing to do is to stop the training at the epoch where this behaviour start.

Response Quality

As this is not a classification fine-tune were it's possible to trust more the loss variations, it's also important to check the quality of the responses in the testing set. Therefore, it's recommended to gather the generated responses from all the testing sets and check their quality manually to see if there are wrong answers (note that it's possible for the LLM to create correctly the format and syntax of the response sentence but gives a completely wrong response. The loss variation won't reflect this behaviour).
Note that it's also possible to perform this review by passing the generated responses and the expected responses to other LLMs and ask them to evaluate the responses.

Other test to run to verify the quality of the responses:

Measuring Massive Multitask Language Understanding (MMLU): MMLU evaluates a model's knowledge and problem-solving abilities across 57 subjects, including humanities, sciences, and more. It uses multiple-choice questions to assess understanding at various difficulty levels, from elementary to advanced professional.
LMSYS Chatbot Arena: This platform allows users to compare responses from different chatbots side by side. Users input a prompt, and multiple chatbots generate responses that can be directly compared.
AlpacaEval: AlpacaEval is an automated evaluation framework where an advanced LLM like GPT-4 assesses the responses of other models to various prompts.
General Language Understanding Evaluation (GLUE): GLUE is a collection of nine natural language understanding tasks, including sentiment analysis, textual entailment, and question answering.
SuperGLUE: Building upon GLUE, SuperGLUE includes more challenging tasks designed to be difficult for current models.
Beyond the Imitation Game Benchmark (BIG-bench): BIG-bench is a large-scale benchmark with over 200 tasks that test a model's abilities in areas like reasoning, translation, and question answering.
Holistic Evaluation of Language Models (HELM): HELM provides a comprehensive evaluation across various metrics like accuracy, robustness, and fairness.
OpenAI Evals: An open-source evaluation framework by OpenAI that allows for the testing of AI models on custom and standardized tasks.
HumanEval: A collection of programming problems used to evaluate code generation abilities of language models.
Stanford Question Answering Dataset (SQuAD): SQuAD consists of questions about Wikipedia articles, where models must comprehend the text to answer accurately.
TriviaQA: A large-scale dataset of trivia questions and answers, along with evidence documents.

and many many more

Follow instructions fine-tuning code

You can find an example of the code to perform this fine tuning in https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/01_main-chapter-code/gpt_instruction_finetuning.py

References

https://www.manning.com/books/build-a-large-language-model-from-scratch

tip

HackTricksをサポートする

サブスクリプションプランを確認してください！
**💬 Discordグループまたはテレグラムグループに参加するか、Twitter 🐦 @hacktricks_liveをフォローしてください。
HackTricksおよびHackTricks CloudのGitHubリポジトリにPRを提出してハッキングトリックを共有してください。