Phi-1: All the Textbooks You Need Article

https://medium.com/@cbarkinozer/phi-1-t%C3%BCm-i%CC%87htiyac%C4%B1n%C4%B1z-olan-ders-kitaplar%C4%B1-makalesi-37cdba8c9937


This article has shown in the context of code LLMs that even small models with little but quality data can challenge large models.

Abstract

The Phi-1 code LLM is significantly smaller than the competing models. It is a model with 1.3D parameters and was trained using various "textbooks" over 4 days on 8 A100s. It consists of "quality" data from the internet (with 6B tokens), synthetically generated textbooks, and exercises made with GPT-3.5 (with 1B tokens). This small scale was able to achieve 50.6% pass@1 accuracy on HumanEval and 55.5% pass@1 accuracy on MBPP. However, it shows surprising properties compared to the phi-1-base. Our model before fine-tuning on a dataset of coding exercises and phi-1-small was trained along the same pipeline as phi-1 with 350M parameters and achieving 45% accuracy on HumanEval.

Summary

  • Phi-1, a 1.3D parameterized transformer-based code llm, is introduced.
  • Pass@1 accuracy of 50.6% in HumanEval and 55.5% in MBPP was achieved by Phi-1.
  • Research on training large artificial neural networks is very limited.
  • Scaling laws show that performance increases as computing or network size increases.
  • Eldan and Li's work on TinyStories has shown that high-quality data can change performance and scaling rules.
  • The latest technologies of large language models can reduce training computations and dataset sizes by using high-quality data.
  • While training LLMs, the focus is on coding and writing Python methods.
  • Current scaling laws prohibit the use of high-quality data.
  • Although phi-1 is lower than competing models, it shows high accuracy.
  • The details of Phi-1's training process, how the data was selected, and the resulting features are discussed.
  • The model has been published for public use and evaluation.
  • The importance of coding textbook-quality training data: A large number of standard code datasets are not enough to teach the basics of coding.
  • Lack of self-containment, lack of meaningful computation, complexity or inadequate documentation, and uneven distribution of coding concepts are some of the problems in standard code datasets.
  • Language models will benefit from a training set that is clear, self-contained, didactic, and balanced.
  • There are three main datasets used for training: filtered code language, synthetic textbooks, and synthetic exercises.
  • “Code Textbook” is a combination of filtered code language and synthetic textbook datasets.
  • Small synthetic training data needs to be collected to fine-tune the model.
  • GPT-4 is used to evaluate the quality of a subset of code samples.
  • Model performance can be significantly improved using filtering methodology.
  • Diversity in examples is important to expose the model to a variety of coding ideas and increase generalization and robustness. Language models tend to create redundant and homogeneous datasets.
  • Adding randomness to the prompt can make a difference in the text generated.
  • Different topics and target audience constraints helped create synthetic textbooks.
  • Python's synthetic exercise dataset contains exercises and solutions.
  • Only the decoder transformer model was used for the FlashAttention implementation.
  • Architectural details of the models (phi-1 and phi1-small)
  • queue length, optimizer, learning rate schedule, etc. educational information such as.
  • Pre-training on the CodeExercises dataset and the CodeTextbook dataset.
  • Improvement in execution of tasks not in the fine-tuning dataset.
  • Qualitative analysis of the model's performance: The text compares the capabilities between the fine-tuned phi-1 model and the pre-trained baseline phi-1-base model.
  • He notes that Phi-1 showed better understanding and compliance with instructions after the fine-tuning was applied.
  • Phi-1-base interprets and produces logical relationships in prompts, but phi-1 interprets and produces correct answers.
  • Even the smaller phi-1-small model shows an understanding of the problems, but their solutions are wrong.
  • Minor changes to CodeExercises increase the model's ability to use external libraries, the text says.
  • It provides an example of a PyGame program where phi-1 correctly implements PyGame functions and phi-1-base and phi-1-small produce unrelated function calls.
  • Another example is in the TKinter app where phi-1 correctly updates a text field based on user input, and phi-1-base and phi-1-smael struggle with task logic. — The code includes buttons and functions from the Tkinter GUI.
  • While phi-1-base, phi-1-small and phi-1 implement the GUI correctly, the other two models do not use the Tkinter APIs correctly.
  • A discussion is provided between an assistant and a student to increase resolution and rotate the Python pyplot.
  • Concerns about memorization and contamination of the dataset are eliminated.
  • A new evaluation method was developed with GPT-4 to evaluate coding solutions.
  • The results show that phi-1 works better than StarCoder on unfamiliar questions.
  • Data pruning experiments are applied to evaluate the performance.
  • HumanEval did not cause harm according to a normal contamination test from CodeExercises.
  • The data set is subjected to N-gram overlap analysis. — Minimal letter overlap using HumanEval
  • HumanEval returns the largest integer whose frequency is greater than zero and greater than or equal to the value of the integer.
  • CodeExercises: Calculates the total power frequency analysis of a list of integers.
  • Use embedding and syntax-based distances to find similar code snippets from HumanEval and CodeExercises.
  • Code pairs with similar code semantics are detected using embedding distance.
  • Syntax-based distance is used to find the edit distance between abstract syntax trees (ASTs) of code snippets.
  • CodeExercises are pruned based on insertion distance and thresholds for different AST match rates.
  • Pruned datasets compare the performance of phi-1 retrained on StarCoder-Prompted and the original phi-1.
  • Accuracy of models for subsets of HumanEval problems.
  • The importance of high-quality data in improving language model proficiency in coding tasks
  • limitations of phi-1 compared to larger models for coding
  • Potential for creating synthetic data with GPT-4.
  • Creating high-quality datasets involves scope, diversity, and redundancy of content.
  • Lack of methods to measure and evaluate redundancy and diversity in data.
  • Social and moral implications of training language models and curating data for future models
  • Example of challenging expression for the model before fine-tuning and performance increase after fine-tuning. The first code snippet searches for indices in a numeric matrix D that meet certain conditions.
  • A PyTorch function that performs a modified gradient update on the parameters of a model is known as the second snippet.
  • The Pyplot application that animates a line chart is shown in the third code snippet.
  • The different ways a given request is completed by the phi-1-small and phi-1-base models is known as the fourth code snippet.
  • The limitations and weaknesses of the model, as well as sensitivity to instantaneous changes, are addressed by the fifth code snippet.
  • The sixth code snippet asks you to create a three-layer neural network class in PyTorch and use a new method known as “rr”.
  • The seventh code snippet is a similar prompt with four layers instead of three.- Create a neural network class with four layers with Pytorch.
  • Nonsense codes can be created for spatial reasoning and numerical tasks.

Görseller

Table 1: We use self-reported scores whenever possible. Although it is better than competing models in HumanEval and MBPP, except phi-1, GPT-4 (WizardCoder achieves better HumanEval but worse MBPP).
Figure 2.1 shows Pass@1 accuracy (%) in HumanEval. Grouping bar charts either enlarges the computation time (e.g., here from 26B tokens to 76B) or the parameters of the model (e.g., here from 350M to 1.3B) as a result of normal scaling. Each column in a group corresponds to different training datasets. The orange column shows the performance of models trained on the standard dataset of deduplicated Python files on Stack (plus StackOverflow for the 1.3D parameter model), while the light green column shows the performance of models trained with CodeTextbook, our new dataset composition. Finally, the third column in dark green represents the corresponding second column models in our newly developed CodeExercises dataset. For 1.3D models, phi-1 and phi-1-base are post-training checkpoints on 51B tokens (770 GPU hours); The Stack+ model is trained for 76B tokens and 1090 GPU clocks. We highlight that our phi-1-base model trained on the CodeTextbook dataset achieved 29% HumanEval performance with only a 1.3D parameter model. Without any fine tuning. The Replit-Finetuned model showed close to 30% performance on HumanEval on 2.7D parameters trained with 100 times more training tokens than ours [Rep23]. On top of that, fine-tuning our CodeExercises dataset to achieve phi-1 gives us 51% better performance on HumanEval, as well as unlocking unexpected coding capabilities.
Even without the synthetic datasets discussed below, our filtering methodology significantly improves our model performance: HumanEval performance for the 350 million parameter model trained on unfiltered Stack (deduplicated python) and StackOverflow is 12.19% even after training for 96k steps (∼200B tokens). reaches. After 36k steps, filtered subset training reaches 17.68% on HumanEval. We further improve this to 20.12% by training on a combination of the filtered dataset and the synthetic textbook dataset discussed below (reported in Figure 2.1).
Figure 3.1 shows the number of imports among more than 880,000 exercises in CodeExercises. This ignores libraries that have been imported less than ten times. Phi-1 requests the following command to generate the drawing code: “I have a dictionary, sort the dictionary from largest to smallest using the value first.” Then make a pyplot bar chart. First set the font size to 7 and then rotate it. The x-axis is the key and the y-axis is the value of the dictionary; The label should be opened 90 degrees. You can use log scale for the y axis. Additionally, change the y-axis label to “Log Number of Times”. and make sure the x-axis label is “Imports”. Set dpi to 1000.”
Table 2: LLM evaluated comprehension scores on fifty new and unfamiliar coding problems.
Table 3 shows the percentage of HumanEval problems solved correctly by different models. It is checked whether the corresponding HumanEval problem has any close matches within the CodeExercises dataset for a given Ï„. The amount of HumanEval issues in each subset is represented by the issue count. In this case, the AST-based match rate is the threshold for the match rate between codes for similarity checking.

Conclusion

Like the comprehensive and well-crafted preparation of a textbook, our work demonstrates the remarkable impact of high-quality data in improving the adequacy of a language model for code generation tasks. By generating “textbook quality” data, we were able to train a model that outperformed nearly all open-source models on coding benchmarks such as HumanEval and MBPP. This was despite being ten times smaller in model size and a hundred times smaller in dataset size. We believe that such high-quality data significantly increases the effectiveness of learning coding language models because they provide clear, independent, instructive, and balanced examples of coding concepts and skills. Compared to larger models for encoding, our model faces some limitations. First, compared to multilingual models, phi-1 Python coding ability is limited.

Second, phi-1 does not have specific knowledge of larger models, such as programming with specific APIs or using less common packages. Finally, phi-1 is less robust to stylistic changes or errors in prompting due to the structured nature of the datasets and lack of diversity in terms of language and style. Grammatical errors significantly reduce its performance. desire). We will expand on these limitations and provide examples of failure modes of phi-1. We will further expand on these limitations in Appendix B. None of these limitations appear fundamental, and with further work, our approach can be used to overcome them. However, it is unclear what scaling can be achieved. These may need to be overcome because both the model and the dataset can be large.

Additionally, since we noticed that GPT-3.5 data has a high error rate, we believe that using GPT-3.5 instead of GPT-4 can be of great benefit to generating synthetic data. Interestingly, Phi-1 has such good coding ability despite these errors (a similar case was found in [AZL23] where a language model could be trained on data with a 100% error rate and still give correct answers during testing). Overall, our work shows that developing a good technique for generating high-quality datasets is an important component of research aimed at advancing natural language processing and related fields (see [JWJ+23] for more information). However, creating high-quality datasets is a challenging task and comes with numerous issues that need to be addressed.

It is difficult for the dataset to cover all the relevant content and concepts that the model wants to learn and to do so in a balanced and representative way. Ensuring that the dataset is truly diverse and not repetitive is another hurdle that prevents the model from overfitting the data or memorizing certain patterns or solutions. This means finding ways to maintain the quality and consistency of samples, as well as add creativity and randomness to the data generation process. Even after creating such datasets, we do not have a suitable method to measure and evaluate the diversity and redundancy in the data. For example, if we have a dataset containing coding exercises, it is difficult to figure out how many different variations of each exercise there are and how they are distributed across the dataset.

Finally, since language models themselves will be used to curate data for future language models, the ethical and social implications of training such models are even more important, such as transparency, accountability, and bias of data and models. participants in this transaction.

Resource

[1] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, 2 Oct 2023, Textbooks Are All You Need:

arXiv:2306.11644

Post a Comment

0 Comments