diff --git a/.github/workflows/ci-checks.yaml b/.github/workflows/ci-checks.yaml new file mode 100644 index 0000000..168a39e --- /dev/null +++ b/.github/workflows/ci-checks.yaml @@ -0,0 +1,33 @@ +name: CI Checks + +on: + push: + branches: + - main + + pull_request: + branches: + - "*" # Run on all branches + +env: + MIN_PYTHON_VERSION: 3.11 + +jobs: + job-pre-commit-check: + name: Pre-Commit Check + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Python + uses: actions/setup-python@v3 + with: + python-version: ${{ env.MIN_PYTHON_VERSION }} + + - name: Install dependencies + run: pip install -r requirements-dev.txt + + - name: Run pre-commit + run: pre-commit run --all-files diff --git a/adi_function_app/README.md b/adi_function_app/README.md index 1794059..722734f 100644 --- a/adi_function_app/README.md +++ b/adi_function_app/README.md @@ -42,7 +42,7 @@ Using the [Phi-3 Technical Report: A Highly Capable Language Model Locally on Yo ```json { - "content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Table 1: Comparison results on RepoQA benchmark.
ModelCtx SizePythonC++RustJavaTypeScriptAverage
gpt-4O-2024-05-13128k958085969790.6
gemini-1.5-flash-latest1000k937987949790
Phi-3.5-MoE128k897481889585
Phi-3.5-Mini128k866773778277
Llama-3.1-8B-Instruct128k806573766371
Mixtral-8x7B-Instruct-v0.132k666564717468
Mixtral-8x22B-Instruct-v0.164k606774835567.8
\n\n\nsuch as Arabic, Chinese, Russian, Ukrainian, and Vietnamese, with average MMLU-multilingual scores\nof 55.4 and 47.3, respectively. Due to its larger model capacity, phi-3.5-MoE achieves a significantly\nhigher average score of 69.9, outperforming phi-3.5-mini.\n\nMMLU(5-shot) MultiLingual\n\nPhi-3-mini\n\nPhi-3.5-mini\n\nPhi-3.5-MoE\n\n\n\n\n\nWe evaluate the phi-3.5-mini and phi-3.5-MoE models on two long-context understanding tasks:\nRULER [HSK+24] and RepoQA [LTD+24]. As shown in Tables 1 and 2, both phi-3.5-MoE and phi-\n3.5-mini outperform other open-source models with larger sizes, such as Llama-3.1-8B, Mixtral-8x7B,\nand Mixtral-8x22B, on the RepoQA task, and achieve comparable performance to Llama-3.1-8B on\nthe RULER task. However, we observe a significant performance drop when testing the 128K context\nwindow on the RULER task. We suspect this is due to the lack of high-quality long-context data in\nmid-training, an issue we plan to address in the next version of the model release.\n\nIn the table 3, we present a detailed evaluation of the phi-3.5-mini and phi-3.5-MoE models\ncompared with recent SoTA pretrained language models, such as GPT-4o-mini, Gemini-1.5 Flash, and\nopen-source models like Llama-3.1-8B and the Mistral models. The results show that phi-3.5-mini\nachieves performance comparable to much larger models like Mistral-Nemo-12B and Llama-3.1-8B, while\nphi-3.5-MoE significantly outperforms other open-source models, offers performance comparable to\nGemini-1.5 Flash, and achieves above 90% of the average performance of GPT-4o-mini across various\nlanguage benchmarks.\n\n\n\n\n", + "content": "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Table 1: Comparison results on RepoQA benchmark.
ModelCtx SizePythonC++RustJavaTypeScriptAverage
gpt-4O-2024-05-13128k958085969790.6
gemini-1.5-flash-latest1000k937987949790
Phi-3.5-MoE128k897481889585
Phi-3.5-Mini128k866773778277
Llama-3.1-8B-Instruct128k806573766371
Mixtral-8x7B-Instruct-v0.132k666564717468
Mixtral-8x22B-Instruct-v0.164k606774835567.8
\n\n\nsuch as Arabic, Chinese, Russian, Ukrainian, and Vietnamese, with average MMLU-multilingual scores\nof 55.4 and 47.3, respectively. Due to its larger model capacity, phi-3.5-MoE achieves a significantly\nhigher average score of 69.9, outperforming phi-3.5-mini.\n\nMMLU(5-shot) MultiLingual\n\nPhi-3-mini\n\nPhi-3.5-mini\n\nPhi-3.5-MoE\n\n\n\n\n\n We evaluate the phi-3.5-mini and phi-3.5-MoE models on two long-context understanding tasks:\nRULER [HSK+24] and RepoQA [LTD+24]. As shown in Tables 1 and 2, both phi-3.5-MoE and phi-\n3.5-mini outperform other open-source models with larger sizes, such as Llama-3.1-8B, Mixtral-8x7B,\nand Mixtral-8x22B, on the RepoQA task, and achieve comparable performance to Llama-3.1-8B on\nthe RULER task. However, we observe a significant performance drop when testing the 128K context\nwindow on the RULER task. We suspect this is due to the lack of high-quality long-context data in\nmid-training, an issue we plan to address in the next version of the model release.\n\n In the table 3, we present a detailed evaluation of the phi-3.5-mini and phi-3.5-MoE models\ncompared with recent SoTA pretrained language models, such as GPT-4o-mini, Gemini-1.5 Flash, and\nopen-source models like Llama-3.1-8B and the Mistral models. The results show that phi-3.5-mini\nachieves performance comparable to much larger models like Mistral-Nemo-12B and Llama-3.1-8B, while\nphi-3.5-MoE significantly outperforms other open-source models, offers performance comparable to\nGemini-1.5 Flash, and achieves above 90% of the average performance of GPT-4o-mini across various\nlanguage benchmarks.\n\n\n\n\n", "sections": [], "page_number": 7 }