Skip to content

Latest commit

 

History

History
114 lines (72 loc) · 8.62 KB

File metadata and controls

114 lines (72 loc) · 8.62 KB

Code

This folder hosts the notebooks and code (in Pyhton) used in the different tutorials and hands-on sessions. The proposed set-ups, and contents of the sessions are described below.

  • Week 1: Familiarization with BERT-like models, using transformers package. Generation of embedding vectors and visualization; applications on word sense disambiguation, and semantic shifts exploration.
  • Week 2: Topic Modeling: follow a step-by-step implementation of a (simplified) version of BERTopic relying on sentence_transformers model representations and compare the output of different topic models. Experiments illustrated with a corpus of 19th century American recipes, and UN General Debate speeches.
  • Week 3: Supervised Learning: Tutorial of BERT-like model fine-tuning applied to book genre prediction (compared with document representation-based baselines). Hands-on applied to "literary canon" prediction: design your own classifier and reflect on fairness issues in ML.
  • Week 4: Generative LLMs interactions: Tutorial on how to interact with LLMs (via diverse APIs), and hands-on session on devising a questionnaire to assess LLMs behaviors.

Setups

Feel free to use the notebooks, either locally, or using hosted services such as Jupyter Binder and Google Colab.

Running on your machine

You can use the requirements.txt file provided at the root of this repository. In your virtual environment, cd to repo root, and run:

pip install -r requirements.txt

Binder

You can launch the projects on Binder: Binder

Warning

It can take some time to build the image on Binder.

Binder can be handy to have the repository in a jupyter-lab hosted environment. However, it does not provide extensive memory nor computational resources. Thus, it is not fitted to manipulate large data or to use pre-trained language models with large number of parameters.

Colab

The notebooks are provided in Google Colab. It provides a convenient way to run the experiments and offers computational resources that should be sufficient for the content of this course in the free-tier (including GPU and TPU runtime access).

Content

Week 1 — 29.10

  • Discover_BERT.ipynb: Familiarize with BERT-like models. Overview of the architecture. Visulisation of attention mechanism.
  • Tutorial_1_WSD.ipynb: Familiarize with BERT-like models. Generation of embedding vectors and visualisation. Exemplified with Word Sense Disambiguation application.
  • Hands-on_1_SS.ipynb: Reproduce and expand tutorial's content. Explore Semantic Shifts from LM's lense based on historical newspaper data from Living With Machines initiative.

Main libraries: transformers, bertviz, (scikit-learn, pandas, altair)

To go further

Week 2 — 05.11

  • Tutorial_2_MyBERTopic.ipynb: Implement your own (simplified) version of BERTopic and explore a corpus of 19th century recipes.
  • Hands_on_2_CompareTM.ipynb: Apply different topic modeling algorithms on a corpus of UN General Debate speeches transcripts from 1946 until today. Explore time and space, and try to find the best methods to decipher what is discussed during these assemblies!

Main libraries: sentence_transformers, BERTopic, gensim, pyLDAvis, sklearn, umap, hdbscan

To go further

Week 3 — 12.11

  • Tutorial_3_SFT.ipynb: Fine-tune a BERT-like model for literary genre classification based on 5-sentences long book chunks. Compare the results with classification performed by standard classifiers trained on document representations (BoW, TF-IDF, SentenceTransformers' embeddings).
  • Hands_on_3_CanonChallenge.ipynb: Your time to devise a classifier for "canonicity" prediction based on 5-sentences long excerpts of French-language novels. Reflect about the data, the models, and the fairness implications of both. Implement you classifier and submit your predictions to the Performance & Fairness class shared task: https://tinyurl.com/canon-pf!

Main libraries: transformers, pytorch, sklearn

To go further

Week 4 — 19.11

  • Tutorial_4_LLM_Interaction.ipynb: Learn how to use open-weight LLM via the transformers library, run and query LLMs locally with ollama, or interact with diverse providers through APIs and requests.
  • Hands_on_4_EvalLLM.ipynb: Write a multiple choice questionnaire and apply it to LLMs.

Main libraries: transformers, ollama, openai, requests

To go further

Revisit previous sessions with LLMs!

  • Week_1:

    • Re-annotate the data with an LLM and observe potential differences, measure aggreement with humans with κ index
    • Extract features from a generative LLM instead of BERT
    • Improve the OCRed text via prompting LLMs (& find methods to evaluate improvement)
  • Week_2:

    • Add a LLM-based component to summarize topics or provide more meaningful topic labels
    • Replace the document embedder of BERTopic with features extracted from a LLM
  • Week_3:

    • Prompt LLMs to do zero-/few- shot classification (try diverse prompts, number of examples, etc.)
      • of book genres
      • or canonicity (upload your predictions on the Shared Task app!)
  • Glimpse at Data Curation for LLM training: Colab notebook to explore data curation process: language identification & quality filtering, by Rose E Wang.

  • Interrogating a National Narrative with GPT-2: Using Generated Texts to Interrogate the Brexit Narrative (Lesson), by Chantal Brousseau.

  • Text Classification using LLMs: Using the skorch library for zero-shot classification with LLMs.