This folder hosts the notebooks and code (in Pyhton) used in the different tutorials and hands-on sessions. The proposed set-ups, and contents of the sessions are described below.
- Week 1: Familiarization with BERT-like models, using
transformerspackage. Generation of embedding vectors and visualization; applications on word sense disambiguation, and semantic shifts exploration. - Week 2: Topic Modeling: follow a step-by-step implementation of a (simplified) version of BERTopic relying on
sentence_transformersmodel representations and compare the output of different topic models. Experiments illustrated with a corpus of 19th century American recipes, and UN General Debate speeches. - Week 3: Supervised Learning: Tutorial of BERT-like model fine-tuning applied to book genre prediction (compared with document representation-based baselines). Hands-on applied to "literary canon" prediction: design your own classifier and reflect on fairness issues in ML.
- Week 4: Generative LLMs interactions: Tutorial on how to interact with LLMs (via diverse APIs), and hands-on session on devising a questionnaire to assess LLMs behaviors.
Feel free to use the notebooks, either locally, or using hosted services such as Jupyter Binder and Google Colab.
You can use the requirements.txt file provided at the root of this repository. In your virtual environment, cd to repo root, and run:
pip install -r requirements.txtYou can launch the projects on Binder:
Warning
It can take some time to build the image on Binder.
Binder can be handy to have the repository in a jupyter-lab hosted environment. However, it does not provide extensive memory nor computational resources. Thus, it is not fitted to manipulate large data or to use pre-trained language models with large number of parameters.
The notebooks are provided in Google Colab. It provides a convenient way to run the experiments and offers computational resources that should be sufficient for the content of this course in the free-tier (including GPU and TPU runtime access).
- Discover_BERT.ipynb: Familiarize with BERT-like models. Overview of the architecture. Visulisation of attention mechanism.
- Tutorial_1_WSD.ipynb: Familiarize with BERT-like models. Generation of embedding vectors and visualisation. Exemplified with Word Sense Disambiguation application.
- Hands-on_1_SS.ipynb: Reproduce and expand tutorial's content. Explore Semantic Shifts from LM's lense based on historical newspaper data from Living With Machines initiative.
Main libraries: transformers, bertviz, (scikit-learn, pandas, altair)
To go further
- Implement the attention mechanism from the Attention is All You Need (Vaswani et al., 2017) cornerstone paper in a Colab Notebook by Alexander "Sasha" Rush. Or read their detailed walkthrough post: The Annotated Transformer.
- Tutorial_2_MyBERTopic.ipynb: Implement your own (simplified) version of BERTopic and explore a corpus of 19th century recipes.
- Hands_on_2_CompareTM.ipynb: Apply different topic modeling algorithms on a corpus of UN General Debate speeches transcripts from 1946 until today. Explore time and space, and try to find the best methods to decipher what is discussed during these assemblies!
Main libraries: sentence_transformers, BERTopic, gensim, pyLDAvis, sklearn, umap, hdbscan
To go further
- Tutorial - Topic Modeling with BERTopic: A tutorial and overview of the different functionalities of
BERTopic(Author unknown?). - Tutorial - LDA Topic Modeling with
sklearnand visualization withpyLDAvis. - Understanding and Using Common Similarity Measures for Text Analysis: A detailed tutorial on computing distances on text document (using BoW-like representations) in Python, applied to data from the EarlyPrint initiative. ©John R. Ladd (2020).
- Tutorial_3_SFT.ipynb: Fine-tune a BERT-like model for literary genre classification based on 5-sentences long book chunks. Compare the results with classification performed by standard classifiers trained on document representations (BoW, TF-IDF, SentenceTransformers' embeddings).
- Hands_on_3_CanonChallenge.ipynb: Your time to devise a classifier for "canonicity" prediction based on 5-sentences long excerpts of French-language novels. Reflect about the data, the models, and the fairness implications of both. Implement you classifier and submit your predictions to the Performance & Fairness class shared task: https://tinyurl.com/canon-pf!
Main libraries: transformers, pytorch, sklearn
To go further
- Tutorial: Fine-tuning : Fine-tuning a Code LLM on Custom Code on a single GPU, by Maria Khalusova.
- Tutorial: Interpreting BERT's classification decisions: Interpreting the Prediction of BERT Model for Text Classification, by Ruben Winastwan. (Blog post | Notebook)
- Fairness with the
dalexPython Package.
- Tutorial_4_LLM_Interaction.ipynb: Learn how to use open-weight LLM via the
transformerslibrary, run and query LLMs locally withollama, or interact with diverse providers through APIs and requests. - Hands_on_4_EvalLLM.ipynb: Write a multiple choice questionnaire and apply it to LLMs.
Main libraries: transformers, ollama, openai, requests
To go further
Revisit previous sessions with LLMs!
-
- Re-annotate the data with an LLM and observe potential differences, measure aggreement with humans with κ index
- Extract features from a generative LLM instead of BERT
- Improve the OCRed text via prompting LLMs (& find methods to evaluate improvement)
-
- Add a LLM-based component to summarize topics or provide more meaningful topic labels
- Replace the document embedder of BERTopic with features extracted from a LLM
-
- Prompt LLMs to do zero-/few- shot classification (try diverse prompts, number of examples, etc.)
- of book genres
- or canonicity (upload your predictions on the Shared Task app!)
- Prompt LLMs to do zero-/few- shot classification (try diverse prompts, number of examples, etc.)
-
Glimpse at Data Curation for LLM training: Colab notebook to explore data curation process: language identification & quality filtering, by Rose E Wang.
-
Interrogating a National Narrative with GPT-2: Using Generated Texts to Interrogate the Brexit Narrative (Lesson), by Chantal Brousseau.
-
Text Classification using LLMs: Using the
skorchlibrary for zero-shot classification with LLMs.