NLP-coursework

Coursework material for introduction to NLP

Syllabus

Objectives

This course aims to provide concrete toolkit for data scientists or software engineers willing to perform natural language processing tasks. At the end of the course, the students should be able to develop a full NLP pipeline:

Preprocess and analyze raw text data
Represent text data into vectors and manipulate them
Train a supervised model in order to achieve standard NLP tasks (e.g. classification, Named Entity Recognition)
Serve the model for inference on new data

In addition, the students will learn to leverage pre-trained models (e.g. Word2Vec, BERT).

This course aims to be practical. Therefore we will manipulate real-word raw datasets from open source projects such as DataGouv. Also, students will package their Python code into a library and share their code to a Github repository.

Prerequisites

Python: some knowledge of the main ML libraries: Pandas, Numpy, Sklearn
Basic linear algebra and statistics skills

Tools

Environment: Jupyter Notebook, Google collab, Github
Python libraries: Keras / Tensforflow, SpacY, sklearn, ...

Organisation

During each session, students will start from a Jupyter Notebook template and will produce their own code. Each template gives some guidelines and toy examples relevant to the topic of the session. Additional resources will be shared between sessions in order to introduce the theoritical concepts.

Content - Part A

Data

We will work on the dataset of avis et conseils de la CADA from datagouv.fr. This dataset contains documents from French public authorities. Each document is related to an opinion, which belongs to either favorable, unfavorable or neutral categories.

Objective

This is a supervised classification task. The objective is to classify these documents into one of the above categories.

Data exploration

Analyze the shape of documents, distribution of words, topic modelling, etc...

Word representation

Build a preprocessing pipeline:

Tokenization
Text normalisation (eg lemmatisation)
Representation into vectors

Modelling

Baseline model

Let's build a first baseline model with a simple ML classifier. Test the results on a holdout sample.

Deep Learning

Build a deep learning classifier with various configurations.

Pre-trained embeddings

Can we perform better by leveraging well defined linguistic properties of a pre-trained model ?

Content - Part B

In the next part we will perform a Named Entity Recognition task.

To be completed

Bibliography

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.env		.env
Data_Exploration.ipynb		Data_Exploration.ipynb
Deep_Learning.ipynb		Deep_Learning.ipynb
ML_Classifier.ipynb		ML_Classifier.ipynb
README.md		README.md
Word Representation.ipynb		Word Representation.ipynb
config.py		config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-coursework

Syllabus

Objectives

Prerequisites

Tools

Organisation

Content - Part A

Data

Objective

Data exploration

Word representation

Modelling

Baseline model

Deep Learning

Pre-trained embeddings

Content - Part B

Bibliography

About

Releases

Packages

Languages

Kapoorlabs-paris/NLP-coursework

Folders and files

Latest commit

History

Repository files navigation

NLP-coursework

Syllabus

Objectives

Prerequisites

Tools

Organisation

Content - Part A

Data

Objective

Data exploration

Word representation

Modelling

Baseline model

Deep Learning

Pre-trained embeddings

Content - Part B

Bibliography

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages