This repo contains the complete DQW streamlit app code, however, the streamlit apps have been split into 5 for maintenance purposes:
- Main Streamlit app 📊
- Tabular Data Section 🏗️
- Audio Data Section 🎶
- Text Data Section 📚
- Image Data Section 🖼️
This application was built in the ITEA IVVES project and is an accelerator of the Sogeti Quality AI Framework, providing methods and advice on the preprocessing steps to accelerate and ensure transparency of the data preprocessing pipeline.
The DQW can be applied to the following data structures:
- Structured data. The most common data format used in data science, be it in finance, health, biotechnology, cybersecurity, etc. Since structured data is difficult for humans to grasp, it's very important to make it understandable prior to preparing it for training.
- Unstructured data. Unstructured data is easily understandable to humans, but needs to be thoroughly processed to use as training data for ML: Images used in computer vision algorithms such as object detection and classification; Text used in NLP models, be it for classification or sentiment analysis; Audio used in audio signal processing algorithms such as music genre recognition and automatic speech recognition.
- Synthetic data. Synthetic data evaluation is a critical step of the synthetic data generation pipeline. Validating the synthetic data training set to be used in an ML algorithm ensures model performance will not be impacted negatively. To evaluate synthetic data, you also need a portion of real data to compare it to.
The packages used in the application are in the table below.
App section | Description | Visualisation | Selection | Package |
---|---|---|---|---|
Synthetic tabular | x | x | table-evaluator | |
Tabular | x | x | sweetviz | |
Tabular | x | x | pandas-profiling | |
Tabular, text | x | PyCaret | ||
Text | x | NLTK | ||
Text | x | SpaCy | ||
Text | x | x | TextBlob | |
Text | x | x | WordCloud | |
Text | x | x | TextStat | |
Image | x | x | Pillow | |
Audio | x | x | librosa | |
Audio | x | x | dtw | |
Audio | x | audiomentations | ||
Audio | x | x | AudioAnalyser | |
Report generation | x | Fpdf | ||
Report generation | x | wkhtmltopdf | ||
Report generation | x | pdfkit |
Key points addressed:
- Quantitative measures – number of rows and columns.
- Qualitative measures – column types.
- Descriptive statistics with NumPy for numeric columns, for example, count, mean, percentiles and standard deviation. For discrete columns, count, unique, top and frequency.
- Explore missing data.
- Examine outliers.
- Mitigate class imbalance.
- Compare datasets, like train, test and evaluate data.
- Evaluate synthetic datasets.
- Create a quality report.
To complete the key points, 4 subsections are created:
- One file EDA with pandas-profiling
- One file preporcessing with PyCaret
- Two file comparison with Sweetviz
- Synthetic data evaluation with table-evaluator
- In all the sections, there is an option to download a pdf/zip of the results
Key points addressed:
- Frequency - Count most common words with WordCloud package in Python. This is the quickest way of seeing what the handled data contents are, in addition, it provides visualisation in form of word clouds.
- Analyse sentiment with TextBlob in case of classification tasks. We can investigate the polarity of the text and represent it in form of bar graphs.
- Investigate readability of data with Textstat, typically used for determining readability, complexity, and grade level of a corpus.
- Topic analysis.
- Provide an automated preprocessing based on methods on the market.
To complete the key points, the following subsections are created:
- Preprocessing of the data with file download option
- Basic analysis methods like number of unique words, characters
- N-gram analysis with NLTK
- PoS tagging with NLTK
- NER with SpaCy
- Topic analysis with LDA, including optimal number of topics generation with u_mass coherence score
Key points addressed:
- Provide EDA of audio files
- Augment audio files to showcase methods of increasing robustness of the audio dataset
- Provide methods of comparison of two files
To complete the key points, the following subsections are created:
- One file EDA with librosa
- One file augmentation with audiomentations
- Two file comparison with DTW
- Two file comparsion with audio_compare method
Key points addressed:
- EDA of images
- Augmentation
Demo files have been provided in the /demo_data folder. Try out the app with them.
-
Installation process:
Create virtual environment and activate it - https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
Clone or download files from this repo
Run pip install -r requirements.txt
Run streamlit run app.py to launch app
-
Software dependencies:
In requirements.txt
-
Latest releases
Use app.py