Scribe-Data is a convenient command-line interface (CLI) for extracting and formatting language data from Wikidata and Wikipedia. Functionality includes allowing users to list, download, and manage language data directly from the terminal.
Note
The contributing section has information for those interested, with the articles and presentations in featured by also being good resources for learning more about Scribe.
Scribe applications are available on iOS, Android (WIP) and Desktop (planned).
Check out Scribe's architecture diagrams for an overview of the organization including our applications, services and processes. It depicts the projects that Scribe is developing as well as the relationships between them and the external systems with which they interact. Also check out the Wikidata and Scribe Guide for an overview of Wikidata and getting language data from it.
Process ⇧
The CLI commands defined within scribe_data/cli and the notebooks within the various scribe_data directories are used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop once they're active.
The main data update process in triggers language based SPARQL queries to query language data from Wikidata using SPARQLWrapper as a URI. The autosuggestion process derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in gen_autosuggestions.ipynb. Emojis are further sourced from Unicode CLDR, with this process being ran via the scribe-data get -lang LANGUAGE -dt emoji-keywords
command.
CLI Usage ⇧
Scribe-Data provides a command-line interface (CLI) for efficient interaction with its language data functionality. Please see the usage guide or the official documentation for detailed instructions.
To utilize the Scribe-Data CLI, you can execute the following command in your terminal:
scribe-data [command] [options]
list
(l
): Enumerate available languages, data types and their combinations.get
(g
): Retrieve data from Wikidata for specified languages and data types.total
(t
): Display the total available data for given languages and data types.convert
(c
): Transform data returned by Scribe-Data into different file formats.
# Commands used in the above GIF::
scribe-data list --language
scribe-data list --data-type
scribe-data get --language English --data-type verbs -od ./scribe-data
scribe-data total --language English
# Commands used in the above GIF:
scribe-data get -i
scribe-data total -i
Contributing ⇧
Scribe uses Matrix for communications. You're more than welcome to join us in our public chat rooms to share ideas, ask questions or just say hi :)
Please see the contribution guidelines and Wikidata and Scribe Guide if you are interested in contributing to Scribe-Data. Work that is in progress or could be implemented is tracked in the issues and projects.
Note
Just because an issue is assigned on GitHub doesn't mean that the team isn't interested in your contribution! Feel free to write in the issues and we can potentially reassign it to you.
Those interested can further check the -next release-
and -priority-
labels in the issues for those that are most important, as well as those marked good first issue
that are tailored for first time contributors.
After your first few pull requests organization members would be happy to discuss granting you further rights as a contributor, with a maintainer role then being possible after continued interest in the project. Scribe seeks to be an inclusive and supportive organization. We'd love to have you on the team!
Ways to Help ⇧
- Reporting bugs as they're found 🐞
- Working on new features ✨
- Documentation for onboarding and project cohesion 📝
- Adding language data to Scribe-Data via Wikidata! 🗃️
Road Map ⇧
The Scribe road map can be followed in the organization's project board where we list the most important issues along with their priority, status and an indication of which sub projects they're included in (if applicable).
Note
Consider joining our bi-weekly developer syncs!
Data Edits ⇧
Note
Please see the Wikidata and Scribe Guide for an overview of Wikidata and how Scribe uses it.
Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!
Environment Setup ⇧
Important
Suggested IDE extensions
VS Code
The development environment for Scribe-Data can be installed via the following steps:
- Fork the Scribe-Data repo, clone your fork, and configure the remotes:
Note
Consider using SSH
Alternatively to using HTTPS as in the instructions below, consider SSH to interact with GitHub from the terminal. SSH allows you to connect without a user-pass authentication flow.
To run git commands with SSH, remember then to substitute the HTTPS URL, https://github.com/...
, with the SSH one, [email protected]:...
.
- e.g. Cloning now becomes
git clone [email protected]:<your-username>/Scribe-Data.git
GitHub also has their documentation on how to Generate a new SSH key 🔑
# Clone your fork of the repo into the current directory.
git clone https://github.com/<your-username>/Scribe-Data.git
# Navigate to the newly cloned directory.
cd Scribe-Data
# Assign the original repo to a remote called "upstream".
git remote add upstream https://github.com/scribe-org/Scribe-Data.git
- Now, if you run
git remote -v
you should see two remote repositories named:origin
(forked repository)upstream
(Scribe-Data repository)
- Use Python venv to create the local development environment within your Scribe-Data directory:
-
On Unix or MacOS, run:
python3 -m venv venv # make an environment named venv source venv/bin/activate # activate the environment
-
On Windows (using Command Prompt), run:
python -m venv venv venv\Scripts\activate.bat
-
On Windows (using PowerShell), run:
python -m venv venv venv\Scripts\activate.ps1
After activating the virtual environment, install the required dependencies and set up pre-commit by running:
pip install --upgrade pip # make sure that pip is at the latest version
pip install -r requirements.txt # install dependencies
pip install -e . # install the local version of Scribe-Data
pre-commit install # install pre-commit hooks
# pre-commit run --all-files # lint and fix common problems in the codebase
See the contribution guidelines for a more detailed explanation and troubleshooting.
Note
Feel free to contact the team in the Data room on Matrix if you're having problems getting your environment setup!
Supported Languages ⇧
Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the language_data_extraction directory for queries for currently supported languages and those that have substantial data on Wikidata.
The following table shows the supported languages and the amount of data available for each on Wikidata and via Unicode CLDR for emojis:
Languages | Nouns | Verbs | Translations* | Prepositions† | Emoji Keywords |
---|---|---|---|---|---|
French | 18,082 | 6,575 | 67,652 | - | 2,488 |
German | 194,762 | 3,637 | 67,652 | 215 | 2,898 |
Italian | 59,910 | 7,654 | 67,652 | - | 2,457 |
Portuguese | 5,281 | 539 | 67,652 | - | 2,327 |
Russian | 194,567 | 15 | 67,652 | 15 | 3,827 |
Spanish | 62,949 | 7,938 | 67,652 | - | 3,134 |
Swedish | 47,039 | 4,682 | 67,652 | - | 2,913 |
*
Given the current beta
status where words are machine translated.
†
Only for languages for which preposition annotation is needed.
Featured By ⇧
Articles and Presentations on Scribe
2024
- October: Blog post on Medium discussing the Scribe-Data development process, community and features
- October: Blog post on medium describing the main features of Scribe-Data
- September: Final Google Summer of Code report on the creation of the Scribe-Data CLI
- August: Final Google Summer of Code report on the creation of Scribe's cross-language translation functionality
- July: Blog post on Medium about the progress on creating the Scribe-Data CLI
- July: Blog post on Hashnode providing an midterm report on the localization and translation expansion for Scribe-iOS
- July: Blog post on Hashnode about the initial steps towards the localization of Scribe-iOS
- June: Blog post on Medium about the planned Scribe-Data CLI
- April: Blog post on Medium about Scribe-Data and its functionalities
- February: Presentation slides for Scribe's participation at the Wikimedia Tech Safari Program
2023
- August: Scribe-iOS final submission report for Google Summer of Code 2023
- June: Scribe-iOS development blog post on Nested UITableViews & Apple's built-in ViewControllers in app menu for GSoC '23
- March: Presentation slides for a talk at Berlin Hack and Tell (Hack of the month winner 🏆)
2022
- August: Presentation slides for a session at the 2022 Wikimania Hackathon
- July: Presentation slides for a talk at CocoaHeads Berlin
- July: Video on Scribe for Wikimedia Celtic Knot 2022
- June: Presentation slides for a talk with the LD4 Wikidata Affinity Group
- June: Scribe featured for new developers on MediaWiki
- May: Presentation slides for Wikimedia Hackathon 2022
- March: Blog post on Scribe-iOS for Wikimedia Tech News (DE / Tweet)
- March: Presentation slides for Wikidata Data Reuse Days 2022
Powered By ⇧
Many thanks to all the Scribe-Data contributors! 🚀
List of referenced posts