Skip to content

Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

Notifications You must be signed in to change notification settings

commoncrawl/web-languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Languages Project

Welcome! This is a crowd-sourced effort to improve crawling of low-resource languages. This dataset is public.

Common Crawl recognizes a lot of languages, and we can see that we don't have enough of languages like Hindi (500 million speakers!), smaller country languages like Hungarian, and regional languages like Catalan. We are interested in languages from all over the world. If you choose to help, you'll be helping create lists of websites related to languages that you read or speak.

How can I contribute?

If you look below you'll see a huge list of living languages. If you see one that looks interesting, click on it. You'll see a language-specific document, probably mostly blank, that you can fill out.

There are 2 ways to add to this document. If you aren't very familiar with Github, you can copy the entire document into an email, fill it out, and send it to web-languages ZAT commoncrawl ZOT org. We'll do the rest.

If you are familiar with Github, and are logged in, click on the pen icon in the upper right corner to start editing the document. Github will request that you fork the repo. Do that, edit the document, and finally create a pull request.

To see a partially completed example, look at the Welsh entry.

Sometimes asking a Large Language Model can be helpful: "What are some top websites written in the Welsh language?"

What kind of websites are you looking for?

If you look at the template, we have requested urls in a few categories: News, Culture/History, Government, Political Parties, and Other. Remember that we're looking for websites in this particular language. If the language is only a part of the website, and that's visible in the URL as https://example.com/catalan/, then that's the URL you should add.

For a language like Hindi, with 500 million speakers, there are a lot of websites to choose from. Please suggest websites that are important and influential, and please think about diversity. Are all geographic regions represented?

For a country-wide language like Hungarian, there are still probably a wide variety of websites to choose from. If a website is all English, however, that's not what we're looking for.

For a regional language like Catalan, things are trickier. Catalan has multiple names -- it's called Valencian in some parts of Spain -- and use of the Catalan language is a part of a vigorous debate in Spanish national and regional politics. You might not be able to find Catalan-language content for every political party, and government websites might offer Catalan content one day and remove it after the next election. In that case, please do the best you can.

If your favorite language has its own Wikipedia -- check the list here -- please include this link under "Other".

What if my favorite language isn't in the list?

If you don't see your language, please open a Github issue, or send us an email at web-languages ZAT commoncrawl ZOT org. It could be that your language is here but has an unfamiliar name, or perhaps we need to add it. This list was started with the list in ISO-639-3, which is, like any world-wide standard, an imperfect list.

See also: Constructed, Extinct, Historical, Special

Languages with more than 50mm speakers

Languages

License

This work is marked with CC0 1.0

By editing this file, contributors are agreeing to release their contributions under the CC0 license.