By claims
- from a local file
cat entities.json | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
cat entities.json | wikibase-dump-filter --claim P18 > entities_with_an_image.ndjson
cat entities.json | wikibase-dump-filter --claim P31:Q5,Q6256 > humans_and_countries.ndjson
this command filters entities_dump.json
into a subset where all lines are the json with an entity having Q5 in it's P31 claims
(where ndjson stands for newline-delimited json)
- directly from a Wikidata dump
curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 | bzcat | wikibase-dump-filter --claim P31:Q5 > humans.ndjson
this can be quite convinient when you don't have enough space to keep the whole decompressed dump on your disk: here you only write the desired subset.
Of course, this probably only make sense if the kind of entities you are looking for is somewhere above 2 000 000 units(?), given that under this level, it would probably be faster/more efficient to get the list of ids from your Wikibase Query Service (see Wikidata Query Service), then get the entities data from the API (which could easily be done using wikibase-cli wb data
).
and
# operator: &
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50' > books_with_an_author.ndjson
or
# operator: |
cat entities.json | wikibase-dump-filter --claim 'P31:Q146|P31:Q144' > cats_and_dogs.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --claim 'P31:Q146,Q144' > cats_and_dogs.ndjson
# the 'or' operator has priority on the 'and' operator:
# this claim filter is equivalent to (P31:Q571 && (P50 || P110))
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&P50|P110' > books_with_an_author_or_an_illustrator.ndjson
not
# operator: ~
cat entities.json | wikibase-dump-filter --claim 'P31:Q571&~P50' > books_without_author.ndjson
If your claim is too long and triggers a Argument list too long
error, you can pass a file instead:
echo 'P31:Q5,Q6256' > ./claim
cat entities.json | wikibase-dump-filter --claim ./claim > humans_and_countries.ndjson
By sitelinks
Keep only entities with a certain sitelink
# entities with a page on Wikimedia Commons
cat entities.json | wikibase-dump-filter --sitelink commonswiki > subset.ndjson
# entities with a Dutch Wikipedia article
cat entities.json | wikibase-dump-filter --sitelink nlwiki > subset.ndjson
# entities with a Russian Wikipedia articles or Wikiquote article
cat entities.json | wikibase-dump-filter --sitelink 'ruwiki|ruwikiquote' > subset.ndjson
You can even do finer filters by combining conditions with &
(AND) / |
(OR).
# entities with Chinese and French Wikipedia articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki' > subset.ndjson
# entities with Chinese and French Wikipedia articles, or Chinese and Spanish articles
cat entities.json | wikibase-dump-filter --sitelink 'zhwiki&frwiki|eswiki' > subset.ndjson
NB: A&B|C
is interpreted as A AND (B OR C)
By type
Default: item
cat entities.json | wikibase-dump-filter --type item
cat entities.json | wikibase-dump-filter --type property
cat entities.json | wikibase-dump-filter --type both
Need another kind of filter? Just ask for it in the issues, or make a pull request!
Wikidata entities have the following attributes: id
, type
, labels
, descriptions
, aliases
, claims
, sitelinks
.
All in all, this whole takes a lot of place and might not be needed in your use case: for instance, if your goal is to do full text search on a subset of Wikidata, you just need to keep the labels, aliases and descriptions, and you can omit the claims and sitelinks that do take a lot of space.
This can be done with either the --keep
or the --omit
command:
cat entities.json | wikibase-dump-filter --omit claims,sitelinks > humans.ndjson
# which is equivalent to
cat entities.json | wikibase-dump-filter --keep id,type,labels,descriptions,aliases > humans.ndjson
Keep only the desired languages for labels, descriptions, aliases, and sitelinks.
cat entities.json | wikibase-dump-filter --languages en,fr,de,zh,eo > subset.ndjson
Uses wikidata-sdk simplify.entity
function to parse the labels, descriptions, aliases, claims, and sitelinks.
# Default simplify options
cat entities.json | wikibase-dump-filter --simplify > simplified_dump.ndjson
# Custom options, see wdk.simplify.entity documentation https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_entities_data.md
# and specifically for claims options, see https://github.com/maxlath/wikidata-sdk/blob/master/docs/simplify_claims.md#options
cat entities.json | wikibase-dump-filter --simplify '{"keepRichValues":"true","keepQualifiers":"true","keepReferences":"true"}' > simplified_dump.ndjson
# The options can also be passed in a lighter, urlencoded-like, key=value format
# that's simpler than typing all those JSON double quotes
cat entities.json | wikibase-dump-filter --simplify 'keepRichValues=true&keepQualifiers=true&keepReferences=true' > simplified_dump.ndjson
All the options (see wbk.simplify.entity
documentation for more details):
- claims simplification options:
entityPrefix
andpropertyPrefix
(string)keepRichValues
(boolean)keepQualifiers
(boolean)keepReferences
(boolean)keepIds
(boolean)keepHashes
(boolean)keepNonTruthy
(boolean)
- sitelinks simplification options:
addUrl
(boolean)
-h, --help output usage information
-p, --progress enable the progress bar
-q, --quiet disable the progress bar
-V, --version output the version number