Wikidata words

Wikidata words

What are Wikidata lexemes? What can we use them for? They let software reason about words! In this blog post, we use the lexeme data model to find synonyms.

30 January 2020

Wikidata is an ambitious database. It wants to hold all the knowledge in the world. This includes lexicographical data — knowledge about words. They are described with the lexeme data model:

Lexeme: A word, including its variations. In English, belong, belongs, and belonged are forms of the same lexeme. Wikidata also includes short phrases and word parts, like prefixes, as lexemes.
Sense: A meaning of a word. The noun film has two senses: a motion picture and a thin membrane
Lemma: One of the variations of a lexeme chosen as the standard. A few lexemes, like color/colour have more than one lemma because they have more than one standard spelling. But, most have only one.

Like everything in Wikidata, lexemes, senses, and lemmas are stored as semantic triples. Triples are a convention for storing data, associated with RDF and Linked Open Data.

Let's demonstrate these parts of the data model with some SPARQL queries, the query language of Wikidata, at the query.wikidata.org query interface.

A photograph of many books on a shelf that are similar in color — A bookshelf near my desk at Mann Library, Cornell University

Let's see some lexemes

This simple SPARQL query, which you can run at query.wikidata.org, will find all the lexemes in Wikidata:

# select all lexemes
SELECT ?lexeme ?lemma WHERE {
  ?lexeme wikibase:lemma ?lemma . # show a lemma for each lexeme
}

After running it, you'll see the output is a long table of random-seeming words in many languages. Each word is a lemma. Next to every word is a link. The link indicates the lexeme to which the lemma belongs.

This table of lexemes is less than 300k words as of this writing. This is a lot less than the number of all words in every spoken language — every estimate I can find puts this at many millions — so Wikidata lexemes are clearly a work in progress. I wonder how many words it lists in English?

# select all lexemes in English
SELECT ?lexeme ?lemma WHERE {
  ?lexeme <http://purl.org/dc/terms/language> wd:Q1860;
  wikibase:lemma ?lemma .
}

The code wd:Q1860 is a Wikidata item identifier for the English language. We use it to restrict our query. At the moment of this writing, we find 38,268 English lexemes.

Let's find a particular lexeme by its lemma

Instead of massive tables of words, how do we pluck out a specific word? Can we search Wikidata for a string?

# find the word film in English
SELECT ?lexeme ?lemma WHERE {
  VALUES ?lemma {'film'@en}       # Restrict search to English "film" lemmas
  ?lexeme wikibase:lemma ?lemma . # Find the corresponding lexemes
}

Here we are specifying the English language by the @en applied to the string 'film'. The result (as of this writing) are two lexemes: the noun form of film, and the verb form.

Let's look at a way we can apply this knowledge!

A photograph of many books on a shelf at odd angles — Another, rather less orderly bookshelf at Mann Library...toward the back somewhere

Find synonyms of a word

Below is a SPARQL query that finds synonyms for a word. It traverses from lemmas to their lexemes, as we've already seen. Then it grabs the senses of those lexemes — the meanings of the words. It uses the wdt:P5973 synonym property to look for words with the similar meanings.

Once it has discovered some synonyms, it re-traces its steps. It walks back from those synonym's senses, to their lexemes, to their lemmas. In the end, it tells us that "movie" and "motion picture" are synonyms for "film".

# Find synonyms for "film":
SELECT * {
  VALUES ?lemma1 {'film'@en}
  ?lexeme1 wikibase:lemma ?lemma1 . # get lexemes that share the lemma "film" (resulting in the noun and verb form of the word)
  ?lexeme1 ontolex:sense ?sense1 .  # get senses for each lexeme (resulting in movie and membrane senses of the noun film)
  ?sense1 wdt:P5973 ?sense2 .       # get synonymous senses belonging to other lexemes (resulting in a sense of the lexeme motion picture)
  ?lexeme2 wikibase:lemma ?lemma2 . # then, from all lemmas belonging to any lexeme
  ?lexeme2 ontolex:sense ?sense2 .  # get those that have senses we've found to be synonymous with film
}

As we've already noted, there is more than one sense of the noun film. It can designate a movie, but also a thin membrane. As of now, only movie-related lexemes are returned by this SPARQL query.

This is because synonym data in Wikidata is currently not very complete. The membrane sense of film is in the data. But, it has no synonym assertions, so other membrane-related lexemes were not found. Maybe some day!

Thanks are in order

Stanislav Kralin, Huda Khan, and Hilary Thorsen were essential to this writing!

John Skiles Skinner's portfolio

Wikidata words

Let's see some lexemes

Let's find a particular lexeme by its lemma

Find synonyms of a word

Thanks are in order