Monday, June 29, 2020

Annotating rhymes in texts (From rhymes to networks 3)

Having discussed some general aspects of rhyming in a couple of different languages, in last month's blog post, the third post in this series is devoted to the question of how rhyme can be annotated. Annotation plays a crucial role in almost all fields of linguistics. The main idea is to add value to a given resource (Milà-Garcia 2018). What value we add to resources can differ widely, but as far as textual resources are concerned, we can say that the information that we add can usually not be extracted automatically from the resource.

In our case, the information we want to explicitly add to rhyme texts or rhyme corpora is the rhyme relations between words. Retrieving this information may be trivial, as in the case of Shakespeare's Sonnets, where we know the rhyme schema in advance, but it is considerably complicated when working with other, less strict types of rhyming.

One usually distinguishes two basic types of annotation: inline and stand-off (Eckart 2012). For inline annotation, we add our information directly into our textual resource, while stand-off annotation creates an index over the resource, and then adds the information in a separate resource that refers to the index of the original text.

Both methods have their pros and cons. Stand-off annotation often seems to provide a cleaner solution (as one never knows how much a manual annotation added into a text might modify the text involuntarily). However, inline annotation has, in my experience, the advantage of allowing for a much faster annotation process, at least as long as the annotation has to be done in text files directly, without interfaces that could help to assist in the annotation process.

Overview of existing annotation practice

If we look at different practices that have been used to annotate rhymes in collections of poetry, we will find quite a variety of techniques that have been used so far.

Wáng (1980), for example, uses an inline annotation style in his corpus of the rhymes in the Book of Odes, as illustrated in the following example taken from List et al. (2019). In this annotation, rhyme words are indirectly annotated by providing reconstructed readings for the Chinese characters which are supposed to narrow the original pronunciation. Whenever two rhyme words share the same main vowel, the author would have judged them to have rhymed in the original text.

Annotation in Wáng (1980)

Baxter (1992) uses a stand-off annotation, which is shown (again taken from List et al. 2019) in the following table. An advantage of Baxter's annotation is that it allows him to provide multiple layers of information for each rhyme word. A disadvantage is that a clear index to the words in the poem is lacking. While this is not entirely problematic, since it is usually easy to identify which words are in rhyme position, it is not entirely "safe", from an annotation point-of-view, as it may still create ambiguities.

Annotation in Baxter (1992)

In a study of automated rhyme word detection, Haider and Kuhn (2018) use annotated rhyme datasets from a variety of German styles (Hip Hop, contemporary lyrics, and more ancient lyrics). To annotate the data, they use the standard format of the Text Encoding Initiative, which is based essentially on XML. Unfortunately, however, they do not provide tags for each word that rhymes, but instead only add an attribute to each stanza, indicating the rhyme schema, as can be seen in the example below:
<lg rhyme="aabccb" type="stanza">
  <l>Vor seinem Löwengarten,</l>
  <l>Das Kampfspiel zu erwarten,</l>
  <l>Saß König Franz,</l>
  <l>Und um ihn die Großen der Krone,</l>
  <l>Und rings auf hohem Balkone</l>
  <l>Die Damen in schönem Kranz.</l>
The drawback of this annotation style is that it places the annotation where it does not belong, assuming that a poem only rhymes the words that appear in the end of a line, and that there are no exceptions.

For French, I found an interesting website called métrique en ligne, offering a large number of phonetically analyzed texts in French. They offer a rhyme analysis in an interactive fashion: one can have a look at a poem in raw form and then see which parts of the words appear in rhyme relation. A screenshot of the website (with the poem "Les Phares" from Charles Baudelaire) illustrates this annotation:

It is very nice that the project offers the rhyme annotation in such a clear form, annotating explicitly those parts of the words (albeit in orthography) that are supposed to be responsible for the rhyming. However, the annotation has a clear drawback, in that it provides rhyme annotation only on the level of the stanza, although we know well that quite a few poems have recurring rhymes that are reused across many stanzas, and we would like to acknowledge that in our annotation.

The most complete annotation of poetry I have found so far is ``MCFlow: A Digital Corpus of Rap Transcriptions'' (Condit-Schultz 2017). The goal of the annotation was not to annotate rhyme in the primary instance, but to provide a corpus that also takes the musical and rhythmic aspects of rap into account. As a result it offers annotations along seven major aspects: rhythm, stress, tone, break, rhyme, pronunciation, and the lyrics themselves. The rhyme annotation itself is provided for each syllable (the texts themselves are all syllabified), with capital letters indicating stressed, and lower case letters indicating unstressed syllables. Rhyme units (usually, but not necessarily words) are marked by brackets. The following figure from Condit-Schultz (2017) illustrates this schema.

Annotation of rhymes by Condit-Schultz (2017)

What I do not entirely understand is the motivation of using the same lowercase letters for unstressed syllables as for the stressed ones in a rhyme sequence. Given that the information about stress is generally available from the annotation, it seems redundant to add it; and it is not clear to me for what it serves, specifically also because unstressed syllables do not necessarily rhyme in rhyme sequences. But apart from this, I find the information that this annotation schema provides quite convincing, although I find the format difficult to parse computationally; and I also imagine that it is quite difficult to annotate it manually.

Initial reflections on rhyme annotation

When dealing with annotation schemas and trying to develop a framework for annotation, it is always useful to recall the Zen of Python, especially the first seven lines:
  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
What I think we can extract from these seven lines are the following basic rules for an initial annotation schema for rhyme data.
  • First, ideally, we want an annotation schema that gives us the same look and feel that we know when reading a poem. This does not mean we need to store the full annotation in this schema, but for a quick editing of rhyme relations, such an annotation schema has many advantages.
  • Second, in order to maintain explicitness, all rhymes should be treated as rhyming globally inside a poem — we should never restrict annotation of rhymes to a single stanza, and we should also avoid brackets to mark rhyming sequences, as there are other ways to assign words to units.
  • Third, we should be explicit enough to show which parts of a word rhyme but, for now, I think it is not necessary to annotate all syllables at the same time. Since this would cost a lot of time, and specifically since syllabification differs from language to language, it seems better to add this information later on a language-specific basis, semi-automatically. Since many words repeat across poems, one can design a lookup-table to syllabify a word much more easily from a corpus that has been assembled, than adding the information when preparing each poem.

Towards a: Standardized Annotation of Rhyme Data

Last year, we proposed an annotation schema for rhyme annotation (List et al. 2019). Our basic idea was inspired by tabular formats. These are used in linguistic software packages dealing with problems in computational historical linguistics, such as LingPy. They are also used as the backbone of the Cross-Linguistic Data Formats Initiative (Forkel et al. 2018), which uses tabular formats in combination with metadata in order to render linguistic datasets (wordlists, information on structural features) cross-linguistically comparable. Essentially, the format can be seen as a stand-off annotation, where the original data are not modified directly. While our basic format was rather powerful with respect to what can be annotated, it is also very difficult to code data in this format, at least in the absence of a proper annotation tool.

At the same time, to ease the initial preparation of annotated rhyme data conforming to these standards, we proposed an intermediate format, in which a poem was provided just in text form, with minimal markup for metadata, and in which rhymes could be annotated inline. As an example, consider the first two stanzas of the poem "Morning has broken" by Eleanor Farjeon (1881-1965):
@CREATED: 2020-06-26 06:09:04
@TITLE: Morning has broken
@AUTHOR: Eleanor Farjeon
@BIODATE: 1881-1965
@YEAR: before 1965
@MODIFIED: 2020-06-26 06:09:46
@LANGUAGE: English

Morning has [a]broken like the first morning
Blackbird has [a]spoken like the first [b]bird
Praise for the [c]singing, praise for the morning
Praise for them [c]springing fresh from the [b]Word

Sweet the rain's [e]new_[f]fall, sunlit from heaven
Like the first [e]dew_[f]fall on the first [g]grass
Praise for the [d]sweet[h]ness of the wet garden
Sprung in com[d]plete[h]ness where His feet [g]pass
As you can see from this example, we start with some metadata (which is more or less a free form, consisting of the formula @key: value, and then render the stanzas, line by line, separating stanzas by one blank line. Rhymes are annotated by enclosing rhyme labels in angular brackets before the part of the word responsible for the rhyme. If wanted, one can annotate rhymes for each syllable, as done in the rhyme words [d]sweet[h]ness and com[d]plete[h]ness, but one can also only annotate the rhyme as a whole, as done in the rhyme words [a]broken and [a]spoken.

In order to assign words to rhyme units, an understroke can be used that indicates that two orthographic words are perceived as one unit in the rhyme, which is the case for [e]new_[f]fall rhyming with [e]dew_fall. Furthermore, if a stanza reappears throughout a poem or song in the form of a refrain, this can be indicated by adding two spaces before all lines of the stanza.

Comments can be added by beginning a line with the hash symbol #, as shown in this small excerpt of Bob Dylan's "Sad-Eyed Lady of the Lowlands".
# [Verse 1]
With your mercury mouth in the missionary [c]times
And your eyes like smoke and your prayers like [c]rhymes
And your silver cross, and your voice like [c]chimes
Oh, who do they think could [i]bury_[j]you?
With your pockets well protected at [e]last
And your streetcar visions which ya' place on the [e]grass
And your flesh like silk, and your face like [e]glass
Who could they get to [i]carry_[j]you?

# [Chorus]
  Sad-eyed lady of the lowlands
  Where the sad-eyed prophet say that no man [a]comes
  My warehouse eyes, my Arabian [a]drums
  Should I put them by your [b]gate
  Or, sad-eyed lady, should I [b]wait?
When testing this framework on many different kinds of poems from different languages and styles, I realized that the greedy rhyme annotation that I used (you place the rhyme tag before a word, and all letters that follow will be considered to belong to that very rhyme tag) has a disadvantage in those situations where syllables in multi-syllabic rhyme units essentially do not rhyme. As an example consider the following lines from Eminem's "Not Afraid":
I'ma be what I set out to be, 
without a doubt, undoubtedly
And all those who look down on me, 
I'm tearin' down your balcony
Here, the author plays with rhymes centering around the words out to be, undoubtedly, down on me, and balcony. Condit-Schultz has annotated the rhymes as follows (I use the rhyme schema inline for simplicity):
I'ma D|be what I set (C|out c|to D|be), 
wi(C|thout c|a) (C|doubt, c|un)(C|doub.c|ted.D|ly)
And all those who look (C|down c|on D|me), 
I'm tearin' C|down your (C|bal.c|co.D|ny)
In my opinion, however, the parts annotated with c by Condit-Schultz do not really rhyme in these lines, they are mere fillers for the rhythm, while the most important rhyme parts, which are also perceived as such, are the stressed syllables with the main vowel ou. To mark that a syllable is not really rhyming, but also in order to mark the border of a rhyme (and thus allow indication that only the first syllable of a word rhymes with another word), I therefore decided to introduce a specific "empty" rhyme symbol, which is now represented by a plus. My annotation of the lines thus looks as follows:
I'ma be what I set [h]out_[+]to_[e]be, 
wi[h]thout a [h]doubt, un[h]doub[+]tab[e]ly
And all those who look [h]down_[d]on_[e]me
I'm tearin' down your bal[d]co[e]ny

An Interactive Tool for Rhyme Annotation

While I consider the inline-annotation format as now rather complete (with all limitations resulting from inline-annotation), I realized, when trying to annotate poems by using the format, that it is not fun to edit text files in this way. I am not talking about small edits, like one stanza, or typing in some metadata — annotating a whole rap song can become very tedious and even problematic, as one may easily forget which rhyme tags one has already used, or oversee which words have been annotated as rhyming, or forget brackets and the like.

As a result, I decided to write an interactive rhyme annotation tool that supports the inline-annotation format and can be edited both in the text and interactively at the same time. This is a bit similar to the text processing programs in blogging software, which allow writing both in the HTML source and in a more convenient version that shows you what you will get.

The following screenshot in the database, for example, shows how the rhymes in Shakespeare's Sonnet Number 98 are visually rendered.

Visual display of Shakespeare's Sonnet 98

This tool is now already available online. I call it RhyAnT, which is short for Rhyme Annotation Tool. I have been using it in combination with a small server, to populate a first database with rhymes in different languages, which already contains more than 350 annotated poems. This database can be accessed and inspected by everybody interested, at AntRhyme; but copyrighted texts from modern songs can — unfortunately — not be rendered yet (as I am not sure how many I would be allowed to share).

I do not want to claim that I am gifted as a designer (I am surely not), and it is possible that there are better ways to implement the whole interface. However, I find it important to note that the format itself, with the coloring of rhyme words, has dramatically increased my efficiency at annotating rhyme data, and also my accuracy in spotting similarities.

Annotating the same poem with RhyAnT, the interactive rhyme annotator

The above screenshot shows how I can edit the poem from my edit access to the database. Alternatively, one can just paste in the text and edit it on the publicly accessible interface of the RhyAnT tool, edit the data, and then copy-paste it to store it. In this form, the interface can already be used by anybody who wants to annotate rhymes in their work.


The current annotation framework that I have illustrated here is not almighty, specifically because it does not allow for multi-layered annotation (Banski 2019: 230f), which would allow us to add pronunciation, rhythm, and many other aspects than rhyming alone. However, I hope that many of these aspects can be later added quickly, by creating lookup tables and processing the annotated corpus automatically. Following the Zen of Python, this seems to be much simpler than investing a lot of time in the creation of a highly annotated dataset that would discourage working with the data from the beginning.


Bański, Piotr and Witt, Andreas (2019) Modeling and annotating complex data structures. In: Julia Flanders and Fotis Jannidis (eds) The Shape of Data in the Digital Humanities: Modeling Texts and Text-based Resources. Oxford and New York: Routledge, pp. 217-235.

Baxter, William H. (1992) A Handbook of Old Chinese Phonology. Berlin: de Gruyter.

Nathaniel Condit-Schultz (2017) MCFlow: A Digital Corpus of Rap Transcriptions. Empirical Musicology Review 11.2: 124-147.

Eckart, Kerstin (2012):Resource annotations. In: Clarin-D, AP 5 (ed.) Berlin: DWDS, pp. 30-42.

Forkel, Robert and List, Johann-Mattis and Greenhill, Simon J. and Rzymski, Christoph and Bank, Sebastian and Cysouw, Michael and Hammarström, Harald and Haspelmath, Martin and Kaiping, Gereon A. and Gray, Russell D. (2018) Cross-Linguistic Data Formats, advancing data sharing and re-use in comparative linguistics. Scientific Data 5.180205: 1-10.

Haider, Thomas and Kuhn, Jonas (2018) Supervised rhyme detection with Siamese recurrent networks. In: Proceedings of Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 81-86.

List, Johann-Mattis and Nathan W. Hill and Christopher J. Foster (2019) Towards a standardized annotation of rhyme judgments in Chinese historical phonology (and beyond). Journal of Language Relationship 17.1: 26-43.

Milà‐Garcia, Alba (2018) Pragmatic annotation for a multi-layered analysis of speech acts: a methodological proposal. Corpus Pragmatics 2.1: 265-287.

Wáng, Lì 王力 (2006) Hànyǔ shǐgǎo 漢語史稿 [History of the Chinese language]. Běijīng 北京:Zhōnghuá Shūjú 中华书局.

1 comment:

  1. What about languages that have internal rhyming, like Breton or Welsh: