Textual Russian Dolls:

On generating cascading plot summaries for second language learning

Bill Winder, U. of British Columbia

 

 

1. Exploratory talk

 

I would like to consider here a special kind of hypertext, which Christian Vandendorpe has called a stratified hypertext. I will develop this notion with an eye to its practical application in the area of language learning and reading. Both the theoretical framework and the applications that illustrate it are outlines and sketches that only serve to make the theory a bit more concrete. This is purely an exploratory talk. In fact, I hope that everything I describe has indeed been implemented somewhere and you can point me to a website where I can download it. In that case, I would throw out the prototypes I have developed, however laborious it has been to develop them.

 

My goal here is simply to explore the notion of a stratified hypertext,  defined as a set of linked textual versions where each version is a summary or abstract of the original, but at different degrees of informational detail. I call these Russian doll texts or cascading summaries because each version of the stack of versions is a replica, but  a more concise one, of the one that proceeds it.

 

My way of talking about such texts is not systematic. I use fairly interchangeably the terms Russian doll texts, cascading summaries, stratified hypertext and laminar text. I talk about laminar readings and stratified readings. The terminology in the field of automatic summarization is fairly unsystematic as well. Often summaries, abstracts, excerpts, and extracts are all called summaries. I will use the word summary in a very loose fashion, especially since it may require considerable redefinition in the field of language pedagogy. I hope this lexical pluralism will not be too confusing.

1.1.Extraction: Autosummarize

 

Microsoft Word has a function called autosummarize (under the “Tools” menu) that offers a crude tool for constructing cascading summaries. It really should be called autoexcerpt, since it only picks out certain parts of a text as its summary. The user can choose to extract a given percentage of the most important sentences of a document, ranked by their thematic importance using various measures. The ranking is based on very simple text structures, like word frequency distribution, sentence position, positive and negative surface cues, and some salient discourse markers.

 

One can ask Word to generate a series of summaries in 5% increments, and at each step the summary would be that much closer to the original text. Autosummarize is really only useful for long, explicitly structured documents, preferably of at least several hundred pages. For this talk I have asked it to summarize “Trishka’s Caftan”, a fable by Krylov. I chose this text because of a study of it that was done by Yuri Schescheglov and Alexandre Zholkovsky, which is particularly pertinent to the question of stratified hypertexts. Unfortunately I won’t have time to deal with the meaning-text theory or their Poetics of Expressiveness, but both have clear implications for cascading summaries. Here is the stratified hypertext (or at least its blueprint) for this fable: t_word.htm

 

“Trishka’s Caftan” is not of course a fair test of  the autosummarize function. I only want to give some indication of its logic. On longer texts, such as novels, the results are better, but still very odd; I tried it on Thoreau’s Walden, and the only noticeably interesting thing it did was to pick out some of the famous quotations, such as “the mass of men lead lives of quiet desperation”. That seems interesting in itself and would be worth exploring further at some other time. My point here is simply to show how one might establish a sequence of different summaries, each of which is a more detailed version of the original than the preceding one in the sequence.

 

Such stratified hypertexts can be formally represented as a tree “planted” in the single most representative sentence:

 

100%

S1

S2

S3

S4

S4

S6

S7

Sn-6

Sn-5

Sn-4

Sn-3

Sn-2

Sn-1

Sn

50%

Sa

Sb

Sc

Sd

Se

Sf

Sg

Sh

25%

St

Su

Sv

Sw

 

1%

Sq

 

Sr

1 S

Sp

 

The top row is the set of all sentences of the text: all the other sentences in the tree are drawn from that common pool. The sum of sentences on a given row is a stratum or level of the hypertext and makes up a complete summary.

 

There are many ways to understand how extracted sentences could be distributed in the tree. If the extraction were totally systematic, sentences would flow down the tree towards the root, as is done in sports ladders. So summarization would always be a question of choosing the best of  two sentences. For example, Sa would summarize S1 and S2 and would be chosen among these two, and the same scenario for Sb, Sc and so on. Then St would be the winner between Sa and Sb; Su the winner between Sc and Sd, and so on.

 

Another way would be to redo the competition at each level with all the original, topmost sentences; sentences would not flow down the tree, but at each level would compete on equal footing with a larger and larger pool of sentences. For example, Sa would summarize S1 and S2 and would be chosen among these two; St would summarize S1-4, and would be chosen from among those four, and not as a competition between just the winners at the level below, Sa and Sb. In this scenario, extracting would be a question of selecting the sentence which is the most representative in a given subset of sentences. Any sentence could therefore be found at a lower level, without it necessarily reappearing at intervening levels up to the top. So, for example, Sp might only appear at the root of the tree and nowhere else, except of course at the top.

 

In fact, what Autosummarize does is very different. As I noted earlier, in Word any sentence at a lower level will be found at all the levels above it. Sp will then be found in all the other summaries. This is because every sentence in Word is ranked according to the total population of sentences, not with respect to a given chunk of the text. So the ranking happens once, globally, and the different percentage levels are chosen according to that ranking.

 

100%

S1 S2 S3 S4 S4 S6 S7…. Sn-6 Sn-5 Sn-4 Sn-3 Sn-2 Sn-1 Sn

50%

Sa=p   Sb=q   Sc=r Sd=t Se ….

25%

St=p Su=q Sv=r St….

 

1%

Sq=p Sr…

1 S

Sp

 

There are obvious problems with all these scenarios, the main one being that selection should be conditioned by discourse topology. For example, it is clear in the Trishka story that the end is a kind of summary. None of these models would really deal with that discourse topology effectively.

2. Manual summarization

Even though Microsoft calls their feature summarization, it is clear that it is simply extracting convenient subtexts. The main criticism one would have of their system is that stories are not articulated into subcomponents, each of which would have proportional representation at the next level. In other words, each sentence of the text is considered with respect to all the other sentences and there is no appreciation for the role and meaning of subtexts.

 

To take an extreme comparison, if one could imagine an autosummarize that could be applied to things, such as a bicycle, rather than to texts, it would compare components, such as a bolt, a wheel, handlebars, or an inner tube, and select one as the most emblematic of the whole bicycle. For the kind of stratified hypertext I would like to consider, this kind of extraction is simply not good enough. We want to extract a miniature bicycle, not any collection of its parts; we would like to have a bit of a wheel, a bit of the handlebars, a bit of the seat, etc. It is true that texts, unlike bicycles, can contain as a component their own summary. In the Trishka text, the moral of the story given in the last lines summarizes Trishka’s story. Unfortunately Word is oblivious to that. While the autoreferential component of texts is linguistically interesting, most texts are built like bicycles; they do not contain themselves.

2.1.Trishka

 

Stratified hypertexts, like Russian dolls, have a fractal structure: each stratum is a smaller version of the whole, not simply a subset of components. A hand crafted cascading summary of Trishka might be the following: a Manual cascading summary of Trishka in table format and a Manual cascading summary of Trishka in outline format. (I make no claim that my reading is particularly accurate or complete; it is intended as simply a tractable example.)

 

These summaries are not based on extraction. Rather each stratum is a new text, though there are clear links between strata. It looks a lot like an outline. That is what it is basically, but this outline will be read by same-level subheadings –the columns --, which is not how a typical outline is read. So each column is a whole text, not simply the components of a higher level.  Most outline managers, such as Word’s for example, will not display a given level without displaying higher levels as well. I have used yellow highlighting to indicate the new information that appears at a given level. The non-highlighted text is generally the same as the level above it and should be thought of as the tight links between the levels. How to represent those tight links more explicitly remains a question; for the moment highlighting simply contrasts them with the new information.

 

This hypertext is constructed in an intuitive way, with no particular guiding principles. Many of the choices I made need to be justified. In fact, formulating explicit guidelines for generating summaries is a central theoretical issue for the field of automatic summarization. I won’t consider that dimension of the problem here. It seems more important to first sketch out how a stratified hypertext might be used, because some decisions about its ultimate structure will depend on its use. One  particular use lies in the field of language instruction.

2.2.Re-Reading

 

My initial interest in summarization was as a simple tool for language learning. Motivating second language learners to read in their target foreign language presents a dilemma: on the one hand students are motivated to read authors in the original who's works are already well known to them, perhaps as part of the literary canon in their native language. Shakespeare is an example for ESL; Hugo for FSL. Texts from such authors offer a rich and yet familiar conceptual framework for language learning and a natural transition to immersion in a foreign language and culture.

On the other hand, texts by well know authors are often difficult to read in the original, even for native speakers. "Hamlet" is extremely well known to most ESL students, but its archaic language and syntax are such that it would make little sense to suggest it as introductory reading, however motivating and useful prior knowledge of the plot might be.

One solution to that dilemma is to rely on scaled down versions of classic stories that are hand-crafted for language learners. Instead of the original "Hamlet", one could imagine any number of simplified Hamlets written at different levels of linguistic complexity. In fact, these derivative Hamlets could serve as useful stepping stones towards a mature reading the original. The learner would start by first reading a simple but complete version of Hamlet at the appropriate level, with a simple syntax and lexicon, and then reread  the entire story, but in a more complex version. By reading several times the same story at increasing levels of detail, the student should be able to gain in a systematic manner the linguistic expertise needed to read the original.

3. Prototype

 

The stratified hypertext would be the way to present these different levels to the reader in a convenient manner. Here is a prototype of a Stratified Browser.

 

There are a thousand pedagogical questions to confront, and particularly the whole question of re-reading and repetition. To really understand the pedagogical implications, I feel a stratified browser has to be developed first, and then one can see how readers respond to it.

 

The browser itself is really the simplest part of the package. The more important part is found in the algorithms that would allow one to take a known text and generate a stratified hypertext in a fairly systematic fashion, whether it is manually, semi-automatically or automatically.

 

True summarization is a NLP tough problem, and probably will not be resolved any time soon. That does not mean however that summarization could not be computer assisted.


4. Conclusion

 

My goal in this talk was simply to define an area of research that has interesting theoretical and practical implications. Stratified hypertext is an interesting tool for language learning and perhaps for information presentation in general. It also offers a convenient way to understand semantic webs, and to map how meaning is distributed in texts.

 

Classical hypertexts are not defined in terms of text generation. Links are syntactic references that are pasted on top of texts. On the other hand, the links of a stratified hypertext are ideally generated or at least systematically constrained by the source text’s meaning. Stratified hypertext have semantic links because such links are generated according to relevance. Obviously, summaries are by nature tightly bound to the source text’s meaning, and that is why a stratified hypertext has a special place when theorizing semantic links.

 

I have only sketched here some of the interface issues. The most crucial question, both practically and theoretically, is finding ways to make text generation part of the how we think about hypertexts. I have not dealt with that crucial problem here.

 

5. Works Consulted

 

 

Vandendorpe, Christian. “Variétés de l’hypertexte”. Astrolab: Enclopédie <URL http://www.uottawa.ca/academic/arts/astrolabe/auteurs.htm > Ottawa: U. of Ottawa. 2000.

 

Shcheglov, Yu and A. Zholkovsky. Poetics of Expressiveness: a theory and applications. Philadelphia: John Benjamins. 1987.