in spoken language annotation: Applying u α‐ family coefficients to discourse segmentation 1

As databases make Corpus Linguistics a common tool for most linguists, corpus annotation becomes an increasingly important process. Corpus users do not need only raw data, but also annotated data, submitted to tagging or parsing processes through annotation protocols. One problem with corpus annotation lies in its reliability, that is, in the probability that its results can be replicable by independent researchers. Inter-annotation agreement (IAA) is the process which evaluates the probability that, applying the same protocol, different annotators reach similar results. To measure agreement, different statistical metrics are used. This study applies IAA for the first time to the Valencia Español Coloquial (Val.Es.Co.) discourse segmentation model, designed for segmenting and labelling spoken language into discourse units. Whereas most IAA studies merely label a set of in advance pre-defined units, this study applies IAA to the Val.Es.Co. protocol, which involves a more complex two-fold process: first, the speech continuum needs to be divided into units; second, the units have to be labelled. Kripendorff’s u α-family statistical metrics (Krippendorff et al. 2016) allow measuring IAA in both segmentation and labelling tasks. Three expert annotators segmented a spontaneous conversation into subacts, the minimal discursive unit of the Val.Es.Co. model, and labelled the resulting units according to a set of 10 subact categories. Kripendorff’s u α coefficients were applied in several rounds to elucidate whether the inclusion of a bigger number of categories and their distinction had an impact on the agreement results. The conclusions show high levels of IAA, especially in the annotation of procedural subact categories, where results reach coefficients over 0.8. This study validates the Val.Es.Co. model as an optimal method to fully analyze a conversation into pragmatically-based discourse units.

In most previous works, IAA is used to measure the fit of a set of labels onto a set of units: in Crible & Degand (2019a), a set of 423 tokens of discourse markers (henceforth, DM) is annotated independently by two expert annotators into thirtyfour functional labels hierarchically distributed. Likewise, in Scholman et al. (2016), 40 non-expert annotators annotate Discourse Relations in 36 excerpts containing pre-delimited segments, taking as a basis the theory of the cognitive approach to coherence relations (Sanders et al. 1992(Sanders et al. , 1993. In this study, 12 hierarchically distributed categories are assigned to each pairing of segments. Common to both studies is the fact that the annotators operate with two closed sets: DM or pairs of utterances, on the one hand, and discourse relationships, on the other. Valuable as these efforts might be, IAA achieves an extra layer of complexity when the process implies a previous identification of the units to be labelled. In this case, the annotation process involves two consecutive steps, segmentation and labelling: a) segmentation means identifying units by setting their boundaries in a given continuum (e. g. in a text or in a conversation); b) labelling is the assignment of a specific category to each unit. This twofold procedure constitutes the main endeavour of discourse segmentation models (Pons Bordería 2014), which are theoretical proposals aimed at fully dividing speech into units and subunits, just as syntactic analyses do with sentences and phrases. The calculation of IAA is an important step to evaluate the fit of a given model and to compare it to other models on an objective basis. However, IAA has not been applied to both processes simultaneously, as this paper does.
To better illustrate this two-step annotation process, recall example (1), where two speakers (S1 and S2) discuss about their preferences regarding two supermarket chains, Consum and Mercadona: (1) S1: no me gustan las de Consum me gustan más las de Mercadona S2: a mí también pero mi madre compró en Consum ayer [S1: I don't like the ones from Consum I prefer the ones from Mercadona S2: me too but my mother shopped in Consum yesterday] Excerpt (1) can be analyzed by two different annotators, say A and B. Their analysis comprises two different tasks: the first one consists of dividing the text into linguistic units, as shown in (1'). The second task consists of labelling the units from a closed set of alternates {x, y, z,…, n}, as shown in (1''). With respect to the first task, differences in interpretation can produce different segmentations. In (1'), annotator A interprets a sequence abc as a single unit ([abc]), whereas annotator B analyzes the same sequence as two units ([ab] Divergences may arise also in the second task of labelling, as annotators A and B can interpret his sequence differently ([xabcx] vs. [yaby][zcz]), as shown by (1''): ( Examples (1') and (1") illustrate the complexity of an annotation process involving segmentation and labelling. Most research on IAA consists of matching a set of labels (pragmatic functions, or discourse relationships) onto a pre-defined set of units (DM, turns or punctuation-delimited sentences). In discourse segmentation, the units themselves have to be established independently by each annotator. Here, agreement is much harder to reach, for not only a good match in the labels-onto-units projection is needed, but this match is dependent on a previous agreement on the segmentation of discourse units. The analysis in this paper reveals a complex approach to IAA, especially considering that i) the object of study are spontaneous conversations, a place where contextual cues must be taken into account for properly identifying units; and ii) the segmentation makes use of syntax, prosodic, semantic and pragmatic information (see 2.2).
The annotation process described so far becomes even more complex when more than two annotators are implicated, as the potential sources of divergence multiply and therefore good results are harder to achieve 6 .
To sum up, three parameters can be implied in an annotation process: a) The number of annotators. b) The segmentation (or not) of the linguistic units as part of the annotation process.
c) The number of labels to be applied. The complexity of the process is largely dependent on the numbers assigned to these variables. For instance, two annotators labelling a same set of discourse markers with a set of five categories face a total of 2*1*5 = 10 variables. Two annotators labelling a set of eleven discourse relationships on the same pairs of sentences face a total of 2*1*11 = 22 variables. Alternatively, three annotators dividing a full conversation into units -units which can be coincident or not -and assigning a set of eight labels to each unit face a total of 3*2*8 = 48 variables. It is evident that, the more parameters are included in the annotation, the greater differences might be expected.
The metrics selected in this paper are Kripendorff's uα-family coefficients (Krippendorff et al. 2016) and the units to be tested are the subacts, the minimal segments in the Val.Es.Co. model (see 2.3). As subacts organize the distribution of conceptual and procedural information 7 in speakers' turns, IAA evaluates one key feature of a discourse segmentation model, namely the extent to which both kinds of meaning can be robustly accounted for by a single, pragmatically-based analysis.
In what follows, section 2 presents some previous literature on discourse segmentation ( § 2.1) and brings into play the applicability of IAA to proposals of discourse segmentation models. More specifically, the Val.Es.Co. model ( § 2.2), and the statistical techniques for measuring IAA ( § 2.3) are presented in detail. Section 3 explains the methodology in this study. Section 4 shows the results obtained in IAA measurement, and sections 5 and 6 sum up the results and the main findings of this study.

Current annotation proposals by discourse segmentation models
Since spoken discourse began to be a focus of interest for linguistic research, it became evident that traditional syntax was too narrow as a segmentation tool (Pons Bordería 2014: 1). Units such as sentence or clause proved inadequate for 6 Artstein and Poesio (2005) prove that, as regards tests such as Fleiss'κ and a generalized Cohen's κ, including more annotators is a good way to decrease the so-called annotator bias -the individual preferences of annotators. See also Artstein and Poesio (2008: 570-573). 7 The conception of procedural meaning used in this paper is limited to non-propositional procedural meaning, what equals it with discourse markedness (Briz and Pons Bordería 2010). For a more comprehensive account of procedural meaning, see Wilson (2011) and Grisot (2017). analyzing spoken language, where some "deviant" language uses ("unachieved" syntactic structures, multifunctional discourse markers or unusual word ordering, just to mention a few) are not the exception, but the rule (Sornicola 1981, Blanche-Benveniste & Jeanjean 1987, Narbona 1986, 1992, 2012, Briz 1998. The need for a new syntax (Narbona 1992) to account for spoken language set the grounds for an emerging area of research on models for discourse segmentation. As Pons Bordería (2014: 1) explains, efforts attempting to find new units for analyzing spoken discourse have been made in particular from Romance languages, where Latin grammar has been traditionally influential. This is evident in the proliferation of various segmentation models 8 in French, Spanish or Italian such as those of Geneva (Roulet et al. 1985, Roulet, Fillietaz & Grobet, 2001, the Sorbonne (Morel & Danon-Boileau 1998), the Val.Es.Co. Research Group (Briz & Grupo Val.Es.Co. 2003, Grupo Val.Es.Co. 2014, Leuven (Degand & Simon 2009a) and Freiburg (Groupe de Fribourg 2012). All these models, while offering different units and divergent criteria to identify them, have in common one aim: segmenting spoken language without leaving any segments unanalyzed.
Segmenting spoken language becomes especially challenging when it comes to smaller-scope units (Degand & Simon [2005, 2009a, Grupo Val.Es.Co. [2014: 12], Briz [2011]). Contrary to higher-scope units such as turn or a dialogue, identifying smallest scope units requires considering diverse parameters such as prosodic cues, syntactic boundaries or pragmatic information, which must be properly balanced to achieve a sound result. Evaluating such complex segmentation and labelling practices by means of IAA techniques provides a handle for assessing and improving any discourse segmentation proposal.
Despite its beneficial potential, discourse segmentation models have barely made use of IAA techniques. Being most of them theoretical, studies showing the results of applying a segmentation model are the exception (Degand & Simon 2011, 2009b, Latorre 2017, Pascual 2015a, 2015b. To the authors' knowledge, no model has applied IAA to test protocols for segmenting discourse into units.
We believe that IAA contributes to providing a robust way of identifying discourse units, a goal at which segmentation models should aim. Testing the segmentation protocol becomes crucial for developing theories and more robust protocols. This study applies IAA to the Val.Es.Co. model -more specifically, to the unit subact.

The Val.Es.Co. model (VAM) of discourse segmentation
The Val.Es.Co. model of discourse units (henceforth, VAM) (Briz & Grupo Val.Es.Co. 2003, Val.Es.Co. Group 2014) relies on different approaches (Conversation Analysis [Sacks et al. 1974], Discourse Analysis, [Sinclair & Coulthard 1975], the Sorbonne Group [Morel and Danon-Boileau 1998], the Geneva Group [Roulet 1985, Roulet 1991, Roulet et al. 2001). Since 2003, this framework has been applied to different problems, such as the polyfunctionality of discourse markers (Briz 1998, Briz & Pons 2010, Estellés 2011, Pons 2008, the study of intensification and hedging devices (Albelda 2007, Albelda & Gras 2011), or diachronic approaches in grammaticalization or constructionalization (Pons & Estellés 2009, Pons 2014, Salameh 2021. The VAM comprises eight hierarchical units (discourse, turn-taking, turn, dialogue, exchange, intervention, act and subact) located into three dimensions (social, structural and informative) and two levels (monologic and dialogic), as the following table illustrates (Table 1). In this top-to-bottom model, wider-scope units have scope over smaller-scope units (e.g. interventions have scope over acts, exchanges have scope over interventions, and so forth). Speaking is conceived as an activity involving three dimensions: first, speaking is a social activity, where speaker and hearer interact; second, speaking is a structural activity, consisting of uttering language (including disfluency phenomena such as false starts or truncated segments); finally, speaking is and an informative activity, whereby information is packed into units.
The act and subact units are monological, whereas exchange, turn, turn-taking, discourse and dialogue are dialogical units. In turn, the unit intervention is, at the same time, monological and dialogical, as the maximal projection in speaker's production and, at the same time, the minimal content aimed at interacting with other participants. Dimensions, levels and units are interrelated and allow for a complete segmentation of a conversation.
The IAA study in this paper focuses on the smallest unit in the VAMthe subact -conceived as the smallest piece of information delivered by a speaker. As such, it is perhaps the most difficult unit to identify, since the boundaries of informative units intertwine with the syntactic ones (Briz & Grupo Val.Es.Co. 2014) 9 .

Subact: definition and types
A subact is defined as the smallest monological and informative unit. Subacts are hierarchically subordinated to a wider-scope unit called act; therefore, a subact or a group of subacts constitute an act, defined as the host of an illocutionary force (Grupo Val.Es.Co., 2014: 54). Notation-wise, subacts are indicated by braces ({ }) whereas acts are indicated by the hash sign (#).
Subacts are classified into two main categories, depending on the type of information they convey: substantive subacts (SS) convey conceptual information, and adjacent subacts (AS) convey procedural information. SS are, in turn, subdivided into directive substantive subacts (DSS), subordinated substantive subacts (SSS) and topicalized subordinated substantive subacts (TopSSS). DSS carry the weight of the main content in the act; SSS host semantically secondary or dependent information; TopSS are instances of prosodically or informatively detached constituents: ( In example (2), the TopSSS "and to the cinema→" is prosodically detached from the segment that conveys the main illocutionary force: "are you coming?". At the same time, the TopSSS is informatively dependent on the DSS (otherwise, the prototypical ordering of the utterance might be "and are you coming to the cinema?"). On the other hand, the SSS "because I should prepare for my exam" depends on the DSS "I cannot go" (as shown by the subordination conjunction because) and contains the explanation derived from the negative assertion made by A (Salameh, Estellés & Pons, 2018: 115). This SSS could be removed without changing the illocutive force of the intervention -a refusal; its subordinated nature lies on the fact that B would not be able to answer to A's previous intervention with just the SSS, as shown in (2' Together, these six labels (DSS, SSS, TopSSS, TAS, MAS and IAS) account for most of the distribution of information in a spontaneous conversation. However, in spontaneous conversations, some constituents remain unachieved, reflecting processes in language-planning (Ochs 1979, Sornicola 1981). These fragmentary units pose a problem for any discourse segmentation model, since by nature of their unachieved status, they cannot be classified as AS or SS. According to their degree of completion, the Val.Es.Co. model classifies them as XSS (an incomplete constituent with conceptual content), ASX (an incomplete constituent with procedural content), XXS (an incomplete constituent whose conceptual or procedural nature cannot be established), and R (a sub-structural, residual element in the analysis) 10 (Pons Bordería [2016] and [Pascual 2018[Pascual , 2020). Example (4) shows some of these fragmentary units:  Krippendorff (1995Krippendorff ( , 2003Krippendorff ( , 2013 and Krippendorff et al. (2016) have developed a family of statistical coefficients in order to measure agreement not only in the labelling of units by different annotators, but also in the segmentation of units in a continuum not previously pre-segmented, -i. e. in cases where there is not a total number of pre-established units for each annotator to label. This family comprises four coefficients: uα, |uα, cuα and (k)uα. In the case of IAA, the variables taken into account by those tests are the following:

Statistical tests: Krippendorff's u α-family coefficients
a) The location of the units in the continuum: this variable measures if two or more annotators have identified a same unit in the same time span.
b) The length of the units: this variable measures whether a unit measures the same number of milliseconds, even if not being placed in exactly the same minute and second in the conversation.
c) The total number of annotated units in a given span of time. d) The type or label of the annotated unit. These variables stay in close relationship with the goals of a two-fold annotation process like the one performed in this paper: on the one hand, the segmentation process involves a) placing and b) c) bounding subacts; on the other hand, the labelling process also implies d) categorizing the types of subacts previously identified in a conversation.
Adapting the example provided by Krippendorff et. al. (2016Krippendorff et. al. ( : 2349, Figure 1 illustrates what happens when three different annotators (A, B and C) segment and annotate a conversation into subacts. The columns (1), (2), (3), (4) and (5) show the different possibilities of the analysis and, therefore, the variables taken into account by the four uα-family coefficients: Ann.
(1) In column (1), all the three annotators agree in the segmentation and in the labelling of all the variables, since the units coincide in their location, length, number and type; in (2), the units show the same segmentation (location, length and number), but differ with respect to their labels (TAS, MAS and IAS); in (3), the units are not equally segmented (they are located in different time spans, albeit coinciding in length, and number) but are equally labelled (DSS in all cases); in (4), the units are equally labelled, but differ in their segmentation (they occur in the same time span, but differ in number and length); finally, in (5) there is not any agreement neither in segmentation nor in labelling (annotator A identifies a TAS while annotators B and C do not identify a linguistic unit at all).
Thus, Krippendorf's uα-family coefficients provide indicators allowing to measure agreement in both segmenting and labelling procedures. This is why Krippendorf's metrics have been chosen for measuring IAA, in contrast with other statistical tests that measure only categorical agreement in labelling such as Cohen's kappa, Fleiss' kappa or Scott's pi. 11 The ua, |uα, cuα and (k)uα coefficients provide information about different aspects of the reliability of the annotation and vary in two essential points, namely in the way they compute agreement and in the type of data they take into account: a) uα measures overall agreement in all data, this meaning that the calculation includes both units and no-units: in our case, pauses, silences and gaps between 11 According to Krippendorff et al. (2016Krippendorff et al. ( : 2349, Guetzkow (1950) defined a coefficient to measure the reliability on unitizing data (i.e. identifying units on a given continuous data). However, Krippendorff et al. (2016) affirm that Guetzkow's test has several drawbacks: i) it is only applicable when a total of two annotators participate in the annotation procedure, ii) it measures disagreement of the number of units identified, but is unable to assess reliability on the agreed units and iii) the result does not provide any information about whether the identified units overlap or whether they are related in any way (i.e. have the same or a different duration). subacts and turns); therefore, the final results contemplate data irrelevant of the annotation; b) |uα reduces data to a binary metric (gap vs. no-gap), and does not specify the distinction between categories; this is useful to show the agreement in the segmentation of a continuum into units; however, it does not inform about the labelling performed by each annotator; c) cuα shows agreement only on the units that have been assigned a value by all annotators (in our case, contemplating all types of subacts); d) (k)uα goes a step beyond and specifies the agreement results for each individual label in the analysis, that is, for each subact type (DSS, SSS, MAS, TAS, etc.).
In conclusion, the Krippendorff coefficients can be understood as a set of tools, leading to successive refinements of the IAA analysis: from units and no-units or gaps (ua), to the number of units and no-units per annotator, irrespective of their labelling (|ua); and from the labelling of all categories as a whole, excluding gaps (cua), to a more fine-grained account of each category in particular ((k)ua).

Data and procedure
A 19 minute-long, informal conversation (4352 words iii) XXS iv) Residuals The number of possible labels for any given constituent is 10. Taking into account that agreement was measured only for the units that did not overlap in time, and that the number of annotators was three; this means that, for any constituent annotated, agreement possibilities were 1/(10*3*2).
Once the task was completed, the annotation results were transferred to an Excel sheet, overlapped units were suppressed from the data 13 and Krippendorff's statistical uα-family coefficients were applied using the software provided by Krippendorff et al. (2016) in order to measure IAA. As the Krippendorff coefficients provide successive refinements, each test becomes informative of the fit of the analysis.
Successive rounds for calculating IAA were applied to different groupings of the same data, so as to elucidate to which extent working with a bigger number of variables had an impact on the agreement results: first, the labels were reduced to the more general categories AS and SS, in order to measure the agreement related to the procedural vs. conceptual distinction; second, taking into account all the labels representing the 10 types of subacts (DSS, SSS, TAS, MAS, etc.); and third, focusing specifically on the subtypes of procedural subacts (AS) with the aim to observe agreement on the identification of the textual, interpersonal and modal discourse functions. In each step, the analysis was performed twice in order to elucidate whether the presence or absence of the most residual subactsundetermined subacts (XSS, XAS, XXS) and residuals (R) -influences IAA results.

Inter-annotation agreement results
The following sections present the results of the study. Section 4.1. displays the raw data in the quantification of the subacts and provides an insight into the performance of the three annotators. Section 4.2 shows the results of Krippendorff's coefficients in the different rounds of analysis: starting with the labels representing procedural and conceptual subacts (4.2.1 and 4.2.2), continuing with subacts conveying procedural information (4.2.3 and 4.2.4), and finishing with all the types of subacts (4.2.5 and 2.4.6). In all cases, the analysis is carried out twice, so that it can be checked out the effect of including and excluding from the calculation the most residual subact categories (XSS, XAS, XXS and R). Table 2 shows the number of units per annotator (named A, B and C). A first overview of the data shows that the total number of subacts identified by the three annotators is very similar (A n = 1331, B n = 1339, C n = 1325). This is a positive signal, especially taking into account the relatively high number of variables in the analysis.

General results
Two additional columns indicate the number of subacts that could be computed using Krippendorff's coefficients: recall that Krippendorff's statistics cannot be applied to units overlapping in the same time-span. Due to the nature of spontaneous conversations, in this analysis overlapping affects 30.5 % of the annotated subacts, which could not be calculated and were removed from the analysis. All in all, 2776 is a relatively big number of units for measuring IAA. In example (5), speakers (S3) and 1 (S1) are repeatedly trying to take the floor. The restart ("it's the-") and the co-construction of the collaborative intervention ("[of the European UNION] that pays [best→ to instructors] ") are illustrative of the competition to get the floor. In turn, Table 3 shows that most of the excluded subacts (represented by the sign "Ø" in Table 3) belong to sub-structural categories such as XXS (47.19 %) or R (84.9 %), as these categories are frequent in overlapped speech and are often embedded within wider-scope units (a DSS, in the case of "it's the-") ( Table 3).  Table 4 shows the results based on a first distinction between constituents with conceptual or procedural meaning (SS. vs. and AS). The second row in the table shows the results of including XSS and R in the analysis. The IAA results are high in all cases. The positive results show that the conceptual-procedural distinction is clearcut. In the case of uα (= 0.825 / 0.823) 14 and |uα (= 0.841 / 0.853), it must not be forgotten that inter-and intra-speaker pauses are taken as if they were labelled units. This means that the gaps between turns and pauses are also computed, even if they have not been labelled. Yet, the results of cuα (0.843 / 0.813) and (k)uα (AS = 0.844 / 0.842, SS = 0.841 / 0.818) show that once the gaps and pauses are excluded from the calculation, the agreement in the segmentation is still high, as shown by example 493:

Conceptual versus procedural labels (SS, AS)
(6) S2: ee pasé dos días bailando / mira 15 // [¡las secuelas!] S1: [(RISAS)] S3: [¿pero qué te ha pasao en] el ojo? S1: pues que me caí / ((puees)) bebí un poquito de rusc→ /// de rusco (RISAS) [S2: ee I spent two days dancing / look 16 // [the consequences!] S1: [(LAUGH)] S3: [but what happened] to your eye? S2: well [que] I fell / ((well)) I drank a little bit of rusc→ /// of rusco (LAUGH)] Example (493) is segmented by Annotator A into three SSs, whereas annotators B and C identify two SSs. All annotators agreed in considering the constituent "I spent two days dancing / look// [(at) the consequences!]" as a SS, even if its boundaries remain not as clear. Also, all three annotators identified the filler "ee" as procedural (AS) ( Table 5). Neither the identification of boundaries nor the distinction between conceptual and procedural content are challenged by the inclusion in the analysis of residual categories, as proven by the prevalent high results in the different scores. Although the total number of XXS (n= 47) and R (n= 35) included in the calculation only constitutes the 2.95 % of the total number of subacts (n= 2776), the (k)uα scores are fairly good in the case of XXS (0.626), this notwithstanding the controversial nature of residuals. Indeed, residuals are sub-structural elements whose status as a pragmatic or semantic unit remains still unclear among scholars (Crible & Pascual 2019, Pascual 2020 Truncations such as y en-y en-("and in-and in-") or es-es ("it's-it's-") are correctly identified by all three annotators as residual categories (see Table 6 below). In any case are residuals annotated as AS or SS, and disagreements remain limited to choosing between the two labels in this category, that is, between XXS or R. Disagreement in the conceptual vs. procedural distinction is limited to very specific instances of discourse markers, like que in pues que me caí ("well [que] I fell") in Table 7 or yy ("aand") in Table 8. In these cases, the annotators hesitate between considering them pragmatic discourse markers (hence, coded as an autonomous AS), or grammatically integrated conjunctions (hence, included into a SS):  In conclusion, as for what regards the first, basic distinction between conceptual and procedural categories, the IAA results obtained here are particularly positive.

Procedural labels (TAS, MAS, IAS)
After this first distinction, the IAA zooms on the three types of AS in the Val.Es.Co. model: textual, modal and interpersonal (TAS, MAS, and IAS). In a further step, the residual XAS label has been added.
To better understand this process, consider example (495) Table 9 below shows that the performance of annotators is very similar at identifying the boundaries and categories of AS. Apart from some marginal cases, the recognition of AS boundaries and, in most cases, their categorisation as types of subacts shows a high threshold of agreement.
One of such marginal cases is the adverbial particle encima (Engl. on top of it) in example (3), which is annotated as SS by annotator B, and as SA by annotators A and C. However, A and C diverge in the type of AS assigned to encima: modal (MAS), for annotator A, or textual (TAS), for annotator C. A second case of disagreement is "uhum", considered as a TAS functioning as a filler by annotator A, and as an interpersonal marker (IAS) by annotators B and C. The agreement levels in this new process are again high (see Table 10 below): the uα (0.802), |uα (0.832) and cuα (0.846 / 0.846) metrics all exceed an IAA of 0.8. This means that not only the boundaries of units are clear, but also their categorisation. Remark also that the XAS category (with only two occurrences on 2776 subacts) does not have a negative impact on the overall good agreement results, which remain similar in both cases: The fact that cuα is higher than uα and |uα might suggest, as Krippenforff et al. (2016Krippenforff et al. ( : 2358 put it, that the agreement is due mostly to the labelling of units, not to the segmentation of units and the gaps between them (since gaps are excluded from the calculation, unlike in uα and |uα computation).
As for the (k)uα test, although the only category with a lower level of agreement is IAS (0.738), this is still a highly positive result. The (k)uα value for TAS shows hardly any change when including residual XAS in the analysis (0-870 vs. 0.868). Overall, the model proves to be rather reliable in the segmentation of ASs.

All conceptual and procedural labels (DSS, SSS, TopSSS, TAS, MAS, IAS)
Finally, all the possible labels for conceptual (DSS, SSS, SSSTop) and for procedural (MAS, IAS, TAS) categories are taken into account. The results (vid .  Table 11) are positive (uα = 0.680 / 0.679, |uα = 0.807 / 0.853, cuα = 0.589 / 0.555), especially taking into account that a high number of labels (amounting to ten, with the inclusion of residual segments) on three different annotations are compared. In fact, distinguishing conceptual from procedural information does not pose a great controversy among annotators, and neither does identifying types of procedural content (see § 4.2.1 and § 4.2.2). The |uα value being higher than the uα, taken together with a lower result of cuα, shows that the agreement among annotators arises from the identification of boundaries between units and gaps (units and pauses or silences). The segmentation of TopSSS and SSS shows lower results ((SSS)u-α = 0.286 / 0.274; (TopSSS)u-α = 0.184). In the case of TopSS, the problem may lie in the theoretical definition of the category in the model; as for SSSs, disagreements are probably due to determining how some constituents are informatively subordinated to others without making use of syntactic clues.
With respect to (k)uα, the results are still high. The high level of agreement on MAS (0.853) prevails, suggesting that this is the most reliable category among the three annotations.
To understand this last segmentation and labelling phase, consider example (9): [what a girl!] B: ((I mean)) plus it's-it's the first thing thaat that you learn dude that-thosethose things like that] Table 12 shows how two pieces of conceptual information in example (9) (colloquial Spanish and those things like that) are labelled differently: as TopSSS by annotators B and C, and as DSS by annotator A. Also, in the segment (where do you go? to the fucking nuns), annotators A and B identify a single DSS, whereas annotator C identifies a SSS and a DSS. The segmentation of IAS and MAS also proves to be complex, as shown in the status of "hein?" and "dude" as interpersonal cues or modalizers.

Discussion
The results obtained in the different rounds of IAA analysis show a high level of agreement. In most cases, the coefficient values reach a threshold of 0.800; otherwise, the rates are superior to 0.500, with the exception of the most residual subact units. Despite the lack of scientific consensus on what an "acceptable" level of IAA should be (Arstein & Poesio 2008; van Enschot et al. in press;Kripendorff et al. 2016), the application of the Val.Es.Co. annotation protocol for segmenting conversations into subacts yields a very positive outcome, especially taking into consideration the fact that the annotation procedure is complex and involves two tasks: segmenting and labelling units in a conversational continuum.
The successive groupings of categories in the different rounds of analysis lead to differences on IAA results: needless to say, the greater the number of labels in the calculation, the lower IAA rates. The main results in each round of analysis can be summed up as follows: -The comprehensive distinction between substantive and adjacent subacts (SS vs. AS), shows a noticeable high level of agreement among annotators, even when the most fragmentary units (XXS and R) are included in the model.
-Procedural labels (AS) offer a robust IAA result that reach over 0.800 (see 4.2.2), even when including the most residual AS unit: the XAS. Including XAS ((k)ua = 0.000) in the calculation does not entail a general decrease on agreement, nor significantly affects the overall IAA for AS. This shows that agreement on AS categories is prevalently high.
-As for all subacts labels, the overall IAA results are high. MAS are the subacts that show higher agreement rates, also in correspondence to the general trend shown by AS, whose IAA results are higher than in the case of conceptual subacts (SS). SSS and TopSSS are the labels showing the lowest IAA rates, which suggests that these categories call for a more thorough definition in the model. They outline the difficulty of analyzing the hierarchical organisation of conceptual information in spoken language, a genre that precisely stands out for a nonprototypical distribution of information and a non-prototypical syntactical organisation, in comparison to more formal or written uses of language.
-Finally, the inclusion of the most residual units do not lead to an increase in the rate of disagreement. SXX and R are sub-structural constituents that bring to light the difficulties underlying the analysis of spontaneous speech. This study shows that the VAM is able to account for these residual segments by offering labels for their analysis, unlike other models of discourse segmentation (Pascual 2020).

Conclusions
IAA emerges as a method for testing the reliability and replicability of corpus annotation protocols. This paper tested the performance of three annotators following the VAM annotation protocol, which in turn, allows to assess the validity of this model. This is also the first time when Krippendorf's coefficients are applied to the whole process of discourse segmentation, setting thus new standards for validation within the field.
The present study has followed a two-fold procedure for segmentation: first the conversational continuum has been divided into discourse units; and second, each unit has been classified as a type of subact. The complexity of this annotation procedure contrasts with most IAA studies, where the measurement of agreement relies only in the categorization of pre-defined units whose boundaries have been set in advance (see for example Crible & Degand 2019а, Scholman et al. 2016).
Krippenforff's ua-family coefficients were applied to measuring IAA in several rounds of analysis of the same data. As outlined in section 5, the results of the experiment are very positive, since high levels of IAA were obtained in most analyses. Agreement has proven to reach positive results -yielding coefficients over 0.8 -when it comes to distinguishing conceptual and procedural content (4.2.1) and the different procedural functions conveyed by AS (4.2.2). The few shortcomings of the protocol are explained by the fact that it is hard to define constituents (SSS and TopSSS), thereby calling for a better account of such units (4.2.3).
In conclusion, Krippendorf's coefficients, applied for the first time to test a model of discourse segmentation, validate the Val.Es.Co. model as an optimal method to fully analyze a conversation into pragmatically-based discourse units.