Special track on the Syntactic Analysis of Non-Canonical Language
ENDORSED BY SIGPARSE
The SANCL special track will be part of the joint SPMRL-SANCL 2014 - Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages"
Co-located with COLING 2014, August 24 in Dublin, Ireland
SANCL Poster submissions
In addition to regular paper submissions, we ask for poster submissions addressing the syntactic analysis of frequent phenomena of non-canonical languages which are difficult to annotate and parse using conventional annotation schemes. A case in point are the representation of verbless utterances in a dependency scheme, the pros and cons of different representations of disfluencies for statistical parsing, or the analysis of complex hashtags which incorporate and merge different syntactic arguments into one token.
Poster submissions should focus on one or more of the topics listed below. They should either be submitted as a short paper (up to 7 pages + references, to be included in the proceedings and presented as a poster at the workshop) or be submitted as an abstract (max. 500 words excluding examples/references, to be presented as a poster at the workshop). Abstract submissions should sketch an analysis for a given problem while short paper submissions should also present at least preliminary experimental results showing the feasibility of the approach.
Submissions will be accepted until June, 13 , 2014, (11:59 p.m. GMT) in PDF format via the START system (https://www.softconf.com/coling2014/WS-13) and must be formatted using the Coling 2014 formatting instructions.
Important Dates
Submission deadline | June 13, 2014 |
Author Notification | July 1, 2014 |
Workshop | August 24, 2014 |
Topics for poster submissions:
Unit of analysis
For canonical, written text the relevant unit for syntactic analysis is defined by the sentence boundaries. In CMC (computer mediated communication), on the other side, sentence boundaries are not always marked in a systematic way, and for spoken language, we can not revert to sentence boundaries at all. Decisions concerning the relevant unit of analysis will influence corpus-linguistic research (e.g. measures like sentence length, syntactic complexity) as well as parsing results. On the token level, it is also not clear what should be used as the unit of analysis. In spoken language as well as in conceptually spoken registers like CMC, multiple tokens are often merged into one new token (2,4-6), or long compound words are split into separate units (5). It is not yet clear whether it is preferable to address these issues during preprocessing, e.g. by tokenizing and normalising the text, or whether this would result in a "lossy translation", as argued by Owoputi et al. 2013, which should be avoided.
We ask for contributions on the optimal unit of analysis for non-canonical languages which do not come already separated into sentence-like units (e.g. spoken language (1), tweets (2), historical data(3) ) We ask for contributions on best practices for tokenizing spoken language and CMC (2, 4-6)
Elliptical structures and missing elements
Non-canonical languages often include sentences where syntactic arguments are not expressed at the surface level. This raises the question how we can provide a meaningful analysis for these structures, especially in a dependency grammar framework. One way to deal with the problem is to insert missing predicates as dummy verbs into the tree to be able to provide a dependency analysis for these structures (e.g. Seeker & Kuhn 2012; Dipper, Lüdeling & Reznicek 2013, see NoSta-D annotation guidelines). The question remains whether this approach is feasible for automatic processing, especially for the highly underspecified and ambiguous input often provided by NCLs, or whether a constituency-based analysis offers more elegant means to analyse elliptical structures.
We ask for contributions discussing the optimal representation for elliptical structures as the ones in (7)-(9)?
Hashtags & friends
Newly emerging text types from the Social Media have triggered new, creative means of communication which help users to overcome the limitations of expressing themselves in a written medium. Twitter hashtags are one case in point, not only allowing the users to add a semantic tag to their tweet, but also to add comments, context information, irony and sarcasm, to express personal feelings, or to evaluate. Formally, they are not bound to one particular part-of-speech but can include whole phrases or sentences, which implies that the common practise to tag them using the the label HASHTAG does not do them justice. This is even more so the case for hashtags encoding one or more arguments of the predicate, as in (10). Hashtags provide a rich source of information which has already been exploited in sentiment analysis and opinion mining (e.g. Mohammad et al. 2013, Kunneman et al 2013; also see http://www.newyorker.com/online/blogs/susanorlean/2010/06/hash.html for an overview of the different functions of hashtags). We are interested in approaches towards a syntactic analysis of hashtags (and related phenomena such as complex inflective constructions in German CMC (Schlobinski 2001)) which allow us to make better use of the information encoded in hashtags. What are the new challenges for analysing these phenomena? What can be learned from research on similar phenomena, e.g. on MWE?
Disfluencies
Disfluencies (e.g. fillers, repairs) are a common phenomenon in spoken language (14) and also occur in written, but conceptually spoken language such as CMC (15).
There are different ways of representing disfluencies. In the Switchboard corpus, fillers are included in the tree, and for repairs, both the repair and the reparandum are attached to the same node. In the German Verbmobil treebank, fillers have been removed and so-called speech errors and repetitions are not integrated in the tree but instead are attached to the root node. The different representations are expected to have an impact on statistical parsing as well as on the usefulness of the resources for linguistic research.
We ask for contributions discussing the best way of representing disfluencies in the syntax tree.
Code mixing
In informal spoken language as well as in CMC, a considerable amount of the data includes code mixing. This provides a huge challenge for automatic processing, and even more so as there is no agreed upon theoretical distinction between loanwords and foreign words. Should we annotate foreign language material using the same annotation scheme as for the target language, especially in cases where the grammatical differences between the languages involved do not easily allow us to do so, as in (18)?
We ask for contributions discussing best practices for the syntactic analysis of code mixing.
Resources & References
DCU Football Corpus Jennifer Foster, Ozlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Joakim Nivre, Deirdre Hogan and Josef van Genabith, 2011. "From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0." In Proceedings of IJCNLP, Chiang Mai, Thailand.
Falko (Error-annotated Learner Corpus) Reznicek, Marc; Lüdeling, Anke; Krummes, Cedric; Schwantuschke, Franziska; Walter, Maik; Schmidt, Karin; Hirschmann, Hagen; Andreas, Torsten (2012): Das Falko-Handbuch. Korpusaufbau und Annotationen Version 2.01 https://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/falko
French Social Media Bank Djamé Seddah, Benoit Sagot, Marie Candito, Virginie Mouilleron, Vanessa Combet (2012): The French Social Media Bank: a Treebank of Noisy User Generated Content,, COLING 2012, Mumbay, India http://aclweb.org/anthology//C/C12/C12-1149.pdf
KiDKo Rehbein, Ines; Schalowski, Sören; Wiese, Heike (2014): The KiezDeutsch Korpus (KiDKo) Release 1.0. In: Proceedings of LREC 2014, Reykjavik, Iceland.
NoSta-D Dipper, Stefanie; Lüdeling, Anke; Reznicek, Marc (to appear): NoSta-D: A Corpus of German Non-Standard Varieties. In: Zampieri, Marcos (Hrsg.): Non-Standard Data Sources in Corpus-Based Research. Shaker Verlag. http://www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/clarin-d
Syntactically Annotating Learner Language of English (SALLE) Ragheb, Marwa and Dickinson, Markus. Defining Syntax for Learner Language Annotation. COLING 2012, Bombay, India. http://cl.indiana.edu/~md7/papers/ragheb-dickinson12.html SALLE Project: http://cl.indiana.edu/~salle/
Switchboard Corpus Calhoun, S., Carletta, J., Brenier, J., Mayo, N., Jurafsky, D., Steedman, M. and Beaver, D. (2010) The NXT-format Switchboard Corpus: A Rich Resource for Investigating the Syntax, Semantics, Pragmatics and Prosody of Dialogue. Language Resources and Evaluation Journal 44(4): 387-419. http://groups.inf.ed.ac.uk/switchboard/
SANCL Special Track Organizers
- Ozlem Cetinoglu (IMS, Germany)
- Ines Rehbein (Postdam University, Germany)
- Djamé Seddah (Université Paris Sorbonne & Inria's Alpage project)
- Joel Tetreault (Yahoo! Labs, US)