Linguistics of German Twitter

... a project by Tatjana Scheffler

Access Twitter data for linguistic research

Using Twitter Data for (Linguistic) Research
How-To: Corpus Construction
Links and Resources


Posters and Slides (by TS & collaborators)

Measuring Social Jetlag in Twitter Data. (with Christopher Kyba)
Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016), AAAI, Köln, Germany. 2016.
Dialog Act Recognition for Twitter Conversations. (with Elina Zarisheva)
Proceedings of the Workshop on Normalisation and Analysis of Social Media Texts (NormSoMe), Portorož, Slovenia. 2016.
Dialog act annotation for Twitter conversations.
SigDial Conference, September 2, 2015, Prague, Czech Republic
Dialogakte in deutschen Twitterkonversationen. (German)
Langer Tag der Wissenschaft, May 9, 2015, Universität Potsdam
Conversations on German Twitter.
Social Media Workshop, October 24, 2014, FU Berlin
Introduction to Twitter data and its use for linguistic research. Contains example for data (tweet in JSON format). (German)
Gastvortrag im Seminar “Soziale Bewegungen im Internet”, Mai 2014, FU Berlin
A German Twitter Snapshot. Corpus construction and analysis. (English)
9th Language Resources and Evaluation Conference (LREC), May 26-31, 2014, Reykjavik, Iceland
Analyse von Diskursen in Social Media. Presentation of the BMBF-subproject. (German)
Workshop “Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen”, 28.2.2014, Berlin
Basic statistics about German twitter data. (German)
Tausend Fragen – Eine Stadt, 8.6.2013, Golm/Potsdam
Erstellung eines deutschen Twitterkorpus. (German)
DGfS-CL Postersession, 35. Tagung der Deutschen Gesellschaft für Sprachwissenschaft, 14.3.2013, Potsdam

Using Twitter for (Linguistic) Research

General comments on using Twitter for linguistic research - coming soon!

Constructing a Twitter Corpus


The Twitter API doesn't allow distribution of aggregated tweets (= corpora), but researchers can collect their own data. This package allows the real-time recording of a representative portion of Twitter data in a specific language.

In particular, for languages other than English, it is possible to collect a near-complete snapshot of tweets over a real-time period (without hitting API rate limits).

Some programming experience is helpful, but running the script should be doable without it if you are able to install the necessary Python packages.


In order to build your own custom Twitter corpus, in particular of all tweets in a particular language, follow the steps below:

  1. Install Python if not included in your operating system.
  2. Get the Python-package Tweepy which wraps access to the Twitter stream.
  3. Register as a Twitter user.
  4. Create a new Twitter application and receive consumer key and secret.
  5. After the step above, you will be redirected to your app's page. Create an access token in the "Your access token" section.
  6. Download the Twitter-for-Linguists script Twython and insert the consumer key and secret, and the access token key and secret in the appropriate lines.
  7. Create a keyword file of words which should be tracked on Twitter (up to 400 words, one per line), save it as "twython-keywords.txt" in the same directory as the script. Alternatively, you can download my German stopword list to obtain almost all German tweets (make sure to change its name or change the line in the script).
  8. Run the script with "python" from a console.

Additional Notes

Log of Changes

Links and Resources

... please email me if you want tools included in this list.

Last modified: Mon Jul 24 10:47:20 CEST 2017