Journal article

Mining one percent of Twitter: collections, baselines, sampling

Information technology


The objective of the paper is to reflect on the affordances of different techniques for making Twitter collections and to suggest the use of a random sampling technique, made possible by Twitter’s Streaming API (Application Programming Interface), for baselining, scoping, and contextualising practices and issues. It discusses this technique by analysing a one per cent sample of all tweets posted during a 24-hour period and introducing a number of analytical directions considered useful for qualifying some of the core elements of the platform, in particular hashtags. To situate the proposal, the report first discusses how platforms propose particular affordances but leave considerable margins for the emergence of a wide variety of practices. This argument is then related to the question of how medium and sampling technique are intrinsically connected.


Social media platforms present numerous challenges to empirical research, making it different from researching cases in offline environments, but also different from studying the “open” Web. Because of the limited access possibilities and the sheer size of platforms like Facebook or Twitter, the question of delimitation, i.e. the selection of subsets to analyse, is particularly relevant. Whilst sampling techniques have been thoroughly discussed in the context of social science research, sampling procedures in the context of social media analysis are far from being fully understood. Even for Twitter, a platform having received considerable attention from empirical researchers due to its relative openness to data collection, methodology is largely emergent. In particular the question of how smaller collections relate to the entirety of activities of the platform is quite unclear. Recent work comparing case based studies to gain a broader picture and the development of graph theoretical methods for sampling are certainly steps in the right direction, but it seems that truly large-scale Twitter studies are limited to computer science departments, where epistemic orientation can differ considerably from work done in the humanities and social sciences. 


Publication Details