Methodology Online Analysis Tool Who We Are Publications Useful Links Contacts
Home Page Methodology Data Mining


Sources Data Mining


Data Mining

Selection of concepts & issues

The method of data collection is basic text/data mining or knowledge discovery from textual databases, based on the search for specific words contained in the sample of documents. The sample consists of 115 national newspapers from 41 countries, comprising of approximately 69,000,000 articles from 369,000 issues of leading newspapers between January 1990 and May 2010. The selection of sustainability-related issues was based on the desk study of major international agreements and initiatives. Core documents were Agenda 21, UN Millennium Development Goals, UN Millennium Ecosystem assessment and the UN Global Compact. The selection of key terms was restricted by practical issues such as ambiguities of terms in a number of languages, or difficulties in translating a key term into a number of languages. The desk study resulted in the selection of 20 sustainability-related issues (Table below). Whilst this set of 20 issues is not meant to represent a conclusive list of sustainability-related issues, it nevertheless provides insight into attention given to issues that are commonly associated with sustainability and sustainable development. In addition, media coverage of 10 concepts commonly associated with the area of sustainability (including the term sustainability) was analysed. On the level of sustainability-related concepts, the analysis was restricted to 62 English-language newspapers for reasons of comparability.

Issues & concepts selected for analysis

Selection of newspapers

The selection criteria for newspapers included circulation, area of circulation and, if possible, private ownership, with priority given to national broadsheets. In terms of area of circulation, newspapers that were not predominantly local or regional in scope were selected in an attempt to reflect the national public agenda as far as possible. When newspapers were not accessible through the databases over the full period of analysis, they were included in the sample from the first full month in which they became available. The selection process aimed to create a sample that was as geographically diverse as possible, particularly with regard to countries in the global South. It was not possible to include any African newspapers except from South Africa.

A total of eight languages were included in the sample: English, French, German, Spanish, Portuguese, Dutch, Danish and Italian, however in cases such as Russia, Japan, China (Hong Kong) or South Africa, the analysis was restricted to national newspapers published in the English language. Whilst the scope and content of these newspapers is likely to be affected by this restriction, these newspapers are still likely to reflect national deviations from the ‘global mainstream’. It could also be argued that in countries such as Russia or China, English publications may remain more independent with a greater freedom of press, thereby reflecting public opinion in these countries to a higher degree. In order to minimise bias towards countries in the global North, the number of newspapers from countries such as the United States, the UK or Germany has been limited to a maximum of eight.

Please click here for a selection of newspapers that have been included in the sample. The base unit of analysis is the monthly average number of articles containing the search term per newspaper issue. This allows for the comparison of different newspapers irrespective of whether they are released daily or weekly. Therefore a sample frequency of 0.1 for climate change represents a probability of 10% for that month that any newspaper bought would contain at least one article that mentions the term climate change.

There were minor issues regarding the accuracy of search terms, in particular to general terms such as poverty, which carry significantly broader meanings than terms such as soil erosion, and may be used in a range of contexts that are only remotely related to sustainability. This in turn has implications for the total level of word frequencies. Errors were significantly reduced through random sampling and subsequent reiterative adjustment of the text mining tool, and the longitudinal analysis allows for the identification of long-term trends and deviations, which are not affected by these characteristics of the search used. Irregularities such as the misspelling of terms or the title of a publication, or changes in date format led to slight errors in data transformation. An error of 1% was tolerated; larger errors were corrected through reiterative adjustment of the text-mining tool.

Finally it is important to note that this analysis focuses on the quantification of general levels of media attention towards specific issue-areas, rather than the specific tone taken within each article or the identification of positions on the issue in question.