‘...automated journalism is expected to substantially increase the amount of available news. [...] To cope with the resulting information overload, the importance of search engines and personalized news aggregators, such as Google News, are likely to increase further. Search engine providers claim to analyze individual user data (e.g., location and historical search behavior). [As a result] different news consumers might receive different results for the same keyword searches. [...] personalization will lead individuals to consume more and more of the same information, as algorithms provide only content that users like to read or agree with. Consequently, people would be less likely to encounter information that challenges their views or contradicts their interests, which could carry risks for the formation of public opinion in a democratic society.

'Guide to Automatic Journalism'—Andreas Graefe, 2016

Text-generating tool GPT-2 is used in various services for Automated Journalism. The user of such a service is only experiencing the end result—a piece of text. In the same way as other types of technologies, where the process is unknown or unavailable for the user, it is frequently perceived as impartial and objective (since it is made by a 'machine'.) This project is an attempt to take a closer look into the very much human-created contents of the dataset, used to train GPT-2, and to expose their subjective character.
Memes


times scraped
Fake News

times scraped
Conservative Media

times scraped
Liberal Media

times scraped
Entertainment Media

times scraped
Other Media

times scraped
Science

times scraped
Shopping

times scraped
Gaming

times scraped
Finance

times scraped
Search Engines

times scraped
Fiction

times scraped
*This project provides a glimpse into the content of the training dataset for GPT-2. All the presented calculations are made manually in order to avoid any automatic analysis. Therefore it does not represent a full overview of the contents of the dataset, but is rather intended to give an idea of what types of information can be included in the process of developing text-generating tools.