Skip to content

Data Pipeline#

In this section, we describe the data pipeline used to generate the dataset.

Data Source#

We collected comments from different sources, such as Twitter, YouTube, and related datasets.

For each social media (Twitter and YouTube), we defined a set of public profiles that we considered relevant to the topic.

Additionally, we used Brazilian texts from other datasets, such as:

Architecture#

The following diagram shows the architecture of the data pipeline.

Architecture - Image by author.

Filtering#

We want to filter out comments that are not relevant to the scope of the dataset.

  • Comments must be in Portuguese.
  • Comments that have one or more related keywords.

Privacy#

We will apply some privacy policies to the comments collected from each source directly in the ingestion pipeline.

  • User mentions were replaced with the word "@USER".
  • URLs were replaced with the word "URL".

Last update: March 1, 2023