Data Pipeline#

In this section, we describe the data pipeline used to generate the dataset.

Data Source#

We collected comments from different sources, such as Twitter, YouTube, and related datasets.

For each social media (Twitter and YouTube), we defined a set of public profiles that we considered relevant to the topic.

Additionally, we used Brazilian texts from other datasets, such as:

The following diagram shows the architecture of the data pipeline.

We want to filter out comments that are not relevant to the scope of the dataset.

We will apply some privacy policies to the comments collected from each source directly in the ingestion pipeline.

Last update: March 1, 2023