Dataset statistics
Number of variables | 17 |
Number of observations | 1995 |
Missing cells | 1121 |
Missing cells (%) | 3.3% |
Duplicate rows | 0 |
Duplicate rows (%) | 0.0% |
Total size in memory | 1.5 MiB |
Average record size in memory | 786.0 B |
Variable types
Categorical | 5 |
Unsupported | 1 |
Boolean | 11 |
religious_intolerance has constant value "False" | Constant |
id has a high cardinality: 1995 distinct values | High cardinality |
text has a high cardinality: 1995 distinct values | High cardinality |
lgbtqphobia is highly correlated with religious_intolerance | High correlation |
religious_intolerance is highly correlated with lgbtqphobia and 12 other fields | High correlation |
other_lifestyle is highly correlated with religious_intolerance | High correlation |
physical_aspects is highly correlated with religious_intolerance | High correlation |
racism is highly correlated with religious_intolerance | High correlation |
targeted_type is highly correlated with religious_intolerance and 2 other fields | High correlation |
health is highly correlated with religious_intolerance | High correlation |
sexism is highly correlated with religious_intolerance | High correlation |
xenophobia is highly correlated with religious_intolerance | High correlation |
is_offensive is highly correlated with religious_intolerance and 3 other fields | High correlation |
profanity_obscene is highly correlated with religious_intolerance | High correlation |
insult is highly correlated with religious_intolerance and 2 other fields | High correlation |
is_targeted is highly correlated with religious_intolerance and 3 other fields | High correlation |
ideology is highly correlated with religious_intolerance | High correlation |
is_offensive is highly correlated with is_targeted and 1 other fields | High correlation |
is_targeted is highly correlated with is_offensive and 1 other fields | High correlation |
health is highly correlated with physical_aspects | High correlation |
insult is highly correlated with is_offensive and 1 other fields | High correlation |
physical_aspects is highly correlated with health | High correlation |
targeted_type has 812 (40.7%) missing values | Missing |
toxic_spans has 309 (15.5%) missing values | Missing |
id is uniformly distributed | Uniform |
text is uniformly distributed | Uniform |
id has unique values | Unique |
text has unique values | Unique |
toxic_spans is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Analysis started | 2022-10-06 01:26:24.675422 |
Analysis finished | 2022-10-06 01:26:42.600230 |
Duration | 17.92 seconds |
Software version | pandas-profiling v3.2.0 |
Download configuration | config.json |
Distinct | 1995 |
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 173.5 KiB |
ccc932bedff24efd91ade52c236d554b | 1 |
7b775f36cfc9439292544d721b808b21 | 1 |
17efc32abf784dc79de14f0c5bf0284f | 1 |
cf1d3e6a413046a9b614af81aea77605 | 1 |
5f74b203e7d949a382f354e6db5b076b | 1 |
Other values (1990) |
Max length | 32 |
Median length | 32 |
Mean length | 32 |
Min length | 32 |
Characters and Unicode
Total characters | 63840 |
Distinct characters | 16 |
Distinct categories | 2 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 1995 ? |
Unique (%) | 100.0% |
1st row | ccc932bedff24efd91ade52c236d554b |
2nd row | 1e1fa5c2011f4b79ab8fa17d311abc8d |
3rd row | 847d23e970d84e3ca687911ec2791af7 |
4th row | 8f4e011d21ea4ed6ab1702a2acf43c0b |
5th row | 08f67adbd9c54cd1a308b77ec733af77 |
Common Values
Value | Count | Frequency (%) |
ccc932bedff24efd91ade52c236d554b | 1 | 0.1% |
7b775f36cfc9439292544d721b808b21 | 1 | 0.1% |
17efc32abf784dc79de14f0c5bf0284f | 1 | 0.1% |
cf1d3e6a413046a9b614af81aea77605 | 1 | 0.1% |
5f74b203e7d949a382f354e6db5b076b | 1 | 0.1% |
408f246bc57e43f89f45da052d9ec1b5 | 1 | 0.1% |
df41d8deeccb4457ac650c1eca60bb53 | 1 | 0.1% |
c6b3708ae6084b12b90c1ba11230a7c4 | 1 | 0.1% |
b161e4c4660f41118b4f116cdeb29545 | 1 | 0.1% |
6aa5a5b42fe5435790e131a266320122 | 1 | 0.1% |
Other values (1985) | 1985 |
Histogram of lengths of the category
Value | Count | Frequency (%) |
ccc932bedff24efd91ade52c236d554b | 1 | 0.1% |
a6f9b242df2647d6a29be7820c7caad8 | 1 | 0.1% |
8f4e011d21ea4ed6ab1702a2acf43c0b | 1 | 0.1% |
08f67adbd9c54cd1a308b77ec733af77 | 1 | 0.1% |
e9e09ac235d2427e84c28f2106d0d2a6 | 1 | 0.1% |
327d36f100b44c8f84f2844a11336d84 | 1 | 0.1% |
e67d642951dd42ed9da034a8765ba6c2 | 1 | 0.1% |
f3726ce30fbb4260b20a02d85f0a42c3 | 1 | 0.1% |
38b85624d15241dfa8f9eaab8ab5a556 | 1 | 0.1% |
86e3552f8ca8453f9bd8a97df9aa420d | 1 | 0.1% |
Other values (1985) | 1985 |
Most occurring characters
Value | Count | Frequency (%) |
4 | 5756 | 9.0% |
8 | 4359 | 6.8% |
b | 4231 | 6.6% |
9 | 4225 | 6.6% |
a | 4129 | 6.5% |
0 | 3814 | 6.0% |
6 | 3797 | 5.9% |
d | 3790 | 5.9% |
c | 3757 | 5.9% |
2 | 3756 | 5.9% |
Other values (6) | 22226 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 40503 | |
Lowercase Letter | 23337 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
4 | 5756 | |
8 | 4359 | |
9 | 4225 | |
0 | 3814 | |
6 | 3797 | |
2 | 3756 | |
7 | 3724 | |
5 | 3716 | |
1 | 3680 | |
3 | 3676 |
Lowercase Letter
Value | Count | Frequency (%) |
b | 4231 | |
a | 4129 | |
d | 3790 | |
c | 3757 | |
e | 3719 | |
f | 3711 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 40503 | |
Latin | 23337 |
Most frequent character per script
Value | Count | Frequency (%) |
4 | 5756 | |
8 | 4359 | |
9 | 4225 | |
0 | 3814 | |
6 | 3797 | |
2 | 3756 | |
7 | 3724 | |
5 | 3716 | |
1 | 3680 | |
3 | 3676 |
Value | Count | Frequency (%) |
b | 4231 | |
a | 4129 | |
d | 3790 | |
c | 3757 | |
e | 3719 | |
f | 3711 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 63840 |
Most frequent character per block
Value | Count | Frequency (%) |
4 | 5756 | 9.0% |
8 | 4359 | 6.8% |
b | 4231 | 6.6% |
9 | 4225 | 6.6% |
a | 4129 | 6.5% |
0 | 3814 | 6.0% |
6 | 3797 | 5.9% |
d | 3790 | 5.9% |
c | 3757 | 5.9% |
2 | 3756 | 5.9% |
Other values (6) | 22226 |
Distinct | 1995 |
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 706.0 KiB |
USER USER USER pelo contrario, eu sou USER bolsonaro e USER USER, pelo seu xingamento já da pra perceber que não tem argumento, USER só é mais um ser humano que gosta de baba-ovo de outro ser-humano que é bolsonaro. | 1 |
puta merda cara! cara deve ter sido horrivel pra USER assistir o filme todo pra poder explica-lo detalhadamente, pra quem odeia o filme, saber todo o conteúdo do inicio ao fim é meio estranho! até pq USER comprou até o livro... que irônico!, ou USER ta querendo ser sarcástico??? | 1 |
Cara que gente esquisita, tinha q ser virgem q webnamoram e fzm gf, e pqp minha filha o cara dava todos os sinais q era um babaca. Se ele insisitisse em continuar era só fzr um ghosting e fds URL | 1 |
USER n sabia q vc era uma bichona lan | 1 |
Somos, ainda, um país muito atrasado e aí está a prova: todas essas manifestações homofóbicas. O discurso é sempre o mesmo: "Não sou preconceituoso, mas...". Enquanto o mundo inteiro se move no sentido de entender que ninguém tem o direito de julgar a quem o outro pode amar, no Brasil, continuamos um povo burro, preguiçoso e cheio de uma moralidade torpe. | 1 |
Other values (1990) |
Max length | 1084 |
Median length | 397 |
Mean length | 129.3077694 |
Min length | 8 |
Characters and Unicode
Total characters | 257969 |
Distinct characters | 267 |
Distinct categories | 18 ? |
Distinct scripts | 4 ? |
Distinct blocks | 11 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 1995 ? |
Unique (%) | 100.0% |
1st row | USER USER USER pelo contrario, eu sou USER bolsonaro e USER USER, pelo seu xingamento já da pra perceber que não tem argumento, USER só é mais um ser humano que gosta de baba-ovo de outro ser-humano que é bolsonaro. |
2nd row | USER e a favor do Nazismo cabe safado |
3rd row | USER USER USER USER USER A transposição foi inaugurada pelo Lula, Dilma, Temer e Bolsonaro. Mas pra vcs a obra é só do Genocida. Quero saber, no q o país melhorou pro gado tentar reeleger esse lixo? |
4th row | USER tens de continuar, eu pratico ginásio à uns 3 anos e contino-o o mesmo pau de virar tripas xddd |
Common Values
Value | Count | Frequency (%) |
USER USER USER pelo contrario, eu sou USER bolsonaro e USER USER, pelo seu xingamento já da pra perceber que não tem argumento, USER só é mais um ser humano que gosta de baba-ovo de outro ser-humano que é bolsonaro. | 1 | 0.1% |
puta merda cara! cara deve ter sido horrivel pra USER assistir o filme todo pra poder explica-lo detalhadamente, pra quem odeia o filme, saber todo o conteúdo do inicio ao fim é meio estranho! até pq USER comprou até o livro... que irônico!, ou USER ta querendo ser sarcástico??? | 1 | 0.1% |
Cara que gente esquisita, tinha q ser virgem q webnamoram e fzm gf, e pqp minha filha o cara dava todos os sinais q era um babaca. Se ele insisitisse em continuar era só fzr um ghosting e fds URL | 1 | 0.1% |
USER n sabia q vc era uma bichona lan | 1 | 0.1% |
Somos, ainda, um país muito atrasado e aí está a prova: todas essas manifestações homofóbicas. O discurso é sempre o mesmo: "Não sou preconceituoso, mas...". Enquanto o mundo inteiro se move no sentido de entender que ninguém tem o direito de julgar a quem o outro pode amar, no Brasil, continuamos um povo burro, preguiçoso e cheio de uma moralidade torpe. | 1 | 0.1% |
eu te odeio paralisia do sono eu te odeio | 1 | 0.1% |
USER mimado USER USER | 1 | 0.1% |
Adorei o video.Tu e USER veio... | 1 | 0.1% |
Engraçado é que o convidado tbm falou bobagem! Mas USER teve a hombridade de assumir o seu erro! Tá pagando de vítima agora! | 1 | 0.1% |
eu acho que nos temos que colocar nossas macara i ir pra rua mostra pra esse tirano que ele não ta com essa bola toda não porque se ficarmos só sentados reclamando não vai adianta não na cabeça doente dele ele acha que o USER ta ali e nos temos que mostra pra ele que ele não nos representa | 1 | 0.1% |
Other values (1985) | 1985 |
Histogram of lengths of the category
Value | Count | Frequency (%) |
user | 2659 | 5.7% |
que | 1415 | 3.0% |
de | 1260 | 2.7% |
e | 1212 | 2.6% |
o | 1201 | 2.6% |
a | 951 | 2.0% |
é | 935 | 2.0% |
não | 719 | 1.5% |
um | 489 | 1.0% |
se | 465 | 1.0% |
Other values (7993) | 35499 |
Most occurring characters
Value | Count | Frequency (%) |
44810 | ||
a | 23063 | 8.9% |
e | 21692 | 8.4% |
o | 19210 | 7.4% |
s | 13515 | 5.2% |
r | 12087 | 4.7% |
i | 10638 | 4.1% |
m | 8889 | 3.4% |
d | 8843 | 3.4% |
n | 8670 | 3.4% |
Other values (257) | 86552 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 182050 | |
Space Separator | 44810 | 17.4% |
Uppercase Letter | 22340 | 8.7% |
Other Punctuation | 7121 | 2.8% |
Other Symbol | 634 | 0.2% |
Decimal Number | 543 | 0.2% |
Dash Punctuation | 128 | < 0.1% |
Math Symbol | 81 | < 0.1% |
Close Punctuation | 69 | < 0.1% |
Open Punctuation | 64 | < 0.1% |
Other values (8) | 129 | 0.1% |
Most frequent character per category
Other Symbol
Value | Count | Frequency (%) |
😂 | 91 | 14.4% |
🤣 | 64 | 10.1% |
👏 | 24 | 3.8% |
🤮 | 23 | 3.6% |
😡 | 19 | 3.0% |
🤦 | 17 | 2.7% |
🤡 | 16 | 2.5% |
😍 | 14 | 2.2% |
😭 | 14 | 2.2% |
💩 | 14 | 2.2% |
Other values (114) | 338 |
Lowercase Letter
Value | Count | Frequency (%) |
a | 23063 | |
e | 21692 | |
o | 19210 | |
s | 13515 | 7.4% |
r | 12087 | 6.6% |
i | 10638 | 5.8% |
m | 8889 | 4.9% |
d | 8843 | 4.9% |
n | 8670 | 4.8% |
u | 8100 | 4.4% |
Other values (31) | 47343 |
Uppercase Letter
Value | Count | Frequency (%) |
E | 3980 | |
S | 3712 | |
R | 3588 | |
U | 3501 | |
A | 1223 | 5.5% |
O | 874 | 3.9% |
T | 494 | 2.2% |
M | 479 | 2.1% |
N | 437 | 2.0% |
D | 412 | 1.8% |
Other values (30) | 3640 |
Other Punctuation
Value | Count | Frequency (%) |
. | 2787 | |
, | 2179 | |
! | 1085 | 15.2% |
? | 463 | 6.5% |
" | 234 | 3.3% |
: | 126 | 1.8% |
' | 73 | 1.0% |
* | 60 | 0.8% |
… | 41 | 0.6% |
/ | 26 | 0.4% |
Other values (5) | 47 | 0.7% |
Decimal Number
Value | Count | Frequency (%) |
0 | 114 | |
1 | 102 | |
2 | 91 | |
3 | 67 | |
6 | 34 | 6.3% |
8 | 32 | 5.9% |
4 | 31 | 5.7% |
5 | 29 | 5.3% |
9 | 24 | 4.4% |
7 | 19 | 3.5% |
Other Letter
Value | Count | Frequency (%) |
贸 | 2 | |
茅 | 2 | |
º | 2 | |
馃 | 2 | |
檮 | 2 | |
么 | 1 | |
茫 | 1 | |
铆 | 1 | |
谩 | 1 |
Math Symbol
Value | Count | Frequency (%) |
> | 50 | |
= | 16 | 19.8% |
+ | 7 | 8.6% |
< | 4 | 4.9% |
| | 2 | 2.5% |
¬ | 2 | 2.5% |
Modifier Symbol
Value | Count | Frequency (%) |
🏽 | 15 | |
🏻 | 4 | 13.8% |
🏿 | 3 | 10.3% |
🏾 | 3 | 10.3% |
` | 2 | 6.9% |
🏼 | 2 | 6.9% |
Close Punctuation
Value | Count | Frequency (%) |
) | 66 | |
] | 2 | 2.9% |
》 | 1 | 1.4% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 124 | |
— | 4 | 3.1% |
Open Punctuation
Value | Count | Frequency (%) |
( | 62 | |
[ | 2 | 3.1% |
Initial Punctuation
Value | Count | Frequency (%) |
“ | 14 | |
‘ | 2 | 12.5% |
Final Punctuation
Value | Count | Frequency (%) |
” | 9 | |
’ | 2 | 18.2% |
Space Separator
Value | Count | Frequency (%) |
44810 |
Nonspacing Mark
Value | Count | Frequency (%) |
️ | 24 |
Value | Count | Frequency (%) |
| 20 |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 14 |
Currency Symbol
Value | Count | Frequency (%) |
$ | 1 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 204392 | |
Common | 53521 | 20.7% |
Inherited | 44 | < 0.1% |
Han | 12 | < 0.1% |
Most frequent character per script
Value | Count | Frequency (%) |
44810 | ||
. | 2787 | 5.2% |
, | 2179 | 4.1% |
! | 1085 | 2.0% |
? | 463 | 0.9% |
" | 234 | 0.4% |
: | 126 | 0.2% |
- | 124 | 0.2% |
0 | 114 | 0.2% |
1 | 102 | 0.2% |
Other values (165) | 1497 | 2.8% |
Value | Count | Frequency (%) |
a | 23063 | 11.3% |
e | 21692 | 10.6% |
o | 19210 | 9.4% |
s | 13515 | 6.6% |
r | 12087 | 5.9% |
i | 10638 | 5.2% |
m | 8889 | 4.3% |
d | 8843 | 4.3% |
n | 8670 | 4.2% |
u | 8100 | 4.0% |
Other values (72) | 69685 |
Value | Count | Frequency (%) |
贸 | 2 | |
茅 | 2 | |
馃 | 2 | |
檮 | 2 | |
么 | 1 | |
茫 | 1 | |
铆 | 1 | |
谩 | 1 |
Value | Count | Frequency (%) |
️ | 24 | |
| 20 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 251252 | |
None | 6267 | 2.4% |
Emoticons | 249 | 0.1% |
Punctuation | 93 | < 0.1% |
VS | 24 | < 0.1% |
Misc Symbols | 21 | < 0.1% |
Enclosed Alphanum Sup | 20 | < 0.1% |
Geometric Shapes Ext | 16 | < 0.1% |
Dingbats | 14 | < 0.1% |
CJK | 12 | < 0.1% |
Most frequent character per block
Value | Count | Frequency (%) |
44810 | ||
a | 23063 | 9.2% |
e | 21692 | 8.6% |
o | 19210 | 7.6% |
s | 13515 | 5.4% |
r | 12087 | 4.8% |
i | 10638 | 4.2% |
m | 8889 | 3.5% |
d | 8843 | 3.5% |
n | 8670 | 3.5% |
Other values (79) | 79835 |
Value | Count | Frequency (%) |
ã | 1592 | |
é | 1290 | |
á | 693 | |
ç | 636 | 10.1% |
í | 460 | 7.3% |
ó | 444 | 7.1% |
ê | 315 | 5.0% |
ú | 121 | 1.9% |
É | 85 | 1.4% |
🤣 | 64 | 1.0% |
Other values (100) | 567 | 9.0% |
Value | Count | Frequency (%) |
😂 | 91 | |
😡 | 19 | 7.6% |
😍 | 14 | 5.6% |
😭 | 14 | 5.6% |
😘 | 12 | 4.8% |
🙄 | 9 | 3.6% |
😠 | 9 | 3.6% |
😉 | 7 | 2.8% |
🙏 | 7 | 2.8% |
😅 | 6 | 2.4% |
Other values (28) | 61 |
Value | Count | Frequency (%) |
… | 41 | |
| 20 | |
“ | 14 | 15.1% |
” | 9 | 9.7% |
— | 4 | 4.3% |
’ | 2 | 2.2% |
‘ | 2 | 2.2% |
• | 1 | 1.1% |
Value | Count | Frequency (%) |
️ | 24 |
Value | Count | Frequency (%) |
❤ | 13 | |
✌ | 1 | 7.1% |
Geometric Shapes Ext
Value | Count | Frequency (%) |
🟨 | 10 | |
🟩 | 6 |
Enclosed Alphanum Sup
Value | Count | Frequency (%) |
🇧 | 9 | |
🇷 | 9 | |
🇱 | 1 | 5.0% |
🇮 | 1 | 5.0% |
Misc Symbols
Value | Count | Frequency (%) |
♂ | 9 | |
♀ | 8 | |
♥ | 2 | 9.5% |
☹ | 2 | 9.5% |
Value | Count | Frequency (%) |
贸 | 2 | |
茅 | 2 | |
馃 | 2 | |
檮 | 2 | |
么 | 1 | |
茫 | 1 | |
铆 | 1 | |
谩 | 1 |
Letterlike Symbols
Value | Count | Frequency (%) |
™ | 1 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 117.0 KiB |
OFF | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5985 |
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | OFF |
2nd row | OFF |
3rd row | OFF |
4th row | OFF |
5th row | OFF |
Common Values
Value | Count | Frequency (%) |
OFF | 1686 | |
NOT | 309 | 15.5% |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
off | 1686 | |
not | 309 | 15.5% |
Most occurring characters
Value | Count | Frequency (%) |
F | 3372 | |
O | 1995 | |
N | 309 | 5.2% |
T | 309 | 5.2% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 5985 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
F | 3372 | |
O | 1995 | |
N | 309 | 5.2% |
T | 309 | 5.2% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 5985 |
Most frequent character per script
Value | Count | Frequency (%) |
F | 3372 | |
O | 1995 | |
N | 309 | 5.2% |
T | 309 | 5.2% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5985 |
Most frequent character per block
Value | Count | Frequency (%) |
F | 3372 | |
O | 1995 | |
N | 309 | 5.2% |
T | 309 | 5.2% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 117.0 KiB |
TIN | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5985 |
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | TIN |
2nd row | TIN |
3rd row | TIN |
4th row | UNT |
5th row | TIN |
Common Values
Value | Count | Frequency (%) |
TIN | 1335 | |
UNT | 660 |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
tin | 1335 | |
unt | 660 |
Most occurring characters
Value | Count | Frequency (%) |
T | 1995 | |
N | 1995 | |
I | 1335 | |
U | 660 | 11.0% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 5985 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
T | 1995 | |
N | 1995 | |
I | 1335 | |
U | 660 | 11.0% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 5985 |
Most frequent character per script
Value | Count | Frequency (%) |
T | 1995 | |
N | 1995 | |
I | 1335 | |
U | 660 | 11.0% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5985 |
Most frequent character per block
Value | Count | Frequency (%) |
T | 1995 | |
N | 1995 | |
I | 1335 | |
U | 660 | 11.0% |
Distinct | 3 |
Distinct (%) | 0.3% |
Missing | 812 |
Missing (%) | 40.7% |
Memory size | 94.8 KiB |
IND | |
GRP | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 3549 |
Distinct characters | 9 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | IND |
2nd row | IND |
3rd row | IND |
4th row | IND |
5th row | IND |
Common Values
Value | Count | Frequency (%) |
IND | 751 | |
GRP | 237 | 11.9% |
OTH | 195 | 9.8% |
(Missing) | 812 |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
ind | 751 | |
grp | 237 | 20.0% |
oth | 195 | 16.5% |
Most occurring characters
Value | Count | Frequency (%) |
I | 751 | |
N | 751 | |
D | 751 | |
G | 237 | 6.7% |
R | 237 | 6.7% |
P | 237 | 6.7% |
O | 195 | 5.5% |
T | 195 | 5.5% |
H | 195 | 5.5% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 3549 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
I | 751 | |
N | 751 | |
D | 751 | |
G | 237 | 6.7% |
R | 237 | 6.7% |
P | 237 | 6.7% |
O | 195 | 5.5% |
T | 195 | 5.5% |
H | 195 | 5.5% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 3549 |
Most frequent character per script
Value | Count | Frequency (%) |
I | 751 | |
N | 751 | |
D | 751 | |
G | 237 | 6.7% |
R | 237 | 6.7% |
P | 237 | 6.7% |
O | 195 | 5.5% |
T | 195 | 5.5% |
H | 195 | 5.5% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 3549 |
Most frequent character per block
Value | Count | Frequency (%) |
I | 751 | |
N | 751 | |
D | 751 | |
G | 237 | 6.7% |
R | 237 | 6.7% |
P | 237 | 6.7% |
O | 195 | 5.5% |
T | 195 | 5.5% |
H | 195 | 5.5% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 39 |
Value | Count | Frequency (%) |
False | 1956 | |
True | 39 | 2.0% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True |
Value | Count | Frequency (%) |
False | 1714 | |
True | 281 | 14.1% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
True | |
False |
Value | Count | Frequency (%) |
True | 1551 | |
False | 444 | 22.3% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 91 |
Value | Count | Frequency (%) |
False | 1904 | |
True | 91 | 4.6% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 52 |
Value | Count | Frequency (%) |
False | 1943 | |
True | 52 | 2.6% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 89 |
Value | Count | Frequency (%) |
False | 1906 | |
True | 89 | 4.5% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True |
Value | Count | Frequency (%) |
False | 1257 | |
True | 738 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 12 |
Value | Count | Frequency (%) |
False | 1983 | |
True | 12 | 0.6% |
Distinct | 1 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False |
Value | Count | Frequency (%) |
False | 1995 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 2.1 KiB |
False | |
True | 69 |
Value | Count | Frequency (%) |
False | 1926 | |
True | 69 | 3.5% |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.
First rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
0 | ccc932bedff24efd91ade52c236d554b | USER USER USER pelo contrario, eu sou USER bolsonaro e USER USER, pelo seu xingamento já da pra perceber que não tem argumento, USER só é mais um ser humano que gosta de baba-ovo de outro ser-humano que é bolsonaro. | OFF | TIN | IND | [170, 171, 172, 173, 174, 175, 176, 177, 205, 206, 207, 208, 209, 210, 211, 212, 213] | False | True | True | False | False | False | True | False | False | False | False |
1 | 1e1fa5c2011f4b79ab8fa17d311abc8d | USER e a favor do Nazismo cabe safado | OFF | TIN | IND | [31, 32, 33, 34, 35, 36] | False | True | True | False | False | False | True | False | False | False | False |
2 | 847d23e970d84e3ca687911ec2791af7 | USER USER USER USER USER A transposição foi inaugurada pelo Lula, Dilma, Temer e Bolsonaro. Mas pra vcs a obra é só do Genocida. Quero saber, no q o país melhorou pro gado tentar reeleger esse lixo? | OFF | TIN | IND | [167, 168, 169, 170, 193, 194, 195, 196] | False | True | True | False | False | False | False | False | False | False | False |
3 | 8f4e011d21ea4ed6ab1702a2acf43c0b | USER tens de continuar, eu pratico ginásio à uns 3 anos e contino-o o mesmo pau de virar tripas xddd | OFF | UNT | None | [] | False | False | True | False | False | True | True | False | False | False | False |
4 | 08f67adbd9c54cd1a308b77ec733af77 | USER USER NAO VOU APAGAR A PAGINA PORRAAAAAAA VAMOOOOOO TIMEEEE | OFF | TIN | None | [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45] | False | False | True | False | False | False | True | False | False | False | False |
5 | e9e09ac235d2427e84c28f2106d0d2a6 | USER USER Tu sabe falar outra coisa além de desvio de caráter, seu lambe ovo de miliciano? Já levou sua mãe pra o Bozo comer o 👌🏼 ? Puxa saco do crlh | OFF | TIN | IND | [67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 114, 115, 116, 117, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148] | False | True | True | False | False | False | True | False | False | False | False |
6 | 327d36f100b44c8f84f2844a11336d84 | Esses idiotas falam que é normal morrer por corona, porque não foi eles e nem ninguém da família, se USER é justo muitas pessoa aí vai pegar corona vírus, se USER quiser! | OFF | TIN | None | [5, 6, 7, 8, 9, 10, 11, 12, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152] | True | True | True | False | False | False | False | False | False | False | False |
7 | e67d642951dd42ed9da034a8765ba6c2 | ou em pedem esse maluco ou estamos perdidos. só faz merda | OFF | TIN | IND | [51, 52, 53, 54, 55, 56] | False | False | True | False | False | False | False | False | False | False | False |
8 | f3726ce30fbb4260b20a02d85f0a42c3 | USER o espírito de rico safado | OFF | TIN | IND | [24, 25, 26, 27, 28, 29] | False | False | True | False | False | False | True | False | False | False | False |
9 | 38b85624d15241dfa8f9eaab8ab5a556 | Globo jornalista apresentador de TV ator atriz apoiador Adélio USER não vai ter aonde vocês roubar .por isso q esta lamentando .pode chorar avontade q não tem mais leite . | OFF | UNT | None | [] | False | False | True | False | False | False | False | False | False | False | False |
Last rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
1985 | a0647638f0a6473e85d56a24c0f4ecdc | Empresa nojenta | OFF | TIN | OTH | [8, 9, 10, 11, 12, 13, 14] | False | False | True | False | False | False | False | False | False | False | False |
1986 | 31b64791facf43a2b27139dfa187740f | Olha que legal agora estamos chegando ao final da entrevistas percebendo que o professor entrou aí pra fazer campanha pro movimento lgbt né ! Só pra lembrar que esse movimento ODEIA ISRAEL 🇮🇱 E USER USER ,e amam os árabes que mandam pendurar gays no guindaste 🏗! USER o tanto que esse movimento tem força nos dias de hoje ! Atacam a família descaradamente ! | OFF | UNT | None | [] | False | False | True | True | False | False | False | False | False | False | False |
1987 | e9a20c214c154abbb738286417c79af9 | Ele fala tão bem que quase pegou os repórter USER USER | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
1988 | ca8fe154762749689a5ecc8ad6ff68cf | USER Ya deje de violar la ley y de hacerle el caldo gordo al bueno para nada . Póngase a trabajar y deje de grillar . | OFF | UNT | None | [] | False | False | False | False | False | False | False | False | False | False | False |
1989 | f1f4ef44e2464384b7bc6fdf9e1660a8 | USER cara nojento do caralho | OFF | TIN | IND | [10, 11, 12, 13, 14, 15, 16, 21, 22, 23, 24, 25, 26, 27] | False | False | True | False | False | False | True | False | False | False | False |
1990 | b410d2c87b984e4ab1b8925647d9825b | USER Novidade...facada fake | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
1991 | 0e3aeabd1cbb466cbfac54d39062a072 | Caguei se oque elas falam apoia quem sofre | OFF | TIN | None | [0, 1, 2, 3, 4, 5] | False | False | True | False | False | False | False | False | False | False | False |
1992 | d00cca68919e45c2a76309ce8c9d55e8 | foram embora só agora da minha casa, bando de vagabundo | OFF | TIN | GRP | [45, 46, 47, 48, 49, 50, 51, 52, 53, 54] | False | False | True | False | False | False | True | False | False | False | False |
1993 | 0e4dacc937ae456a973080bc3ae76eab | Parece que foi apenas uma fantasia sexual da parte dela. Ou ela é ninfomaníaca e estava com muito tesão. É difícil entender esse caso. Talvez o marido sabia dessa fantasia. Só que não aguentou presenciar a cena. Quem se deu mal e sofreu agressão, foi o mendingo. Agora, as especulações e as mentiras, são muitas. | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
1994 | 437560d73d2b4433839bc1b9831a1a22 | BANDO DE HOMOFOBICOS!!! Vou continuar comprando! inclusive ontem mesmo comprei esse povo USER sabe desrespeitar mas quando desrespeitam eles ai USER pode ne? hipocrisia enfia amo os lanches crtz q quem fala que vai compra em outro lugar...USER USER tem dinheiro ne???😁😂 | OFF | UNT | None | [] | False | False | True | False | False | False | False | False | False | False | False |