Dataset statistics
Number of variables | 17 |
Number of observations | 2987 |
Missing cells | 1493 |
Missing cells (%) | 2.9% |
Duplicate rows | 0 |
Duplicate rows (%) | 0.0% |
Total size in memory | 2.3 MiB |
Average record size in memory | 791.4 B |
Variable types
Categorical | 5 |
Unsupported | 1 |
Boolean | 11 |
religious_intolerance has constant value "False" | Constant |
id has a high cardinality: 2987 distinct values | High cardinality |
text has a high cardinality: 2987 distinct values | High cardinality |
lgbtqphobia is highly correlated with religious_intolerance | High correlation |
is_targeted is highly correlated with insult and 3 other fields | High correlation |
insult is highly correlated with is_targeted and 2 other fields | High correlation |
xenophobia is highly correlated with religious_intolerance | High correlation |
health is highly correlated with religious_intolerance | High correlation |
other_lifestyle is highly correlated with religious_intolerance | High correlation |
religious_intolerance is highly correlated with lgbtqphobia and 12 other fields | High correlation |
sexism is highly correlated with religious_intolerance | High correlation |
racism is highly correlated with religious_intolerance | High correlation |
physical_aspects is highly correlated with religious_intolerance | High correlation |
targeted_type is highly correlated with is_targeted and 2 other fields | High correlation |
is_offensive is highly correlated with is_targeted and 3 other fields | High correlation |
profanity_obscene is highly correlated with religious_intolerance | High correlation |
ideology is highly correlated with religious_intolerance | High correlation |
is_offensive is highly correlated with is_targeted and 1 other fields | High correlation |
is_targeted is highly correlated with is_offensive and 1 other fields | High correlation |
health is highly correlated with physical_aspects | High correlation |
insult is highly correlated with is_offensive and 1 other fields | High correlation |
physical_aspects is highly correlated with health | High correlation |
targeted_type has 1136 (38.0%) missing values | Missing |
toxic_spans has 357 (12.0%) missing values | Missing |
id is uniformly distributed | Uniform |
text is uniformly distributed | Uniform |
id has unique values | Unique |
text has unique values | Unique |
toxic_spans is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Analysis started | 2022-09-01 22:19:46.870158 |
Analysis finished | 2022-09-01 22:19:49.191505 |
Duration | 2.32 seconds |
Software version | pandas-profiling v3.2.0 |
Download configuration | config.json |
Distinct | 2987 |
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 259.7 KiB |
460cca26c7144f9ea5edabe45d33ad7d | 1 |
6c2bc42785174e608f50d633a450ac38 | 1 |
b4b6e8089ec04ba098d949484215d6fe | 1 |
e4b3e1931eef473388c16d6e7923d664 | 1 |
8c3be0ca235445c3890dc48c74430ca0 | 1 |
Other values (2982) |
Max length | 32 |
Median length | 32 |
Mean length | 32 |
Min length | 32 |
Characters and Unicode
Total characters | 95584 |
Distinct characters | 16 |
Distinct categories | 2 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 2987 ? |
Unique (%) | 100.0% |
1st row | 460cca26c7144f9ea5edabe45d33ad7d |
2nd row | 542c3afa48114e4890b27b32f4b88c05 |
3rd row | 2d65e628400149438210298a6972b3a0 |
4th row | e1dfbeaae05145b2afc5544149d22f61 |
5th row | 614aeda1a791406a96671958a7a49aae |
Common Values
Value | Count | Frequency (%) |
460cca26c7144f9ea5edabe45d33ad7d | 1 | < 0.1% |
6c2bc42785174e608f50d633a450ac38 | 1 | < 0.1% |
b4b6e8089ec04ba098d949484215d6fe | 1 | < 0.1% |
e4b3e1931eef473388c16d6e7923d664 | 1 | < 0.1% |
8c3be0ca235445c3890dc48c74430ca0 | 1 | < 0.1% |
dbee3d03d628433b99f29de2d3f7edae | 1 | < 0.1% |
89f1188b93c54b0694b19a813b5f4e48 | 1 | < 0.1% |
3df756e3743345b7935b1da23978c00d | 1 | < 0.1% |
ee8f9a3c67dc4e568186444d04bd6a58 | 1 | < 0.1% |
a5d9d4b7951d456f8b4289b979f3ea4f | 1 | < 0.1% |
Other values (2977) | 2977 |
Histogram of lengths of the category
Value | Count | Frequency (%) |
460cca26c7144f9ea5edabe45d33ad7d | 1 | < 0.1% |
04882791c0bd42c888983a5e554be036 | 1 | < 0.1% |
81ed0a68e6c84922885e438d12478f0b | 1 | < 0.1% |
161d8eaccf0442aaae0a13a454f1d034 | 1 | < 0.1% |
ee2c038d4d7b40a28001c8b87dd0e8d0 | 1 | < 0.1% |
2d65e628400149438210298a6972b3a0 | 1 | < 0.1% |
e1dfbeaae05145b2afc5544149d22f61 | 1 | < 0.1% |
614aeda1a791406a96671958a7a49aae | 1 | < 0.1% |
0a53c08cb0c74b22b476638086c650d6 | 1 | < 0.1% |
6c3b5f35b2b54cafaac825ac4edeb705 | 1 | < 0.1% |
Other values (2977) | 2977 |
Most occurring characters
Value | Count | Frequency (%) |
4 | 8447 | 8.8% |
b | 6426 | 6.7% |
8 | 6421 | 6.7% |
9 | 6307 | 6.6% |
a | 6144 | 6.4% |
0 | 5743 | 6.0% |
6 | 5716 | 6.0% |
1 | 5681 | 5.9% |
7 | 5677 | 5.9% |
e | 5633 | 5.9% |
Other values (6) | 33389 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 60620 | |
Lowercase Letter | 34964 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
4 | 8447 | |
8 | 6421 | |
9 | 6307 | |
0 | 5743 | |
6 | 5716 | |
1 | 5681 | |
7 | 5677 | |
5 | 5567 | |
2 | 5561 | |
3 | 5500 |
Lowercase Letter
Value | Count | Frequency (%) |
b | 6426 | |
a | 6144 | |
e | 5633 | |
f | 5622 | |
d | 5597 | |
c | 5542 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 60620 | |
Latin | 34964 |
Most frequent character per script
Value | Count | Frequency (%) |
4 | 8447 | |
8 | 6421 | |
9 | 6307 | |
0 | 5743 | |
6 | 5716 | |
1 | 5681 | |
7 | 5677 | |
5 | 5567 | |
2 | 5561 | |
3 | 5500 |
Value | Count | Frequency (%) |
b | 6426 | |
a | 6144 | |
e | 5633 | |
f | 5622 | |
d | 5597 | |
c | 5542 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 95584 |
Most frequent character per block
Value | Count | Frequency (%) |
4 | 8447 | 8.8% |
b | 6426 | 6.7% |
8 | 6421 | 6.7% |
9 | 6307 | 6.6% |
a | 6144 | 6.4% |
0 | 5743 | 6.0% |
6 | 5716 | 6.0% |
1 | 5681 | 5.9% |
7 | 5677 | 5.9% |
e | 5633 | 5.9% |
Other values (6) | 33389 |
Distinct | 2987 |
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 995.9 KiB |
"vc é minha vida bela" vc não tem vida seu peido do capeta USER kkkkkkkkkk rachei | 1 |
USER jô? indoidou de vez? esse cara só pode tá louco , ou e mais um dos grandes clubistas q trabalham como jornalista esportivo.... | 1 |
USER USER preoUSERupão tanto d fika falando mal das modinhas USER a gente gosta USER nem USER preoUSERupão d proUSERurar no youtube mulheres USERantoras tentando fazer fama ai USER fika atraz d homens pra falar mal e USER USER USER tem mulheres no youtube tipo amannda USERondesUSERa USER homens ja perUSEReberão USER USER USER vem atraz d outros homens e nunUSERa d mulheres ow meu USER sera USER todos os homens vão vira gay desUSERulpa felipe mais USERresçe e vai atraz d alguma mulher pra fazer sexo pUSER tenho USERerteza homen USER fala muito d mulher virgem | 1 |
Alguma bixona metida a hetero pode me dizer pq o brasil vai trocar de treinador antes da copa do mundo? | 1 |
Tem coisa que não adianta debater. USER falar de um partido nazista é como dizer que tem coisa boa em um psicopata. Não pode defender o indefensável. | 1 |
Other values (2982) |
Max length | 995 |
Median length | 454 |
Mean length | 119.9099431 |
Min length | 5 |
Characters and Unicode
Total characters | 358171 |
Distinct characters | 317 |
Distinct categories | 19 ? |
Distinct scripts | 6 ? |
Distinct blocks | 12 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 2987 ? |
Unique (%) | 100.0% |
1st row | "vc é minha vida bela" vc não tem vida seu peido do capeta USER kkkkkkkkkk rachei |
2nd row | USER Caralho mano eu ia mandar um vai tomar no cu mas ai lembrei do meme, por um momento meu sangue subiu e olha que nem sigo vc foi tt da Iza |
3rd row | Os ignorantes nos comentários querendo dar desculpinhas pra comentar coisas preconceituosas. Que bom que as crianças assim tão novas já tem uma visão mais evoluída e empatica que vocês, bando de preconceituosos nojentos, que ao invés de estarem do lado do amor e de um mundo melhor estão do lado do preconceito e da dor, vocês só pensam assim porque não tem o mínimo de cultura e conhecimento, não sabem nem um terço da dor de um LGBTQ+, agora ao invés da gente ter que desenhar para as crianças entenderem temos que desenhar para os adultos entenderem, ENSINO USER IUSEROLOGIA USER GÊNERO NECESSÁRIO USER USER, POR UM MUNDO COM MAIS USER E MENOS USER! 💞 |
4th row | rafael é muito fdp, praga de garoto... |
5th row | PUTARIA 🔞 URL |
Common Values
Value | Count | Frequency (%) |
"vc é minha vida bela" vc não tem vida seu peido do capeta USER kkkkkkkkkk rachei | 1 | < 0.1% |
USER jô? indoidou de vez? esse cara só pode tá louco , ou e mais um dos grandes clubistas q trabalham como jornalista esportivo.... | 1 | < 0.1% |
USER USER preoUSERupão tanto d fika falando mal das modinhas USER a gente gosta USER nem USER preoUSERupão d proUSERurar no youtube mulheres USERantoras tentando fazer fama ai USER fika atraz d homens pra falar mal e USER USER USER tem mulheres no youtube tipo amannda USERondesUSERa USER homens ja perUSEReberão USER USER USER vem atraz d outros homens e nunUSERa d mulheres ow meu USER sera USER todos os homens vão vira gay desUSERulpa felipe mais USERresçe e vai atraz d alguma mulher pra fazer sexo pUSER tenho USERerteza homen USER fala muito d mulher virgem | 1 | < 0.1% |
Alguma bixona metida a hetero pode me dizer pq o brasil vai trocar de treinador antes da copa do mundo? | 1 | < 0.1% |
Tem coisa que não adianta debater. USER falar de um partido nazista é como dizer que tem coisa boa em um psicopata. Não pode defender o indefensável. | 1 | < 0.1% |
Escapou de morrer, se ele morresse ela iria dizer que foi estuprada. | 1 | < 0.1% |
USER Esse governo maldito precisa pagar por seus crimes contra o povo brasileiro. HASHTAG | 1 | < 0.1% |
Não reclame d assédio,se vc sai c/ um pervetido sexual q conheceu há 2 min. e exije q ele te respeite, mesmo vc estando alcoolizada /drogada | 1 | < 0.1% |
que porra e crepuscolo ?e uma merda! | 1 | < 0.1% |
USER USER tenho msm mana sou conhecida no tt por maria putinha, vc queria oq? q eu passasse na dm dos outros falando a palavra de Deus? | 1 | < 0.1% |
Other values (2977) | 2977 |
Histogram of lengths of the category
Value | Count | Frequency (%) |
user | 3922 | 6.0% |
que | 1950 | 3.0% |
de | 1802 | 2.7% |
e | 1598 | 2.4% |
o | 1555 | 2.4% |
a | 1337 | 2.0% |
é | 1136 | 1.7% |
não | 934 | 1.4% |
se | 670 | 1.0% |
do | 634 | 1.0% |
Other values (10633) | 50086 |
Most occurring characters
Value | Count | Frequency (%) |
62637 | ||
a | 31631 | 8.8% |
e | 29340 | 8.2% |
o | 25968 | 7.3% |
s | 17702 | 4.9% |
r | 16069 | 4.5% |
i | 14470 | 4.0% |
d | 12263 | 3.4% |
n | 12222 | 3.4% |
m | 11926 | 3.3% |
Other values (307) | 123943 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 247272 | |
Space Separator | 62637 | 17.5% |
Uppercase Letter | 36417 | 10.2% |
Other Punctuation | 9498 | 2.7% |
Other Symbol | 883 | 0.2% |
Decimal Number | 734 | 0.2% |
Dash Punctuation | 185 | 0.1% |
Math Symbol | 113 | < 0.1% |
Open Punctuation | 106 | < 0.1% |
Close Punctuation | 99 | < 0.1% |
Other values (9) | 227 | 0.1% |
Most frequent character per category
Other Symbol
Value | Count | Frequency (%) |
🤣 | 143 | 16.2% |
😂 | 123 | 13.9% |
🤮 | 110 | 12.5% |
🤬 | 26 | 2.9% |
😭 | 21 | 2.4% |
😡 | 19 | 2.2% |
❤ | 19 | 2.2% |
😈 | 15 | 1.7% |
💩 | 14 | 1.6% |
👏 | 13 | 1.5% |
Other values (134) | 380 |
Lowercase Letter
Value | Count | Frequency (%) |
a | 31631 | |
e | 29340 | |
o | 25968 | |
s | 17702 | 7.2% |
r | 16069 | 6.5% |
i | 14470 | 5.9% |
d | 12263 | 5.0% |
n | 12222 | 4.9% |
m | 11926 | 4.8% |
u | 11289 | 4.6% |
Other values (42) | 64392 |
Uppercase Letter
Value | Count | Frequency (%) |
E | 6327 | |
R | 5832 | |
S | 5703 | |
U | 5437 | |
A | 2175 | 6.0% |
O | 1589 | 4.4% |
T | 931 | 2.6% |
N | 809 | 2.2% |
M | 796 | 2.2% |
D | 789 | 2.2% |
Other values (32) | 6029 |
Other Punctuation
Value | Count | Frequency (%) |
. | 3538 | |
, | 2956 | |
! | 1533 | |
? | 619 | 6.5% |
" | 274 | 2.9% |
: | 253 | 2.7% |
' | 85 | 0.9% |
… | 74 | 0.8% |
* | 71 | 0.7% |
/ | 35 | 0.4% |
Other values (9) | 60 | 0.6% |
Other Letter
Value | Count | Frequency (%) |
茫 | 8 | |
ㅤ | 6 | |
莽 | 6 | |
茅 | 4 | |
贸 | 2 | 4.5% |
谩 | 2 | 4.5% |
锚 | 2 | 4.5% |
ا | 2 | 4.5% |
铆 | 2 | 4.5% |
脳 | 1 | 2.3% |
Other values (9) | 9 |
Decimal Number
Value | Count | Frequency (%) |
0 | 160 | |
2 | 134 | |
1 | 131 | |
3 | 76 | |
4 | 43 | 5.9% |
9 | 43 | 5.9% |
6 | 40 | 5.4% |
5 | 37 | 5.0% |
8 | 37 | 5.0% |
7 | 33 | 4.5% |
Math Symbol
Value | Count | Frequency (%) |
> | 61 | |
+ | 21 | 18.6% |
¬ | 17 | 15.0% |
= | 10 | 8.8% |
< | 2 | 1.8% |
~ | 1 | 0.9% |
| | 1 | 0.9% |
Modifier Symbol
Value | Count | Frequency (%) |
🏻 | 22 | |
🏼 | 8 | 16.3% |
🏾 | 7 | 14.3% |
^ | 6 | 12.2% |
🏽 | 3 | 6.1% |
🏿 | 2 | 4.1% |
´ | 1 | 2.0% |
Open Punctuation
Value | Count | Frequency (%) |
( | 96 | |
[ | 9 | 8.5% |
〝 | 1 | 0.9% |
Close Punctuation
Value | Count | Frequency (%) |
) | 88 | |
] | 10 | 10.1% |
〞 | 1 | 1.0% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 181 | |
— | 4 | 2.2% |
Final Punctuation
Value | Count | Frequency (%) |
” | 26 | |
’ | 4 | 13.3% |
Space Separator
Value | Count | Frequency (%) |
62637 |
Nonspacing Mark
Value | Count | Frequency (%) |
️ | 39 |
Initial Punctuation
Value | Count | Frequency (%) |
“ | 33 |
Value | Count | Frequency (%) |
| 14 |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 13 |
Currency Symbol
Value | Count | Frequency (%) |
$ | 3 |
Other Number
Value | Count | Frequency (%) |
¹ | 2 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 283681 | |
Common | 74394 | 20.8% |
Inherited | 53 | < 0.1% |
Han | 28 | < 0.1% |
Arabic | 9 | < 0.1% |
Hangul | 6 | < 0.1% |
Most frequent character per script
Value | Count | Frequency (%) |
62637 | ||
. | 3538 | 4.8% |
, | 2956 | 4.0% |
! | 1533 | 2.1% |
? | 619 | 0.8% |
" | 274 | 0.4% |
: | 253 | 0.3% |
- | 181 | 0.2% |
0 | 160 | 0.2% |
🤣 | 143 | 0.2% |
Other values (199) | 2100 | 2.8% |
Value | Count | Frequency (%) |
a | 31631 | 11.2% |
e | 29340 | 10.3% |
o | 25968 | 9.2% |
s | 17702 | 6.2% |
r | 16069 | 5.7% |
i | 14470 | 5.1% |
d | 12263 | 4.3% |
n | 12222 | 4.3% |
m | 11926 | 4.2% |
u | 11289 | 4.0% |
Other values (78) | 100801 |
Value | Count | Frequency (%) |
茫 | 8 | |
莽 | 6 | |
茅 | 4 | |
贸 | 2 | 7.1% |
谩 | 2 | 7.1% |
锚 | 2 | 7.1% |
铆 | 2 | 7.1% |
脳 | 1 | 3.6% |
么 | 1 | 3.6% |
Value | Count | Frequency (%) |
ا | 2 | |
م | 1 | |
ه | 1 | |
و | 1 | |
ل | 1 | |
ع | 1 | |
ن | 1 | |
ص | 1 |
Value | Count | Frequency (%) |
️ | 39 | |
| 14 | 26.4% |
Value | Count | Frequency (%) |
ㅤ | 6 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 349272 | |
None | 8266 | 2.3% |
Emoticons | 319 | 0.1% |
Punctuation | 158 | < 0.1% |
VS | 39 | < 0.1% |
Dingbats | 34 | < 0.1% |
CJK | 28 | < 0.1% |
Misc Symbols | 25 | < 0.1% |
Arabic | 9 | < 0.1% |
Math Alphanum | 9 | < 0.1% |
Other values (2) | 12 | < 0.1% |
Most frequent character per block
Value | Count | Frequency (%) |
62637 | ||
a | 31631 | 9.1% |
e | 29340 | 8.4% |
o | 25968 | 7.4% |
s | 17702 | 5.1% |
r | 16069 | 4.6% |
i | 14470 | 4.1% |
d | 12263 | 3.5% |
n | 12222 | 3.5% |
m | 11926 | 3.4% |
Other values (82) | 115044 |
Value | Count | Frequency (%) |
ã | 1953 | |
é | 1631 | |
á | 912 | |
ç | 763 | 9.2% |
í | 626 | 7.6% |
ó | 582 | 7.0% |
ê | 384 | 4.6% |
ú | 197 | 2.4% |
É | 146 | 1.8% |
🤣 | 143 | 1.7% |
Other values (118) | 929 |
Value | Count | Frequency (%) |
😂 | 123 | |
😭 | 21 | 6.6% |
😡 | 19 | 6.0% |
😈 | 15 | 4.7% |
😒 | 12 | 3.8% |
😍 | 11 | 3.4% |
😅 | 8 | 2.5% |
😠 | 7 | 2.2% |
😹 | 7 | 2.2% |
🙏 | 7 | 2.2% |
Other values (33) | 89 |
Value | Count | Frequency (%) |
… | 74 | |
“ | 33 | |
” | 26 | 16.5% |
| 14 | 8.9% |
— | 4 | 2.5% |
’ | 4 | 2.5% |
• | 3 | 1.9% |
Value | Count | Frequency (%) |
️ | 39 |
Value | Count | Frequency (%) |
❤ | 19 | |
✌ | 5 | 14.7% |
✔ | 3 | 8.8% |
❓ | 2 | 5.9% |
✝ | 1 | 2.9% |
✍ | 1 | 2.9% |
❌ | 1 | 2.9% |
✨ | 1 | 2.9% |
✊ | 1 | 2.9% |
Misc Symbols
Value | Count | Frequency (%) |
♂ | 8 | |
♥ | 5 | |
♀ | 5 | |
♡ | 3 | 12.0% |
☑ | 1 | 4.0% |
⚖ | 1 | 4.0% |
☆ | 1 | 4.0% |
☺ | 1 | 4.0% |
Value | Count | Frequency (%) |
茫 | 8 | |
莽 | 6 | |
茅 | 4 | |
贸 | 2 | 7.1% |
谩 | 2 | 7.1% |
锚 | 2 | 7.1% |
铆 | 2 | 7.1% |
脳 | 1 | 3.6% |
么 | 1 | 3.6% |
Compat Jamo
Value | Count | Frequency (%) |
ㅤ | 6 |
Value | Count | Frequency (%) |
ا | 2 | |
م | 1 | |
ه | 1 | |
و | 1 | |
ل | 1 | |
ع | 1 | |
ن | 1 | |
ص | 1 |
Math Alphanum
Value | Count | Frequency (%) |
𝕡 | 2 | |
𝕒 | 2 | |
𝑹 | 1 | |
𝕟 | 1 | |
𝕖 | 1 | |
𝕆 | 1 | |
𝕜 | 1 |
Enclosed Alphanum Sup
Value | Count | Frequency (%) |
🇧 | 2 | |
🇷 | 2 | |
🇨 | 1 | |
🇳 | 1 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 175.1 KiB |
OFF | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 8961 |
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | OFF |
2nd row | OFF |
3rd row | OFF |
4th row | OFF |
5th row | OFF |
Common Values
Value | Count | Frequency (%) |
OFF | 2630 | |
NOT | 357 | 12.0% |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
off | 2630 | |
not | 357 | 12.0% |
Most occurring characters
Value | Count | Frequency (%) |
F | 5260 | |
O | 2987 | |
N | 357 | 4.0% |
T | 357 | 4.0% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 8961 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
F | 5260 | |
O | 2987 | |
N | 357 | 4.0% |
T | 357 | 4.0% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 8961 |
Most frequent character per script
Value | Count | Frequency (%) |
F | 5260 | |
O | 2987 | |
N | 357 | 4.0% |
T | 357 | 4.0% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 8961 |
Most frequent character per block
Value | Count | Frequency (%) |
F | 5260 | |
O | 2987 | |
N | 357 | 4.0% |
T | 357 | 4.0% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 175.1 KiB |
TIN | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 8961 |
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | TIN |
2nd row | TIN |
3rd row | TIN |
4th row | TIN |
5th row | TIN |
Common Values
Value | Count | Frequency (%) |
TIN | 1942 | |
UNT | 1045 |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
tin | 1942 | |
unt | 1045 |
Most occurring characters
Value | Count | Frequency (%) |
T | 2987 | |
N | 2987 | |
I | 1942 | |
U | 1045 | 11.7% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 8961 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
T | 2987 | |
N | 2987 | |
I | 1942 | |
U | 1045 | 11.7% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 8961 |
Most frequent character per script
Value | Count | Frequency (%) |
T | 2987 | |
N | 2987 | |
I | 1942 | |
U | 1045 | 11.7% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 8961 |
Most frequent character per block
Value | Count | Frequency (%) |
T | 2987 | |
N | 2987 | |
I | 1942 | |
U | 1045 | 11.7% |
Distinct | 3 |
Distinct (%) | 0.2% |
Missing | 1136 |
Missing (%) | 38.0% |
Memory size | 144.1 KiB |
IND | |
GRP | |
Max length | 3 |
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5553 |
Distinct characters | 9 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique | 0 ? |
Unique (%) | 0.0% |
1st row | IND |
2nd row | IND |
3rd row | GRP |
4th row | IND |
5th row | OTH |
Common Values
Value | Count | Frequency (%) |
IND | 1175 | |
GRP | 375 | 12.6% |
OTH | 301 | 10.1% |
(Missing) | 1136 |
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
ind | 1175 | |
grp | 375 | 20.3% |
oth | 301 | 16.3% |
Most occurring characters
Value | Count | Frequency (%) |
I | 1175 | |
N | 1175 | |
D | 1175 | |
G | 375 | 6.8% |
R | 375 | 6.8% |
P | 375 | 6.8% |
O | 301 | 5.4% |
T | 301 | 5.4% |
H | 301 | 5.4% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 5553 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
I | 1175 | |
N | 1175 | |
D | 1175 | |
G | 375 | 6.8% |
R | 375 | 6.8% |
P | 375 | 6.8% |
O | 301 | 5.4% |
T | 301 | 5.4% |
H | 301 | 5.4% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 5553 |
Most frequent character per script
Value | Count | Frequency (%) |
I | 1175 | |
N | 1175 | |
D | 1175 | |
G | 375 | 6.8% |
R | 375 | 6.8% |
P | 375 | 6.8% |
O | 301 | 5.4% |
T | 301 | 5.4% |
H | 301 | 5.4% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5553 |
Most frequent character per block
Value | Count | Frequency (%) |
I | 1175 | |
N | 1175 | |
D | 1175 | |
G | 375 | 6.8% |
R | 375 | 6.8% |
P | 375 | 6.8% |
O | 301 | 5.4% |
T | 301 | 5.4% |
H | 301 | 5.4% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 87 |
Value | Count | Frequency (%) |
False | 2900 | |
True | 87 | 2.9% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True |
Value | Count | Frequency (%) |
False | 2520 | |
True | 467 | 15.6% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
True | |
False |
Value | Count | Frequency (%) |
True | 2405 | |
False | 582 | 19.5% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 159 |
Value | Count | Frequency (%) |
False | 2828 | |
True | 159 | 5.3% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 76 |
Value | Count | Frequency (%) |
False | 2911 | |
True | 76 | 2.5% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 146 |
Value | Count | Frequency (%) |
False | 2841 | |
True | 146 | 4.9% |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True |
Value | Count | Frequency (%) |
False | 1837 | |
True | 1150 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 26 |
Value | Count | Frequency (%) |
False | 2961 | |
True | 26 | 0.9% |
Distinct | 1 |
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False |
Value | Count | Frequency (%) |
False | 2987 |
Distinct | 2 |
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.0 KiB |
False | |
True | 127 |
Value | Count | Frequency (%) |
False | 2860 | |
True | 127 | 4.3% |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.
First rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
0 | 460cca26c7144f9ea5edabe45d33ad7d | "vc é minha vida bela" vc não tem vida seu peido do capeta USER kkkkkkkkkk rachei | OFF | TIN | IND | [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] | False | False | True | False | False | False | True | False | False | False | False |
1 | 542c3afa48114e4890b27b32f4b88c05 | USER Caralho mano eu ia mandar um vai tomar no cu mas ai lembrei do meme, por um momento meu sangue subiu e olha que nem sigo vc foi tt da Iza | OFF | TIN | IND | [4, 5, 6, 7, 8, 9, 10, 11, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49] | False | False | True | False | False | False | True | False | False | False | False |
2 | 2d65e628400149438210298a6972b3a0 | Os ignorantes nos comentários querendo dar desculpinhas pra comentar coisas preconceituosas. Que bom que as crianças assim tão novas já tem uma visão mais evoluída e empatica que vocês, bando de preconceituosos nojentos, que ao invés de estarem do lado do amor e de um mundo melhor estão do lado do preconceito e da dor, vocês só pensam assim porque não tem o mínimo de cultura e conhecimento, não sabem nem um terço da dor de um LGBTQ+, agora ao invés da gente ter que desenhar para as crianças entenderem temos que desenhar para os adultos entenderem, ENSINO USER IUSEROLOGIA USER GÊNERO NECESSÁRIO USER USER, POR UM MUNDO COM MAIS USER E MENOS USER! 💞 | OFF | TIN | GRP | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 211, 212, 213, 214, 215, 216, 217, 218] | False | True | True | False | False | False | False | False | False | False | False |
3 | e1dfbeaae05145b2afc5544149d22f61 | rafael é muito fdp, praga de garoto... | OFF | TIN | IND | [15, 16, 17, 20, 21, 22, 23, 24, 25] | False | False | True | False | False | False | True | False | False | False | False |
4 | 614aeda1a791406a96671958a7a49aae | PUTARIA 🔞 URL | OFF | TIN | OTH | [0, 1, 2, 3, 4, 5, 6, 7] | False | False | False | False | False | False | True | False | False | False | False |
5 | 0a53c08cb0c74b22b476638086c650d6 | Um ignorante vs um arrogante. | OFF | TIN | IND | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28] | False | False | True | False | False | False | False | False | False | False | False |
6 | 6c3b5f35b2b54cafaac825ac4edeb705 | Pelos comentários aqui, da pra ver o porque o USER está entre os países que mais matam homossexuais no mundo. | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
7 | 4c54dd18cc1e493bb970e5cfd5f8adb7 | É INTERESSANTE USER ESSA IMPRESSA SE USER PARA ENTREGAR USER ENTREVISTA LIXO PARA A SOCIEDADE. | OFF | TIN | OTH | [73, 74, 75, 76, 77, 81, 82, 83, 84] | False | False | True | False | False | False | False | False | False | False | False |
8 | 04882791c0bd42c888983a5e554be036 | Vão tomar vergonha na cara Burguer USER ,deixem nossas crianças em paz!!!!! | OFF | TIN | OTH | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] | False | False | True | False | False | False | False | False | False | False | False |
9 | b85d10840170438ebe3be8bac9362dec | USER Achei q sua mãe ia comer seu cu | OFF | TIN | IND | [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] | False | False | True | False | False | False | True | False | False | False | False |
Last rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
2977 | 0ffe522af137469aa7e9f38f51512e23 | Tem q tira esse cao dessa casa | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
2978 | b8924d8d5bec4e6594ec1d00e8871d2b | USER VOTO USER USER USER | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
2979 | 610ca49ff8004ca198d3c9d4ab5e9e9c | USER USER USER USER obrigado por responderem de forma coerente, do contrário desses moleques que se acham donos da verdade aqui nos comentários. Ainda sim, acho que o fim da carreira do USER foi uma tempestade no copo d'água porque ele falou merda e se retratou por isso. Que não vem muito ao caso, mas ele estava bêbado. Tais medidas seriam justas caso estivesse abertamente apoiando tal atrocidade, o que não foi o caso. O USER foi um pioneiro em podcast no Brasil, e empregou muita gente, particularmente acredito que todo esse cancelamento foi exagero. | OFF | TIN | None | [84, 85, 86, 87, 88, 89, 90, 91, 242, 243, 244, 245, 246, 247] | False | False | True | False | False | False | False | False | False | False | False |
2980 | e95d599466d14b35a9e5852e2bf1a9cb | 'trabalhando graças ao feminismo' RH decide: - vou escolher a mulher pq mesmo ela nao concluindo a faculdade, ela cagou na foto do bolsonaro | OFF | TIN | GRP | [113, 114, 115, 116, 117, 118, 119] | False | True | True | False | False | False | False | False | False | True | False |
2981 | e3d6694a5bbb4cb0905dbc4af924a6bc | Engraçado ver nego bater no peito e falar que vai fazer isso e aquilo, que não precisa de patrocínio pra nada, e agora enfia o rabo no meio das pernas por conta da opinião da pessoa...... não pode falar do partido alemão que assassinou milhões de pessoas, mas vamos falar livremente da Igreja e esquece tudo o que já foi feito em nome da Igreja... | OFF | TIN | OTH | [119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131] | False | True | True | False | False | False | True | False | False | False | False |
2982 | 97864419e3e54fb5b53557fe5c44389f | mulher insuportável... | OFF | TIN | IND | [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] | False | False | True | False | False | False | False | False | False | False | False |
2983 | 68ab3c88a1134fe4ad9d7ee85abb8df7 | por que caralhos você precisa ir pro chile ou para a argentina se depois vai para frança? você é burro?faça letras (pt/fr) na usp, viva com todos os auxílios que a universidade dá e aproveite as bolsas de intercâmbio que abrem no início de cada semestre (se você tiver uma boa média ponderada consegue inclusive ser pago para fazer o intercâmbio).daí pra frente toca seu plano. | OFF | TIN | IND | [8, 9, 10, 11, 12, 13, 14, 15, 16, 96, 97, 98, 99, 100, 101] | False | False | True | False | False | False | True | False | False | False | False |
2984 | c1b41e051fd040f287a1d56192f8759c | Pobre diabo.kkkkkk URL | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
2985 | 4b86de06a731413ebcbb9389f7ff5e8f | ele ficou mais puto por saber q tava geral cantando junto, aqui é nós n é a gente | OFF | TIN | IND | [14, 15, 16, 17, 18, 19] | False | False | True | False | False | False | True | False | False | False | False |
2986 | b90d33a84ee447f794e39981f21ecd41 | USER Que absurdo. Isso não pode passar. Segue adiante Leonel. Que Deus proteja sua vida. Que Deus proteja a vida do Lula e de todos os ameaçados. Que essa gente HIPÓCRITA e DESUMANA sejam condenados. Deus provê. Deus proverá. A sua misericórdia não faltará. 🙏🏻✝️🛐♥️😢✊🏻🚩 | OFF | UNT | None | [161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 173, 174, 175, 176, 177, 178, 179, 180] | False | True | True | False | False | False | False | False | False | False | False |