Dataset statistics
Number of variables | 17 |
---|---|
Number of observations | 2996 |
Missing cells | 1400 |
Missing cells (%) | 2.7% |
Duplicate rows | 0 |
Duplicate rows (%) | 0.0% |
Total size in memory | 2.7 MiB |
Average record size in memory | 939.2 B |
Variable types
Categorical | 5 |
---|---|
Unsupported | 1 |
Boolean | 11 |
religious_intolerance has constant value "False" | Constant |
id has a high cardinality: 2996 distinct values | High cardinality |
text has a high cardinality: 2996 distinct values | High cardinality |
is_offensive is highly correlated with targeted_type and 2 other fields | High correlation |
health is highly correlated with religious_intolerance | High correlation |
other_lifestyle is highly correlated with religious_intolerance | High correlation |
racism is highly correlated with religious_intolerance | High correlation |
sexism is highly correlated with religious_intolerance | High correlation |
is_targeted is highly correlated with targeted_type and 1 other fields | High correlation |
targeted_type is highly correlated with is_offensive and 2 other fields | High correlation |
profanity_obscene is highly correlated with religious_intolerance | High correlation |
insult is highly correlated with is_offensive and 1 other fields | High correlation |
religious_intolerance is highly correlated with is_offensive and 12 other fields | High correlation |
xenophobia is highly correlated with religious_intolerance | High correlation |
physical_aspects is highly correlated with religious_intolerance | High correlation |
lgbtqphobia is highly correlated with religious_intolerance | High correlation |
ideology is highly correlated with religious_intolerance | High correlation |
is_offensive is highly correlated with insult | High correlation |
insult is highly correlated with is_offensive | High correlation |
targeted_type has 1283 (42.8%) missing values | Missing |
toxic_spans has 117 (3.9%) missing values | Missing |
id is uniformly distributed | Uniform |
text is uniformly distributed | Uniform |
id has unique values | Unique |
text has unique values | Unique |
toxic_spans is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Reproduction
Analysis started | 2022-09-01 22:12:40.318763 |
---|---|
Analysis finished | 2022-09-01 22:12:53.568420 |
Duration | 13.25 seconds |
Software version | pandas-profiling v3.2.0 |
Download configuration | config.json |
Distinct | 2996 |
---|---|
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 260.5 KiB |
da19df36730945f08df3d09efa354876 | 1 |
---|---|
49bf28b765484b9f963d6885eb48df31 | 1 |
ee5a5d40987f424ab71dd60abf56a23c | 1 |
3ca64c7132d042feaa0eadea0e76ff22 | 1 |
e1bffd1c19be401393ab91e196839854 | 1 |
Other values (2991) |
Length
Max length | 32 |
---|---|
Median length | 32 |
Mean length | 32 |
Min length | 32 |
Characters and Unicode
Total characters | 95872 |
---|---|
Distinct characters | 16 |
Distinct categories | 2 ? |
Distinct scripts | 2 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
Unique | 2996 ? |
---|---|
Unique (%) | 100.0% |
Sample
1st row | da19df36730945f08df3d09efa354876 |
---|---|
2nd row | 80f1a8c981864887b13963fed1261acc |
3rd row | 80eee9db811c4ea4b2ddb7863d12c5fe |
4th row | 2f67025f913e4a6292e3d000d9e2b5a8 |
5th row | cd92f539559e421ba61cf23ecd005511 |
Common Values
Value | Count | Frequency (%) |
da19df36730945f08df3d09efa354876 | 1 | < 0.1% |
49bf28b765484b9f963d6885eb48df31 | 1 | < 0.1% |
ee5a5d40987f424ab71dd60abf56a23c | 1 | < 0.1% |
3ca64c7132d042feaa0eadea0e76ff22 | 1 | < 0.1% |
e1bffd1c19be401393ab91e196839854 | 1 | < 0.1% |
4dab80b5088d4255a0f2372704246e4e | 1 | < 0.1% |
b564daa958ec4b7390c4674c298e72a4 | 1 | < 0.1% |
10978af756c94a479fbaf54a144c7052 | 1 | < 0.1% |
1dbe54562a824d42bd95a728c5d6afef | 1 | < 0.1% |
1a1b7b30ec08435889c578791b8188c3 | 1 | < 0.1% |
Other values (2986) | 2986 |
Length
Histogram of lengths of the category
Value | Count | Frequency (%) |
da19df36730945f08df3d09efa354876 | 1 | < 0.1% |
cc66b54eeec24607a67e2259134a1cdd | 1 | < 0.1% |
a223536974394b15b5e3bb658ebc596b | 1 | < 0.1% |
396d5049c82c430a9f4fa5694c00cbd2 | 1 | < 0.1% |
80eee9db811c4ea4b2ddb7863d12c5fe | 1 | < 0.1% |
2f67025f913e4a6292e3d000d9e2b5a8 | 1 | < 0.1% |
cd92f539559e421ba61cf23ecd005511 | 1 | < 0.1% |
430b13705cf34e13b74bc999425187c3 | 1 | < 0.1% |
c779826dc43f460cb18e8429ca443477 | 1 | < 0.1% |
e64148caa4474fc79298e01d0dda8f5e | 1 | < 0.1% |
Other values (2986) | 2986 |
Most occurring characters
Value | Count | Frequency (%) |
4 | 8805 | 9.2% |
9 | 6413 | 6.7% |
8 | 6392 | 6.7% |
b | 6381 | 6.7% |
a | 6270 | 6.5% |
2 | 5760 | 6.0% |
7 | 5663 | 5.9% |
0 | 5635 | 5.9% |
e | 5599 | 5.8% |
1 | 5590 | 5.8% |
Other values (6) | 33364 |
Most occurring categories
Value | Count | Frequency (%) |
Decimal Number | 60956 | |
Lowercase Letter | 34916 |
Most frequent character per category
Decimal Number
Value | Count | Frequency (%) |
4 | 8805 | |
9 | 6413 | |
8 | 6392 | |
2 | 5760 | |
7 | 5663 | |
0 | 5635 | |
1 | 5590 | |
3 | 5590 | |
6 | 5587 | |
5 | 5521 |
Lowercase Letter
Value | Count | Frequency (%) |
b | 6381 | |
a | 6270 | |
e | 5599 | |
f | 5583 | |
d | 5573 | |
c | 5510 |
Most occurring scripts
Value | Count | Frequency (%) |
Common | 60956 | |
Latin | 34916 |
Most frequent character per script
Common
Value | Count | Frequency (%) |
4 | 8805 | |
9 | 6413 | |
8 | 6392 | |
2 | 5760 | |
7 | 5663 | |
0 | 5635 | |
1 | 5590 | |
3 | 5590 | |
6 | 5587 | |
5 | 5521 |
Latin
Value | Count | Frequency (%) |
b | 6381 | |
a | 6270 | |
e | 5599 | |
f | 5583 | |
d | 5573 | |
c | 5510 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 95872 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
4 | 8805 | 9.2% |
9 | 6413 | 6.7% |
8 | 6392 | 6.7% |
b | 6381 | 6.7% |
a | 6270 | 6.5% |
2 | 5760 | 6.0% |
7 | 5663 | 5.9% |
0 | 5635 | 5.9% |
e | 5599 | 5.8% |
1 | 5590 | 5.8% |
Other values (6) | 33364 |
Distinct | 2996 |
---|---|
Distinct (%) | 100.0% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 1.0 MiB |
USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | 1 |
---|---|
6 DE PDL É PUTARIA!!! Chega por hoje URL | 1 |
Perto dela ele fica calado | 1 |
O que esperar de uma empresa que vende lixo para as pessoas comerem ? | 1 |
USER USER é uma hamburgueria não uma escola, pai que precisa de uma hamburgueria pra educar filho é um pai vagabundo ou não é pai. | 1 |
Other values (2991) |
Length
Max length | 954 |
---|---|
Median length | 476.5 |
Mean length | 125.6048064 |
Min length | 4 |
Characters and Unicode
Total characters | 376312 |
---|---|
Distinct characters | 318 |
Distinct categories | 21 ? |
Distinct scripts | 6 ? |
Distinct blocks | 18 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
Unique | 2996 ? |
---|---|
Unique (%) | 100.0% |
Sample
1st row | USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs |
---|---|
2nd row | Cara isso foi muito babaca geral USER conhece o Monark e as merda que ele fala. Isso é muito merda eu USER E to decepcionada pra caralho mas não cabe a mim dizer |
3rd row | Quem liga pra judeu kkkk |
4th row | Se vc for porco, folgado e relaxado, você não ia conseguir viver com ela mesmo. Realmente, gente escrota não ia conseguir conviver com a Jojo |
5th row | Rapaziada chata, né?! O cara trabalha c funk, vive no meio de mulher, solteiro ou não ele ia gravar o clip.... mas aí porque ficou solteiro “tá querendo se mostrar” pqp, gente chata p caralho |
Common Values
Value | Count | Frequency (%) |
USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | 1 | < 0.1% |
6 DE PDL É PUTARIA!!! Chega por hoje URL | 1 | < 0.1% |
Perto dela ele fica calado | 1 | < 0.1% |
O que esperar de uma empresa que vende lixo para as pessoas comerem ? | 1 | < 0.1% |
USER USER é uma hamburgueria não uma escola, pai que precisa de uma hamburgueria pra educar filho é um pai vagabundo ou não é pai. | 1 | < 0.1% |
Que merda de insônia mos, quero dormir inferno | 1 | < 0.1% |
cara, dedo quente nesse cu gelado foi ótimo vei, ri pa caralho!!!! | 1 | < 0.1% |
caralho mano pq eu fui ler essa porra, vou jogar ácido no meu olho da próxima vez que eu inventar de fazer essa palhaçada | 1 | < 0.1% |
USER KKKKKKKKKK VAGABUNDA | 1 | < 0.1% |
USER USER USER percebeu a merda q USER falou? Liberdade de expressão não é tirar a liberdade de outra pessoa só porque USER quer. Pensa antes de escrever. | 1 | < 0.1% |
Other values (2986) | 2986 |
Length
Histogram of lengths of the category
Value | Count | Frequency (%) |
user | 3945 | 5.7% |
que | 2076 | 3.0% |
de | 1853 | 2.7% |
e | 1675 | 2.4% |
o | 1668 | 2.4% |
a | 1429 | 2.1% |
é | 1329 | 1.9% |
não | 972 | 1.4% |
um | 714 | 1.0% |
do | 668 | 1.0% |
Other values (10786) | 52615 |
Most occurring characters
Value | Count | Frequency (%) |
65948 | ||
a | 33194 | 8.8% |
e | 31584 | 8.4% |
o | 27632 | 7.3% |
s | 19623 | 5.2% |
r | 17519 | 4.7% |
i | 15455 | 4.1% |
n | 12862 | 3.4% |
d | 12852 | 3.4% |
m | 12482 | 3.3% |
Other values (308) | 127161 |
Most occurring categories
Value | Count | Frequency (%) |
Lowercase Letter | 263150 | |
Space Separator | 65948 | 17.5% |
Uppercase Letter | 34911 | 9.3% |
Other Punctuation | 9976 | 2.7% |
Other Symbol | 961 | 0.3% |
Decimal Number | 718 | 0.2% |
Dash Punctuation | 113 | < 0.1% |
Close Punctuation | 96 | < 0.1% |
Open Punctuation | 88 | < 0.1% |
Math Symbol | 82 | < 0.1% |
Other values (11) | 269 | 0.1% |
Most frequent character per category
Other Symbol
Value | Count | Frequency (%) |
😂 | 141 | 14.7% |
🤣 | 110 | 11.4% |
👏 | 37 | 3.9% |
😭 | 35 | 3.6% |
🤮 | 27 | 2.8% |
🤦 | 24 | 2.5% |
🇧 | 20 | 2.1% |
🇷 | 20 | 2.1% |
😍 | 17 | 1.8% |
😡 | 17 | 1.8% |
Other values (144) | 513 |
Lowercase Letter
Value | Count | Frequency (%) |
a | 33194 | |
e | 31584 | |
o | 27632 | |
s | 19623 | 7.5% |
r | 17519 | 6.7% |
i | 15455 | 5.9% |
n | 12862 | 4.9% |
d | 12852 | 4.9% |
m | 12482 | 4.7% |
u | 11849 | 4.5% |
Other values (33) | 68098 |
Uppercase Letter
Value | Count | Frequency (%) |
E | 6049 | |
S | 5441 | |
R | 5406 | |
U | 5170 | |
A | 2061 | 5.9% |
O | 1477 | 4.2% |
T | 895 | 2.6% |
N | 829 | 2.4% |
M | 795 | 2.3% |
D | 773 | 2.2% |
Other values (28) | 6015 |
Other Letter
Value | Count | Frequency (%) |
茅 | 5 | |
茫 | 4 | 10.3% |
锚 | 4 | 10.3% |
谩 | 4 | 10.3% |
馃 | 3 | 7.7% |
º | 2 | 5.1% |
莽 | 2 | 5.1% |
鉁 | 2 | 5.1% |
檮 | 1 | 2.6% |
ツ | 1 | 2.6% |
Other values (11) | 11 |
Other Punctuation
Value | Count | Frequency (%) |
. | 3696 | |
, | 3054 | |
! | 1555 | |
? | 661 | 6.6% |
" | 406 | 4.1% |
: | 252 | 2.5% |
' | 97 | 1.0% |
… | 84 | 0.8% |
* | 68 | 0.7% |
/ | 44 | 0.4% |
Other values (8) | 59 | 0.6% |
Decimal Number
Value | Count | Frequency (%) |
0 | 153 | |
2 | 111 | |
1 | 110 | |
3 | 88 | |
4 | 65 | |
6 | 49 | 6.8% |
5 | 47 | 6.5% |
8 | 34 | 4.7% |
9 | 31 | 4.3% |
7 | 30 | 4.2% |
Math Symbol
Value | Count | Frequency (%) |
> | 47 | |
= | 13 | 15.9% |
+ | 13 | 15.9% |
< | 3 | 3.7% |
~ | 3 | 3.7% |
¬ | 2 | 2.4% |
| | 1 | 1.2% |
Modifier Symbol
Value | Count | Frequency (%) |
🏼 | 23 | |
🏻 | 14 | |
🏽 | 13 | |
🏾 | 9 | 12.2% |
🏿 | 8 | 10.8% |
^ | 5 | 6.8% |
´ | 2 | 2.7% |
Dash Punctuation
Value | Count | Frequency (%) |
- | 108 | |
— | 5 | 4.4% |
Close Punctuation
Value | Count | Frequency (%) |
) | 90 | |
] | 6 | 6.2% |
Open Punctuation
Value | Count | Frequency (%) |
( | 81 | |
[ | 7 | 8.0% |
Nonspacing Mark
Value | Count | Frequency (%) |
️ | 47 | |
͜ | 1 | 2.1% |
Format
Value | Count | Frequency (%) |
| 30 | |
| 1 | 3.2% |
Final Punctuation
Value | Count | Frequency (%) |
” | 27 | |
’ | 2 | 6.9% |
Connector Punctuation
Value | Count | Frequency (%) |
_ | 7 | |
﹏ | 1 | 12.5% |
Space Separator
Value | Count | Frequency (%) |
65948 |
Initial Punctuation
Value | Count | Frequency (%) |
“ | 31 |
Currency Symbol
Value | Count | Frequency (%) |
$ | 4 |
Control
Value | Count | Frequency (%) |
2 |
Enclosing Mark
Value | Count | Frequency (%) |
⃣ | 2 |
Modifier Letter
Value | Count | Frequency (%) |
ー | 1 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 298064 | |
Common | 78132 | 20.8% |
Inherited | 80 | < 0.1% |
Han | 31 | < 0.1% |
Katakana | 4 | < 0.1% |
Hiragana | 1 | < 0.1% |
Most frequent character per script
Common
Value | Count | Frequency (%) |
65948 | ||
. | 3696 | 4.7% |
, | 3054 | 3.9% |
! | 1555 | 2.0% |
? | 661 | 0.8% |
" | 406 | 0.5% |
: | 252 | 0.3% |
0 | 153 | 0.2% |
😂 | 141 | 0.2% |
2 | 111 | 0.1% |
Other values (202) | 2155 | 2.8% |
Latin
Value | Count | Frequency (%) |
a | 33194 | 11.1% |
e | 31584 | 10.6% |
o | 27632 | 9.3% |
s | 19623 | 6.6% |
r | 17519 | 5.9% |
i | 15455 | 5.2% |
n | 12862 | 4.3% |
d | 12852 | 4.3% |
m | 12482 | 4.2% |
u | 11849 | 4.0% |
Other values (73) | 103012 |
Han
Value | Count | Frequency (%) |
茅 | 5 | |
茫 | 4 | |
锚 | 4 | |
谩 | 4 | |
馃 | 3 | |
莽 | 2 | 6.5% |
鉁 | 2 | 6.5% |
檮 | 1 | 3.2% |
芒 | 1 | 3.2% |
脭 | 1 | 3.2% |
Other values (4) | 4 |
Inherited
Value | Count | Frequency (%) |
️ | 47 | |
| 30 | |
⃣ | 2 | 2.5% |
͜ | 1 | 1.2% |
Katakana
Value | Count | Frequency (%) |
ツ | 1 | |
イ | 1 | |
メ | 1 | |
ジ | 1 |
Hiragana
Value | Count | Frequency (%) |
の | 1 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 366689 | |
None | 8876 | 2.4% |
Emoticons | 352 | 0.1% |
Punctuation | 179 | < 0.1% |
VS | 47 | < 0.1% |
Enclosed Alphanum Sup | 40 | < 0.1% |
Misc Symbols | 38 | < 0.1% |
Dingbats | 33 | < 0.1% |
CJK | 31 | < 0.1% |
Geometric Shapes Ext | 12 | < 0.1% |
Other values (8) | 15 | < 0.1% |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
65948 | ||
a | 33194 | 9.1% |
e | 31584 | 8.6% |
o | 27632 | 7.5% |
s | 19623 | 5.4% |
r | 17519 | 4.8% |
i | 15455 | 4.2% |
n | 12862 | 3.5% |
d | 12852 | 3.5% |
m | 12482 | 3.4% |
Other values (82) | 117538 |
None
Value | Count | Frequency (%) |
ã | 1978 | |
é | 1799 | |
á | 1017 | |
ç | 848 | |
ó | 684 | 7.7% |
í | 675 | 7.6% |
ê | 440 | 5.0% |
ú | 204 | 2.3% |
É | 155 | 1.7% |
🤣 | 110 | 1.2% |
Other values (117) | 966 |
Emoticons
Value | Count | Frequency (%) |
😂 | 141 | |
😭 | 35 | 9.9% |
😍 | 17 | 4.8% |
😡 | 17 | 4.8% |
😠 | 14 | 4.0% |
😒 | 13 | 3.7% |
😆 | 11 | 3.1% |
😤 | 9 | 2.6% |
😢 | 8 | 2.3% |
🙄 | 6 | 1.7% |
Other values (35) | 81 |
Punctuation
Value | Count | Frequency (%) |
… | 84 | |
“ | 31 | 17.3% |
| 30 | 16.8% |
” | 27 | 15.1% |
— | 5 | 2.8% |
’ | 2 | 1.1% |
VS
Value | Count | Frequency (%) |
️ | 47 |
Enclosed Alphanum Sup
Value | Count | Frequency (%) |
🇧 | 20 | |
🇷 | 20 |
Misc Symbols
Value | Count | Frequency (%) |
♂ | 14 | |
♀ | 12 | |
♡ | 4 | 10.5% |
☠ | 2 | 5.3% |
☺ | 2 | 5.3% |
♥ | 2 | 5.3% |
☄ | 1 | 2.6% |
⚖ | 1 | 2.6% |
Dingbats
Value | Count | Frequency (%) |
❤ | 14 | |
✅ | 8 | |
✌ | 5 | 15.2% |
✨ | 2 | 6.1% |
❗ | 1 | 3.0% |
❌ | 1 | 3.0% |
❞ | 1 | 3.0% |
❝ | 1 | 3.0% |
Geometric Shapes Ext
Value | Count | Frequency (%) |
🟩 | 10 | |
🟨 | 2 | 16.7% |
CJK
Value | Count | Frequency (%) |
茅 | 5 | |
茫 | 4 | |
锚 | 4 | |
谩 | 4 | |
馃 | 3 | |
莽 | 2 | 6.5% |
鉁 | 2 | 6.5% |
檮 | 1 | 3.2% |
芒 | 1 | 3.2% |
脭 | 1 | 3.2% |
Other values (4) | 4 |
Box Drawing
Value | Count | Frequency (%) |
╥ | 2 |
Specials
Value | Count | Frequency (%) |
� | 2 |
Geometric Shapes
Value | Count | Frequency (%) |
● | 1 | |
○ | 1 |
Katakana
Value | Count | Frequency (%) |
ツ | 1 | |
イ | 1 | |
メ | 1 | |
ジ | 1 | |
ー | 1 |
Hiragana
Value | Count | Frequency (%) |
の | 1 |
IPA Ext
Value | Count | Frequency (%) |
ʖ | 1 |
Diacriticals
Value | Count | Frequency (%) |
͜ | 1 |
CJK Compat Forms
Value | Count | Frequency (%) |
﹏ | 1 |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 175.7 KiB |
OFF | |
---|---|
NOT | 117 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 8988 |
---|---|
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | OFF |
---|---|
2nd row | OFF |
3rd row | OFF |
4th row | OFF |
5th row | OFF |
Common Values
Value | Count | Frequency (%) |
OFF | 2879 | |
NOT | 117 | 3.9% |
Length
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
off | 2879 | |
not | 117 | 3.9% |
Most occurring characters
Value | Count | Frequency (%) |
F | 5758 | |
O | 2996 | |
N | 117 | 1.3% |
T | 117 | 1.3% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 8988 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
F | 5758 | |
O | 2996 | |
N | 117 | 1.3% |
T | 117 | 1.3% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 8988 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
F | 5758 | |
O | 2996 | |
N | 117 | 1.3% |
T | 117 | 1.3% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 8988 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
F | 5758 | |
O | 2996 | |
N | 117 | 1.3% |
T | 117 | 1.3% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 175.7 KiB |
TIN | |
---|---|
UNT |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 8988 |
---|---|
Distinct characters | 4 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | UNT |
---|---|
2nd row | TIN |
3rd row | UNT |
4th row | UNT |
5th row | TIN |
Common Values
Value | Count | Frequency (%) |
TIN | 1713 | |
UNT | 1283 |
Length
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
tin | 1713 | |
unt | 1283 |
Most occurring characters
Value | Count | Frequency (%) |
T | 2996 | |
N | 2996 | |
I | 1713 | |
U | 1283 |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 8988 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
T | 2996 | |
N | 2996 | |
I | 1713 | |
U | 1283 |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 8988 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
T | 2996 | |
N | 2996 | |
I | 1713 | |
U | 1283 |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 8988 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
T | 2996 | |
N | 2996 | |
I | 1713 | |
U | 1283 |
Distinct | 3 |
---|---|
Distinct (%) | 0.2% |
Missing | 1283 |
Missing (%) | 42.8% |
Memory size | 140.6 KiB |
IND | |
---|---|
GRP | |
OTH | 92 |
Length
Max length | 3 |
---|---|
Median length | 3 |
Mean length | 3 |
Min length | 3 |
Characters and Unicode
Total characters | 5139 |
---|---|
Distinct characters | 9 |
Distinct categories | 1 ? |
Distinct scripts | 1 ? |
Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
Unique | 0 ? |
---|---|
Unique (%) | 0.0% |
Sample
1st row | GRP |
---|---|
2nd row | GRP |
3rd row | GRP |
4th row | GRP |
5th row | OTH |
Common Values
Value | Count | Frequency (%) |
IND | 1047 | |
GRP | 574 | |
OTH | 92 | 3.1% |
(Missing) | 1283 |
Length
Histogram of lengths of the category
Category Frequency Plot
Value | Count | Frequency (%) |
ind | 1047 | |
grp | 574 | |
oth | 92 | 5.4% |
Most occurring characters
Value | Count | Frequency (%) |
I | 1047 | |
N | 1047 | |
D | 1047 | |
G | 574 | |
R | 574 | |
P | 574 | |
O | 92 | 1.8% |
T | 92 | 1.8% |
H | 92 | 1.8% |
Most occurring categories
Value | Count | Frequency (%) |
Uppercase Letter | 5139 |
Most frequent character per category
Uppercase Letter
Value | Count | Frequency (%) |
I | 1047 | |
N | 1047 | |
D | 1047 | |
G | 574 | |
R | 574 | |
P | 574 | |
O | 92 | 1.8% |
T | 92 | 1.8% |
H | 92 | 1.8% |
Most occurring scripts
Value | Count | Frequency (%) |
Latin | 5139 |
Most frequent character per script
Latin
Value | Count | Frequency (%) |
I | 1047 | |
N | 1047 | |
D | 1047 | |
G | 574 | |
R | 574 | |
P | 574 | |
O | 92 | 1.8% |
T | 92 | 1.8% |
H | 92 | 1.8% |
Most occurring blocks
Value | Count | Frequency (%) |
ASCII | 5139 |
Most frequent character per block
ASCII
Value | Count | Frequency (%) |
I | 1047 | |
N | 1047 | |
D | 1047 | |
G | 574 | |
R | 574 | |
P | 574 | |
O | 92 | 1.8% |
T | 92 | 1.8% |
H | 92 | 1.8% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True | 71 |
Value | Count | Frequency (%) |
False | 2925 | |
True | 71 | 2.4% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True |
Value | Count | Frequency (%) |
False | 2220 | |
True | 776 | 25.9% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
True | |
---|---|
False | 174 |
Value | Count | Frequency (%) |
True | 2822 | |
False | 174 | 5.8% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True | 213 |
Value | Count | Frequency (%) |
False | 2783 | |
True | 213 | 7.1% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True | 41 |
Value | Count | Frequency (%) |
False | 2955 | |
True | 41 | 1.4% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True | 178 |
Value | Count | Frequency (%) |
False | 2818 | |
True | 178 | 5.9% |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True |
Value | Count | Frequency (%) |
False | 2064 | |
True | 932 |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True | 77 |
Value | Count | Frequency (%) |
False | 2919 | |
True | 77 | 2.6% |
Distinct | 1 |
---|---|
Distinct (%) | < 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False |
---|
Value | Count | Frequency (%) |
False | 2996 |
Distinct | 2 |
---|---|
Distinct (%) | 0.1% |
Missing | 0 |
Missing (%) | 0.0% |
Memory size | 3.1 KiB |
False | |
---|---|
True |
Value | Count | Frequency (%) |
False | 2616 | |
True | 380 | 12.7% |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.
First rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | da19df36730945f08df3d09efa354876 | USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | OFF | UNT | None | [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86] | False | False | True | False | False | False | True | False | False | False | False |
1 | 80f1a8c981864887b13963fed1261acc | Cara isso foi muito babaca geral USER conhece o Monark e as merda que ele fala. Isso é muito merda eu USER E to decepcionada pra caralho mas não cabe a mim dizer | OFF | TIN | GRP | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105] | False | False | True | False | False | False | False | False | False | False | False |
2 | 80eee9db811c4ea4b2ddb7863d12c5fe | Quem liga pra judeu kkkk | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | False | True | True | False | False | False | False | False | False | False | True |
3 | 2f67025f913e4a6292e3d000d9e2b5a8 | Se vc for porco, folgado e relaxado, você não ia conseguir viver com ela mesmo. Realmente, gente escrota não ia conseguir conviver com a Jojo | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104] | False | False | True | False | False | False | False | False | False | False | False |
4 | cd92f539559e421ba61cf23ecd005511 | Rapaziada chata, né?! O cara trabalha c funk, vive no meio de mulher, solteiro ou não ele ia gravar o clip.... mas aí porque ficou solteiro “tá querendo se mostrar” pqp, gente chata p caralho | OFF | TIN | GRP | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 165, 166, 167, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190] | False | False | True | False | False | False | False | False | False | False | False |
5 | 430b13705cf34e13b74bc999425187c3 | USER USER é muito bom. USER ^^ E claro a equipe USER. | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
6 | c779826dc43f460cb18e8429ca443477 | Pior do que adolescentezinhas de merda...são pessoas que levam filmes tão a sério! O livro/filme é dela, ela faz o que quiser! E por mais ruim que seja, ta rendendo milhões (: | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38] | False | False | True | False | False | False | True | False | False | True | False |
7 | e64148caa4474fc79298e01d0dda8f5e | USER Toma no cu é vitamina como tu e tua prima. | OFF | TIN | GRP | [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46] | False | False | True | False | False | False | True | False | False | False | False |
8 | 738ccd4476784f47af3a5a6cfdda4695 | Se fosse um sniper ia ser louco | OFF | UNT | None | [26, 27, 28, 29, 30] | False | False | True | False | False | True | False | False | False | False | False |
9 | cc66b54eeec24607a67e2259134a1cdd | Muito bom, pena a circunstâncias serem ruins, ele poderia ter vindo antes, até para dar uma aula ao Monark e fazer ele parar de falar tanta besteira... | OFF | UNT | None | [119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150] | False | False | True | False | False | False | False | False | False | False | False |
Last rows
id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2986 | 358dd843965f429f9550f5a5eb41a46b | RT USER: TODO MUNDO NESSE CARALHO DE REDE SOCIAL TEM UM SQUAD UMA PANELINHA UM GRUPINHO E OS CARALHO A4 L, SÓ EU QUE NÃO QUE ODIO QUE PO… | OFF | TIN | GRP | [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102] | False | False | True | False | False | False | True | False | False | False | False |
2987 | 1d8a7915a93847d9abc67d2564926920 | Mano o povo gosta que seja ignorante , por isso eu mando pra casa do caralho mesmo 🤬 | OFF | TIN | GRP | [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76] | False | True | True | False | False | False | True | False | False | False | True |
2988 | 7545fcbcd9f5497987639797d2b19518 | Essa mulher é louca.kkkkkkkkk. | OFF | TIN | IND | [14, 15, 16, 17, 18] | False | False | True | False | False | False | False | False | False | True | False |
2989 | 8849dc11cdde4bcebf2b361e905811fd | USER Ladrão desgracado dos infernos. Voce é um câncer que faz mal ao pais. URL | OFF | TIN | IND | [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 27, 28, 29, 30, 31, 32, 33, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54] | True | True | True | False | False | False | False | False | False | False | False |
2990 | 78f5241468a54171893914852b0c6cce | Nós vamos colocar pra fuder cm esses bandidos do STF USER e senado | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 37, 38, 39, 40, 41, 42, 43, 44, 45] | False | True | True | False | False | False | True | False | False | False | False |
2991 | 0315aa7b60dd48be9c5f47028422af13 | USER USER EU SO BURRA PERDÃO VO APAGA NGM AVISA NADA NESSE CARAIO TBM | OFF | TIN | IND | [16, 17, 18, 19, 20, 21, 59, 60, 61, 62, 63, 64, 65] | False | True | True | False | False | False | False | False | False | False | False |
2992 | 57be31ef8fd54e96afa8482f08df8987 | RT USER: POLÍTICOS CORRUTOS DEVERIAM APODRECER NA CADEIA. POR CAUSA DELES MORREM MILHARES DE PESSOAS POR FALTA DE RECURSOS. ELES… | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 28, 29, 30, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] | False | True | True | False | False | False | False | False | False | False | False |
2993 | dfbffbb11afa4d28a745282c422314d4 | USER é sem personalidade, fala tudo ensaiado… parece um robô. | OFF | TIN | IND | [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] | False | False | True | False | False | False | False | False | False | False | False |
2994 | d0e31661ac2e402b8d139b5efcd5c07c | FICO ME PERGUNTANDO Q MERDA A PESSOA TEM NA CABEÇA PRA VOLTAR PRA UM RELACIONAMENTO FUDIDO? PQP EU FICO INDIGNADA COM O TANTO Q EXISTE MINA TROUXA QUERIA Q TODAS ENTENDESSEM A MERDA DE RELACIONAMENTO Q TÃO E A PESSOA LIXO DO LADO DELAS | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 84, 85, 86, 87, 88, 89, 92, 93, 94, 140, 141, 142, 143, 144, 145, 176, 177, 178, 179, 180, 181, 217, 218, 219, 220, 221] | False | True | True | False | False | False | False | False | False | True | False |
2995 | 012fdebdb224452a8666eea8ea86d35b | O Nosso Presidente Tem Mesmo Uma Paciência De Jó, Porque se Fosse Eu, Já Tinha Mandado Todos Estes Pilantras Trapaceiros Que Defende a Esquerda Para a Casa Do Caralho ! | OFF | TIN | GRP | [99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 156, 157, 159, 160, 161, 162, 163, 164, 165, 166] | False | True | False | False | False | False | False | False | False | False | False |