Dataset statistics
| Number of variables | 17 |
|---|---|
| Number of observations | 2996 |
| Missing cells | 1400 |
| Missing cells (%) | 2.7% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 2.7 MiB |
| Average record size in memory | 939.2 B |
Variable types
| Categorical | 5 |
|---|---|
| Unsupported | 1 |
| Boolean | 11 |
religious_intolerance has constant value "False" | Constant |
id has a high cardinality: 2996 distinct values | High cardinality |
text has a high cardinality: 2996 distinct values | High cardinality |
is_offensive is highly correlated with targeted_type and 2 other fields | High correlation |
health is highly correlated with religious_intolerance | High correlation |
other_lifestyle is highly correlated with religious_intolerance | High correlation |
racism is highly correlated with religious_intolerance | High correlation |
sexism is highly correlated with religious_intolerance | High correlation |
is_targeted is highly correlated with targeted_type and 1 other fields | High correlation |
targeted_type is highly correlated with is_offensive and 2 other fields | High correlation |
profanity_obscene is highly correlated with religious_intolerance | High correlation |
insult is highly correlated with is_offensive and 1 other fields | High correlation |
religious_intolerance is highly correlated with is_offensive and 12 other fields | High correlation |
xenophobia is highly correlated with religious_intolerance | High correlation |
physical_aspects is highly correlated with religious_intolerance | High correlation |
lgbtqphobia is highly correlated with religious_intolerance | High correlation |
ideology is highly correlated with religious_intolerance | High correlation |
is_offensive is highly correlated with insult | High correlation |
insult is highly correlated with is_offensive | High correlation |
targeted_type has 1283 (42.8%) missing values | Missing |
toxic_spans has 117 (3.9%) missing values | Missing |
id is uniformly distributed | Uniform |
text is uniformly distributed | Uniform |
id has unique values | Unique |
text has unique values | Unique |
toxic_spans is an unsupported type, check if it needs cleaning or further analysis | Unsupported |
Reproduction
| Analysis started | 2022-09-01 22:12:40.318763 |
|---|---|
| Analysis finished | 2022-09-01 22:12:53.568420 |
| Duration | 13.25 seconds |
| Software version | pandas-profiling v3.2.0 |
| Download configuration | config.json |
| Distinct | 2996 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 260.5 KiB |
| da19df36730945f08df3d09efa354876 | 1 |
|---|---|
| 49bf28b765484b9f963d6885eb48df31 | 1 |
| ee5a5d40987f424ab71dd60abf56a23c | 1 |
| 3ca64c7132d042feaa0eadea0e76ff22 | 1 |
| e1bffd1c19be401393ab91e196839854 | 1 |
| Other values (2991) |
Length
| Max length | 32 |
|---|---|
| Median length | 32 |
| Mean length | 32 |
| Min length | 32 |
Characters and Unicode
| Total characters | 95872 |
|---|---|
| Distinct characters | 16 |
| Distinct categories | 2 ? |
| Distinct scripts | 2 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 2996 ? |
|---|---|
| Unique (%) | 100.0% |
Sample
| 1st row | da19df36730945f08df3d09efa354876 |
|---|---|
| 2nd row | 80f1a8c981864887b13963fed1261acc |
| 3rd row | 80eee9db811c4ea4b2ddb7863d12c5fe |
| 4th row | 2f67025f913e4a6292e3d000d9e2b5a8 |
| 5th row | cd92f539559e421ba61cf23ecd005511 |
Common Values
| Value | Count | Frequency (%) |
| da19df36730945f08df3d09efa354876 | 1 | < 0.1% |
| 49bf28b765484b9f963d6885eb48df31 | 1 | < 0.1% |
| ee5a5d40987f424ab71dd60abf56a23c | 1 | < 0.1% |
| 3ca64c7132d042feaa0eadea0e76ff22 | 1 | < 0.1% |
| e1bffd1c19be401393ab91e196839854 | 1 | < 0.1% |
| 4dab80b5088d4255a0f2372704246e4e | 1 | < 0.1% |
| b564daa958ec4b7390c4674c298e72a4 | 1 | < 0.1% |
| 10978af756c94a479fbaf54a144c7052 | 1 | < 0.1% |
| 1dbe54562a824d42bd95a728c5d6afef | 1 | < 0.1% |
| 1a1b7b30ec08435889c578791b8188c3 | 1 | < 0.1% |
| Other values (2986) | 2986 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| da19df36730945f08df3d09efa354876 | 1 | < 0.1% |
| cc66b54eeec24607a67e2259134a1cdd | 1 | < 0.1% |
| a223536974394b15b5e3bb658ebc596b | 1 | < 0.1% |
| 396d5049c82c430a9f4fa5694c00cbd2 | 1 | < 0.1% |
| 80eee9db811c4ea4b2ddb7863d12c5fe | 1 | < 0.1% |
| 2f67025f913e4a6292e3d000d9e2b5a8 | 1 | < 0.1% |
| cd92f539559e421ba61cf23ecd005511 | 1 | < 0.1% |
| 430b13705cf34e13b74bc999425187c3 | 1 | < 0.1% |
| c779826dc43f460cb18e8429ca443477 | 1 | < 0.1% |
| e64148caa4474fc79298e01d0dda8f5e | 1 | < 0.1% |
| Other values (2986) | 2986 |
Most occurring characters
| Value | Count | Frequency (%) |
| 4 | 8805 | 9.2% |
| 9 | 6413 | 6.7% |
| 8 | 6392 | 6.7% |
| b | 6381 | 6.7% |
| a | 6270 | 6.5% |
| 2 | 5760 | 6.0% |
| 7 | 5663 | 5.9% |
| 0 | 5635 | 5.9% |
| e | 5599 | 5.8% |
| 1 | 5590 | 5.8% |
| Other values (6) | 33364 |
Most occurring categories
| Value | Count | Frequency (%) |
| Decimal Number | 60956 | |
| Lowercase Letter | 34916 |
Most frequent character per category
Decimal Number
| Value | Count | Frequency (%) |
| 4 | 8805 | |
| 9 | 6413 | |
| 8 | 6392 | |
| 2 | 5760 | |
| 7 | 5663 | |
| 0 | 5635 | |
| 1 | 5590 | |
| 3 | 5590 | |
| 6 | 5587 | |
| 5 | 5521 |
Lowercase Letter
| Value | Count | Frequency (%) |
| b | 6381 | |
| a | 6270 | |
| e | 5599 | |
| f | 5583 | |
| d | 5573 | |
| c | 5510 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Common | 60956 | |
| Latin | 34916 |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 4 | 8805 | |
| 9 | 6413 | |
| 8 | 6392 | |
| 2 | 5760 | |
| 7 | 5663 | |
| 0 | 5635 | |
| 1 | 5590 | |
| 3 | 5590 | |
| 6 | 5587 | |
| 5 | 5521 |
Latin
| Value | Count | Frequency (%) |
| b | 6381 | |
| a | 6270 | |
| e | 5599 | |
| f | 5583 | |
| d | 5573 | |
| c | 5510 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 95872 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| 4 | 8805 | 9.2% |
| 9 | 6413 | 6.7% |
| 8 | 6392 | 6.7% |
| b | 6381 | 6.7% |
| a | 6270 | 6.5% |
| 2 | 5760 | 6.0% |
| 7 | 5663 | 5.9% |
| 0 | 5635 | 5.9% |
| e | 5599 | 5.8% |
| 1 | 5590 | 5.8% |
| Other values (6) | 33364 |
| Distinct | 2996 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 1.0 MiB |
| USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | 1 |
|---|---|
| 6 DE PDL É PUTARIA!!! Chega por hoje URL | 1 |
| Perto dela ele fica calado | 1 |
| O que esperar de uma empresa que vende lixo para as pessoas comerem ? | 1 |
| USER USER é uma hamburgueria não uma escola, pai que precisa de uma hamburgueria pra educar filho é um pai vagabundo ou não é pai. | 1 |
| Other values (2991) |
Length
| Max length | 954 |
|---|---|
| Median length | 476.5 |
| Mean length | 125.6048064 |
| Min length | 4 |
Characters and Unicode
| Total characters | 376312 |
|---|---|
| Distinct characters | 318 |
| Distinct categories | 21 ? |
| Distinct scripts | 6 ? |
| Distinct blocks | 18 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 2996 ? |
|---|---|
| Unique (%) | 100.0% |
Sample
| 1st row | USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs |
|---|---|
| 2nd row | Cara isso foi muito babaca geral USER conhece o Monark e as merda que ele fala. Isso é muito merda eu USER E to decepcionada pra caralho mas não cabe a mim dizer |
| 3rd row | Quem liga pra judeu kkkk |
| 4th row | Se vc for porco, folgado e relaxado, você não ia conseguir viver com ela mesmo. Realmente, gente escrota não ia conseguir conviver com a Jojo |
| 5th row | Rapaziada chata, né?! O cara trabalha c funk, vive no meio de mulher, solteiro ou não ele ia gravar o clip.... mas aí porque ficou solteiro “tá querendo se mostrar” pqp, gente chata p caralho |
Common Values
| Value | Count | Frequency (%) |
| USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | 1 | < 0.1% |
| 6 DE PDL É PUTARIA!!! Chega por hoje URL | 1 | < 0.1% |
| Perto dela ele fica calado | 1 | < 0.1% |
| O que esperar de uma empresa que vende lixo para as pessoas comerem ? | 1 | < 0.1% |
| USER USER é uma hamburgueria não uma escola, pai que precisa de uma hamburgueria pra educar filho é um pai vagabundo ou não é pai. | 1 | < 0.1% |
| Que merda de insônia mos, quero dormir inferno | 1 | < 0.1% |
| cara, dedo quente nesse cu gelado foi ótimo vei, ri pa caralho!!!! | 1 | < 0.1% |
| caralho mano pq eu fui ler essa porra, vou jogar ácido no meu olho da próxima vez que eu inventar de fazer essa palhaçada | 1 | < 0.1% |
| USER KKKKKKKKKK VAGABUNDA | 1 | < 0.1% |
| USER USER USER percebeu a merda q USER falou? Liberdade de expressão não é tirar a liberdade de outra pessoa só porque USER quer. Pensa antes de escrever. | 1 | < 0.1% |
| Other values (2986) | 2986 |
Length
Histogram of lengths of the category
| Value | Count | Frequency (%) |
| user | 3945 | 5.7% |
| que | 2076 | 3.0% |
| de | 1853 | 2.7% |
| e | 1675 | 2.4% |
| o | 1668 | 2.4% |
| a | 1429 | 2.1% |
| é | 1329 | 1.9% |
| não | 972 | 1.4% |
| um | 714 | 1.0% |
| do | 668 | 1.0% |
| Other values (10786) | 52615 |
Most occurring characters
| Value | Count | Frequency (%) |
| 65948 | ||
| a | 33194 | 8.8% |
| e | 31584 | 8.4% |
| o | 27632 | 7.3% |
| s | 19623 | 5.2% |
| r | 17519 | 4.7% |
| i | 15455 | 4.1% |
| n | 12862 | 3.4% |
| d | 12852 | 3.4% |
| m | 12482 | 3.3% |
| Other values (308) | 127161 |
Most occurring categories
| Value | Count | Frequency (%) |
| Lowercase Letter | 263150 | |
| Space Separator | 65948 | 17.5% |
| Uppercase Letter | 34911 | 9.3% |
| Other Punctuation | 9976 | 2.7% |
| Other Symbol | 961 | 0.3% |
| Decimal Number | 718 | 0.2% |
| Dash Punctuation | 113 | < 0.1% |
| Close Punctuation | 96 | < 0.1% |
| Open Punctuation | 88 | < 0.1% |
| Math Symbol | 82 | < 0.1% |
| Other values (11) | 269 | 0.1% |
Most frequent character per category
Other Symbol
| Value | Count | Frequency (%) |
| 😂 | 141 | 14.7% |
| 🤣 | 110 | 11.4% |
| 👏 | 37 | 3.9% |
| 😭 | 35 | 3.6% |
| 🤮 | 27 | 2.8% |
| 🤦 | 24 | 2.5% |
| 🇧 | 20 | 2.1% |
| 🇷 | 20 | 2.1% |
| 😍 | 17 | 1.8% |
| 😡 | 17 | 1.8% |
| Other values (144) | 513 |
Lowercase Letter
| Value | Count | Frequency (%) |
| a | 33194 | |
| e | 31584 | |
| o | 27632 | |
| s | 19623 | 7.5% |
| r | 17519 | 6.7% |
| i | 15455 | 5.9% |
| n | 12862 | 4.9% |
| d | 12852 | 4.9% |
| m | 12482 | 4.7% |
| u | 11849 | 4.5% |
| Other values (33) | 68098 |
Uppercase Letter
| Value | Count | Frequency (%) |
| E | 6049 | |
| S | 5441 | |
| R | 5406 | |
| U | 5170 | |
| A | 2061 | 5.9% |
| O | 1477 | 4.2% |
| T | 895 | 2.6% |
| N | 829 | 2.4% |
| M | 795 | 2.3% |
| D | 773 | 2.2% |
| Other values (28) | 6015 |
Other Letter
| Value | Count | Frequency (%) |
| 茅 | 5 | |
| 茫 | 4 | 10.3% |
| 锚 | 4 | 10.3% |
| 谩 | 4 | 10.3% |
| 馃 | 3 | 7.7% |
| º | 2 | 5.1% |
| 莽 | 2 | 5.1% |
| 鉁 | 2 | 5.1% |
| 檮 | 1 | 2.6% |
| ツ | 1 | 2.6% |
| Other values (11) | 11 |
Other Punctuation
| Value | Count | Frequency (%) |
| . | 3696 | |
| , | 3054 | |
| ! | 1555 | |
| ? | 661 | 6.6% |
| " | 406 | 4.1% |
| : | 252 | 2.5% |
| ' | 97 | 1.0% |
| … | 84 | 0.8% |
| * | 68 | 0.7% |
| / | 44 | 0.4% |
| Other values (8) | 59 | 0.6% |
Decimal Number
| Value | Count | Frequency (%) |
| 0 | 153 | |
| 2 | 111 | |
| 1 | 110 | |
| 3 | 88 | |
| 4 | 65 | |
| 6 | 49 | 6.8% |
| 5 | 47 | 6.5% |
| 8 | 34 | 4.7% |
| 9 | 31 | 4.3% |
| 7 | 30 | 4.2% |
Math Symbol
| Value | Count | Frequency (%) |
| > | 47 | |
| = | 13 | 15.9% |
| + | 13 | 15.9% |
| < | 3 | 3.7% |
| ~ | 3 | 3.7% |
| ¬ | 2 | 2.4% |
| | | 1 | 1.2% |
Modifier Symbol
| Value | Count | Frequency (%) |
| 🏼 | 23 | |
| 🏻 | 14 | |
| 🏽 | 13 | |
| 🏾 | 9 | 12.2% |
| 🏿 | 8 | 10.8% |
| ^ | 5 | 6.8% |
| ´ | 2 | 2.7% |
Dash Punctuation
| Value | Count | Frequency (%) |
| - | 108 | |
| — | 5 | 4.4% |
Close Punctuation
| Value | Count | Frequency (%) |
| ) | 90 | |
| ] | 6 | 6.2% |
Open Punctuation
| Value | Count | Frequency (%) |
| ( | 81 | |
| [ | 7 | 8.0% |
Nonspacing Mark
| Value | Count | Frequency (%) |
| ️ | 47 | |
| ͜ | 1 | 2.1% |
Format
| Value | Count | Frequency (%) |
| | 30 | |
| | 1 | 3.2% |
Final Punctuation
| Value | Count | Frequency (%) |
| ” | 27 | |
| ’ | 2 | 6.9% |
Connector Punctuation
| Value | Count | Frequency (%) |
| _ | 7 | |
| ﹏ | 1 | 12.5% |
Space Separator
| Value | Count | Frequency (%) |
| 65948 |
Initial Punctuation
| Value | Count | Frequency (%) |
| “ | 31 |
Currency Symbol
| Value | Count | Frequency (%) |
| $ | 4 |
Control
| Value | Count | Frequency (%) |
| 2 |
Enclosing Mark
| Value | Count | Frequency (%) |
| ⃣ | 2 |
Modifier Letter
| Value | Count | Frequency (%) |
| ー | 1 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 298064 | |
| Common | 78132 | 20.8% |
| Inherited | 80 | < 0.1% |
| Han | 31 | < 0.1% |
| Katakana | 4 | < 0.1% |
| Hiragana | 1 | < 0.1% |
Most frequent character per script
Common
| Value | Count | Frequency (%) |
| 65948 | ||
| . | 3696 | 4.7% |
| , | 3054 | 3.9% |
| ! | 1555 | 2.0% |
| ? | 661 | 0.8% |
| " | 406 | 0.5% |
| : | 252 | 0.3% |
| 0 | 153 | 0.2% |
| 😂 | 141 | 0.2% |
| 2 | 111 | 0.1% |
| Other values (202) | 2155 | 2.8% |
Latin
| Value | Count | Frequency (%) |
| a | 33194 | 11.1% |
| e | 31584 | 10.6% |
| o | 27632 | 9.3% |
| s | 19623 | 6.6% |
| r | 17519 | 5.9% |
| i | 15455 | 5.2% |
| n | 12862 | 4.3% |
| d | 12852 | 4.3% |
| m | 12482 | 4.2% |
| u | 11849 | 4.0% |
| Other values (73) | 103012 |
Han
| Value | Count | Frequency (%) |
| 茅 | 5 | |
| 茫 | 4 | |
| 锚 | 4 | |
| 谩 | 4 | |
| 馃 | 3 | |
| 莽 | 2 | 6.5% |
| 鉁 | 2 | 6.5% |
| 檮 | 1 | 3.2% |
| 芒 | 1 | 3.2% |
| 脭 | 1 | 3.2% |
| Other values (4) | 4 |
Inherited
| Value | Count | Frequency (%) |
| ️ | 47 | |
| | 30 | |
| ⃣ | 2 | 2.5% |
| ͜ | 1 | 1.2% |
Katakana
| Value | Count | Frequency (%) |
| ツ | 1 | |
| イ | 1 | |
| メ | 1 | |
| ジ | 1 |
Hiragana
| Value | Count | Frequency (%) |
| の | 1 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 366689 | |
| None | 8876 | 2.4% |
| Emoticons | 352 | 0.1% |
| Punctuation | 179 | < 0.1% |
| VS | 47 | < 0.1% |
| Enclosed Alphanum Sup | 40 | < 0.1% |
| Misc Symbols | 38 | < 0.1% |
| Dingbats | 33 | < 0.1% |
| CJK | 31 | < 0.1% |
| Geometric Shapes Ext | 12 | < 0.1% |
| Other values (8) | 15 | < 0.1% |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| 65948 | ||
| a | 33194 | 9.1% |
| e | 31584 | 8.6% |
| o | 27632 | 7.5% |
| s | 19623 | 5.4% |
| r | 17519 | 4.8% |
| i | 15455 | 4.2% |
| n | 12862 | 3.5% |
| d | 12852 | 3.5% |
| m | 12482 | 3.4% |
| Other values (82) | 117538 |
None
| Value | Count | Frequency (%) |
| ã | 1978 | |
| é | 1799 | |
| á | 1017 | |
| ç | 848 | |
| ó | 684 | 7.7% |
| í | 675 | 7.6% |
| ê | 440 | 5.0% |
| ú | 204 | 2.3% |
| É | 155 | 1.7% |
| 🤣 | 110 | 1.2% |
| Other values (117) | 966 |
Emoticons
| Value | Count | Frequency (%) |
| 😂 | 141 | |
| 😭 | 35 | 9.9% |
| 😍 | 17 | 4.8% |
| 😡 | 17 | 4.8% |
| 😠 | 14 | 4.0% |
| 😒 | 13 | 3.7% |
| 😆 | 11 | 3.1% |
| 😤 | 9 | 2.6% |
| 😢 | 8 | 2.3% |
| 🙄 | 6 | 1.7% |
| Other values (35) | 81 |
Punctuation
| Value | Count | Frequency (%) |
| … | 84 | |
| “ | 31 | 17.3% |
| | 30 | 16.8% |
| ” | 27 | 15.1% |
| — | 5 | 2.8% |
| ’ | 2 | 1.1% |
VS
| Value | Count | Frequency (%) |
| ️ | 47 |
Enclosed Alphanum Sup
| Value | Count | Frequency (%) |
| 🇧 | 20 | |
| 🇷 | 20 |
Misc Symbols
| Value | Count | Frequency (%) |
| ♂ | 14 | |
| ♀ | 12 | |
| ♡ | 4 | 10.5% |
| ☠ | 2 | 5.3% |
| ☺ | 2 | 5.3% |
| ♥ | 2 | 5.3% |
| ☄ | 1 | 2.6% |
| ⚖ | 1 | 2.6% |
Dingbats
| Value | Count | Frequency (%) |
| ❤ | 14 | |
| ✅ | 8 | |
| ✌ | 5 | 15.2% |
| ✨ | 2 | 6.1% |
| ❗ | 1 | 3.0% |
| ❌ | 1 | 3.0% |
| ❞ | 1 | 3.0% |
| ❝ | 1 | 3.0% |
Geometric Shapes Ext
| Value | Count | Frequency (%) |
| 🟩 | 10 | |
| 🟨 | 2 | 16.7% |
CJK
| Value | Count | Frequency (%) |
| 茅 | 5 | |
| 茫 | 4 | |
| 锚 | 4 | |
| 谩 | 4 | |
| 馃 | 3 | |
| 莽 | 2 | 6.5% |
| 鉁 | 2 | 6.5% |
| 檮 | 1 | 3.2% |
| 芒 | 1 | 3.2% |
| 脭 | 1 | 3.2% |
| Other values (4) | 4 |
Box Drawing
| Value | Count | Frequency (%) |
| ╥ | 2 |
Specials
| Value | Count | Frequency (%) |
| � | 2 |
Geometric Shapes
| Value | Count | Frequency (%) |
| ● | 1 | |
| ○ | 1 |
Katakana
| Value | Count | Frequency (%) |
| ツ | 1 | |
| イ | 1 | |
| メ | 1 | |
| ジ | 1 | |
| ー | 1 |
Hiragana
| Value | Count | Frequency (%) |
| の | 1 |
IPA Ext
| Value | Count | Frequency (%) |
| ʖ | 1 |
Diacriticals
| Value | Count | Frequency (%) |
| ͜ | 1 |
CJK Compat Forms
| Value | Count | Frequency (%) |
| ﹏ | 1 |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 175.7 KiB |
| OFF | |
|---|---|
| NOT | 117 |
Length
| Max length | 3 |
|---|---|
| Median length | 3 |
| Mean length | 3 |
| Min length | 3 |
Characters and Unicode
| Total characters | 8988 |
|---|---|
| Distinct characters | 4 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | OFF |
|---|---|
| 2nd row | OFF |
| 3rd row | OFF |
| 4th row | OFF |
| 5th row | OFF |
Common Values
| Value | Count | Frequency (%) |
| OFF | 2879 | |
| NOT | 117 | 3.9% |
Length
Histogram of lengths of the category
Category Frequency Plot
| Value | Count | Frequency (%) |
| off | 2879 | |
| not | 117 | 3.9% |
Most occurring characters
| Value | Count | Frequency (%) |
| F | 5758 | |
| O | 2996 | |
| N | 117 | 1.3% |
| T | 117 | 1.3% |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 8988 |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| F | 5758 | |
| O | 2996 | |
| N | 117 | 1.3% |
| T | 117 | 1.3% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 8988 |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| F | 5758 | |
| O | 2996 | |
| N | 117 | 1.3% |
| T | 117 | 1.3% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 8988 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| F | 5758 | |
| O | 2996 | |
| N | 117 | 1.3% |
| T | 117 | 1.3% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 175.7 KiB |
| TIN | |
|---|---|
| UNT |
Length
| Max length | 3 |
|---|---|
| Median length | 3 |
| Mean length | 3 |
| Min length | 3 |
Characters and Unicode
| Total characters | 8988 |
|---|---|
| Distinct characters | 4 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | UNT |
|---|---|
| 2nd row | TIN |
| 3rd row | UNT |
| 4th row | UNT |
| 5th row | TIN |
Common Values
| Value | Count | Frequency (%) |
| TIN | 1713 | |
| UNT | 1283 |
Length
Histogram of lengths of the category
Category Frequency Plot
| Value | Count | Frequency (%) |
| tin | 1713 | |
| unt | 1283 |
Most occurring characters
| Value | Count | Frequency (%) |
| T | 2996 | |
| N | 2996 | |
| I | 1713 | |
| U | 1283 |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 8988 |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| T | 2996 | |
| N | 2996 | |
| I | 1713 | |
| U | 1283 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 8988 |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| T | 2996 | |
| N | 2996 | |
| I | 1713 | |
| U | 1283 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 8988 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| T | 2996 | |
| N | 2996 | |
| I | 1713 | |
| U | 1283 |
| Distinct | 3 |
|---|---|
| Distinct (%) | 0.2% |
| Missing | 1283 |
| Missing (%) | 42.8% |
| Memory size | 140.6 KiB |
| IND | |
|---|---|
| GRP | |
| OTH | 92 |
Length
| Max length | 3 |
|---|---|
| Median length | 3 |
| Mean length | 3 |
| Min length | 3 |
Characters and Unicode
| Total characters | 5139 |
|---|---|
| Distinct characters | 9 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | GRP |
|---|---|
| 2nd row | GRP |
| 3rd row | GRP |
| 4th row | GRP |
| 5th row | OTH |
Common Values
| Value | Count | Frequency (%) |
| IND | 1047 | |
| GRP | 574 | |
| OTH | 92 | 3.1% |
| (Missing) | 1283 |
Length
Histogram of lengths of the category
Category Frequency Plot
| Value | Count | Frequency (%) |
| ind | 1047 | |
| grp | 574 | |
| oth | 92 | 5.4% |
Most occurring characters
| Value | Count | Frequency (%) |
| I | 1047 | |
| N | 1047 | |
| D | 1047 | |
| G | 574 | |
| R | 574 | |
| P | 574 | |
| O | 92 | 1.8% |
| T | 92 | 1.8% |
| H | 92 | 1.8% |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 5139 |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| I | 1047 | |
| N | 1047 | |
| D | 1047 | |
| G | 574 | |
| R | 574 | |
| P | 574 | |
| O | 92 | 1.8% |
| T | 92 | 1.8% |
| H | 92 | 1.8% |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 5139 |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| I | 1047 | |
| N | 1047 | |
| D | 1047 | |
| G | 574 | |
| R | 574 | |
| P | 574 | |
| O | 92 | 1.8% |
| T | 92 | 1.8% |
| H | 92 | 1.8% |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 5139 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| I | 1047 | |
| N | 1047 | |
| D | 1047 | |
| G | 574 | |
| R | 574 | |
| P | 574 | |
| O | 92 | 1.8% |
| T | 92 | 1.8% |
| H | 92 | 1.8% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True | 71 |
| Value | Count | Frequency (%) |
| False | 2925 | |
| True | 71 | 2.4% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True |
| Value | Count | Frequency (%) |
| False | 2220 | |
| True | 776 | 25.9% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| True | |
|---|---|
| False | 174 |
| Value | Count | Frequency (%) |
| True | 2822 | |
| False | 174 | 5.8% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True | 213 |
| Value | Count | Frequency (%) |
| False | 2783 | |
| True | 213 | 7.1% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True | 41 |
| Value | Count | Frequency (%) |
| False | 2955 | |
| True | 41 | 1.4% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True | 178 |
| Value | Count | Frequency (%) |
| False | 2818 | |
| True | 178 | 5.9% |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True |
| Value | Count | Frequency (%) |
| False | 2064 | |
| True | 932 |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True | 77 |
| Value | Count | Frequency (%) |
| False | 2919 | |
| True | 77 | 2.6% |
| Distinct | 1 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False |
|---|
| Value | Count | Frequency (%) |
| False | 2996 |
| Distinct | 2 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 3.1 KiB |
| False | |
|---|---|
| True |
| Value | Count | Frequency (%) |
| False | 2616 | |
| True | 380 | 12.7% |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.
First rows
| id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | da19df36730945f08df3d09efa354876 | USER Adorei o comercial também Jesus. Só achei que faltou um beijinho gay estilo Jesus e USER sabe?! Rsrsrs | OFF | UNT | None | [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86] | False | False | True | False | False | False | True | False | False | False | False |
| 1 | 80f1a8c981864887b13963fed1261acc | Cara isso foi muito babaca geral USER conhece o Monark e as merda que ele fala. Isso é muito merda eu USER E to decepcionada pra caralho mas não cabe a mim dizer | OFF | TIN | GRP | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105] | False | False | True | False | False | False | False | False | False | False | False |
| 2 | 80eee9db811c4ea4b2ddb7863d12c5fe | Quem liga pra judeu kkkk | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] | False | True | True | False | False | False | False | False | False | False | True |
| 3 | 2f67025f913e4a6292e3d000d9e2b5a8 | Se vc for porco, folgado e relaxado, você não ia conseguir viver com ela mesmo. Realmente, gente escrota não ia conseguir conviver com a Jojo | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104] | False | False | True | False | False | False | False | False | False | False | False |
| 4 | cd92f539559e421ba61cf23ecd005511 | Rapaziada chata, né?! O cara trabalha c funk, vive no meio de mulher, solteiro ou não ele ia gravar o clip.... mas aí porque ficou solteiro “tá querendo se mostrar” pqp, gente chata p caralho | OFF | TIN | GRP | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 165, 166, 167, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190] | False | False | True | False | False | False | False | False | False | False | False |
| 5 | 430b13705cf34e13b74bc999425187c3 | USER USER é muito bom. USER ^^ E claro a equipe USER. | NOT | UNT | None | None | False | False | False | False | False | False | False | False | False | False | False |
| 6 | c779826dc43f460cb18e8429ca443477 | Pior do que adolescentezinhas de merda...são pessoas que levam filmes tão a sério! O livro/filme é dela, ela faz o que quiser! E por mais ruim que seja, ta rendendo milhões (: | OFF | UNT | None | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38] | False | False | True | False | False | False | True | False | False | True | False |
| 7 | e64148caa4474fc79298e01d0dda8f5e | USER Toma no cu é vitamina como tu e tua prima. | OFF | TIN | GRP | [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46] | False | False | True | False | False | False | True | False | False | False | False |
| 8 | 738ccd4476784f47af3a5a6cfdda4695 | Se fosse um sniper ia ser louco | OFF | UNT | None | [26, 27, 28, 29, 30] | False | False | True | False | False | True | False | False | False | False | False |
| 9 | cc66b54eeec24607a67e2259134a1cdd | Muito bom, pena a circunstâncias serem ruins, ele poderia ter vindo antes, até para dar uma aula ao Monark e fazer ele parar de falar tanta besteira... | OFF | UNT | None | [119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150] | False | False | True | False | False | False | False | False | False | False | False |
Last rows
| id | text | is_offensive | is_targeted | targeted_type | toxic_spans | health | ideology | insult | lgbtqphobia | other_lifestyle | physical_aspects | profanity_obscene | racism | religious_intolerance | sexism | xenophobia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2986 | 358dd843965f429f9550f5a5eb41a46b | RT USER: TODO MUNDO NESSE CARALHO DE REDE SOCIAL TEM UM SQUAD UMA PANELINHA UM GRUPINHO E OS CARALHO A4 L, SÓ EU QUE NÃO QUE ODIO QUE PO… | OFF | TIN | GRP | [26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102] | False | False | True | False | False | False | True | False | False | False | False |
| 2987 | 1d8a7915a93847d9abc67d2564926920 | Mano o povo gosta que seja ignorante , por isso eu mando pra casa do caralho mesmo 🤬 | OFF | TIN | GRP | [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76] | False | True | True | False | False | False | True | False | False | False | True |
| 2988 | 7545fcbcd9f5497987639797d2b19518 | Essa mulher é louca.kkkkkkkkk. | OFF | TIN | IND | [14, 15, 16, 17, 18] | False | False | True | False | False | False | False | False | False | True | False |
| 2989 | 8849dc11cdde4bcebf2b361e905811fd | USER Ladrão desgracado dos infernos. Voce é um câncer que faz mal ao pais. URL | OFF | TIN | IND | [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 27, 28, 29, 30, 31, 32, 33, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54] | True | True | True | False | False | False | False | False | False | False | False |
| 2990 | 78f5241468a54171893914852b0c6cce | Nós vamos colocar pra fuder cm esses bandidos do STF USER e senado | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 37, 38, 39, 40, 41, 42, 43, 44, 45] | False | True | True | False | False | False | True | False | False | False | False |
| 2991 | 0315aa7b60dd48be9c5f47028422af13 | USER USER EU SO BURRA PERDÃO VO APAGA NGM AVISA NADA NESSE CARAIO TBM | OFF | TIN | IND | [16, 17, 18, 19, 20, 21, 59, 60, 61, 62, 63, 64, 65] | False | True | True | False | False | False | False | False | False | False | False |
| 2992 | 57be31ef8fd54e96afa8482f08df8987 | RT USER: POLÍTICOS CORRUTOS DEVERIAM APODRECER NA CADEIA. POR CAUSA DELES MORREM MILHARES DE PESSOAS POR FALTA DE RECURSOS. ELES… | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 28, 29, 30, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58] | False | True | True | False | False | False | False | False | False | False | False |
| 2993 | dfbffbb11afa4d28a745282c422314d4 | USER é sem personalidade, fala tudo ensaiado… parece um robô. | OFF | TIN | IND | [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59] | False | False | True | False | False | False | False | False | False | False | False |
| 2994 | d0e31661ac2e402b8d139b5efcd5c07c | FICO ME PERGUNTANDO Q MERDA A PESSOA TEM NA CABEÇA PRA VOLTAR PRA UM RELACIONAMENTO FUDIDO? PQP EU FICO INDIGNADA COM O TANTO Q EXISTE MINA TROUXA QUERIA Q TODAS ENTENDESSEM A MERDA DE RELACIONAMENTO Q TÃO E A PESSOA LIXO DO LADO DELAS | OFF | TIN | GRP | [22, 23, 24, 25, 26, 27, 84, 85, 86, 87, 88, 89, 92, 93, 94, 140, 141, 142, 143, 144, 145, 176, 177, 178, 179, 180, 181, 217, 218, 219, 220, 221] | False | True | True | False | False | False | False | False | False | True | False |
| 2995 | 012fdebdb224452a8666eea8ea86d35b | O Nosso Presidente Tem Mesmo Uma Paciência De Jó, Porque se Fosse Eu, Já Tinha Mandado Todos Estes Pilantras Trapaceiros Que Defende a Esquerda Para a Casa Do Caralho ! | OFF | TIN | GRP | [99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 156, 157, 159, 160, 161, 162, 163, 164, 165, 166] | False | True | False | False | False | False | False | False | False | False | False |