rules.md•8.94 kB
The specific rules for each quality metric are as follows:
| Function Name | Type | Description | Reference |
|------------------------------|-------------------|---------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| RuleAlphaWords | EFFECTIVENESS | check whether the ratio of words that contain at least one alphabetic character > 0.6 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleCapitalWords | UNDERSTANDABILITY | check whether capital words ratio > 0.2 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) |
| RuleCharNumber | EFFECTIVENESS | check whether the number of char > 100 | [MAP-en](https://arxiv.org/abs/2405.19327) |
| RuleColonEnd | COMPLETENESS | check whether the last char is ':' | |
| RuleContentNull | EFFECTIVENESS | check whether content is null | |
| RuleCurlyBracket | UNDERSTANDABILITY | check whether the ratio of the number of {,} and the number of characters < 0.025 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [C4](https://arxiv.org/abs/1910.10683) |
| RuleDocRepeat | SIMILARITY | check whether content repeats | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [Gopher](https://arxiv.org/abs/2112.11446) |
| RuleHtmlEntity | RELEVANCE | check whether content has html entity | |
| RuleIDCard | SECURITY | check if the content contains ID card. | |
| RuleLineEndWithEllipsis | COMPLETENESS | check whether the ratio of line ends with ellipsis < 0.3 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleLineEndWithTerminal | COMPLETENESS | check whether the ratio of line ends with terminal punctuation mark > 0.6 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [C4](https://arxiv.org/abs/1910.10683) |
| RuleLineStartWithBulletpoint | UNDERSTANDABILITY | check whether the ratio of line starts with bullet points < 0.9 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleLineJavascriptCount | EFFECTIVENESS | check whether line with the word Javascript. | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [C4](https://arxiv.org/abs/1910.10683) |
| RuleLoremIpsum | EFFECTIVENESS | check whether the ratio of lorem ipsum < 3e-08 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [C4](https://arxiv.org/abs/1910.10683) |
| RuleMeanWordLength | EFFECTIVENESS | check whether the mean length of word in [3, 10] | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleNoPunc | FLUENCY | check whether paragraph has no punctuation. | |
| RuleSentenceNumber | COMPLETENESS | check whether the number of sentence in [3, 7500] | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) [C4](https://arxiv.org/abs/1910.10683) |
| RuleSpecialCharacter | RELEVANCE | check whether content has special characters. | |
| RuleStopWord | EFFECTIVENESS | check whether the ratio of stop word > 0.06 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleSymbolWordRatio | EFFECTIVENESS | check whether the ratio of symbol / word is > 0.4 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |
| RuleUniqueWords | UNDERSTANDABILITY | check whether the ratio of unique words > 0.1 | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) |
| RuleWatermark | RELEVANCE | check whether content has watermarks. | |
| RuleWordNumber | EFFECTIVENESS | check whether the number of word in [20, 100000] | [Redpajama](https://www.together.ai/blog/redpajama-data-v2) [MAP-en](https://arxiv.org/abs/2405.19327) [Gopher](https://arxiv.org/abs/2112.11446) [Dolma](https://arxiv.org/abs/2402.00159) |