pre
System overview
Our system is designed with two main components: illegal text detection and high-quality comment filtering, with each component trained independently.
The process begins with illegal text detection, where the input text first undergoes sensitive word matching
The text will also analyzed by a FastText model for screening.
Texts flagged as illegal by both sensitive word detection and FastText are classified as illegal.
However, if the two methods provide conflicting results, the BERT model makes the final determination. For texts classified as non-illegal, the system proceeds to high-quality comment filtering. These texts are converted into token embeddings using the pre-trained BERT model, producing a matrix representation. This matrix is then processed by a trained Auto-Encoder, which attempts to reconstruct the input. By evaluating the reconstruction error, the system determines whether the text meets the threshold for high-quality comments.
If the error falls within the acceptable range, the text is considered high quality; otherwise, it is discarded. This comprehensive pipeline ensures robust illegal text detection and effective identification of high-quality content.
High Quality comment classification
Based on our assumption, a highly liked comment is a high-quality comment, but this does not mean that a low-liked comment is a low-quality comment.
- For example, some comments may have fewer likes simply because the comment has not been seen enough.
At the same time, we believe that it is not realistic and objective to manually screen comments with only a small number of people, after all, highly praised comments have been tested by a large number of people.