Inside the Model That Outsmarts Popular AI Detection Tools

cover
10 Feb 2026
  1. Abstract and Introduction

  2. Dataset

    2.1 Baseline

    2.2 Proposed Model

    2.3 Our System

    2.4 Results

    2.5 Comparison With Proprietary Systems

    2.6 Results Comparison

  3. Conclusion

    3.1 Strengths and Weaknesses

    3.2 Possible Improvements

    3.3 Possible Extensions and Applications

    3.4 Limitations and Potential for Misuse

A. Other Plots and information

B. System Description

C. Effect of Text boundary location on performance

A Other Plots and information

Some of the information that couldn’t be covered due to page limitations along with details for system replication have been added here.

A.1 POS tag usage : human vs machines

It can be seen from Figure 11 , Figure 12 and Figure 10 that machine generated texts had higher share of certain POS tags in the machine generated parts compared to the human written parts. This was observed in all 3 sets, the train and dev had similar distributions as a result of using same generators i.e ChatGPT and the test had a bit of a variation due to multiple different generators i.e LLaMA2 and GPT4. Although the percentile comparison did vary from train, dev and test sets , it was minimal.

Figure 8: Median MAE based on pre and post text boundary POS tags : DeBERTa-CRF

Figure 9: Median MAE based on pre and post text boundary POS tags : Longformer.pos-CRF

A.2 MAE characteristics : DeBERTa vs Longformer

As discussed in the paper , there were some instances where one model performed significantly better than the other as seen in Figure 8 and Figure 9 hinting that an ensemble of both’s predictions might yield better results.

Figure 10: Percentile distribution of each POS tag in test set : human VS machine

B System Description

DeBERTa-CRF was the official submission, longformer.pos-CRF had almost the same performance on the test set. i.e 18.538 and 18.542.

Other models that have been tested but were found to have a big margin of performance with above listed models

Due to time and computational resources limitation, only a part of hyperparameter space was explored.

Table 5: Official submission system description : DeBERTa-CRF

Figure 11: Percentile distribution of each POS tag in train set : human VS machine

Table 6: Unofficial submission system description : Longformer.pos-CRF

Figure 12: Percentile distribution of each POS tag in dev set : human VS machine

Table 7: Other models tested as part of the task

C Effect of Text boundary location on performance

The location of text boundaries with respect to length of the text samples are varying over the training and testing set as seen in Figure 13 and Figure 14. Despite training on samples where the text boundaries are in the first half in most of the cases, the models did perform well on the testing set where there is a good amount of samples with text boundaries in later half. This is an area where the proprietary systems struggled.

Figure 13: Location of text boundary : testing set

Table 8: Hyperparameters explored on the models

Figure 14: Location of text boundary : training set

Author:

(1) Ram Mohan Rao Kadiyala, University of Maryland, College Park ([email protected]**).**


This paper is available on arxiv under CC BY-NC-SA 4.0 license.