top of page
Frankie

Challenging the Downplay of Plagiarism in AI-Generated Code




The rise of Artificial Intelligence (Al) in coding has brought about a radical shift in the way software is developed. Al tools like CoPilot and ChatGPT are becoming essential contributors to the code base in many software projects. However, a significant concern that these advancements have spawned is the risk of generating code that may infringe on existing copyrights. Despite the gravity of this issue, it has been observed that some entities, notably companies providing Software Composition Analysis (SCA) tools lacking the appropriate functionality, tend to downplay the issue of plagiarism validation in Al-generated code.



One of the common misconceptions perpetuated is that the challenge of license compliance in Al-generated code is akin to managing code fragments common to all programming or like the autocomplete feature in Google's search engine. This comparison is fundamentally flawed. Al-generated code has been proven to duplicate intricate, unique, and copyright-protected segments of code. Therefore, managing license compliance is a more complex and graver task than handling common expressions or auto-complete suggestions.



Some also hold the view that the ongoing class action lawsuit against GitHub is the sole issue in this space. However, the threat of potential copyright infringement by Al tools does not hinge on a single lawsuit's outcome. It is a pervasive issue that extends beyond any one case and demands constant vigilance and comprehensive mitigation strategies.



Another area of contention lies in the approach towards validating Al-generated code. Several SCA tool providers advise using tools that can only recognize complete, untouched open-source files. While this approach might serve to detect blatant violations, it overlooks a myriad of subtler transgressions. Al tools can and do generate variations of code that closely resemble open-source code, deviating by a word or two, which would evade detection by such SCA tools. Therefore, a more discerning approach that can identify copyright infringements at the granular level of code fragments is essential.



A narrow focus on specific Al tools, while disregarding the rest, presents a skewed picture of the landscape. Conclusions drawn on such incomplete evidence could be dangerously misleading. Any conversation or risk mitigation strategy concerning Al-generated code must incorporate the full range of Al tools contributing to the coding space, not just a selected few.



False positives, or erroneous alerts of copyright infringement, are often raised as a significant concern against scanning code fragments. However, it's essential to remember that not all flagged fragments are false positives. Some indeed are genuine cases of copyright infringement. Instead of avoiding fragment or snippet scanning altogether, the focus should be on improving the accuracy of detection.



There is also a tendency to dismiss the copyrightability of code fragments, deeming them unworthy of attention in risk management. Such assumptions are precariously baseless and need reconsideration. Even minor fragments of code could carry copyright claims, requiring meticulous scrutiny to avoid infringement.



We need to understand the reality of Al-generated code. It can and does generate verbatim copies of Open Source, copyrightable code. This underlines the substantial risk posed to license compliance. We should therefore resist any attempts to downplay the issue and strive to evolve our approaches towards more effective solutions for plagiarism validation in Al-generated code. The ethical implications of Al use in software development call for robust, comprehensive, and vigilant plagiarism validation. Let us not obscure this reality with flawed logic or narrow perspectives.



Check out SCANOSS to learn more about staying compliant in the new landscape of Al-assisted coding.

Comments


bottom of page