Copyleaks Announces the Addition of Cross-Language Plagiarism Detection

Dec 12, 2022

Photo from Unsplash

Originally Posted On: https://copyleaks.com/blog/copyleaks-announces-multilingual-plagiarism-detection

Copyleaks recently announced an innovative new tool to help fight plagiarism: cross-language plagiarism detection. This will be welcome news to companies, organizations, and academic institutions that value originality and authenticity.
The new cross-language detection tool will allow users to stay one step ahead of online translators, international essay mills, and unscrupulous writers.

What is cross-language plagiarism?

If a writer translates a text written by someone else from one language to another and then presents the translated text as their own, this is called cross-language plagiarism. This can also be true if the writer translates a part of an existing text and incorporates it into their work without proper citation or attribution.

One of the challenges of cross-language plagiarism is that it is often very difficult to detect.

Most plagiarism detection software works algorithmically, scanning for similar phrases, identical words, or sentences copied verbatim. This means that systems will often miss ideas or text copied from a source in a different language than the text being scanned.

Copyleaks knew their innovative plagiarism detector was uniquely positioned to solve this problem.

What makes Copyleaks different?

Copyleaks was formed in 2013 to address what it felt to be a sizable gap in the plagiarism detection software market. What existing tools were doing, primarily, was looking for copy and paste: Did the writer copy text from an existing source and paste it into their own document? And this approach made sense 15 years ago when these tools first emerged.

But today, technology has advanced to the point that it has become easier to take an existing text and drop it into “text spinners.” There are a dozen free websites where a user can copy and paste the text. The site replaces words with synonyms or changes the sentence structures around. This has made it altogether too easy for a user to paraphrase without putting in too much effort.

And Copyleaks realized that virtually no tool on the market was looking for that type of plagiarism.

Plagiarism was changing, but Copyleaks was innovating faster. They realized that a plagiarism checker must detect more than just words in order. The software would need to identify the meaning of the text in a sentence. And that, they thought, was where Artificial Intelligence (AI) would help.

As AI and machine learning have advanced, we’ve seen the rise of tools dedicated to Natural Language Processing (NLP). In AI, NLP pertains to programs that can understand language like the human brain.

IBM explains that “NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in text or voice data and ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.”

So instead of looking for groups of words that appear verbatim from another source, as existing plagiarism software did, Copyleaks harnessed the power of AI and NLP to improve the efficacy of their tools.

This application of AI enables Copyleaks to understand the meaning of the sentence and the meaning of the paragraph. This means that, although the words are different in the second instance, Copyleaks’ tool can look for patterns. First, the AI eliminates coincidences, and then it says, effectively, this is why we think this is questionable; pause and look at this.

Find out what’s in your copy.
Get started with Copyleaks for free today!

Get Started

Copyleaks and Cross-Language Plagiarism Detection

The Copyleaks team quickly realized that their NLP software was a potential solution for cross-language plagiarism. Traditional detectors, looking for verbatim text, could often not detect cross-language or multilingual plagiarism. The nuances of language, coupled with the simple change in the dictionary, meant that text could more easily be copied and pasted but not be identical to the source material.

The proliferation of Google Translate and other similar tools meant that the problem of cross-language detection was becoming urgent. Because of recent advancements in third-party technology and applications such as Google Translate, it takes seconds for someone to find a paper or an article, or a blog entry that a competitor wrote. Then they can take it, and copy/paste it into Google Translate from a different language, say, into French. And Google Translate natively paraphrases and transliterates the text into a selected language.

In moments a user can turn text from something that’s not theirs and submit it as theirs. And the reality is, no existing tool has been able to find that. It’s complicated because, again, it’s all paraphrased content. And because Copyleaks had already solved the paraphrased content question, this problem was a natural next step.

This isn’t a new problem–academics have discussed the need for detection tools for over a decade. But until recently, technology didn’t allow for large-scale indexing of material on the internet or in print in combination with natural language processing.

Cross-language plagiarism and the university campus

While the impact of plagiarism in academic settings is rather clear–plagiarism impacts learning and is therefore subject to often dire consequences–the applications of Copyleaks’ software extend far beyond catching and punishing cheaters on campus.

Copyleaks’ philosophy is that cross-language detection tools can and should be used as something more than punitive. The company would love for this tool to be a true teaching and learning tool.

Most students from most places worldwide understand that you can’t just copy and paste. But when a professor says, “Hey, turn this into your own words, make this your own,” it may be an idea that international students have never learned.

Copyleaks has spoken to a lot of big universities with a lot of international students. They confirm this breakdown in understanding, suggesting that the synthesis of information being referred to in the manner suggested above is not universally taught or understood. Instead, the suggestion that they explain concepts in their own words is confused with simply changing the words and paraphrasing.

As a result, Copyleaks would like their cross-language plagiarism detection tool to be used as a teaching tool, where students can easily see originality reports and receive instruction on idea synthesis.

Cross-language plagiarism in the business world

Increasingly, businesses and organizations are outsourcing their writing. And it is not uncommon for website copy or marketing content to be outsourced to writers in international settings. Here, the ability of writers to copy/paste and translate can create substantial problems.

Consider an early career writer on a deadline. Perhaps a writer working in their second or third language. If that writer translates a resource from their native language into another, might it be tempting to borrow the translated text without attribution?

If they were to do so, the company receiving the content from the writer might never know that it wasn’t wholly new and original work. But they may find out the hard way, through a lawsuit or a decrease in search engine hits.

Considering widely varying international copyright laws and protections for intellectual property and the emphasis by search engines on originality and authority, cross-language plagiarism detection makes great business sense.

How big of a problem is cross-language plagiarism?

Copyleaks believes plagiarism is a problem and is deeply committed to helping businesses and educational institutes ensure originality, authenticity, and ownership.

A challenge with cross-language plagiarism is that it remains unclear how many instances of plagiarism there may be. As we said above, until recently, this type of plagiarism has been somewhat impossible to detect. But we do know that the international usage of Google Translate is immense.

A short time ago, Google reported that over 1 billion people had installed their Translate software. At 2021’s Google I/O event, CEO Sundar Pichai revealed that the software translated 20 billion web pages globally in April 2021 alone.

Suppose we track that latter statistic and extrapolate. In that case, it implies that billions of pages have been successfully paraphrased by Google’s software and that those new pages would be undetectable by traditional plagiarism detection software.

Copyleaks has no doubt that many of these translated pages are being and have been repurposed as “original” text.

How Copyleaks cross-language plagiarism detection can help

Through an innovative application of AI, NLP, and machine learning, Copyleaks has created the capacity to detect cross-language plagiarism. By having this text analysis, the tool can understand the meaning, the sentiment, and more. And that allows the detection tool to catch things like paraphrasing. Copyleaks offers the only platform that can do this, making it a key differentiator for Copyleaks, and, ultimately, the engine that enables them to perform these tasks.

Copyleaks takes advantage of a massive repository of papers and information to combat this plagiarism. Their users have several options they can choose from, depending on their needs and use.

Case number one, they offer a real-time search of the entire web. And that’s anything and everything that’s publicly available–trillions of websites.
Case number two, clients can search Copyleaks internal repository of information compiled from their existing user base. This database hosts millions of papers and other content.

These results, rather uniquely in the anti-plagiarism software industry, are kept private from other users. All personal information, including names, email addresses, and even the institution from which the paper was submitted, are kept entirely masked from users in originality reports. Users will see the plagiarized, or potentially plagiarized, material, but no more.

Cross-language plagiarism detection can be performed on any uploaded document across nearly 30 languages—more than any other platform—with additional languages regularly added. For example, a document uploaded in English can find potential plagiarism matches in Chinese, Spanish, German, or any other language selected.

The analysis is done by advanced AI and NLP algorithms that can understand the meaning and voice of the text across languages.

Copyleaks also provides a mechanism to proactively detect plagiarism online. Their business clients, including The United Nations, Cisco, Rakuten, and more, use an advanced API that crawls the web. This tool can be set to scan for similarity (now in multiple languages) and to report to the company any instances of similar text. In this way, Copyleaks can help protect their clients’ intellectual property from theft–even translation and reuse–by others.

With this new tool, Copyleaks continues to expand its AI-based text-analyzing technologies to support learners, faculty, and content creators. With these new and advanced cross-language plagiarism detection capabilities, Copyleaks can provide the next generation of content originality detection and text analysis that can be used across multiple use cases within and outside of education.

Book a demo today to learn more about Copyleaks new cross-language plagiarism detection tool or their powerful and innovative AI-driven plagiarism detection software.