Machine Translation with Transformers Using Pytorch

Raw Text

Member-only story

Translate from any language to another in a simple few steps!

Raymond Cheng Ā· Follow

Published in Towards Data Science Ā· 4 min read Ā· Jan 18, 2021

--

1

Listen

Share

Pisit Heng

Unsplash

Intro

Translation , or more formally, machine translation, is one of the most popular tasks in Natural Language Processing (NLP) that deals with translating from one language to another. In the early days, translation is initially done by simply substituting words in one language to words in another. However, doing that does not yield good results since languages are fundamentally different so a higher level of understanding (e.g. phrases/sentences) is needed. With the advent of deep learning, modern software now adopts statistical and neural techniques, which are proven to be more effective when doing translation.

Of course, everyone has access to the powerful Google Translate , but in case you want to know how to implement translation in code, this article will teach you how. This article will show how you can easily implement translation with a simple API provided by Huggingface Transformers , a library based on Pytorch .

Now without further ado, letā€™s get started!

Tutorial Overview

Install Library

English to German Translation Example

Custom Language Translation Example

Install Library

Before installing the Transformers library, you will need to have a working version of Pytorch installed. You can install Pytorch by going to its official website .

After installing Pytorch, you can install Transformers by:

pip install transformers

English to German Translation Example

Now, we are ready to do the translation! If you want to do English to German Translation, then you can start by importing the relevant pipeline module in Transformers:

from transformers import pipeline

The pipeline is an easy method of doing inference on different tasks by using a simple API. You can learn more about the pipeline module here .

To do English to German translation, you need a model that is fine-tuned specifically on this task. T5 is a model that has been trained on the massive c4 dataset that contains a dataset for English-German translation, and thus we can directly use this model for the translation pipeline (we are using the t5-base variant):

translation = pipeline(ā€œtranslation_en_to_deā€) ## same with ## translation = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base")

Note that we didnā€™t specify any model in this line of code because by default, t5-base is used for translation. If you want to specify your own model and tokenizer, you can add a model and tokenizer by specifying the model and tokenizer parameter (if they are provided within Huggingface), or build your own model and tokenizer as will be demonstrated in the next example (if provided by the community). For more details on the translation pipeline, you can refer to the official documentation here .

Then, you can define the text you want to translate. Letā€™s try to translate this:

I like to study Data Science and Machine Learning

text = "I like to study Data Science and Machine Learning"

Finally, now you can use the API provided by the pipeline to translate and set a max_length (e.g. 40 tokens):

translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text)

Voila! After tens of seconds, we get the German translation:

Ich studiere gerne Datenwissenschaft und maschinelles Lernen

Custom Language Translation Example

If you want to do a translation of any two custom languages, say English to Chinese, then you need a model that is specifically fine-tuned on that specific task. Fortunately, with the community established by Huggingface, you most likely donā€™t need to collect your own dataset and fine-tune your model on it. You can directly head over to Huggingfaceā€™s model website to see a list of translation models trained on different language pairs.

For our case to translate from English to Chinese, we can use the English-to-Chinese pretrained model by HelsinkiNLP and directly use it. To start, we first import the necessary modules:

from transformers import AutoModelWithLMHead, AutoTokenizer

Then, we can build our model and tokenizer via:

model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh") tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")

Now, we can feed in the language pairs we want to translate, model, and tokenizer into the pipeline:

translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)

Similar to the previous example, we can use the same code to define our text and translate:

text = "I like to study Data Science and Machine Learning" translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text)

After waiting for several seconds, you should see the Chinese version of the text!

ęˆ‘å–œę¬¢å­¦ä¹ ę•°ę®ē§‘学和ęœŗå™Ø学习

Conclusion

Congratulations! Now you should be able to know how to implement translation using the pretrained models offered by Huggingface and the community around it. In case you want to see what the complete code looks like, hereā€™s the Jupyter code:

And thatā€™s all! If you have any questions, please feel free to ask questions below. If you like my work, you can follow me and sign up for my newsletter so that you will get informed whenever I publish a new article! You can also take a look at my previous articles if you like. See you all next time :D

Abstractive Summarization Using Pytorch Summarize any text using Transformers in a few simple steps! towardsdatascience.com

Semantic Similarity Using Transformers Compute Semantic Textual Similarity between two texts using Pytorch and SentenceTransformers towardsdatascience.com

BERT Text Classification Using Pytorch Text classification is a common task in NLP. We apply BERT, a popular Transformer model, on fake news detection usingā€¦ towardsdatascience.com

Fine-tuning GPT2 for Text Generation Using Pytorch Fine-tune GPT2 for text generation using Pytorch and Huggingface. We train on the CMU Book Summary Dataset to generateā€¦ towardsdatascience.com

Implementing Transformer for Language Modeling Training a transformer model using Fairseq towardsdatascience.com

References

[1] Transformers Github , Huggingface

[2] Transformers Official Documentation , Huggingface

[3] Pytorch Official Website , Facebook AI Research

[4] Raffel, Colin, et al. ā€œExploring the limits of transfer learning with a unified text-to-text transformer.ā€ arXiv preprint arXiv:1910.10683 (2019).

[5] Tensorflow Datasets , Google

Single Line Text

Member-only story. Translate from any language to another in a simple few steps! Raymond Cheng Ā· Follow. Published in Towards Data Science Ā· 4 min read Ā· Jan 18, 2021. -- 1. Listen. Share. Pisit Heng. Unsplash. Intro. Translation , or more formally, machine translation, is one of the most popular tasks in Natural Language Processing (NLP) that deals with translating from one language to another. In the early days, translation is initially done by simply substituting words in one language to words in another. However, doing that does not yield good results since languages are fundamentally different so a higher level of understanding (e.g. phrases/sentences) is needed. With the advent of deep learning, modern software now adopts statistical and neural techniques, which are proven to be more effective when doing translation. Of course, everyone has access to the powerful Google Translate , but in case you want to know how to implement translation in code, this article will teach you how. This article will show how you can easily implement translation with a simple API provided by Huggingface Transformers , a library based on Pytorch . Now without further ado, letā€™s get started! Tutorial Overview. Install Library. English to German Translation Example. Custom Language Translation Example. Install Library. Before installing the Transformers library, you will need to have a working version of Pytorch installed. You can install Pytorch by going to its official website . After installing Pytorch, you can install Transformers by: pip install transformers. English to German Translation Example. Now, we are ready to do the translation! If you want to do English to German Translation, then you can start by importing the relevant pipeline module in Transformers: from transformers import pipeline. The pipeline is an easy method of doing inference on different tasks by using a simple API. You can learn more about the pipeline module here . To do English to German translation, you need a model that is fine-tuned specifically on this task. T5 is a model that has been trained on the massive c4 dataset that contains a dataset for English-German translation, and thus we can directly use this model for the translation pipeline (we are using the t5-base variant): translation = pipeline(ā€œtranslation_en_to_deā€) ## same with ## translation = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base") Note that we didnā€™t specify any model in this line of code because by default, t5-base is used for translation. If you want to specify your own model and tokenizer, you can add a model and tokenizer by specifying the model and tokenizer parameter (if they are provided within Huggingface), or build your own model and tokenizer as will be demonstrated in the next example (if provided by the community). For more details on the translation pipeline, you can refer to the official documentation here . Then, you can define the text you want to translate. Letā€™s try to translate this: I like to study Data Science and Machine Learning. text = "I like to study Data Science and Machine Learning" Finally, now you can use the API provided by the pipeline to translate and set a max_length (e.g. 40 tokens): translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text) Voila! After tens of seconds, we get the German translation: Ich studiere gerne Datenwissenschaft und maschinelles Lernen. Custom Language Translation Example. If you want to do a translation of any two custom languages, say English to Chinese, then you need a model that is specifically fine-tuned on that specific task. Fortunately, with the community established by Huggingface, you most likely donā€™t need to collect your own dataset and fine-tune your model on it. You can directly head over to Huggingfaceā€™s model website to see a list of translation models trained on different language pairs. For our case to translate from English to Chinese, we can use the English-to-Chinese pretrained model by HelsinkiNLP and directly use it. To start, we first import the necessary modules: from transformers import AutoModelWithLMHead, AutoTokenizer. Then, we can build our model and tokenizer via: model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh") tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh") Now, we can feed in the language pairs we want to translate, model, and tokenizer into the pipeline: translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer) Similar to the previous example, we can use the same code to define our text and translate: text = "I like to study Data Science and Machine Learning" translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text) After waiting for several seconds, you should see the Chinese version of the text! ęˆ‘å–œę¬¢å­¦ä¹ ę•°ę®ē§‘学和ęœŗå™Ø学习. Conclusion. Congratulations! Now you should be able to know how to implement translation using the pretrained models offered by Huggingface and the community around it. In case you want to see what the complete code looks like, hereā€™s the Jupyter code: And thatā€™s all! If you have any questions, please feel free to ask questions below. If you like my work, you can follow me and sign up for my newsletter so that you will get informed whenever I publish a new article! You can also take a look at my previous articles if you like. See you all next time :D. Abstractive Summarization Using Pytorch Summarize any text using Transformers in a few simple steps! towardsdatascience.com. Semantic Similarity Using Transformers Compute Semantic Textual Similarity between two texts using Pytorch and SentenceTransformers towardsdatascience.com. BERT Text Classification Using Pytorch Text classification is a common task in NLP. We apply BERT, a popular Transformer model, on fake news detection usingā€¦ towardsdatascience.com. Fine-tuning GPT2 for Text Generation Using Pytorch Fine-tune GPT2 for text generation using Pytorch and Huggingface. We train on the CMU Book Summary Dataset to generateā€¦ towardsdatascience.com. Implementing Transformer for Language Modeling Training a transformer model using Fairseq towardsdatascience.com. References. [1] Transformers Github , Huggingface. [2] Transformers Official Documentation , Huggingface. [3] Pytorch Official Website , Facebook AI Research. [4] Raffel, Colin, et al. ā€œExploring the limits of transfer learning with a unified text-to-text transformer.ā€ arXiv preprint arXiv:1910.10683 (2019). [5] Tensorflow Datasets , Google.