Machine Translation with Transformers Using Pytorch
Raw Text
Member-only story
Translate from any language to another in a simple few steps!
Raymond Cheng Ā· Follow
Published in Towards Data Science Ā· 4 min read Ā· Jan 18, 2021
--
1
Listen
Share
Pisit Heng
Unsplash
Intro
Translation , or more formally, machine translation, is one of the most popular tasks in Natural Language Processing (NLP) that deals with translating from one language to another. In the early days, translation is initially done by simply substituting words in one language to words in another. However, doing that does not yield good results since languages are fundamentally different so a higher level of understanding (e.g. phrases/sentences) is needed. With the advent of deep learning, modern software now adopts statistical and neural techniques, which are proven to be more effective when doing translation.
Of course, everyone has access to the powerful Google Translate , but in case you want to know how to implement translation in code, this article will teach you how. This article will show how you can easily implement translation with a simple API provided by Huggingface Transformers , a library based on Pytorch .
Now without further ado, letās get started!
Tutorial Overview
Install Library
English to German Translation Example
Custom Language Translation Example
Install Library
Before installing the Transformers library, you will need to have a working version of Pytorch installed. You can install Pytorch by going to its official website .
After installing Pytorch, you can install Transformers by:
pip install transformers
English to German Translation Example
Now, we are ready to do the translation! If you want to do English to German Translation, then you can start by importing the relevant pipeline module in Transformers:
from transformers import pipeline
The pipeline is an easy method of doing inference on different tasks by using a simple API. You can learn more about the pipeline module here .
To do English to German translation, you need a model that is fine-tuned specifically on this task. T5 is a model that has been trained on the massive c4 dataset that contains a dataset for English-German translation, and thus we can directly use this model for the translation pipeline (we are using the t5-base variant):
translation = pipeline(ātranslation_en_to_deā) ## same with ## translation = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base")
Note that we didnāt specify any model in this line of code because by default, t5-base is used for translation. If you want to specify your own model and tokenizer, you can add a model and tokenizer by specifying the model and tokenizer parameter (if they are provided within Huggingface), or build your own model and tokenizer as will be demonstrated in the next example (if provided by the community). For more details on the translation pipeline, you can refer to the official documentation here .
Then, you can define the text you want to translate. Letās try to translate this:
I like to study Data Science and Machine Learning
text = "I like to study Data Science and Machine Learning"
Finally, now you can use the API provided by the pipeline to translate and set a max_length (e.g. 40 tokens):
translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text)
Voila! After tens of seconds, we get the German translation:
Ich studiere gerne Datenwissenschaft und maschinelles Lernen
Custom Language Translation Example
If you want to do a translation of any two custom languages, say English to Chinese, then you need a model that is specifically fine-tuned on that specific task. Fortunately, with the community established by Huggingface, you most likely donāt need to collect your own dataset and fine-tune your model on it. You can directly head over to Huggingfaceās model website to see a list of translation models trained on different language pairs.
For our case to translate from English to Chinese, we can use the English-to-Chinese pretrained model by HelsinkiNLP and directly use it. To start, we first import the necessary modules:
from transformers import AutoModelWithLMHead, AutoTokenizer
Then, we can build our model and tokenizer via:
model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh") tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh")
Now, we can feed in the language pairs we want to translate, model, and tokenizer into the pipeline:
translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer)
Similar to the previous example, we can use the same code to define our text and translate:
text = "I like to study Data Science and Machine Learning" translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text)
After waiting for several seconds, you should see the Chinese version of the text!
ęåę¬¢å¦ä¹ ę°ę®ē§å¦åęŗåØå¦ä¹
Conclusion
Congratulations! Now you should be able to know how to implement translation using the pretrained models offered by Huggingface and the community around it. In case you want to see what the complete code looks like, hereās the Jupyter code:
And thatās all! If you have any questions, please feel free to ask questions below. If you like my work, you can follow me and sign up for my newsletter so that you will get informed whenever I publish a new article! You can also take a look at my previous articles if you like. See you all next time :D
Abstractive Summarization Using Pytorch Summarize any text using Transformers in a few simple steps! towardsdatascience.com
Semantic Similarity Using Transformers Compute Semantic Textual Similarity between two texts using Pytorch and SentenceTransformers towardsdatascience.com
BERT Text Classification Using Pytorch Text classification is a common task in NLP. We apply BERT, a popular Transformer model, on fake news detection usingā¦ towardsdatascience.com
Fine-tuning GPT2 for Text Generation Using Pytorch Fine-tune GPT2 for text generation using Pytorch and Huggingface. We train on the CMU Book Summary Dataset to generateā¦ towardsdatascience.com
Implementing Transformer for Language Modeling Training a transformer model using Fairseq towardsdatascience.com
References
[1] Transformers Github , Huggingface
[2] Transformers Official Documentation , Huggingface
[3] Pytorch Official Website , Facebook AI Research
[4] Raffel, Colin, et al. āExploring the limits of transfer learning with a unified text-to-text transformer.ā arXiv preprint arXiv:1910.10683 (2019).
[5] Tensorflow Datasets , Google
Single Line Text
Member-only story. Translate from any language to another in a simple few steps! Raymond Cheng Ā· Follow. Published in Towards Data Science Ā· 4 min read Ā· Jan 18, 2021. -- 1. Listen. Share. Pisit Heng. Unsplash. Intro. Translation , or more formally, machine translation, is one of the most popular tasks in Natural Language Processing (NLP) that deals with translating from one language to another. In the early days, translation is initially done by simply substituting words in one language to words in another. However, doing that does not yield good results since languages are fundamentally different so a higher level of understanding (e.g. phrases/sentences) is needed. With the advent of deep learning, modern software now adopts statistical and neural techniques, which are proven to be more effective when doing translation. Of course, everyone has access to the powerful Google Translate , but in case you want to know how to implement translation in code, this article will teach you how. This article will show how you can easily implement translation with a simple API provided by Huggingface Transformers , a library based on Pytorch . Now without further ado, letās get started! Tutorial Overview. Install Library. English to German Translation Example. Custom Language Translation Example. Install Library. Before installing the Transformers library, you will need to have a working version of Pytorch installed. You can install Pytorch by going to its official website . After installing Pytorch, you can install Transformers by: pip install transformers. English to German Translation Example. Now, we are ready to do the translation! If you want to do English to German Translation, then you can start by importing the relevant pipeline module in Transformers: from transformers import pipeline. The pipeline is an easy method of doing inference on different tasks by using a simple API. You can learn more about the pipeline module here . To do English to German translation, you need a model that is fine-tuned specifically on this task. T5 is a model that has been trained on the massive c4 dataset that contains a dataset for English-German translation, and thus we can directly use this model for the translation pipeline (we are using the t5-base variant): translation = pipeline(ātranslation_en_to_deā) ## same with ## translation = pipeline("translation_en_to_de", model="t5-base", tokenizer="t5-base") Note that we didnāt specify any model in this line of code because by default, t5-base is used for translation. If you want to specify your own model and tokenizer, you can add a model and tokenizer by specifying the model and tokenizer parameter (if they are provided within Huggingface), or build your own model and tokenizer as will be demonstrated in the next example (if provided by the community). For more details on the translation pipeline, you can refer to the official documentation here . Then, you can define the text you want to translate. Letās try to translate this: I like to study Data Science and Machine Learning. text = "I like to study Data Science and Machine Learning" Finally, now you can use the API provided by the pipeline to translate and set a max_length (e.g. 40 tokens): translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text) Voila! After tens of seconds, we get the German translation: Ich studiere gerne Datenwissenschaft und maschinelles Lernen. Custom Language Translation Example. If you want to do a translation of any two custom languages, say English to Chinese, then you need a model that is specifically fine-tuned on that specific task. Fortunately, with the community established by Huggingface, you most likely donāt need to collect your own dataset and fine-tune your model on it. You can directly head over to Huggingfaceās model website to see a list of translation models trained on different language pairs. For our case to translate from English to Chinese, we can use the English-to-Chinese pretrained model by HelsinkiNLP and directly use it. To start, we first import the necessary modules: from transformers import AutoModelWithLMHead, AutoTokenizer. Then, we can build our model and tokenizer via: model = AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-zh") tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-zh") Now, we can feed in the language pairs we want to translate, model, and tokenizer into the pipeline: translation = pipeline("translation_en_to_zh", model=model, tokenizer=tokenizer) Similar to the previous example, we can use the same code to define our text and translate: text = "I like to study Data Science and Machine Learning" translated_text = translation(text, max_length=40)[0]['translation_text'] print(translated_text) After waiting for several seconds, you should see the Chinese version of the text! ęåę¬¢å¦ä¹ ę°ę®ē§å¦åęŗåØå¦ä¹ . Conclusion. Congratulations! Now you should be able to know how to implement translation using the pretrained models offered by Huggingface and the community around it. In case you want to see what the complete code looks like, hereās the Jupyter code: And thatās all! If you have any questions, please feel free to ask questions below. If you like my work, you can follow me and sign up for my newsletter so that you will get informed whenever I publish a new article! You can also take a look at my previous articles if you like. See you all next time :D. Abstractive Summarization Using Pytorch Summarize any text using Transformers in a few simple steps! towardsdatascience.com. Semantic Similarity Using Transformers Compute Semantic Textual Similarity between two texts using Pytorch and SentenceTransformers towardsdatascience.com. BERT Text Classification Using Pytorch Text classification is a common task in NLP. We apply BERT, a popular Transformer model, on fake news detection usingā¦ towardsdatascience.com. Fine-tuning GPT2 for Text Generation Using Pytorch Fine-tune GPT2 for text generation using Pytorch and Huggingface. We train on the CMU Book Summary Dataset to generateā¦ towardsdatascience.com. Implementing Transformer for Language Modeling Training a transformer model using Fairseq towardsdatascience.com. References. [1] Transformers Github , Huggingface. [2] Transformers Official Documentation , Huggingface. [3] Pytorch Official Website , Facebook AI Research. [4] Raffel, Colin, et al. āExploring the limits of transfer learning with a unified text-to-text transformer.ā arXiv preprint arXiv:1910.10683 (2019). [5] Tensorflow Datasets , Google.