If you are using standard formats such as TMX, XLIFF, etc., no technical cleanup of the training data is required. There is no need to remove the tags. However, it is important to use relevant, clean, and consistent training data from a language perspective. This means that a cleaning from a linguistic perspective is always recommended.
Yes and no. Using more training data usually improves the quality, but quantity is not everything. It is just as important to use relevant and consistent content for your training. Therefore, if you meet the minimum requirements in terms of volume, it is more important to use qualified data. It makes no sense to increase the training data set with unknown or unqualified content. In some cases, reducing the training data by removing obsolete content can even improve the quality.
Yes, your all–important training data is safe. Each Globalese system is a dedicated, single–tenant system. The training data is only used to train your engines, run the translations, and improve the quality of your engine. We are not providing or selling your data to third parties, nor are we using your data to improve the engine of any other of our customers.
These include the option for custom engines, AI–boosted engines, terminology support, custom prompts, tag handling, and, last but not least, the price–to–quality ratio. However, this is only our opinion. We highly recommend that you try a free trial so that you can judge for yourself.
We natively support the most important standard bilingual formats: XLIFF, TMX, TBX, and bilingual CAT tool files. Other formats can be supported through CAT tool or API integration.
This depends on the size of the training data, the engine, and the type of training. Stock+ engines can be trained quickly, usually taking between half an hour and a few hours. Domain–adapted engines are initially trained for longer, typically between 24 and 32 hours. However, a quick training only takes a few hours.
A full training always begins from scratch. Therefore, the engine training time is longer. In exchange, the engine will have a better understanding of the new content of the new or updated Master corpora. A quick training runs faster, but it can be thought of more like a tuning of the existing engine based on the new or updated Master training corpora, rather than a full training.
A full re–training of an engine is recommended if there are significant changes to the master corpus. For example, if a larger volume of new master corpus is available (over 10% of the existing master corpus size), or if a terminology change was performed in the master training data set, it is recommended to run a full training. However, if only a smaller amount of new training data has been added to the master corpus, it is enough to run a quick training.
This depends largely on how often new training data is available for the master corpus of the engine. For a frequently used engine with quickly growing new training data sets, a typical training cycle can be one full retraining every month, with weekly quick trainings during the month. For less frequently used engines, a typical training cycle can be one full retraining every three months, with one quick training every month.
There are several options for measuring the quality of an engine. The most common method is to use some of the automated metrics like Bleu or ChrF. These compare the MT output with the human translation reference and provide a score. Most CAT tools provide options for such measurements. However, one should know that these metrics do not actually show the translation quality. They just provide a score of how close the MT translation is to the reference translation. For example, if the MT output uses a linguistically proper synonym or slightly different word order, the automated metrics will penalize it, even though the translation might be as good as, or even better than, the reference. Therefore, our recommendation is to perform at least spot checks by human translators/post–editors.
The Document Translation service is an asynchronous translation service which can be used for batch pre-translation of files via the browser-based UI, CAT tool plugins or the API.
The Cloud Text Translation service is a synchronous translation service with auto-scaling feature. It can be used for online propagation of MT matches during translation from a CAT tool, for pre-translation as you can do it in your CAT tool or via the API for any synchronous service.
The two services have a different price tag and are using different plugins and API integration.