WOKIE: LLM-Aided Translation of SKOS Thesauri for Multilingual Digital Humanities

1. Introduction and Motivation

Knowledge organization in Digital Humanities (DH) relies heavily on controlled vocabularies, thesauri, and ontologies, primarily modeled using the Simple Knowledge Organization System (SKOS). A significant barrier exists due to the predominance of English in these resources, which excludes non-native speakers and under-represents diverse cultures and languages. Multilingual thesauri are crucial for inclusive research infrastructures, yet their manual creation is not scalable. Classical Machine Translation (MT) methods fail in DH contexts due to a lack of domain-specific bilingual corpora. This paper introduces WOKIE (Well-translated Options for Knowledge Management in International Environments), an open-source, modular pipeline that combines external translation services with targeted refinement using Large Language Models (LLMs) to automate the translation of SKOS thesauri, balancing quality, scalability, and cost.

2. The WOKIE Pipeline: Architecture and Workflow

WOKIE is designed as a configurable, multi-stage pipeline that requires no prior expertise in MT or LLMs. It runs on everyday hardware and can utilize free translation services.

2.1 Core Components

The pipeline consists of three main stages:

Initial Translation: A SKOS thesaurus is parsed, and its labels (prefLabel, altLabel) are sent to multiple configurable external translation services (e.g., Google Translate, DeepL API).
Candidate Aggregation & Disagreement Detection: Translations for each term are collected. A key innovation is the detection of "disagreement" among the services. A configurable threshold (e.g., if translations from N services differ beyond a similarity score) triggers the refinement stage.
LLM-Based Refinement: For terms where initial translations disagree, the candidate translations and the original term are fed to an LLM (e.g., GPT-4, Llama 3) with a carefully crafted prompt asking for the best possible translation and justification.

2.2 LLM-Based Refinement Logic

The selective use of LLMs is central to WOKIE's design. Instead of translating every term with an LLM (costly, slow, potentially hallucinatory), LLMs are only deployed as arbiters for difficult cases. This hybrid approach leverages the speed and low cost of standard MT APIs for straightforward translations, reserving LLM compute for terms where consensus is lacking, thereby optimizing the trade-off between quality and resource expenditure.

3. Technical Details and Methodology

WOKIE is implemented in Python, leveraging libraries like RDFLib for SKOS parsing. The system's efficacy hinges on its intelligent routing mechanism.

3.1 Translation Quality Assessment Metric

To evaluate translation quality, the authors employed a combination of automated metrics and expert human evaluation. For automated scoring, they adapted the BLEU (Bilingual Evaluation Understudy) score, commonly used in MT research, but noted its limitations for short, terminological phrases. The core evaluation focused on the improvement in Ontology Matching (OM) performance, using standard OM systems like LogMap and AML. The hypothesis was that higher-quality translations would lead to better alignment scores. The performance gain $G$ for a thesaurus $T$ after translation can be formulated as:

$G(T) = \frac{Score_{matched}(T_{translated}) - Score_{matched}(T_{original})}{Score_{matched}(T_{original})}$

where $Score_{matched}$ is the F-measure from the ontology matching system.

4. Experimental Results and Evaluation

The evaluation covered several DH thesauri across 15 languages, testing different parameters, translation services, and LLMs.

Key Experimental Statistics

Thesauri Evaluated: Multiple (e.g., Getty AAT, GND)
Languages: 15, including German, French, Spanish, Chinese, Arabic
LLMs Tested: GPT-4, GPT-3.5-Turbo, Llama 3 70B
Baseline Services: Google Translate, DeepL API

4.1 Translation Quality Across Languages

Human evaluation showed that the WOKIE pipeline (external MT + LLM refinement) consistently outperformed using any single external translation service alone. The quality improvement was most pronounced for:

Low-resource languages: Where standard APIs often fail.
Domain-specific terminology: Terms with cultural or historical nuance (e.g., "fresco secco," "codex") where generic MT provides literal but inaccurate translations.

Chart Description (Imagined): A bar chart comparing BLEU scores (or human evaluation scores) across four conditions: Google Translate alone, DeepL alone, WOKIE with GPT-3.5 refinement, and WOKIE with GPT-4 refinement. The bars for WOKIE configurations are significantly higher, especially for language pairs like English-to-Arabic or English-to-Chinese.

4.2 Ontology Matching Performance Improvement

The primary quantitative result. After processing non-English thesauri through WOKIE to add English labels, the F-measure scores of ontology matching systems (LogMap, AML) increased substantially—by an average of 22-35% depending on the language and thesaurus complexity. This proves the core utility of the pipeline: it directly enhances semantic interoperability by making non-English resources discoverable to English-centric OM tools.

Chart Description (Imagined): A line graph showing the F-measure of ontology matching on the y-axis against different translation methods on the x-axis. The line starts low for "No Translation," rises slightly for "Single MT Service," and peaks sharply for "WOKIE Pipeline."

4.3 Performance and Cost Analysis

By using LLMs selectively only for disagreed-upon terms (typically 10-25% of the total), WOKIE reduced LLM API costs by 75-90% compared to a naive full-LLM translation approach, while retaining ~95% of the quality benefit. Processing time was dominated by LLM calls, but the overall pipeline remained feasible for medium-sized thesauri on standard hardware.

5. Analysis Framework: A Non-Code Case Study

Scenario: A European digital library holds a German-language thesaurus for medieval art techniques. Researchers in Japan cannot find relevant resources because their ontology matching tools only process English labels.

WOKIE Application:

Input: German term "Wandmalerei" (wall painting).
Stage 1 (External MT): Google Translate returns "wall painting." DeepL returns "mural painting." Microsoft Translator returns "wall painting." There is disagreement ("mural" vs. "wall").
Stage 2 (Disagreement Detection): The similarity between candidates is below the threshold. LLM refinement is triggered.
Stage 3 (LLM Refinement): Prompt: "Given the German art history term 'Wandmalerei' and candidate English translations ['wall painting', 'mural painting'], which is the most accurate and contextually appropriate term for a SKOS thesaurus in art history? Consider specificity and common usage in the field."
LLM Output: "In the context of art history thesauri like the Getty AAT, 'mural painting' is the more precise and commonly used descriptor for 'Wandmalerei,' as it specifically denotes painting applied directly to a wall or ceiling."
Result: The SKOS concept gets the prefLabel "mural painting," enabling accurate matching with English-language ontologies.

6. Future Applications and Research Directions

Beyond Translation: Extending WOKIE to suggest new related concepts or altLabels in the target language, acting as a thesaurus augmentation tool.
Integration with Foundational Models: Leveraging vision-language models (like CLIP) to translate concepts based on associated images in digital collections, not just text.
Active Learning Loop: Incorporating human-in-the-loop feedback to correct LLM outputs, continuously improving the pipeline's domain-specific performance.
Standardization of Evaluation: Developing a dedicated benchmark suite for evaluating SKOS/thesaurus translation quality, moving beyond BLEU to metrics that capture hierarchical and relational preservation.
Broader Knowledge Organization Systems (KOS): Applying the hybrid MT+LLM refinement principle to more complex ontologies (OWL) beyond SKOS.

7. References

Kraus, F., Blumenröhr, N., Tonne, D., & Streit, A. (2025). Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri. arXiv preprint arXiv:2507.19537.
Miles, A., & Bechhofer, S. (2009). SKOS Simple Knowledge Organization System Reference. W3C Recommendation. https://www.w3.org/TR/skos-reference/
Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017).
Carroll, J. J., & Stickler, P. (2004). RDF Triples in the Semantic Web. IEEE Internet Computing.
Getty Research Institute. (2024). Art & Architecture Thesaurus (AAT). https://www.getty.edu/research/tools/vocabularies/aat/
Papineni, K., et al. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).

8. Expert Analysis: Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight: WOKIE isn't just another translation tool; it's a pragmatic, cost-conscious interoperability engine for the balkanized world of cultural heritage data. Its real innovation is recognizing that perfect AI translation is a fool's errand for niche domains, and instead, it uses LLMs as a high-precision scalpel rather than a blunt hammer. The paper correctly identifies the root problem in DH: English is the de facto query language for linked data, creating a silent exclusion of vast non-English knowledge reservoirs. WOKIE's goal isn't poetic translation but enabling discovery, a far more achievable and impactful target.

Logical Flow: The argument is compelling and well-structured. It starts with an undeniable pain point (language exclusion in DH), demolishes the obvious solutions (manual work is impossible, classic MT fails due to data scarcity), and positions LLMs as a potential but flawed savior (cost, hallucinations). Then, it introduces the elegant hybrid model: use cheap, fast APIs for the 80% easy cases, and deploy expensive, smart LLMs only as arbiters for the contentious 20%. This "disagreement detection" is the clever kernel of the project. The evaluation logically ties translation quality to the concrete, measurable outcome of improved ontology matching scores, proving real-world utility beyond subjective translation quality.

Strengths & Flaws:
Strengths: The hybrid architecture is commercially savvy and technically sound. The focus on SKOS, a W3C standard, ensures immediate relevance. The open-source nature and design for "everyday hardware" lower adoption barriers dramatically. Evaluating on OM performance is a masterstroke—it measures utility, not just aesthetics.
Flaws: The paper glosses over prompt engineering, which is the make-or-break factor for LLM refinement. A bad prompt could make the LLM layer useless or harmful. The evaluation, while sensible, is still somewhat siloed; how does WOKIE compare to fine-tuning a small, open-source model like NLLB on DH text? The long-term cost trajectory of LLM APIs is a risk factor for sustainability not fully addressed.

Actionable Insights:

For DH Institutions: Pilot WOKIE immediately on one key non-English thesaurus. The ROI in improved resource discovery and alignment with major hubs like Europeana or the DPLA could be significant. Start with the free tier services to validate.
For Developers: Contribute to the WOKIE codebase, especially in creating a library of optimized, domain-tuned prompts for different DH sub-fields (archaeology, musicology, etc.).
For Funders: Fund the creation of a gold-standard, multilingual DH terminology benchmark to move the field beyond BLEU scores. Support projects that integrate WOKIE's output into active learning systems.
Critical Next Step: The community must develop a governance model for these machine-translated labels. They should be clearly tagged as "machine-augmented" to maintain scholarly integrity, following the data provenance principles championed by initiatives like the Research Data Alliance (RDA).

In conclusion, WOKIE represents the kind of pragmatic, use-case-driven AI application that will actually change workflows. It doesn't chase AGI; it solves a specific, painful problem with a clever blend of old and new tech. Its success will be measured not in BLEU points, but in the number of previously invisible historical records that suddenly become findable to a global researcher.