Congist’24 Social Sciences and Artificial Intelligence: Theory and Practice

Name: Congist’24 Social Sciences and Artificial Intelligence: Theory and Practice
Start: 2024-12-18T09:00:00+03:00
End: 2024-12-20T20:00:00+03:00
Location: Istanbul University Faculty of Letters

18–20 Ara 2024

Istanbul University Faculty of Letters

Europe/Istanbul saat dilimi

Contact

congist@istanbul.edu.tr

YOKUTM: Systematic Compilation and Analysis of Turkish Theses for Dataset Development

20 Ara 2024 15:00

30dk

A6 (Istanbul University Faculty of Letters)

A6

Istanbul University Faculty of Letters

Oral Presentation Digital Tools and Techniques Session 3.1 (Day 3)

Efe Özyay (Istanbul University)

The field of Natural Language Processing (NLP) has witnessed a substantial body of research. Historically, linguistic research and language modeling have predominantly relied on strict, rule-based frameworks, such as the ITU Turkish Natural Language Processing Pipeline (Eryiğit, 2014). However, the advent of artificial intelligence has catalyzed a paradigm shift, introducing sophisticated models including BERT (Devlin et al., 2018). The utilization of these statistical models necessitates extensive datasets characterized by clarity, conciseness, and grammatical precision to attain optimal efficacy.

Despite the proliferation of open-source datasets, many are predominantly available only in English and suffer from grammatical flaws and a lack of textual cleanliness. This limitation underscores an urgent need for high-quality, multilingual datasets to bolster the development and training of robust and diverse NLP models.

In response to this necessity, we propose a novel methodology that leverages the freely accessible and comprehensive thesis database of the National Thesis Center (Council of Higher Education of Turkey, 2007). This methodology aims to establish and sustain a large linguistic dataset derived from Turkish academic theses by using fast and robust open-source software solutions, making it accessible for everyone.

This curated dataset is poised to serve as an invaluable resource for researchers and developers within the NLP community, thereby facilitating significant advancements in language models, retrieval-augmented generation frameworks, text summarization tasks, and other AI-driven linguistic applications. We hope that our work not only meets the critical need for high-quality Turkish linguistic data but also sets a benchmark for analogous efforts in other languages, fostering inclusivity and diversity within NLP research.

Keywords: Natural Language Processing, Large Language Models, Retrieval-Augmented Generation, Dataset Creation, Academic Text Processing

Institution / Affiliation / Kurum

Istanbul University, Faculty of Letters, Linguistics

Presentation language / Sunum Dili	TR (Türkçe)
Disciplines / Disiplinler	Linguistics / Dilbilim
E-mail / E-posta	efeozyay@ogr.iu.edu.tr
ORCID ID	0009-0005-5894-6098
Country / Ülke	Turkey

Efe Özyay (Istanbul University)

efe_ozyay_yokutm_abstract.docx

efe_ozyay_yokutm_bio.docx

Congist’24 Social Sciences and Artificial Intelligence: Theory and Practice

Contact

YOKUTM: Systematic Compilation and Analysis of Turkish Theses for Dataset Development

A6

Istanbul University Faculty of Letters

Konuşmacılar

Açıklama

Institution / Affiliation / Kurum

Başlıca yazarlar:

Sunum Materyalleri

Choose timezone

Congist’24 Social Sciences and Artificial Intelligence: Theory and Practice

Contact

Konuşmacılar

Açıklama

Institution / Affiliation / Kurum

Başlıca yazarlar:

Sunum Materyalleri