Mathematical software and computer program for the problem of clustering text articles

Вантажиться...
Ескіз

Дата

2023

Науковий керівник

Назва журналу

Номер ISSN

Назва тому

Видавець

КПІ ім. Ігоря Сікорського

Анотація

The thesis is presented in 85 pages. It contains 2 appendix and bibliography of 19 references, 19 figures and 4 tables are given in the thesis, the presentation slides. Topic Relevance. As we know, today's world is digital, and many people use websites and the Internet and work online. They are looking for their favorite information on any website. Still, have we ever asked ourselves how this information is achieved in a concise time, with millions of pieces of information estimated for the phrase we have entered a reasonable result? In this work, we consider how to solve text classification problems using mathematical software and computer programs and how to determine, like phrases, you can. Provide as much information as possible, accurate or similar, without error or absence. This is done through a number of models and algorithms, each of which is described in detail below. Therefore, our thesis is on the problem of text classification through mathematics and software so that we can solve these problems or eliminate them to a large extent. Clustering text content is essential in extracting useful information online or from other text resources. The common task in text clustering is to process text in a multidimensional space and break up documents into groups, where each group contains similar documents. However, this strategy does not have a comprehensive view of people as a whole, since it cannot explain the main topic of each cluster. The use of semantic information may solve this problem, but it requires a clearly defined ontology or pre-marked gold standard. In this work, we present the thematic algorithm of the clustering of text documents. Given text, thematic terms are extracted and used to cluster documents in a probabilities structure. Purpose and objectives of the study: Clustering aims to identify different groups in the data set. Mathematical software and computer program for the problem of clustering text data to improve the quality and productivity of staff working with text documents. The basic idea of model-based clustering is to approximate the density of the mixture model data. The purpose of the work is to develop mathematical software and computer programs for clustering text articles to visualize objects and automatically detect groups of semantically similar documents among a given fixed set. The end result and purpose of the work: Mathematical software and computer programs for the task of clustering text articles to improve the quality and productivity of staff working with text documents. Object of research: Methods of clustering of text data, methods of Data mining, methods of selection of non-informative words: removal of stop words, stemming and casting of register. Methods for selecting keywords and classifying results: dictionary, statistical, TF-IDF measure, F – measure. Subject of research: Algorithm of realization of division into clusters of text articles. Models of verification (adequacy verification) of the algorithm. Comparative analysis of clustering algorithms for text articles on mathematical and software. Connection of work with scientific programs, plans and topics: The work was carried out at the Department of Applied Mathematics of the National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute “within the topic “Mathematical software and computer program for the problem of clustering text article. Methods of research: The following methods were used to solve this problem: methods of the theory of systems analysis, systems engineering, modeling, Data Science systems design, natural language processing, methods of mathematical statistics, classical data analysis, machine learning, big data theory methods, data visualization and methods of clustering. Scientific novelty: New scientific results are presented in developing and implementing text classification methods and finding problems. In this regard, we are trying to make it easier for the user to get a lot of subjects and to recognize words and classify them to the most meanings so that we can get a lot of results and accurate fulfillment, and that's through it. Mathematical software and computer programs are performed using device study algorithms and creating a suitable system. The practical value of obtained results: This system that we have developed is significant in the field of online or the Internet. We have undertaken to solve the problems that face text classification. This system can be used for the search process, finding similar phrases, solving stopping problems, and finding the best results in the shortest time, and what is essential is that it saves you time. This system recognizes and provides the results as soon as possible. This is recognized by the system and provides the results with the most results, which we are trying to do here on: Mathematical software and computer program for the problem of clustering text articles. Approbation of the thesis results: Publications: V. Tretynyk, Naser J. Hamad SYSTEM OF CLASTERIZATION OF ARABIC PAPERS // Прикладнаматематика та комп’ютинг. ПМК, 2022 :п’ятнадцятанаук. конф. магістрантів та аспірантів, Київ 16-18 лист. 2022: зб. тез доп. / [редкол.: Дичка І. та ін.]. – К. : Просвіта, 2022. – с. 180-186.

Опис

Ключові слова

text mining, word stop selection, stemming, k-means, clustering accuracy

Бібліографічний опис

Hamad Naser, J. H. Mathematical software and computer program for the problem of clustering text articles : магістерська дис. : 172 Прикладна математика / Hamad Naser J. H. – Київ, 2023. – 95 с.

DOI