Stirenko, SergiiXu Jiashu2024-03-112024-03-112024Xu Jiashu. Research and development of self-supervised visual feature learning based on neural networks : thesis ... doctor of philosophy : 121 Software engineering / Xu Jiashu. – Kyiv, 2024. – 168 p.https://ela.kpi.ua/handle/123456789/65406Xu Jiashu. Research and development of self-supervised visual feature learning based on neural networks. - Qualified scientific work on the rights of the manuscript. Dissertation for the degree of Doctor of Philosophy in the specialty 121 - Software Engineering and 12 - Information Technologies. - National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, 2024. This Dissertation focuses on in-depth exploration into the design and development of self-supervised learning algorithms, which are a subset of unsupervised learning techniques that operate without the need for labeled datasets. These algorithms are particularly adept at pre-training models in an unsupervised manner, with the resultant models demonstrating performance on par with their supervised counterparts across a range of downstream applications. This method is particularly advantageous as it aims to mitigate the over-dependence on extensive data labeling that is typical within deep learning paradigms, thereby enhancing efficiency and practical utility in diverse real-world scenarios. The pertinence of selfsupervised learning algorithms is especially highlighted within the realm of medical image analysis. In this specialized field, the requisites for data annotation are not only laborious but also require a high degree of precision due to the critical nature of the data involved. The difficulty of obtaining accurate annotations is compounded by the scarcity of specialists capable of providing them, which in turn underscores the transformative potential of self-supervised learning approaches within this domain. In this dissertation, a cutting-edge self-supervised learning methodology is delineated, which employs the Mixup Feature as the reconstruction target within the pretext task. This pretext task is fundamentally designed to encapsulate visual representations by the prediction of Mixup features from masked image, utilizing these feature maps to extracting high-level semantic information. The dissertation delves into the validation of the Mixup Feature's role as a predictive target in selfsupervised learning frameworks. This investigation involved the meticulous calibration of the hyperparameter , integral to the Mixup Feature operation. Such adjustments allowed for the generation of amalgamated feature maps that encompass Sobel edge detection maps, Histogram of Oriented Gradients (HOG) maps, and Local Binary Pattern (LBP) maps, providing a rich, multifaceted representation of visual data. For the empirical application of this novel method, the visual transformer was selected as the principal architecture, due to its proficiency in handling complex visual inputs and its emphasis on critical image regions. This choice was further reinforced by the insights derived from the Masked AutoEncoder (MAE) approach, which illuminated the potential of utilizing partially visible inputs to reconstruct full images, thus enhancing the model's predictive capabilities in a self-supervised context. A denoising self-distillation Masked Autoencoder model for self-supervised learning was developed. This model synthesizes elements from Siamese Networks and Masked Autoencoders, incorporating a tripartite architecture that includes a student network in the form of a masked autoencoder, an intermediary regressor, and a teacher network. The underlying proxy task for this model is the restoration of input images that have been artificially corrupted with random Gaussian noise patches. This is a strategic choice designed to encourage the model to learn robust feature representations by distilling clean signals from noisy inputs. In doing so, the model is trained to reconstruction of the degraded image, effectively teaching it focus on the essence of the visual content. To ensure comprehensive learning, the model harnesses a dual loss function mechanism. One function is calibrated to reinforce the global contextual understanding of the image, thereby enabling the model to grasp the overall structure and scene configuration. Concurrently, the second function is tailored to refine the perception of intricate local details, ensuring that fine visual nuances are not lost in the process of denoising and reconstruction. Through this innovative approach, the model aspires to achieve a delicate balance between the macroscopic comprehension of visual scenes and the meticulous reconstruction of localized details, a balance that is pivotal for sophisticated image analysis tasks in self-supervised learning frameworks. An exhaustive analysis was executed to assess the experimental performance of two innovative self-supervised learning algorithms, specifically applied to three benchmark datasets: Cifar-10, Cifar-100, and STL-10. This study aimed to benchmark these algorithms against existing advanced self-supervised techniques grounded in Masked Image Modeling. In comparison to other state-of-the-art selfsupervised methods based on Masked Image Modeling, the mixed HOG-Sobel feature maps obtained using Mixup showed outstanding performance on Cifar-10 and STL-10 after full fine-tuning, with an average performance improvement of 0.4%. Additionally, the pre-trained model of the Deep Masked Autoencoder (DMAE) was subjected to a rigorous evaluation. When full fine-tuned on the STL-10 dataset, this model demonstrated a modest yet significant edge over the conventional Masked Autoencoder (MAE), exceeding its performance by a margin of 0.1%. This finding shed light on the potential of DMAE in enhancing model accuracy. Moreover, the study revealed that in comparison to traditional self-supervised learning strategies reliant on contrastive learning, the Mixup Feature method emerged as more efficient. It offered the advantage of shortened training durations and negated the requirement for conventional data augmentation methods, thus streamlining the learning process. In conclusion, the two self-supervised learning algorithms introduced in this research contribute to the expanding repertoire of methods for masked image modeling. Their demonstrated effectiveness on benchmark datasets illuminates their potential for broader applications, particularly in larger and more complex datasets. The application of these self-supervised learning algorithms was effectively expanded to encompass the domain of medical image analysis. This extension involved the utilization of self-supervised pre-training on specifically curated medical image datasets. Following this pre-training phase, the model thus developed was then employed for the downstream tasks. Empirical results from this study illustrate that the approach of self-supervised pre-training surpasses the efficacy of direct training methodologies. A notable enhancement in accuracy, exceeding 5%, was observed upon the Full fine-tuning of the model on the two downstream datasets. Data imbalance poses a substantial challenge in medical image analysis, as inadequate representation of specific conditions or features can negatively impact the efficacy of model training and feature extraction. Considering this, the study developed an imbalanced dataset and delved into the robustness of self-supervised pre-trained models in the context of data imbalance. The experimental findings underscore the superior robustness of self-supervised pre-training methods over from scrath trained models in addressing data imbalance issues. Particularly notable is their performance in scenarios with a positive to negative sample ratio of 1:8, where they exhibit enhanced robustness compared to traditional supervised Convolutional Neural Network (CNN) pre-trained models. These results affirm the effectiveness of our proposed self-supervised pre-trained models in tackling dataset imbalance challenges. The notable improvement in the robustness of self-supervised learning algorithms augments their potential as powerful tools in medical image analysis, suggesting a prospective enhancement in accuracy within intelligent assisted diagnostic systems.168 p.enSelf-supervised learningImage reconstructionFeature extractionImage edge detectionMasked AutoencoderVision TransformersSiamese NetworksMedical image analysisРеконструкція зображенняВидобуток особливостейВиявлення краю зображенняМаскований автоенкодерВізійні трансформаториМережі СіамськихАналіз медичних зображеньResearch and development of self-supervised visual feature learning based on neural networksДослідження та розробка самонавчання візуальним особливостям на основі нейронних мережThesis Doctoral004.032.26