Skip Navigation
Skip to contents

GEO DATA : GEO DATA

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > GEO DATA > Volume 7(1); 2025 > Article
Data Article
GeoAI Dataset for Industrial Park Segmentation from Sentinel-2 Satellite Imagery and GEMS
Sung-Hyun Gong1,2orcid, Hyung-Sup Jung3,4,*orcid, Geun-han Kim5orcid, Geun-Hyouk Han6orcid, Il-Hoon Choi7orcid, Jin-Sung Hong8orcid
GEO DATA 2025;7(1):36-44.
DOI: https://doi.org/10.22761/GD.2024.0054
Published online: February 13, 2025

1Integrated Master and PhD Student, Department of Geoinformatics, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, 02504 Seoul, South Korea

2Integrated Master and PhD Student, Department of Smart Cities, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, 02504 Seoul, South Korea

3Professor, Department of Geoinformatics, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, 02504 Seoul, South Korea

4Professor, Department of Smart Cities, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, 02504 Seoul, South Korea

5Research Specialist, Division for Environmental Planning, Water and Land Research Group, Korea Environment Institute, 370 Sicheong-daero, 30147 Sejong, South Korea

6Director, Neighbor System, 135 Jungdae-ro, Songpa-gu, 05717 Seoul, South Korea

7Managing Director, Neighbor System, 135 Jungdae-ro, Songpa-gu, 05717 Seoul, South Korea

8Senior Manager, e-Terra, 551-17 Yangcheon-ro, Gangseo-gu, 07532 Seoul, South Korea

Corresponding Author Hyung-Sup Jung Tel: +82-2-6490-2892 E-mail: hsjung@uos.ac.kr
• Received: November 20, 2024   • Revised: December 26, 2024   • Accepted: January 7, 2025

Copyright © 2025 GeoAI Data Society

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 337 Views
  • 42 Download
  • Air pollution in East Asia presents critical environmental and health challenges, particularly in industrial regions affected by domestic and cross-border emissions. This study developed a GEO AI dataset specifically for industrial park segmentation, integrating Sentinel-2 satellite imagery, Geostationary Environment Monitoring Spectrometer (GEMS) geostationary satellite data, and Air Quality Monitoring Network data. Optimized for semantic segmentation tasks with labeled data specifically for industrial park classification, this dataset serves as a foundational asset for the precise identification and spatial tracking of major air pollution sources. We validated the dataset’s applicability using a modified U-Net model, achieving a mean intersection over union of 0.8146 and pixel accuracy of 0.9608, thereby demonstrating its potential as a tool for detecting and monitoring pollutant sources in industrial areas. With future expansion through additional temporal data and diverse pollutant measurements, this dataset is anticipated to support regional air quality monitoring efforts and inform strategies for pollution control across East Asia.
Since the Industrial Revolution, human economic and industrial activities have increasingly contributed to air pollution, posing significant threats to human health and socio-economic stability through various pathways (Brauer et al., 2016). Northeast Asia, characterized by rapid population growth and industrial development, has become a major source of anthropogenic pollutants. In South Korea, complex interactions between domestically emitted pollutants and those transported from abroad, particularly from China, have resulted in persistently high concentrations of air pollution. Estimates indicate that 30-40% of South Korea’s air pollution originates from foreign sources, underscoring the need for stringent control measures and mitigation strategies. Assessing future air quality trends and establishing responsive countermeasures necessitate continuous monitoring of major pollution sources, such as industrial parks in both domestic and international regions, especially in China (Choi et el., 2019; Jang and Yeo, 2015). Location data on emission sources is essential for understanding the movement and distribution of pollutants; however, obtaining detailed information on foreign sources remains challenging.
In this context, satellite-based remote sensing technology and artificial intelligence (AI) have emerged as effective tools for identifying air pollutant sources. This integrated approach is particularly advantageous for detecting large-scale air pollution sources, such as industrial parks, in regions that are challenging to access. Recently, advancements in AI have driven substantial progress in air pollution detection through satellite imagery analysis. AI techniques are increasingly applied in remote sensing tasks, including object detection, change detection, classification, and semantic segmentation, with deep learning methods demonstrating superior performance compared to traditional machine learning algorithms (LeCun et al., 2015; Men et al., 2021).
Various studies have actively explored the use of remote sensing and AI, including research utilizing optical satellite imagery and deep learning models for classifying industrial parks and quarries, as well as atmospheric monitoring satellites combined with deep learning models to predict air pollutant concentrations (Muthukumar et al., 2021; Park et al., 2023). However, there remains a lack of studies that integrate diverse satellite imagery and input data to detect sources of air pollution effectively. To address this gap, this study aims to construct a multi modal dataset for detecting air pollutant emission sources by combining optical satellite imagery with Geostationary Environment Monitoring Spectrometer (GEMS) data and Air Quality Monitoring Network data, which are used for monitoring high concentrations of air pollutants. Specifically, 10 m-resolution Sentinel-2 satellite imagery acquired in 2023 and 2024, GEMS geostationary environmental satellite data, and Air Quality Monitoring Network data were utilized to construct an AI training dataset for detecting domestic and international sources of air pollution. Using this dataset, semantic segmentation experiments were conducted with the U-Net network, known for its superior performance in semantic object segmentation. Finally, the study evaluates the potential of the developed AI dataset as foundational data for detecting and monitoring air pollutant emission sources.
2.1 Study data
In this study, an AI dataset for large-scale industrial park segmentation was constructed using Sentinel-2 satellite imagery, GEMS data, and air pollutant data from the Air Quality Monitoring Network. Sentinel-2, operated by the European Space Agency (ESA), is primarily an Earth observation satellite that provides critical information for agriculture and food resource management, forest monitoring, land cover change detection, and natural disaster management. Launched in June 2015 (Sentinel-2A) and March 2017 (Sentinel-2B), Sentinel-2 offers 13 multispectral bands with spatial resolutions of 10, 20, and 60 m. Its minimal spectral distortion and capacity for high-quality data accumulation make Sentinel-2 ideal for constructing an AI dataset for classifying large industrial parks across South Korea and China. Level 2A imagery, including radiometric, geometric, and atmospheric corrections, was utilized. Blue, green, red, and near-infrared wavelength bands corresponding to band 2, 3, 4, and 8 were used as input data (Table 1).
The GEMS, launched in 2020, is the first geostationary satellite dedicated to environmental monitoring, observing climate-change-inducing substances and air pollutants across East Asia, including the Korean Peninsula, more than eight times per day. GEMS data were used in this study as supplementary input for the AI dataset, with level 3 NO2 concentration data averaged monthly from January to December, and utilized as a 12-band dataset (Choi et al., 2018; Kim et al., 2020).
In addition to Sentinel-2 and GEMS data, air pollutant concentration data from national monitoring networks were included in the dataset. In particular, the concentrations of sulfur dioxide (SO2), carbon monoxide (CO), and nitrogen dioxide (NO2) provided by Air Quality Monitoring Network are closely correlated with large-scale air pollutant sources such as industrial parks, providing critical information for identifying major pollutant sources (Wei et al., 2023). For South Korea, monthly average concentrations of SO2, CO, and NO2 were obtained from AirKorea, resulting in 12-band data for each pollutant from January through December. For China, data from Air Quality Monitoring Platform of UN Environment Programme were used to verify the locations of monitoring stations, and pollutant data for each site were acquired from the Air Pollution in World database.
Fig. 1 illustrates the acquisition range of Sentinel-2 imagery used in this study. The imagery acquisition range for constructing the AI dataset for industrial park classification encompasses South Korea and selected regions in China, including Beijing, Tianjin, Hebei Province, Shandong Province, Shanghai, Zhejiang Province, and Jiangsu Province. These areas in China were selected as study regions due to severe air pollution issues arising from large-scale industrial parks, high population density, and increasing traffic volumes (Jeon and Kim, 2015). A total of 67 Sentinel-2 images were acquired for South Korea and 170 for the selected regions in China between April 2023 and June 2024. The number of Sentinel-2 images acquired by region, year, and month is presented in Table 2.
2.2 AI dataset construction
The AI dataset for semantic segmentation based on satellite imagery consists of paired input and label data. Input data is represented by satellite imagery, while label data includes ground truth values specifically constructed for semantic segmentation tasks. In this study, input data was constructed by combining Sentinel-2 satellite imagery, GEMS satellite data, and pollutant data from the Air Quality Monitoring Network, with corresponding industrial park segmentation labels developed as part of the dataset.
Label data for industrial parks was constructed through annotation work using QGIS software (QGIS, Grüt, Switzerland). This process involved establishing explicit criteria for delineating the boundaries of industrial parks and then structuring the data accordingly. Industrial parks were defined as areas densely populated with large-scale factories intended for industrial use. The targeted regions included extensive industrial parks such as power plants, steel mills, and petrochemical facilities, with structures within these complexes marked for segmentation. Conversely, grasslands and objects that could not be distinctly identified through satellite imagery were excluded from labeling.
Since relying solely on Sentinel-2 satellite imagery may pose limitations in delineating industrial park boundaries, additional reference data, such as land cover maps, ESA WorldCover, and OpenStreetMap, were utilized to enhance the reliability of boundary demarcation. Following the delineation criteria, label data for industrial parks was initially created in polygon form and subsequently rasterized to TIFF format, suitable for AI model training. Sentinel-2 satellite data and the rasterized label data were divided into patches of 512×512 pixels with a 25% overlap rate. Correspondingly, GEMS satellite data and pollutant data from the Air Quality Monitoring Network, covering the same area, were divided into patches of 64×64 pixels, thus completing the dataset.
A total of 10,000 training dataset instances were constructed using the methodology described above. Table 3 provides details on the dataset, including patch size, data format, and quantity. The dataset comprises Sentinel-2 satellite imagery, GEMS satellite data, and air pollutant measurements from monitoring networks as input data, while the labeled data serves as ground truth for semantic segmentation tasks. Fig. 2 offers examples of the constructed AI dataset, where Fig. 2A-C illustrates Sentinel-2 imagery, GEMS data, and air pollutant measurements as input data components, and Fig. 2D displays the corresponding labeled data used as ground truth.
To assess the applicability of the constructed AI dataset, a U-Net-based semantic segmentation model was applied to classify industrial parks. While the conventional U-Net model utilizes a single encoder-decoder structure, this study incorporated a modified U-Net architecture with multiple encoders, each handling Sentinel-2, GEMS, and air pollutant data independently (Du et al., 2020; Long et al., 2015). The dataset was split into training (80%), validation (10%), and testing (10%) sets, with the training data used for model learning, validation data employed for hyperparameter optimization and overfitting assessment, and test data reserved for objective performance evaluation. Additionally, data augmentation techniques were applied to the training dataset to enhance model generalization and improve classification accuracy for industrial parks (Baek et al., 2022).
The model’s performance showed a mean intersection over union (mIoU) of 0.8146 and pixel accuracy of 0.9608, with a precision of 0.8853, recall of 0.9108, and F1 score of 0.8978 for industrial areas, indicating robust classification performance. Fig. 3 provides sample predictions from the modified U-Net model, with each row displaying outputs on a different test dataset and columns representing Sentinel-2 imagery, labeled ground truth, and the model’s predictions. Although minor misclassifications and missed detections were noted, the model generally demonstrated effective industrial area classification, accurately distinguishing between industrial and non-industrial regions. Fig. 3A-B shows strong alignment between labeled and predicted results, with clearer boundary demarcation in Fig. 3B, where industrial roads are evident, compared to Fig. 3A. Conversely, Fig. 3C reveals slight discrepancies, likely due to indistinct industrial boundaries, leading to minor variations between predictions and labeled data.
This study developed a GEO AI dataset specifically for monitoring air pollution sources in Northeast Asia, with a focus on industrial park segmentation. By integrating Sentinel-2 satellite imagery, GEMS environmental satellite data, and Air Quality Monitoring Network data, this dataset was designed as an AI training resource for industrial park segmentation, highlighting its significance in utilizing not only optical imagery but also diverse types of data. The AI dataset developed in this study demonstrates strong potential as foundational data for identifying air pollutant emission source locations and supporting continuous monitoring efforts.
To validate its utility, the dataset was tested using a modified U-Net model, achieving high segmentation performance with a mIoU of 0.8146 and pixel accuracy of 0.9608. These results confirm the dataset’s effectiveness for practical detection and segmentation of air pollution sources, particularly emissions from large-scale industrial parks. Furthermore, the model’s strong performance suggests its potential for enhancing understanding of pollutant transport and distribution.
Future work will focus on expanding this dataset by incorporating additional temporal data and diverse pollutant types, enabling more refined monitoring of pollution sources. This AI dataset is expected to play a crucial role in supporting air quality monitoring and emission source management across South Korea and East Asia. In the long term, it holds promise as foundational data for shaping air pollution mitigation strategies and environmental policy development.

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Funding Information

This research was supported by #33. Air pollution source space distribution data funded by the Ministry of Science and ICT, and it was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 202303072004).

Data Availability Statement

The dataset supporting the findings of this study is currently under embargo. It is scheduled for public release on AI Hub in April 2025, at which time it will be assigned a DOI and made fully accessible. This approach ensures compliance with data-sharing requirements and facilitates reproducibility and further research.

Fig. 1.
Study area. Coverage area of Sentinel-2 Imagery in South Korea and China.
GD-2024-0054f1.jpg
Fig. 2.
Examples of constructed AI dataset for industrial park segmentation. GEMS, Geostationary Environment Monitoring Spectrometer; AI, artificial intelligence.
GD-2024-0054f2.jpg
Fig. 3.
Examples of constructed AI dataset for urbanized area segmentation. (A1-C1) represent Sentinel-2 satellite image, (A2-C2) represent label data, and (A3-C3) represent prediction of modified U-Net model. AI, artificial intelligence.
GD-2024-0054f3.jpg
Table 1.
Spectral bands of Sentinel-2
Band Characteristic Central wavelength (µm) Resolution (m)
Band 1 Coastal aerosol 0.443 60
Band 2 Blue 0.490 10
Band 3 Green 0.560 10
Band 4 Red 0.665 10
Band 5 Vegetation red edge 0.705 20
Band 6 Vegetation red edge 0.740 20
Band 7 Vegetation red edge 0.783 20
Band 8 NIR 0.842 10
Band 8A Vegetation red edge 0.865 20
Band 9 Water vapour 0.945 60
Band 10 SWIR - cirrus 1.375 60
Band 11 SWIR 1.610 20
Band 12 SWIR 2.190 20

NIR, near-infrared; SWIR, short-wave infrared.

Table 2.
Number of Sentinel-2 images acquired for the study
Region Acquisition data
Number of images
Year Month
South Korea 2023 April 12
May 7
June 3
July 1
October 6
November 10
2024 March 2
April 10
May 10
June 6
China 2023 April 24
May 22
June 12
July 12
August 6
September 7
October 26
November 21
2024 March 17
April 7
May 12
June 4
Table 3.
Constructed AI dataset for industrial park segmentation
Data type Data source Patch size Format Quantity
Input data Sentinel-2 512×512 TIFF 10,000
GEMS 64×64 TIFF 10,000
Air Quality Monitoring Network 64×64 TIFF 10,000
Label data 512×512 TIFF 10,000

AI, artificial intelligence; GEMS, Geostationary Environment Monitoring Spectrometer.

  • Baek WK, Lee MJ, Jung HS (2022) The performance improvement of U-Net model for landcover semantic segmentation through data augmentation. Korean J Remote Sens 38(6):1663–1676
  • Brauer M, Freedman G, Frostad J, et al (2016) Ambient air pollution exposure estimation for the global burden of disease 2013. Environ Sci Technol 50(1):79–88PubMed
  • Choi WJ, Moon KJ, Yoon J (2018) Introducing the geostationary environment monitoring spectrometer. J Appl Rem Sens 12(4):044005Article
  • Choi J, Park RJ, Lee HM, et al (2019) Impacts of local vs. trans-boundary emissions from different sectors on PM2.5 exposure in South Korea during the KORUS-AQ campaign. Atmos Environ 203:196–205
  • Du G, Cao X, Liang J, Chen X, Zhan Y (2020) Medical image segmentation based on U-Net: a review. J Imaging Sci Technol 64:020508Article
  • Jang KS, Yeo JH (2015) The effects of Korean and Chinese economic growth on particulate matter in Korea: time series cointegration analysis. JEPA 23(1):97–117Article
  • Jeon SH, Kim YP (2015) A study on the smog reduction strategies in China. Par Aerosol Res 11(3):63–75
  • Kim J, Jeong U, Ahn MH, et al (2020) New era of air quality monitoring from space: geostationary environment monitoring spectrometer (GEMS). Bull Am Meterol Soc 101(1):E1–E22
  • LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444ArticlePubMedPDF
  • Long J, Shelhamer E, Darrel T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 Jun 2015, pp 3431-3440Article
  • Men G, He G, Wang G (2021) Concatenated residual attention UNet for semantic segmentation of urban green space. Forests 12(11):1441Article
  • Muthukumar P, Cocom E, Nagrecha K, et al (2021) Predicting PM2. 5 atmospheric air pollution using deep learning with meteorological data and ground-based observations and remote-sensing satellite big data. Air Qual Atmos Health 15(7):1221–1234
  • Park CW, Jung HS, Lee WJ, et al (2023) GeoAI dataset for industrial park and quarry classification from KOMPSAT-3/3A optical satellite imagery. GEO DATA 5(4):238–243ArticlePDF
  • Wei J, Li Z, Wang J, Li C, Gupta P, Cribb M (2023) Ground-level gaseous pollutants (NO2, SO2, and CO) in China: daily seamless mapping and spatiotemporal variations. Atmos Chem Phys 23(2):1511–1532Article
Meta Data for Dataset
Essential
Field Sub-Category
Title of Dataset GeoAI Dataset for Industrial Park Segmentation from Sentinel-2 satellite imagery and GEMS
DOI The dataset supporting the findings of this study is currently under embargo. It is scheduled for public release on AI Hub in April 2025, at which time it will be assigned a DOI and made fully accessible
Category Environment
Temporal Coverage 2023.04.01.-2024.06.30.
Spatial Coverage Address South Korea, China
WGS84 Coordinates WGS84
[Latitude] N26°-43°
[Longitude] E112°-130°
Personnel Name Geun-Hyouk Han
Affiliation Neighbor System
E-mail hyouk@neighbor21.co.kr
CC License CC BY-NC
Optional
Field Sub-Category
Summary of Dataset GeoAI Dataset for Industrial Park Segmentation
Project #33 Air pollution source space distribution data
Instrument

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      • 1
      • 2
      Related articles
      GeoAI Dataset for Industrial Park Segmentation from Sentinel-2 Satellite Imagery and GEMS
      Image Image Image
      Fig. 1. Study area. Coverage area of Sentinel-2 Imagery in South Korea and China.
      Fig. 2. Examples of constructed AI dataset for industrial park segmentation. GEMS, Geostationary Environment Monitoring Spectrometer; AI, artificial intelligence.
      Fig. 3. Examples of constructed AI dataset for urbanized area segmentation. (A1-C1) represent Sentinel-2 satellite image, (A2-C2) represent label data, and (A3-C3) represent prediction of modified U-Net model. AI, artificial intelligence.
      GeoAI Dataset for Industrial Park Segmentation from Sentinel-2 Satellite Imagery and GEMS
      Band Characteristic Central wavelength (µm) Resolution (m)
      Band 1 Coastal aerosol 0.443 60
      Band 2 Blue 0.490 10
      Band 3 Green 0.560 10
      Band 4 Red 0.665 10
      Band 5 Vegetation red edge 0.705 20
      Band 6 Vegetation red edge 0.740 20
      Band 7 Vegetation red edge 0.783 20
      Band 8 NIR 0.842 10
      Band 8A Vegetation red edge 0.865 20
      Band 9 Water vapour 0.945 60
      Band 10 SWIR - cirrus 1.375 60
      Band 11 SWIR 1.610 20
      Band 12 SWIR 2.190 20
      Region Acquisition data
      Number of images
      Year Month
      South Korea 2023 April 12
      May 7
      June 3
      July 1
      October 6
      November 10
      2024 March 2
      April 10
      May 10
      June 6
      China 2023 April 24
      May 22
      June 12
      July 12
      August 6
      September 7
      October 26
      November 21
      2024 March 17
      April 7
      May 12
      June 4
      Data type Data source Patch size Format Quantity
      Input data Sentinel-2 512×512 TIFF 10,000
      GEMS 64×64 TIFF 10,000
      Air Quality Monitoring Network 64×64 TIFF 10,000
      Label data 512×512 TIFF 10,000
      Essential
      Field Sub-Category
      Title of Dataset GeoAI Dataset for Industrial Park Segmentation from Sentinel-2 satellite imagery and GEMS
      DOI The dataset supporting the findings of this study is currently under embargo. It is scheduled for public release on AI Hub in April 2025, at which time it will be assigned a DOI and made fully accessible
      Category Environment
      Temporal Coverage 2023.04.01.-2024.06.30.
      Spatial Coverage Address South Korea, China
      WGS84 Coordinates WGS84
      [Latitude] N26°-43°
      [Longitude] E112°-130°
      Personnel Name Geun-Hyouk Han
      Affiliation Neighbor System
      E-mail hyouk@neighbor21.co.kr
      CC License CC BY-NC
      Optional
      Field Sub-Category
      Summary of Dataset GeoAI Dataset for Industrial Park Segmentation
      Project #33 Air pollution source space distribution data
      Instrument
      Table 1. Spectral bands of Sentinel-2

      NIR, near-infrared; SWIR, short-wave infrared.

      Table 2. Number of Sentinel-2 images acquired for the study

      Table 3. Constructed AI dataset for industrial park segmentation

      AI, artificial intelligence; GEMS, Geostationary Environment Monitoring Spectrometer.


      GEO DATA : GEO DATA
      TOP