We conducted research on the conversion of government information into AI learning data

Published:: Jun 2, 2025
Last Updated:: Jun 4, 2025

Based on the Priority Plan for the Realization of a Digital Society (Cabinet decision on June 21, 2024), we are working to improve and expand data to strengthen the development of large-scale language models (LLM) in Japan.

Under the initiative of the Cabinet Office Council for Science, Technology and Innovation (CSTI), since fiscal 2024 (fiscal 2024), Digital Agency has been conducting a survey on the needs for Society5.0 learning data based on the "Bridge Program between R & D and AI (BRIDGE) (Cabinet Office) ," a program to promote the implementation of R & D results in society. The purpose of the survey is to show what kind of AI should be disclosed and in what format, and what people who handle it in business should be aware of, in order to provide government-owned information in a form that contributes to the utilization and development of generated data in Japan-based on recent technological trends.

Large Language Models (LLM) are expected to be useful for AI learning because most of the information held and published by central government agencies, local governments, and other relevant organizations (hereinafter referred to as the "Government, etc."), such as materials related to the Commentary on the laws and ordinances, Guidelines, etc., statistical data, and other public information (hereinafter referred to as the "Government, etc. Held Information"), has been processed for accuracy, rights, and anonymity. In this way, the Government, etc. Held Information can be used for AI learning.

On the other hand, there are many cases in which it is difficult to use the information held by the government for AI learning because it is in PDF format, and there are also cases in which it is difficult to use the information due to access rights.

Based on the above, this project conducted a survey on trends in the latest technologies and needs required to convert AI that cannot be immediately used for data learning (PDF format, images, etc.) into a format that is easy to use for data learning, collected AI, converted and provided data on a trial basis, and verified the effects of this by actually letting the AI learn the converted data.

Content of the investigation

In the past, simply publishing a large amount of Japanese text and using it for AI learning was considered to be a central countermeasure for improving the performances of large-scale language models (LLMs). However, based on the current development trend of generative AI technology, it was confirmed that it is important to continuously publish "data that can respond to the unique background and information of Japan." Based on this, in order to promote efficient and sustainable data disclosure, prioritization was performed based on the two axes of "area" and "type." The following is the content of this research study.

Based on the data-prioritized

Area axis
1. There is a lack of published data in similar areas
2. This will contribute to the improvement of AI capacity.
Type axis
1. For Evaluation
2. For In-Context Learning
3. For parametric learning

As a result of the prioritization, we concluded that "evaluation data" is the most important because it is impossible to objectively judge the effects of learning if the ability evaluation for AI cannot be performed appropriately.

Creation, evaluation and validation of high-priority data sets

Based on the data-prioritized , we created the following four high-priority datasets and evaluated and validated each of them.

We prepared a multiple-choice question dataset based on information linking laws and ordinances and the legend, and evaluated whether legal interpretation is possible when sufficient information is given to the generation AI.
A dataset was prepared to verify whether the writing skills of the generated AI could be mechanically evaluated based on the evaluation standards of actual writing work by lawyers, and the "practical writing skills and appropriateness of the evaluation standards" were verified.
A data set was prepared to evaluate the interpretive power of slides containing multiple charts, and the "ability to derive a single claim from multiple charts" was evaluated.
We prepared a dataset to evaluate the ability to recognize print layouts unique to Japanese, such as official gazette, and evaluated the "processing capability of print formats including mixed vertical and horizontal writing and mathematical expressions."

Developing a process for the sustained release of high-priority data

In this research, we developed a process to continuously disclose high-priority data.

Results of the investigation

In this research, a process to continuously disclose high-priority data was established, and the following importance was revealed.

Accurate understanding of user needs utilizing data
Clear indication of intention to create data set
Dissemination activities after data disclosure

The overall picture and details of this research are described in the Report Report. It was also found that the person in charge of the AI Disclosure Project is required to have domain-specific expertise and knowledge on how to use it in data learning.

Report

Final Report of Research Study on Conversion of Government-Owned Data into AI Learning Data (PDF / 10,991 kb) (updated on June 4, 2025)

Future Prospects

Based on the results of this research, we will redefine what kind of AI should be disclosed, in what kind of format, and what people who handle it in their work should be aware of, in a way that is suitable for the age of generative AI, in order to provide government-owned information in a way that contributes to the utilization and development of generative data.