Skip to content

Calling from within Python code

Document Analyzer の利用

The Document Analyzer performs OCR and layout analysis, integrating these results into a comprehensive analysis output. It can be used for various use cases, including paragraph and table structure analysis, extraction, and figure/table detection.

import cv2



from yomitoku import DocumentAnalyzer

from yomitoku.data.functions import load_pdf



if __name__ == "__main__":

    PATH_IMGE = "demo/sample.pdf"

    analyzer = DocumentAnalyzer(visualize=True, device="cuda")

    # PDFファイルを読み込み

    imgs = load_pdf(PATH_IMGE)

    for i, img in enumerate(imgs):

        results, ocr_vis, layout_vis = analyzer(img)

        # HTML形式で解析結果をエクスポート

        results.to_html(f"output_{i}.html", img=img)

        # 可視化画像を保存

        cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)

        cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)
  • Setting visualize to True enables the visualization of each processing result. The second and third return values will contain the OCR and layout analysis results, respectively. If set to False, None will be returned. Since visualization adds computational overhead, it is recommended to set it to False unless needed for debugging purposes.
  • The device parameter specifies the computation device to be used. The default is "cuda". If a GPU is unavailable, it automatically switches to CPU mode for processing.
  • The configs parameter allows you to set more detailed parameters for the pipeline processing.

The results of DocumentAnalyzer can be exported in the following formats:

to_json(): JSON format (.json) to_html(): HTML format (.html) to_csv(): Comma-separated CSV format (.csv) to_markdown(): Markdown format (.md)

Using AI-OCR Only

AI-OCR performs text detection and recognition on the detected text, returning the positions of the text within the image along with the

import cv2



from yomitoku import OCR

from yomitoku.data.functions import load_pdf



if __name__ == "__main__":

    ocr = OCR(visualize=True, device="cpu")

    # PDFファイルを読み込み

    imgs = load_pdf("demo/sample.pdf")

    for i, img in enumerate(imgs):

        results, ocr_vis = ocr(img)



        # JSON形式で解析結果をエクスポート

        results.to_json(f"output_{i}.json")

        cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)
  • Setting visualize to True enables the visualization of each processing result. The second and third return values will contain the OCR and layout analysis results, respectively. If set to False, None will be returned. Since visualization adds computational overhead, it is recommended to set it to False unless needed for debugging purposes.
  • The device parameter specifies the computation device to be used. The default is "cuda". If a GPU is unavailable, it automatically switches to CPU mode for processing.
  • The configs parameter allows you to set more detailed parameters for the pipeline processing.

The results of OCR processing support export in JSON format (to_json()) only.

Using Layout Analyzer only

The LayoutAnalyzer performs text detection, followed by AI-based paragraph, figure/table detection, and table structure analysis. It analyzes the layout structure within the document.

import cv2



from yomitoku import LayoutAnalyzer

from yomitoku.data.functions import load_pdf



if __name__ == "__main__":

    analyzer = LayoutAnalyzer(visualize=True, device="cuda")

    # PDFファイルを読み込み

    imgs = load_pdf("demo/sample.pdf")

    for i, img in enumerate(imgs):

        results, layout_vis = analyzer(img)



        # JSON形式で解析結果をエクスポート

        results.to_json(f"output_{i}.json")

        cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)
  • Setting visualize to True enables the visualization of each processing result. The second and third return values will contain the OCR and layout analysis results, respectively. If set to False, None will be returned. Since visualization adds computational overhead, it is recommended to set it to False unless needed for debugging purposes.
  • The device parameter specifies the computation device to be used. The default is cuda. If a GPU is unavailable, it automatically switches to CPU mode for processing.
  • The configs parameter allows you to set more detailed parameters for the pipeline processing.

The results of LayoutAnalyzer processing support export only in JSON format (to_json()).

Detailed Configuration of the Pipeline

By providing a config, you can adjust the behavior in greater detail.

  • model_name: Specifies the architecture of the model to be used.
  • path_cfg: Provides the path to the config file containing hyperparameters.
  • device: Specifies the device to be used for inference. Options are cuda, cpu, or mps.
  • visualize: Indicates whether to perform visualization of the processing results (boolean).
  • from_pretrained: Specifies whether to use a pretrained model (boolean).
  • infer_onnx: Indicates whether to use onnxruntime for inference instead of PyTorch (boolean).

Supported Model Types (model_name)

  • TextRecognizer: parseq, parseq-small
  • TextDetector: dbnet
  • LayoutParser: rtdetrv2
  • TableStructureRecognizer: rtdetrv2

How to Write Config

The config is provided in dictionary format. By using a config, you can execute processing on different devices for each module and set detailed parameters. For example, the following config allows the OCR processing to run on a GPU, while the layout analysis is performed on a CPU:

from yomitoku import DocumentAnalyzer

if __name__ == "__main__":
    configs = {
        "ocr": {
            "text_detector": {
                "device": "cuda",
            },
            "text_recognizer": {
                "device": "cuda",
            },
        },
        "layout_analyzer": {
            "layout_parser": {
                "device": "cpu",
            },
            "table_structure_recognizer": {
                "device": "cpu",
            },
        },
    }

    DocumentAnalyzer(configs=configs)

Defining Parameters in an YAML File

By providing the path to a YAML file in the config, you can adjust detailed parameters for inference. Examples of YAML files can be found in the configs directory within the repository. While the model's network parameters cannot be modified, certain aspects like post-processing parameters and input image size can be adjusted.Refer to configuration for configurable parameters.

For instance, you can define post-processing thresholds for the Text Detector in a YAML file and set its path in the config. The config file does not need to include all parameters; you only need to specify the parameters that require changes.

post_process:
  thresh: 0.1
  unclip_ratio: 2.5

Storing the Path to a YAML File in the Config

from yomitoku import DocumentAnalyzer



if __name__ == "__main__":

    configs = {"ocr": {"text_detector": {"path_cfg": "demo/text_detector.yaml"}}}



    analyzer = DocumentAnalyzer(configs=configs, visualize=True, device="cuda")

Using in an Offline Environment

Yomitoku automatically downloads models from Hugging Face Hub during the first execution, requiring an internet connection at that time. However, by manually downloading the models in advance, it can be executed in an offline environment.

  1. Install Git Large File Storage
  2. In an environment with internet access, download the model repository. Copy the cloned repository to your target environment using your preferred tools.

The following is the command to download the model repository from Hugging Face Hub.

git clone https://huggingface.co/KotaroKinoshita/yomitoku-table-structure-recognizer-rtdtrv2-open-beta

git clone https://huggingface.co/KotaroKinoshita/yomitoku-layout-parser-rtdtrv2-open-beta

git clone https://huggingface.co/KotaroKinoshita/yomitoku-text-detector-dbnet-open-beta

git clone https://huggingface.co/KotaroKinoshita/yomitoku-text-recognizer-parseq-open-beta
  1. Place the model repository directly under the root directory of the Yomitoku repository and reference the local model repository in the hf_hub_repo field of the YAML file. Below is an example of text_detector.yaml. Similarly, define YAML files for other modules as well.
hf_hub_repo: yomitoku-text-detector-dbnet-open-beta
  1. Storing the Path to a YAML File in the Config
from yomitoku import DocumentAnalyzer



if __name__ == "__main__":

    configs = {"ocr": {"text_detector": {"path_cfg": "demo/text_detector.yaml"}}}



    analyzer = DocumentAnalyzer(configs=configs, visualize=True, device="cuda")