{ "cells": [ { "cell_type": "markdown", "id": "fc02fcf9", "metadata": {}, "source": [ "
\n", "\n", "RESERVADO \n", "\n", "\n", "\n", "**MINISTÉRIO DA DEFESA NACIONAL** \n", "**EXÉRCITO PORTUGUÊS** \n", "**DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO** \n", "**CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI** \n", "**PROJECTO LLM - ANEXO A**\n", "\n", "
" ] }, { "cell_type": "markdown", "id": "803bc097", "metadata": {}, "source": [ "# OCR com tesseract\n", "\n", "\n", "## Requesitos\n", "\n", "- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n", "- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n", "- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n" ] }, { "cell_type": "markdown", "id": "82ca8b2f", "metadata": {}, "source": [ "## Bibliotecas" ] }, { "cell_type": "code", "execution_count": 1, "id": "dda10e05", "metadata": {}, "outputs": [], "source": [ "import os\n", "import subprocess\n", "import sys\n", "import shutil\n", "from collections import Counter" ] }, { "cell_type": "code", "execution_count": null, "id": "06ce7747", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tesseract v5.5.0.20241111\n", " leptonica-1.85.0\n", " libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n", " Found AVX512BW\n", " Found AVX512F\n", " Found AVX512VNNI\n", " Found AVX2\n", " Found AVX\n", " Found FMA\n", " Found SSE4.1\n", " Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n", " Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n", "\n", "\n" ] } ], "source": [ "r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n", "print(r.stdout)\n", "print(r.stderr)\n", "print(shutil.which(\"gswin64c\"))" ] }, { "cell_type": "markdown", "id": "fa1bb90b", "metadata": {}, "source": [ "## OCR" ] }, { "cell_type": "code", "execution_count": 2, "id": "f913c83d", "metadata": {}, "outputs": [], "source": [ "pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n", "pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\"" ] }, { "cell_type": "code", "execution_count": null, "id": "92432db3", "metadata": {}, "outputs": [], "source": [ "os.makedirs(pasta_saida, exist_ok=True)" ] }, { "cell_type": "markdown", "id": "5f2a5b2b", "metadata": {}, "source": [ "Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n", "\n", "O proximo comando é para saber onde estão ambos os documentos" ] }, { "cell_type": "code", "execution_count": null, "id": "4804893a", "metadata": {}, "outputs": [], "source": [ "print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n", "print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n", "print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n", "\n", "possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n", "\n", "possiveis_gs = [\n", " r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n", " r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n", " r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n", " r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n", "]\n", "\n", "print(\"\\nTesseract:\")\n", "for p in possiveis_tesseract:\n", " print(p, \"->\", os.path.exists(p))\n", "\n", "print(\"\\nGhostscript:\")\n", "for p in possiveis_gs:\n", " print(p, \"->\", os.path.exists(p))" ] }, { "cell_type": "code", "execution_count": 7, "id": "7f832f3b", "metadata": {}, "outputs": [], "source": [ "tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n", "gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n", "env = os.environ.copy()\n", "env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]" ] }, { "cell_type": "markdown", "id": "572456cc", "metadata": {}, "source": [ "Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))" ] }, { "cell_type": "code", "execution_count": 8, "id": "a6e18fe5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total processados: 297\n", "Com erro: 4\n" ] } ], "source": [ "resultados = []\n", "for raiz, _, ficheiros in os.walk(pasta_entrada):\n", " for ficheiro in ficheiros:\n", " if not ficheiro.lower().endswith(\".pdf\"):\n", " continue\n", " origem = os.path.join(raiz, ficheiro)\n", " rel_path = os.path.relpath(raiz, pasta_entrada)\n", " pasta_destino = os.path.join(pasta_saida, rel_path)\n", " os.makedirs(pasta_destino, exist_ok=True)\n", " nome_base, _ = os.path.splitext(ficheiro)\n", " destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n", " comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n", " try:\n", " subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n", " if os.path.exists(destino) and os.path.getsize(destino) > 0:\n", " os.remove(origem)\n", " resultados.append((ficheiro, \"OK - original removido\"))\n", " else:\n", " resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n", " except subprocess.CalledProcessError as e:\n", " stderr = e.stderr or \"\"\n", " if \"DigitalSignatureError\" in stderr:\n", " comando_assinado = [\n", " sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n", " try:\n", " subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n", " if os.path.exists(destino) and os.path.getsize(destino) > 0:\n", " resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n", " else:\n", " resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n", " except subprocess.CalledProcessError as e2:\n", " resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n", " except Exception as e2:\n", " resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n", " else:\n", " resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n", " except Exception as e:\n", " resultados.append((ficheiro, f\"ERRO: {e}\"))\n", "print(\"Total processados:\", len(resultados))\n", "print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))" ] }, { "cell_type": "code", "execution_count": 9, "id": "83e6ec96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--- 2 ocorrência(s) ---\n", "ERRO OCR: InputFileError\n", "\n", "\n", "--- 1 ocorrência(s) ---\n", "ERRO OCR: Starting processing with 16 workers concurrently\n", "Parsing 19 pages with HocrParser\n", "Suppressing OCR output text with improbable aspect ratio\n", "Postprocessing...\n", "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n", "Image optimization did not improve the file - optimizations will not be used\n", "Image optimization ratio: 1.02 savings: 1.9%\n", "Total file size ratio: 0.97 savings: -3.0%\n", "Output file is a PDF (auto mode)\n", "WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n", "\n", "--- 1 ocorrência(s) ---\n", "ERRO OCR: Starting processing with 3 workers concurrently\n", " 1 [tesseract] lots of diacritics - possibly poor OCR\n", "Parsing 3 pages with HocrParser\n", "Postprocessing...\n", "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n", "Image optimization ratio: 1.18 savings: 15.2%\n", "Total file size ratio: 1.16 savings: 14.1%\n", "Output file is a PDF (auto mode)\n", "WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n" ] } ], "source": [ "erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n", "contagem = Counter(erros)\n", "\n", "for erro, n in contagem.most_common(20):\n", " print(f\"\\n--- {n} ocorrência(s) ---\")\n", " print(erro[:500])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14.2" } }, "nbformat": 4, "nbformat_minor": 5 }