Files
Botexercito/Scripts/OCR.ipynb

313 lines
10 KiB
Plaintext
Raw Normal View History

2026-03-22 18:07:19 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "fc02fcf9",
"metadata": {},
"source": [
"<div align=\"center\">\n",
"\n",
"<span style=\"color:red\"><b>RESERVADO</b></span> \n",
"\n",
"<img src=\"Imagens/logo_presidencia_republica.jpg\" width=\"100\"/>\n",
"\n",
"**MINISTÉRIO DA DEFESA NACIONAL** \n",
"**EXÉRCITO PORTUGUÊS** \n",
"**DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO** \n",
"**CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI** \n",
"**PROJECTO LLM - ANEXO A**\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "803bc097",
"metadata": {},
"source": [
"# OCR com tesseract\n",
"\n",
"\n",
"## Requesitos\n",
"\n",
"- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n",
"- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n",
"- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n"
]
},
{
"cell_type": "markdown",
"id": "82ca8b2f",
"metadata": {},
"source": [
"## Bibliotecas"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "dda10e05",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import subprocess\n",
"import sys\n",
"import shutil\n",
"from collections import Counter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06ce7747",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tesseract v5.5.0.20241111\n",
" leptonica-1.85.0\n",
" libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n",
" Found AVX512BW\n",
" Found AVX512F\n",
" Found AVX512VNNI\n",
" Found AVX2\n",
" Found AVX\n",
" Found FMA\n",
" Found SSE4.1\n",
" Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n",
" Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n",
"\n",
"\n"
]
}
],
"source": [
"r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n",
"print(r.stdout)\n",
"print(r.stderr)\n",
"print(shutil.which(\"gswin64c\"))"
]
},
{
"cell_type": "markdown",
"id": "fa1bb90b",
"metadata": {},
"source": [
"## OCR"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f913c83d",
"metadata": {},
"outputs": [],
"source": [
"pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n",
"pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92432db3",
"metadata": {},
"outputs": [],
"source": [
"os.makedirs(pasta_saida, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"id": "5f2a5b2b",
"metadata": {},
"source": [
"Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n",
"\n",
"O proximo comando é para saber onde estão ambos os documentos"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4804893a",
"metadata": {},
"outputs": [],
"source": [
"print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n",
"print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n",
"print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n",
"\n",
"possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n",
"\n",
"possiveis_gs = [\n",
" r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n",
" r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n",
" r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n",
" r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n",
"]\n",
"\n",
"print(\"\\nTesseract:\")\n",
"for p in possiveis_tesseract:\n",
" print(p, \"->\", os.path.exists(p))\n",
"\n",
"print(\"\\nGhostscript:\")\n",
"for p in possiveis_gs:\n",
" print(p, \"->\", os.path.exists(p))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7f832f3b",
"metadata": {},
"outputs": [],
"source": [
"tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n",
"gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n",
"env = os.environ.copy()\n",
"env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]"
]
},
{
"cell_type": "markdown",
"id": "572456cc",
"metadata": {},
"source": [
"Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "a6e18fe5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total processados: 297\n",
"Com erro: 4\n"
]
}
],
"source": [
"resultados = []\n",
"for raiz, _, ficheiros in os.walk(pasta_entrada):\n",
" for ficheiro in ficheiros:\n",
" if not ficheiro.lower().endswith(\".pdf\"):\n",
" continue\n",
" origem = os.path.join(raiz, ficheiro)\n",
" rel_path = os.path.relpath(raiz, pasta_entrada)\n",
" pasta_destino = os.path.join(pasta_saida, rel_path)\n",
" os.makedirs(pasta_destino, exist_ok=True)\n",
" nome_base, _ = os.path.splitext(ficheiro)\n",
" destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n",
" comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n",
" try:\n",
" subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n",
" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
" os.remove(origem)\n",
" resultados.append((ficheiro, \"OK - original removido\"))\n",
" else:\n",
" resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n",
" except subprocess.CalledProcessError as e:\n",
" stderr = e.stderr or \"\"\n",
" if \"DigitalSignatureError\" in stderr:\n",
" comando_assinado = [\n",
" sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n",
" try:\n",
" subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n",
" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
" resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n",
" else:\n",
" resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n",
" except subprocess.CalledProcessError as e2:\n",
" resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n",
" except Exception as e2:\n",
" resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n",
" else:\n",
" resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n",
" except Exception as e:\n",
" resultados.append((ficheiro, f\"ERRO: {e}\"))\n",
"print(\"Total processados:\", len(resultados))\n",
"print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "83e6ec96",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"--- 2 ocorrência(s) ---\n",
"ERRO OCR: InputFileError\n",
"\n",
"\n",
"--- 1 ocorrência(s) ---\n",
"ERRO OCR: Starting processing with 16 workers concurrently\n",
"Parsing 19 pages with HocrParser\n",
"Suppressing OCR output text with improbable aspect ratio\n",
"Postprocessing...\n",
"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
"Image optimization did not improve the file - optimizations will not be used\n",
"Image optimization ratio: 1.02 savings: 1.9%\n",
"Total file size ratio: 0.97 savings: -3.0%\n",
"Output file is a PDF (auto mode)\n",
"WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n",
"\n",
"--- 1 ocorrência(s) ---\n",
"ERRO OCR: Starting processing with 3 workers concurrently\n",
" 1 [tesseract] lots of diacritics - possibly poor OCR\n",
"Parsing 3 pages with HocrParser\n",
"Postprocessing...\n",
"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
"Image optimization ratio: 1.18 savings: 15.2%\n",
"Total file size ratio: 1.16 savings: 14.1%\n",
"Output file is a PDF (auto mode)\n",
"WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n"
]
}
],
"source": [
"erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n",
"contagem = Counter(erros)\n",
"\n",
"for erro, n in contagem.most_common(20):\n",
" print(f\"\\n--- {n} ocorrência(s) ---\")\n",
" print(erro[:500])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}