313 lines
10 KiB
Plaintext
313 lines
10 KiB
Plaintext
|
|
{
|
||
|
|
"cells": [
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "fc02fcf9",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"<div align=\"center\">\n",
|
||
|
|
"\n",
|
||
|
|
"<span style=\"color:red\"><b>RESERVADO</b></span> \n",
|
||
|
|
"\n",
|
||
|
|
"<img src=\"Imagens/logo_presidencia_republica.jpg\" width=\"100\"/>\n",
|
||
|
|
"\n",
|
||
|
|
"**MINISTÉRIO DA DEFESA NACIONAL** \n",
|
||
|
|
"**EXÉRCITO PORTUGUÊS** \n",
|
||
|
|
"**DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO** \n",
|
||
|
|
"**CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI** \n",
|
||
|
|
"**PROJECTO LLM - ANEXO A**\n",
|
||
|
|
"\n",
|
||
|
|
"</div>"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "803bc097",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"# OCR com tesseract\n",
|
||
|
|
"\n",
|
||
|
|
"\n",
|
||
|
|
"## Requesitos\n",
|
||
|
|
"\n",
|
||
|
|
"- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n",
|
||
|
|
"- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n",
|
||
|
|
"- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "82ca8b2f",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"## Bibliotecas"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 1,
|
||
|
|
"id": "dda10e05",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"import os\n",
|
||
|
|
"import subprocess\n",
|
||
|
|
"import sys\n",
|
||
|
|
"import shutil\n",
|
||
|
|
"from collections import Counter"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": null,
|
||
|
|
"id": "06ce7747",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"tesseract v5.5.0.20241111\n",
|
||
|
|
" leptonica-1.85.0\n",
|
||
|
|
" libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n",
|
||
|
|
" Found AVX512BW\n",
|
||
|
|
" Found AVX512F\n",
|
||
|
|
" Found AVX512VNNI\n",
|
||
|
|
" Found AVX2\n",
|
||
|
|
" Found AVX\n",
|
||
|
|
" Found FMA\n",
|
||
|
|
" Found SSE4.1\n",
|
||
|
|
" Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n",
|
||
|
|
" Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n",
|
||
|
|
"\n",
|
||
|
|
"\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n",
|
||
|
|
"print(r.stdout)\n",
|
||
|
|
"print(r.stderr)\n",
|
||
|
|
"print(shutil.which(\"gswin64c\"))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "fa1bb90b",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"## OCR"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 2,
|
||
|
|
"id": "f913c83d",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n",
|
||
|
|
"pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\""
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": null,
|
||
|
|
"id": "92432db3",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"os.makedirs(pasta_saida, exist_ok=True)"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "5f2a5b2b",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n",
|
||
|
|
"\n",
|
||
|
|
"O proximo comando é para saber onde estão ambos os documentos"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": null,
|
||
|
|
"id": "4804893a",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n",
|
||
|
|
"print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n",
|
||
|
|
"print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n",
|
||
|
|
"\n",
|
||
|
|
"possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n",
|
||
|
|
"\n",
|
||
|
|
"possiveis_gs = [\n",
|
||
|
|
" r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n",
|
||
|
|
" r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n",
|
||
|
|
" r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n",
|
||
|
|
" r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n",
|
||
|
|
"]\n",
|
||
|
|
"\n",
|
||
|
|
"print(\"\\nTesseract:\")\n",
|
||
|
|
"for p in possiveis_tesseract:\n",
|
||
|
|
" print(p, \"->\", os.path.exists(p))\n",
|
||
|
|
"\n",
|
||
|
|
"print(\"\\nGhostscript:\")\n",
|
||
|
|
"for p in possiveis_gs:\n",
|
||
|
|
" print(p, \"->\", os.path.exists(p))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 7,
|
||
|
|
"id": "7f832f3b",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [],
|
||
|
|
"source": [
|
||
|
|
"tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n",
|
||
|
|
"gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n",
|
||
|
|
"env = os.environ.copy()\n",
|
||
|
|
"env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "markdown",
|
||
|
|
"id": "572456cc",
|
||
|
|
"metadata": {},
|
||
|
|
"source": [
|
||
|
|
"Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 8,
|
||
|
|
"id": "a6e18fe5",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"Total processados: 297\n",
|
||
|
|
"Com erro: 4\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"resultados = []\n",
|
||
|
|
"for raiz, _, ficheiros in os.walk(pasta_entrada):\n",
|
||
|
|
" for ficheiro in ficheiros:\n",
|
||
|
|
" if not ficheiro.lower().endswith(\".pdf\"):\n",
|
||
|
|
" continue\n",
|
||
|
|
" origem = os.path.join(raiz, ficheiro)\n",
|
||
|
|
" rel_path = os.path.relpath(raiz, pasta_entrada)\n",
|
||
|
|
" pasta_destino = os.path.join(pasta_saida, rel_path)\n",
|
||
|
|
" os.makedirs(pasta_destino, exist_ok=True)\n",
|
||
|
|
" nome_base, _ = os.path.splitext(ficheiro)\n",
|
||
|
|
" destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n",
|
||
|
|
" comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n",
|
||
|
|
" try:\n",
|
||
|
|
" subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n",
|
||
|
|
" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
|
||
|
|
" os.remove(origem)\n",
|
||
|
|
" resultados.append((ficheiro, \"OK - original removido\"))\n",
|
||
|
|
" else:\n",
|
||
|
|
" resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n",
|
||
|
|
" except subprocess.CalledProcessError as e:\n",
|
||
|
|
" stderr = e.stderr or \"\"\n",
|
||
|
|
" if \"DigitalSignatureError\" in stderr:\n",
|
||
|
|
" comando_assinado = [\n",
|
||
|
|
" sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n",
|
||
|
|
" try:\n",
|
||
|
|
" subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n",
|
||
|
|
" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
|
||
|
|
" resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n",
|
||
|
|
" else:\n",
|
||
|
|
" resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n",
|
||
|
|
" except subprocess.CalledProcessError as e2:\n",
|
||
|
|
" resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n",
|
||
|
|
" except Exception as e2:\n",
|
||
|
|
" resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n",
|
||
|
|
" else:\n",
|
||
|
|
" resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n",
|
||
|
|
" except Exception as e:\n",
|
||
|
|
" resultados.append((ficheiro, f\"ERRO: {e}\"))\n",
|
||
|
|
"print(\"Total processados:\", len(resultados))\n",
|
||
|
|
"print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))"
|
||
|
|
]
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"cell_type": "code",
|
||
|
|
"execution_count": 9,
|
||
|
|
"id": "83e6ec96",
|
||
|
|
"metadata": {},
|
||
|
|
"outputs": [
|
||
|
|
{
|
||
|
|
"name": "stdout",
|
||
|
|
"output_type": "stream",
|
||
|
|
"text": [
|
||
|
|
"\n",
|
||
|
|
"--- 2 ocorrência(s) ---\n",
|
||
|
|
"ERRO OCR: InputFileError\n",
|
||
|
|
"\n",
|
||
|
|
"\n",
|
||
|
|
"--- 1 ocorrência(s) ---\n",
|
||
|
|
"ERRO OCR: Starting processing with 16 workers concurrently\n",
|
||
|
|
"Parsing 19 pages with HocrParser\n",
|
||
|
|
"Suppressing OCR output text with improbable aspect ratio\n",
|
||
|
|
"Postprocessing...\n",
|
||
|
|
"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
|
||
|
|
"Image optimization did not improve the file - optimizations will not be used\n",
|
||
|
|
"Image optimization ratio: 1.02 savings: 1.9%\n",
|
||
|
|
"Total file size ratio: 0.97 savings: -3.0%\n",
|
||
|
|
"Output file is a PDF (auto mode)\n",
|
||
|
|
"WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n",
|
||
|
|
"\n",
|
||
|
|
"--- 1 ocorrência(s) ---\n",
|
||
|
|
"ERRO OCR: Starting processing with 3 workers concurrently\n",
|
||
|
|
" 1 [tesseract] lots of diacritics - possibly poor OCR\n",
|
||
|
|
"Parsing 3 pages with HocrParser\n",
|
||
|
|
"Postprocessing...\n",
|
||
|
|
"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
|
||
|
|
"Image optimization ratio: 1.18 savings: 15.2%\n",
|
||
|
|
"Total file size ratio: 1.16 savings: 14.1%\n",
|
||
|
|
"Output file is a PDF (auto mode)\n",
|
||
|
|
"WARNING: D:\\Trabalhos\\Bot Exército\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"source": [
|
||
|
|
"erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n",
|
||
|
|
"contagem = Counter(erros)\n",
|
||
|
|
"\n",
|
||
|
|
"for erro, n in contagem.most_common(20):\n",
|
||
|
|
" print(f\"\\n--- {n} ocorrência(s) ---\")\n",
|
||
|
|
" print(erro[:500])"
|
||
|
|
]
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"metadata": {
|
||
|
|
"kernelspec": {
|
||
|
|
"display_name": "Python 3",
|
||
|
|
"language": "python",
|
||
|
|
"name": "python3"
|
||
|
|
},
|
||
|
|
"language_info": {
|
||
|
|
"codemirror_mode": {
|
||
|
|
"name": "ipython",
|
||
|
|
"version": 3
|
||
|
|
},
|
||
|
|
"file_extension": ".py",
|
||
|
|
"mimetype": "text/x-python",
|
||
|
|
"name": "python",
|
||
|
|
"nbconvert_exporter": "python",
|
||
|
|
"pygments_lexer": "ipython3",
|
||
|
|
"version": "3.14.2"
|
||
|
|
}
|
||
|
|
},
|
||
|
|
"nbformat": 4,
|
||
|
|
"nbformat_minor": 5
|
||
|
|
}
|