Scripts/OCR.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fc02fcf9",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "\n",
    "<span style=\"color:red\"><b>RESERVADO</b></span>  \n",
    "\n",
    "<img src=\"Imagens/logo_presidencia_republica.jpg\" width=\"100\"/>\n",
    "\n",
    "**MINISTÉRIO DA DEFESA NACIONAL**  \n",
    "**EXÉRCITO PORTUGUÊS**  \n",
    "**DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO**  \n",
    "**CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI**  \n",
    "**PROJECTO LLM - ANEXO A**\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "803bc097",
   "metadata": {},
   "source": [
    "# OCR com tesseract\n",
    "\n",
    "\n",
    "## Requesitos\n",
    "\n",
    "- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n",
    "- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n",
    "- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82ca8b2f",
   "metadata": {},
   "source": [
    "## Bibliotecas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "dda10e05",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "import shutil\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06ce7747",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tesseract v5.5.0.20241111\n",
      " leptonica-1.85.0\n",
      "  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n",
      " Found AVX512BW\n",
      " Found AVX512F\n",
      " Found AVX512VNNI\n",
      " Found AVX2\n",
      " Found AVX\n",
      " Found FMA\n",
      " Found SSE4.1\n",
      " Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n",
      " Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n",
    "print(r.stdout)\n",
    "print(r.stderr)\n",
    "print(shutil.which(\"gswin64c\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa1bb90b",
   "metadata": {},
   "source": [
    "## OCR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f913c83d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n",
    "pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92432db3",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(pasta_saida, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f2a5b2b",
   "metadata": {},
   "source": [
    "Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n",
    "\n",
    "O proximo comando é para saber onde estão ambos os documentos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4804893a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n",
    "print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n",
    "print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n",
    "\n",
    "possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n",
    "\n",
    "possiveis_gs = [\n",
    "    r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n",
    "]\n",
    "\n",
    "print(\"\\nTesseract:\")\n",
    "for p in possiveis_tesseract:\n",
    "    print(p, \"->\", os.path.exists(p))\n",
    "\n",
    "print(\"\\nGhostscript:\")\n",
    "for p in possiveis_gs:\n",
    "    print(p, \"->\", os.path.exists(p))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7f832f3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n",
    "gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n",
    "env = os.environ.copy()\n",
    "env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "572456cc",
   "metadata": {},
   "source": [
    "Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a6e18fe5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total processados: 297\n",
      "Com erro: 4\n"
     ]
    }
   ],
   "source": [
    "resultados = []\n",
    "for raiz, _, ficheiros in os.walk(pasta_entrada):\n",
    "    for ficheiro in ficheiros:\n",
    "        if not ficheiro.lower().endswith(\".pdf\"):\n",
    "            continue\n",
    "        origem = os.path.join(raiz, ficheiro)\n",
    "        rel_path = os.path.relpath(raiz, pasta_entrada)\n",
    "        pasta_destino = os.path.join(pasta_saida, rel_path)\n",
    "        os.makedirs(pasta_destino, exist_ok=True)\n",
    "        nome_base, _ = os.path.splitext(ficheiro)\n",
    "        destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n",
    "        comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n",
    "        try:\n",
    "            subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n",
    "            if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
    "                os.remove(origem)\n",
    "                resultados.append((ficheiro, \"OK - original removido\"))\n",
    "            else:\n",
    "                resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n",
    "        except subprocess.CalledProcessError as e:\n",
    "            stderr = e.stderr or \"\"\n",
    "            if \"DigitalSignatureError\" in stderr:\n",
    "                comando_assinado = [\n",
    "                    sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n",
    "                try:\n",
    "                    subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n",
    "                    if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
    "                        resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n",
    "                    else:\n",
    "                        resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n",
    "                except subprocess.CalledProcessError as e2:\n",
    "                    resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n",
    "                except Exception as e2:\n",
    "                    resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n",
    "            else:\n",
    "                resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n",
    "        except Exception as e:\n",
    "            resultados.append((ficheiro, f\"ERRO: {e}\"))\n",
    "print(\"Total processados:\", len(resultados))\n",
    "print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "83e6ec96",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--- 2 ocorrência(s) ---\n",
      "ERRO OCR: InputFileError\n",
      "\n",
      "\n",
      "--- 1 ocorrência(s) ---\n",
      "ERRO OCR: Starting processing with 16 workers concurrently\n",
      "Parsing 19 pages with HocrParser\n",
      "Suppressing OCR output text with improbable aspect ratio\n",
      "Postprocessing...\n",
      "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
      "Image optimization did not improve the file - optimizations will not be used\n",
      "Image optimization ratio: 1.02 savings: 1.9%\n",
      "Total file size ratio: 0.97 savings: -3.0%\n",
      "Output file is a PDF (auto mode)\n",
      "WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n",
      "\n",
      "--- 1 ocorrência(s) ---\n",
      "ERRO OCR: Starting processing with 3 workers concurrently\n",
      "    1 [tesseract] lots of diacritics - possibly poor OCR\n",
      "Parsing 3 pages with HocrParser\n",
      "Postprocessing...\n",
      "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
      "Image optimization ratio: 1.18 savings: 15.2%\n",
      "Total file size ratio: 1.16 savings: 14.1%\n",
      "Output file is a PDF (auto mode)\n",
      "WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n"
     ]
    }
   ],
   "source": [
    "erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n",
    "contagem = Counter(erros)\n",
    "\n",
    "for erro, n in contagem.most_common(20):\n",
    "    print(f\"\\n--- {n} ocorrência(s) ---\")\n",
    "    print(erro[:500])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
Botexercito 2026-03-22 18:07:19 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "fc02fcf9",`
			`"metadata": {},`
			`"source": [`
			`"<div align=\"center\">\n",`
			`"\n",`
			`"<span style=\"color:red\"><b>RESERVADO</b></span> \n",`
			`"\n",`
			`"<img src=\"Imagens/logo_presidencia_republica.jpg\" width=\"100\"/>\n",`
			`"\n",`
			`"MINISTÉRIO DA DEFESA NACIONAL \n",`
			`"EXÉRCITO PORTUGUÊS \n",`
			`"DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO \n",`
			`"CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI \n",`
			`"PROJECTO LLM - ANEXO A\n",`
			`"\n",`
			`"</div>"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "803bc097",`
			`"metadata": {},`
			`"source": [`
			`"# OCR com tesseract\n",`
			`"\n",`
			`"\n",`
			`"## Requesitos\n",`
			`"\n",`
			"- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n",
			"- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n",
			"- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n"
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "82ca8b2f",`
			`"metadata": {},`
			`"source": [`
			`"## Bibliotecas"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "dda10e05",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import os\n",`
			`"import subprocess\n",`
			`"import sys\n",`
			`"import shutil\n",`
			`"from collections import Counter"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "06ce7747",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"tesseract v5.5.0.20241111\n",`
			`" leptonica-1.85.0\n",`
			`" libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n",`
			`" Found AVX512BW\n",`
			`" Found AVX512F\n",`
			`" Found AVX512VNNI\n",`
			`" Found AVX2\n",`
			`" Found AVX\n",`
			`" Found FMA\n",`
			`" Found SSE4.1\n",`
			`" Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n",`
			`" Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n",`
			`"\n",`
			`"\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n",`
			`"print(r.stdout)\n",`
			`"print(r.stderr)\n",`
			`"print(shutil.which(\"gswin64c\"))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "fa1bb90b",`
			`"metadata": {},`
			`"source": [`
			`"## OCR"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "f913c83d",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n",`
			`"pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "92432db3",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"os.makedirs(pasta_saida, exist_ok=True)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "5f2a5b2b",`
			`"metadata": {},`
			`"source": [`
			`"Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n",`
			`"\n",`
			`"O proximo comando é para saber onde estão ambos os documentos"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "4804893a",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n",`
			`"print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n",`
			`"print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n",`
			`"\n",`
			`"possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n",`
			`"\n",`
			`"possiveis_gs = [\n",`
			`" r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n",`
			`" r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n",`
			`" r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n",`
			`" r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n",`
			`"]\n",`
			`"\n",`
			`"print(\"\\nTesseract:\")\n",`
			`"for p in possiveis_tesseract:\n",`
			`" print(p, \"->\", os.path.exists(p))\n",`
			`"\n",`
			`"print(\"\\nGhostscript:\")\n",`
			`"for p in possiveis_gs:\n",`
			`" print(p, \"->\", os.path.exists(p))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "7f832f3b",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n",`
			`"gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n",`
			`"env = os.environ.copy()\n",`
			`"env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "572456cc",`
			`"metadata": {},`
			`"source": [`
			"Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "a6e18fe5",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Total processados: 297\n",`
			`"Com erro: 4\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"resultados = []\n",`
			`"for raiz, _, ficheiros in os.walk(pasta_entrada):\n",`
			`" for ficheiro in ficheiros:\n",`
			`" if not ficheiro.lower().endswith(\".pdf\"):\n",`
			`" continue\n",`
			`" origem = os.path.join(raiz, ficheiro)\n",`
			`" rel_path = os.path.relpath(raiz, pasta_entrada)\n",`
			`" pasta_destino = os.path.join(pasta_saida, rel_path)\n",`
			`" os.makedirs(pasta_destino, exist_ok=True)\n",`
			`" nome_base, _ = os.path.splitext(ficheiro)\n",`
			`" destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n",`
			`" comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n",`
			`" try:\n",`
			`" subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n",`
			`" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",`
			`" os.remove(origem)\n",`
			`" resultados.append((ficheiro, \"OK - original removido\"))\n",`
			`" else:\n",`
			`" resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n",`
			`" except subprocess.CalledProcessError as e:\n",`
			`" stderr = e.stderr or \"\"\n",`
			`" if \"DigitalSignatureError\" in stderr:\n",`
			`" comando_assinado = [\n",`
			`" sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n",`
			`" try:\n",`
			`" subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n",`
			`" if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",`
			`" resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n",`
			`" else:\n",`
			`" resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n",`
			`" except subprocess.CalledProcessError as e2:\n",`
			`" resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n",`
			`" except Exception as e2:\n",`
			`" resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n",`
			`" else:\n",`
			`" resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n",`
			`" except Exception as e:\n",`
			`" resultados.append((ficheiro, f\"ERRO: {e}\"))\n",`
			`"print(\"Total processados:\", len(resultados))\n",`
			`"print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"id": "83e6ec96",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"\n",`
			`"--- 2 ocorrência(s) ---\n",`
			`"ERRO OCR: InputFileError\n",`
			`"\n",`
			`"\n",`
			`"--- 1 ocorrência(s) ---\n",`
			`"ERRO OCR: Starting processing with 16 workers concurrently\n",`
			`"Parsing 19 pages with HocrParser\n",`
			`"Suppressing OCR output text with improbable aspect ratio\n",`
			`"Postprocessing...\n",`
			`"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",`
			`"Image optimization did not improve the file - optimizations will not be used\n",`
			`"Image optimization ratio: 1.02 savings: 1.9%\n",`
			`"Total file size ratio: 0.97 savings: -3.0%\n",`
			`"Output file is a PDF (auto mode)\n",`
			`"WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n",`
			`"\n",`
			`"--- 1 ocorrência(s) ---\n",`
			`"ERRO OCR: Starting processing with 3 workers concurrently\n",`
			`" 1 [tesseract] lots of diacritics - possibly poor OCR\n",`
			`"Parsing 3 pages with HocrParser\n",`
			`"Postprocessing...\n",`
			`"Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",`
			`"Image optimization ratio: 1.18 savings: 15.2%\n",`
			`"Total file size ratio: 1.16 savings: 14.1%\n",`
			`"Output file is a PDF (auto mode)\n",`
			`"WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n",`
			`"contagem = Counter(erros)\n",`
			`"\n",`
			`"for erro, n in contagem.most_common(20):\n",`
			`" print(f\"\\n--- {n} ocorrência(s) ---\")\n",`
			`" print(erro[:500])"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.14.2"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`