{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fc02fcf9",
   "metadata": {},
   "source": [
    "<div align=\"center\">\n",
    "\n",
    "<span style=\"color:red\"><b>RESERVADO</b></span>  \n",
    "\n",
    "<img src=\"Imagens/logo_presidencia_republica.jpg\" width=\"100\"/>\n",
    "\n",
    "**MINISTÉRIO DA DEFESA NACIONAL**  \n",
    "**EXÉRCITO PORTUGUÊS**  \n",
    "**DIREÇÃO DE COMUNICAÇÕES E INFORMAÇÃO**  \n",
    "**CENTRO DE DESENVOLVIMENTO APLICACIONAL E BI**  \n",
    "**PROJECTO LLM - ANEXO A**\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "803bc097",
   "metadata": {},
   "source": [
    "# OCR com tesseract\n",
    "\n",
    "\n",
    "## Requesitos\n",
    "\n",
    "- Instalar o tesseract [`choco install tesseract -y`] ([chocolatey](https://dev.to/kevinkirsten/como-instalar-e-utilizar-o-chocolatey-guia-para-iniciantes-1i98)) correr o bloco a seguir para confirmar a instalação do tesseract sem isso o OCR nao funciona.\n",
    "- Se o OCR estiver a correr em ambiente windows precisam do [ghostscript](https://www.gnu.org/software/ghostscript/) [`choco install ghostscript -y`]\n",
    "- No tessaract temos que acresentar o [Portugues](https://github.com/tesseract-ocr/tessdata_best) meter na pasta do Tesseract (`C:\\Program Files\\Tesseract-OCR\\tessdata\\`)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82ca8b2f",
   "metadata": {},
   "source": [
    "## Bibliotecas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "dda10e05",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "import sys\n",
    "import shutil\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06ce7747",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tesseract v5.5.0.20241111\n",
      " leptonica-1.85.0\n",
      "  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.44 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2\n",
      " Found AVX512BW\n",
      " Found AVX512F\n",
      " Found AVX512VNNI\n",
      " Found AVX2\n",
      " Found AVX\n",
      " Found FMA\n",
      " Found SSE4.1\n",
      " Found libarchive 3.7.7 zlib/1.3.1 liblzma/5.6.3 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.6\n",
      " Found libcurl/8.11.0 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "r = subprocess.run([\"tesseract\", \"--version\"], capture_output=True, text=True)\n",
    "print(r.stdout)\n",
    "print(r.stderr)\n",
    "print(shutil.which(\"gswin64c\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa1bb90b",
   "metadata": {},
   "source": [
    "## OCR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f913c83d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pasta_entrada = r\"D:\\Trabalhos\\Bot Exército\\OCR\"\n",
    "pasta_saida = r\"D:\\Trabalhos\\Bot Exército\\OCR_limpo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92432db3",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs(pasta_saida, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f2a5b2b",
   "metadata": {},
   "source": [
    "Aqui mudar pelo caminho do tesseract e do Ghostscript se ambiente linux comentar a linha do Ghostscript. \n",
    "\n",
    "O proximo comando é para saber onde estão ambos os documentos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4804893a",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"TESSERACT:\", shutil.which(\"tesseract\"))\n",
    "print(\"GHOSTSCRIPT gswin64c:\", shutil.which(\"gswin64c\"))\n",
    "print(\"GHOSTSCRIPT gswin32c:\", shutil.which(\"gswin32c\"))\n",
    "\n",
    "possiveis_tesseract = [r\"C:\\Program Files\\Tesseract-OCR\\tesseract.exe\",r\"C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe\",]\n",
    "\n",
    "possiveis_gs = [\n",
    "    r\"C:\\Program Files\\gs\\gs10.05.1\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.04.0\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.03.1\\bin\\gswin64c.exe\",\n",
    "    r\"C:\\Program Files\\gs\\gs10.02.1\\bin\\gswin64c.exe\",\n",
    "]\n",
    "\n",
    "print(\"\\nTesseract:\")\n",
    "for p in possiveis_tesseract:\n",
    "    print(p, \"->\", os.path.exists(p))\n",
    "\n",
    "print(\"\\nGhostscript:\")\n",
    "for p in possiveis_gs:\n",
    "    print(p, \"->\", os.path.exists(p))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7f832f3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "tesseract_dir = r\"C:\\Program Files\\Tesseract-OCR\"\n",
    "gs_dir = r\"C:\\Program Files\\gs\\gs10.07.0\\bin\"\n",
    "env = os.environ.copy()\n",
    "env[\"PATH\"] = tesseract_dir + os.pathsep + gs_dir + os.pathsep + env[\"PATH\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "572456cc",
   "metadata": {},
   "source": [
    "Como muitos documentos tem assinatura e isso invalida a transformação pelo OCR em alguns dos casos acrescentou-se [`--invalidate-digital-signatures`] ([documentação do OCR](https://ocrmypdf.readthedocs.io/en/latest/releasenotes/version14.html#v14-4-0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a6e18fe5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total processados: 297\n",
      "Com erro: 4\n"
     ]
    }
   ],
   "source": [
    "resultados = []\n",
    "for raiz, _, ficheiros in os.walk(pasta_entrada):\n",
    "    for ficheiro in ficheiros:\n",
    "        if not ficheiro.lower().endswith(\".pdf\"):\n",
    "            continue\n",
    "        origem = os.path.join(raiz, ficheiro)\n",
    "        rel_path = os.path.relpath(raiz, pasta_entrada)\n",
    "        pasta_destino = os.path.join(pasta_saida, rel_path)\n",
    "        os.makedirs(pasta_destino, exist_ok=True)\n",
    "        nome_base, _ = os.path.splitext(ficheiro)\n",
    "        destino = os.path.join(pasta_destino, f\"{nome_base}_ocr.pdf\")\n",
    "        comando_base = [sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\", origem,destino]\n",
    "        try:\n",
    "            subprocess.run(comando_base,check=True,capture_output=True,text=True,env=env)\n",
    "            if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
    "                os.remove(origem)\n",
    "                resultados.append((ficheiro, \"OK - original removido\"))\n",
    "            else:\n",
    "                resultados.append((ficheiro, \"ERRO: OCR não gerou ficheiro válido\"))\n",
    "        except subprocess.CalledProcessError as e:\n",
    "            stderr = e.stderr or \"\"\n",
    "            if \"DigitalSignatureError\" in stderr:\n",
    "                comando_assinado = [\n",
    "                    sys.executable, \"-m\", \"ocrmypdf\",\"-l\", \"por\",\"--skip-text\",\"--optimize\", \"1\",\"--invalidate-digital-signatures\",origem,destino]\n",
    "                try:\n",
    "                    subprocess.run(comando_assinado,check=True,capture_output=True,text=True,env=env)\n",
    "                    if os.path.exists(destino) and os.path.getsize(destino) > 0:\n",
    "                        resultados.append((ficheiro, \"OK - assinatura invalidada\"))\n",
    "                    else:\n",
    "                        resultados.append((ficheiro, \"ERRO: ficheiro assinado não gerou output válido\"))\n",
    "                except subprocess.CalledProcessError as e2:\n",
    "                    resultados.append((ficheiro, f\"ERRO OCR ASSINADO: {e2.stderr}\"))\n",
    "                except Exception as e2:\n",
    "                    resultados.append((ficheiro, f\"ERRO ASSINADO: {e2}\"))\n",
    "            else:\n",
    "                resultados.append((ficheiro, f\"ERRO OCR: {stderr}\"))\n",
    "        except Exception as e:\n",
    "            resultados.append((ficheiro, f\"ERRO: {e}\"))\n",
    "print(\"Total processados:\", len(resultados))\n",
    "print(\"Com erro:\", sum(1 for r in resultados if \"ERRO\" in r[1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "83e6ec96",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--- 2 ocorrência(s) ---\n",
      "ERRO OCR: InputFileError\n",
      "\n",
      "\n",
      "--- 1 ocorrência(s) ---\n",
      "ERRO OCR: Starting processing with 16 workers concurrently\n",
      "Parsing 19 pages with HocrParser\n",
      "Suppressing OCR output text with improbable aspect ratio\n",
      "Postprocessing...\n",
      "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
      "Image optimization did not improve the file - optimizations will not be used\n",
      "Image optimization ratio: 1.02 savings: 1.9%\n",
      "Total file size ratio: 0.97 savings: -3.0%\n",
      "Output file is a PDF (auto mode)\n",
      "WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\DIFE 2024_ocr.pd\n",
      "\n",
      "--- 1 ocorrência(s) ---\n",
      "ERRO OCR: Starting processing with 3 workers concurrently\n",
      "    1 [tesseract] lots of diacritics - possibly poor OCR\n",
      "Parsing 3 pages with HocrParser\n",
      "Postprocessing...\n",
      "Auto mode: no verapdf available and input is not PDF/A, outputting PDF\n",
      "Image optimization ratio: 1.18 savings: 15.2%\n",
      "Total file size ratio: 1.16 savings: 14.1%\n",
      "Output file is a PDF (auto mode)\n",
      "WARNING: D:\\Trabalhos\\Bot ExÃ©rcito\\OCR_limpo\\.\\NEP AGE.201_ocr.pdf (offset 1460): error decoding stream data for object 17 0: Pl_DCT::decompr\n"
     ]
    }
   ],
   "source": [
    "erros = [msg for _, msg in resultados if \"ERRO\" in msg]\n",
    "contagem = Counter(erros)\n",
    "\n",
    "for erro, n in contagem.most_common(20):\n",
    "    print(f\"\\n--- {n} ocorrência(s) ---\")\n",
    "    print(erro[:500])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}