Drop #649 (2025-05-05): A Python State Of Mind

Docling; Notary; CVE Search MCP

No, I am not being held hostage, nor has this newsletter beeh hijacked.

The weekend Bonus Drop was thwarted by something virtually everyone who has owned a home more than five years can likely sympathize with — H₂O. Since I had to suffer through a horrible mess on Sunday, I figure I could spread the pain around a bit by focusing on icky (but…sigh…useful) Python things today.


TL;DR

(This is an LLM/GPT-generated summary of today’s Drop using Ollama + Qwen 3 and a custom prompt.)

  • Docling is a Python library for document processing that preserves paragraph structure during conversion to formats like Markdown and JSON, ensuring accurate text segmentation for AI workflows (https://github.com/docling-project/docling).
  • Notary is a service that generates visual badges indicating whether Python packages on PyPI have provenance attestations, helping verify package trustworthiness (https://github.com/gojiplus/notary).
  • CVE Search MCP is an MCP server that provides access to CVE data through the CVE-Search API, enabling querying of vulnerabilities, vendors, products, and related details like CAPEC and CWE (https://github.com/roadwy/cve-search_mcp).

Docling

Docling is an FOSS document processing (python) library that prepares documents for text-processing/generative AI workflows. It parses various formats (PDF, DOCX, XLSX, HTML, images, and others) and exports them into structured representations like Markdown, HTML, and JSON. Its main strength is its advanced understanding of document layout, including recognition of paragraphs, headings, tables, formulas, images, and other structural elements. It’s similar to Jina.ai’s “Reader”, Microsoft’s nascent markitdown, and other converters.

Docling has a relentless focus on paragraph handling. When parsing a document, Docling identifies and preserves paragraph boundaries as explicit, structured units in its internal representation (the DoclingDocument format). This ensures that paragraphs are recognized and exported as discrete blocks of text, maintaining their logical and visual separation regardless of original format. This structure is paramount for downstream processing/AI applications such as retrieval-augmented generation (RAD) or fine-tuning, which typically require clean, contextually coherent text segments for chunking and embedding.

This paragraphical focus is readily apparent in Docling’s Markdown and JSON exports. When converting a PDF or DOCX to Markdown, it outputs paragraphs as separate Markdown blocks, preserving the original document structure for both human readability and machine processing. This structured export integrates well with frameworks like LlamaIndex or LangChain, where paragraphs can function as atomic units for node parsing, chunking, and context retrieval.

A significant challenge in paragraph detection occurs with plain text and Markdown inputs, particularly when documents contain line wrapping (hard line breaks for formatting rather than true paragraph breaks). Earlier versions of Docling sometimes misinterpreted single line breaks as paragraph boundaries, fragmenting text inappropriately. Recent updates addressed this issue, ensuring that only true paragraph breaks (such as double newlines in Markdown) are treated as new paragraphs, while single line breaks within paragraphs are ignored. This improvement aligns paragraph segmentation with the author’s intent and the document’s logical structure.

For specialized workflows, Docling supports metadata tagging at the paragraph level. In semi-structured exports like its enhanced Markdown format, block-level metadata (such as paragraph_id) can be inserted before each paragraph. This enables precise tracking and referencing of paragraphs, valuable for AI training, document analysis, or reconstructing original layouts.

Rather than blather, you can see the basic difference between Docling, Jina, and Markitdown in these three markdown docs which were generated by processing the Docling GitHub main page (the link at the top):


Notary

Notary is a digital attestation shield generation service, crafted by friend-of-the-Drop soodoku, specifically designed for Python packages distributed via PyPI. Its purpose in life is to generate visual badges that indicate whether a given Python package release on PyPI has associated provenance attestations.

Provenance attestations are cryptographic statements that provide verifiable information about how and by whom a package was built, which helps users and developers assess the trustworthiness of a package.

The service works by querying the PyPI Integrity API for a package’s provenance data. When someone requests a badge for a specific package version and file, the service checks whether that package has provenance attestations. If attestations are present, it generates a green “Verified” badge; if not, it produces a red “None” badge. The badges are rendered using Shields.io, making them visually consistent with other status badges commonly seen in open source repositories.

To use the service, you simply add a badge to your README or documentation using a specific URL format. For example, to display the attestation status for the pydantic package version 2.7.2, you would use:

![PyPI Attestation](https://notarypy.soodoku.workers.dev/badge/pydantic/2.7.2/pydantic-2) 

This approach provides a quick, visual way for folks to verify the supply chain integrity of Python packages directly from their documentation or repository pages.

We need far more of these services, and the R world could learn a thing or three from PEP 740.


CVE Search MCP

CVE Search MCP provides a Model Context Protocol (MCP) server that interfaces with the CIRCL sponsored CVE-Search API. This server bridges the gap by letting folks query and interact with the CVE-Search database through the MCP framework. It enables comprehensive access to CVE data, browsing of vendors and products, retrieval of vulnerabilities by vendor and product, fetching details for specific CVE IDs, and obtaining the latest CVEs with expanded information including CAPECCWE, and CPE details. The server also lets us check database metadata, such as update times.

To use it, you need Python 3.10 or later, the uv tool for dependency management and running the server, and an MCP-compatible client like Claude Desktop, Zed, ClineRoo Code. Setup is straightforward: clone the repository, install dependencies using uv, and configure your MCP client to point to the installation directory. Configuration is typical of current-gen MCP servers and involves specifying the command to run the server (using uv to execute main.py) and setting the appropriate directory path in your client’s configuration file.

The server exposes several endpoints or tools that return JSON data. You can retrieve lists of all vendors, all products for a given vendor, all vulnerabilities for a specific vendor and product, details for a particular CVE ID, and a list of the most recently updated CVEs. This makes it suitable for integration with security analysis tools, vulnerability management platforms, or custom dashboards that need real-time access to structured CVE information.

As an aside, The New Stack has a spiffy “howto” on “Building Your First Model Context Protocol Server” that is Python-based, should you wish to pollute your systems with more ugly python code.


FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

  • 🐘 Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
  • 🦋 Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

☮️