API Reference

The public API is centered on the package entry points, document model, library model, BibTeX handling, query handling, and import helpers.

Package

archivum project.

archivum.resolve_path(p)[source]

Resolves a path relative to BASE_DIR if it’s not absolute or root-relative.

Parameters:: p (str)
Return type:: Path

Document

Document class

Combines best of Gemini and my original document class.

renamer moved into utilities.

v 1.0 2025-12-06

class archivum.document.Document(doc_path, book_mode=False)[source]

Manages a physical PDF document. Uses a Gather -> Rank -> Verify strategy to reconcile Metadata, Filenames, and OCR.

Parameters:

doc_path (Path)
book_mode (bool)

key()[source]: A reasonable default key to make reviewing easy. Filename based.

process()[source]

Orchestrates the discovery pipeline by prioritizing evidence:

Gather: Collect raw info from Filename, PDF Metadata, and Visual OCR.
Prioritized Enhance: Attempt lookup using a found DOI or ArXiv ID. If successful, accept the result as definitive.
Fallback Enhance: If no ID was found, determine the best local ‘Anchor’, search external APIs, and validate the results.

text_path(text_dir_path, extractor)[source]

Return Path to where text is or will be stored. Mirrors the sharded structure: text_dir / first_2_of_fn / fn.md where fn starts with first 10 chars of hash.

Parameters:

text_dir_path (Path)
extractor (str)

Return type:

Path

text_exists(text_dir_path, extractor)[source]

Check if text file exists.

Parameters:

text_dir_path (Path)
extractor (str)

Return type:

bool

extract_text(text_dir_path=None, extractor='pdftotext')[source]

Extracts text using pdftotext (or pymupdf as fallback/alternative). Stores result in self._text and returns it. If text_dir_path is provided, also saves to disk.

Parameters:

text_dir_path (Path | None)
extractor (str)

Return type:

str

report(print_fn=<built-in function print>)[source]: Prints a comprehensive report of the discovery process, including the steps taken, final status, and the resulting BibTeX entry.

show_log(print_fn=<built-in function print>)[source]: Show the process log information.

archivum.document.extract_text_for_paths(pdf_paths, text_dir_path, extractor='pdftotext', workers=4, hashes=None)[source]

Batch extract text from a list of PDF paths.

Parameters:

pdf_paths (List[Path])
text_dir_path (Path)
extractor (str)
workers (int)
hashes (Dict[Path, str] | None)

archivum.document.discover_docs(doc_path, lib)[source]

Discover documents in doc_path if a directory or about doc_path if it is a file.

Parameters:: doc_path (Path)

archivum.document.elaborate_duplicates(lib, duplicates, trim=True)[source]

Find the refs corresponding to duplicate hashes from discover_docs.

Return the ref if available. Non-matched returned in missing_refs

Docs in missing_refs already exist in the Library but are orphans with no associated reference record.

Library

Manage config file and index database creation and updating.

Equivalent to and based on manager module in file_database.

Querying uses a file-database project-like combo regex-sql (querex) querier.

exception archivum.library.LibraryImportBlocked[source]: Raised when an import analysis says the web ingest should not proceed.

class archivum.library.Library(library_dir_name='', **overrides)[source]

Library specified by config yaml (archivum-config) file.

Parameters:: library_dir_name (str)

class LibraryChangeHandler(library)[source]

on_modified(event)[source]

Called when a file or directory is modified.

Parameters:: event (DirModifiedEvent or FileModifiedEvent) – Event representing file/directory modification.

start_watcher()[source]: Start the background filesystem watcher.

stop_watcher()[source]: Stop the background filesystem watcher.

abspath(p)[source]

Resolve a library-relative path to an absolute path with caching.

Parameters:: p (str | Path)
Return type:: Path

textpath(p)[source]

Return full text path from doc_df path. Does not check existence.

Parameters:: p (str)
Return type:: Path

open_document(path)[source]

Try to open document at path (rel or abs).

Parameters:: path (str | Path)

link_document(tag, file_hash, version=0)[source]

Manually link a tag to a specific (hash, version).

Parameters:

tag (str)
file_hash (str)
version (int)

reset()[source]: Reset all cache variables.

get_status_info()[source]

Return a dictionary containing status information for the library.

Return type:: Dict

property doc_df: Return the document df, loading if needed.

property ref_df: Return the document df, loading if needed.

property ref_doc_df: Return the document df, loading if needed.

property read_df: Return the read history df, loading if needed.

record_read(file_hash, caller='')[source]

Record a read event for a specific file hash.

Parameters:

file_hash (str)
caller (str)

save_read()[source]

Persist only read.feather (read history).

record_read mutates nothing but read history, so it must not trigger the full save() pipeline — rewriting ref/doc/ref-doc feathers, resaving config, clearing every cache via reset(), and rebuilding bibtex.bib. This writes just the one small frame.

property database: Merged database, with exploded authors and read history.

update(importer)[source]

Update internal database and save.

Invalidate all caches to force clean re-load.

Called by the import routine, after figuring what needs to be added.

importer is an import_bibtex.Bib2df_Incremental object.

remove_reference(tag)[source]

Remove a reference and its links from the library.

Parameters:: tag (str)

update_reference(old_tag, new_data)[source]

Update or add a reference. Handles tag changes.

Parameters:

old_tag (str)
new_data (dict)

validate(task='sharding', execute=False)[source]

Audit and fix library structure. Tasks: ‘sharding’, ‘orphans’, ‘missing’

Parameters:

task (str)
execute (bool)

save()[source]: Save config and all dataframes with aggressive safety checks.

distinct(c)[source]: Return distinct occurrences of col c.

static get_library_path_list()[source]: Get a list of available libraries (no suffix) as list of Paths (see also list).

static list()[source]: List of projects in the default location.

static list_deets()[source]: Dataframe of all projects in default location.

static rename_library(old_name, new_name)[source]

Rename a library folder and update its internal name.

Parameters:

old_name (str)
new_name (str)

static copy_library(old_name, new_name)[source]

Copy a library folder and update its internal name.

Parameters:

old_name (str)
new_name (str)

to_name_ex(name, strict=False)[source]: Extend name to longest match using a Trie; in strict mode adds as key if missing.

next_tag(name, year)[source]

Return the next tag after name, year.

Remembers incremental tags handed out.

reset_tag_allocator()[source]: You want to remember new tags for each dry run but be able to accept them. Hence this is useful.

property tag_allocator: Return the loaded key allocator for tag generation.

run_ripgrep(pattern, args)[source]: Execute and format ripgrep search against library full text extracts.

write_bibtex()[source]

Write out bibtex file of the library.

Lives in library/LIB_NAME/lib-name.bib with a symlink to config location.

update_hashes()[source]: Update hashes, save and reset.

extract_all_text(force=False, workers=None, execute=False)[source]

Extract text for all documents in the library. If force=False, only extracts if the text file doesn’t exist. If execute=False, does nothing but log what would be done.

Parameters:

force (bool)
workers (int)
execute (bool)

get_text_info()[source]: Get info about the text: number of docs, number with text files etc., split by extension.

clean_text_extracts(execute=False)[source]

Find (and delete if execute) text files with no corresponding document in the library.

Parameters:: execute (bool)

audit()[source]

Perform a comprehensive structural audit of the library. Returns a dictionary of findings.

These three should all be empty::: missing_physical_files: check all files in doc_df actually exist. broken_tag_links: tag in ref_doc but no actual ref. broken_id_links: (hash, version) in ref_doc but no actual doc.
These may be longer:: refs_missing_doc: a ref withand no doc. Can’t locate an doc (afile). Expected. docs_missing_ref: docs in doc_df with no reference; eg old versions of papers

orphan_extracts: an actual extract exists that is not expected based on doc_df.

reset_library()[source]

Reset a library back to empty state.

USE WITH CARE!

Deletes all data files and the bibtex link if it exists.

initial_import(*, dir_name='', dir_iterable=None, errors_mapper=None, qd=<function display>, update=False, incremental=False)[source]

Iterate import dir_name or iterate over if iterable. Find ! bibtex file - error if the bibtex file is not unique.

E.g. uber library created from

initial_import_bibtex_file(bibtex_file, doc_dir=None, errors_mapper=None, qd=<function display>, update=True, incremental=False)[source]

Import a single bibtex file into library.

Use in prod when you know the bibtex will work to recreate from scratch.

import_staged_document(bibtex_text, staged_document_path, *, known_hash=None, source_label='web-ingest', extract_text=True)[source]

Import one staged document through the same incremental BibTeX path as the CLI.

The edited BibTeX is the metadata source. This helper only injects the staged document as a Mendeley-style file field so Bib2df_Incremental can perform its normal author, tag, duplicate, sharding, and audit work.

Parameters:

bibtex_text (str)
staged_document_path (str | Path)
known_hash (str | None)
source_label (str)
extract_text (bool)

preview_staged_document_import(bibtex_text, staged_document_path, *, known_hash=None, source_label='web-ingest-preview')[source]

Return the BibTeX and analysis that the real staged import would produce.

Parameters:

bibtex_text (str)
staged_document_path (str | Path)
known_hash (str | None)
source_label (str)

Return type:

dict

history()[source]: The history of how self was built from the audit files.

classmethod list_stats()[source]: Combine stats df for all libraries.

find(path)[source]

Hash a file and return the hash and any matching records.

Parameters:: path (str | Path)

find_docs(dir_path=None)[source]: Find all document files in provided dir_path.

enhance_refs(update=False)[source]

Run the enhancement process on references only, sort out duplicates etc.

Designed as a one-time run on initial import. Thereafter the import process itself guards against duplicates. It only addresses references and makes no change to docs. See enhance_docs for the corresponding doc version.

save_enhance_audit(obj, base_path, name)[source]: Save object as CSV if pandas, else JSON to the enhance-audit folder.

make_tag_mapper()[source]: Make a tag mapping dictionary for library

get_tag_info(tag)[source]

Collate all information about a tag from ref, doc, and ref-doc. Returns a 2-column DataFrame: [Field, Value]

Parameters:: tag (str)
Return type:: DataFrame

BibTeX

Functions for creating bibtex entries from various iterables.

v2 Hack off gemini, which actually was very poor for task at hand. v1 Gemini.

archivum.bibtex.sanitize_for_latex(val)[source]

Sanitize string for LaTeX compatibility.

Parameters:: val (Any)
Return type:: str

archivum.bibtex.format_mendeley_file(path)[source]

Format a path using Mendeley’s BibTeX file-field convention.

Windows paths are rendered as :C\:/path/to/file.pdf:pdf. Non-Windows paths keep their absolute/relative POSIX form and still include the final file type segment.

Parameters:: path (Any)
Return type:: str

archivum.bibtex.dict_to_bibtex(data, allowed_fields=None, raw_fields=None)[source]

Converts a dict-like object to a sanitized BibTeX string.

raw_fields bypasses LaTeX sanitization for fields where the literal value matters, such as Mendeley file paths.

Parameters:

data (Any)
allowed_fields (List[str])
raw_fields (Iterable[str])

Return type:

str

archivum.bibtex.rows_to_bibtex(rows, allowed_fields=None, *, include_hash=False, include_file=False, path_resolver=None)[source]

Convert dataframe-like rows to BibTeX text using dict_to_bibtex.

This is the shared path for library-level and ad hoc web exports. The optional include_hash and include_file flags produce Archivum’s enriched BibTeX+ export without changing normal library BibTeX output.

Parameters:

rows (Any)
allowed_fields (List[str])
include_hash (bool)
include_file (bool)
path_resolver (Callable[[Any], Path] | None)

Return type:

str

archivum.bibtex.bibtex_to_dict(bibtex_str)[source]

Very simple BibTeX parser for a single entry. Returns {tag: {field: value, ‘type’: entry_type}}

Parameters:: bibtex_str (str)
Return type:: dict[str, dict[str, str]]

archivum.bibtex.dict_to_bibtex_crossref(data)[source]

Converts a dict-like object to a BibTeX string.

Suitable for the return value form cross ref.

Gemini code.

Parameters:: data (Any)
Return type:: str

Import BibTeX

BibTeX import helpers for archivum.

This module uses a Bib2df_Incremental, derived from the Mendeley porting logic to incrementally import new references from a BibTeX file into an existing Library.

Each import run is recorded under a timestamped directory so that the original .bib and a copy of the PDFs are preserved and the ETL is, in principle, replayable.

class archivum.import_bibtex.Bib2df_Incremental(*, bibtex_file_path, doc_dir, reference_library, fillna=True, errors_mapper=None, remap_dashes=False, add_hashes=False, incremental=False, qd=None, write_audit=True)[source]

Bibtex file to dataframe - library aware.

Code started as a copy of mendeley_port.Bib2df, adjusted for incremental, library-aware use. .

property raw_df: DataFrame of raw(ish) information read directly from bibtex file.

property ref_df: The reference df contains no file information and has tag NOT as the index.

property doc_df

Read file information for the current library’s pdf store.

Returns dataframe describing actual files (afiles). These may or may not be referenced in library.database. Currently only PDFs.

property vfile_df

Information about virtual files (vfiles) found in the file field in the Mendeley bibtex file.

Parses file field created by Mendeley in order to discover them.

Mendeley’s internal file(s) field added to bibtex files. Looks like a semicolon separated list of the form:

:C\:/S/new-papers/Blackwell/1953_Equivalent Comparisons of Experiments.pdf:pdf

Oddly, empty vfiles are represented as ::.

property ref_doc_df

Make the reference/document dataframe by matching vfiles to afiles.

vfiles (virtual files) are references within the file field in the mendeley bibtex file.

afiles are actual files that exist in the pdf_path directory.

property author_map_df

DataFrame of author name showing a transition to a normalized form.

Adjusts for initials (puts periods in), takes the longest ! name using a Trie, adjusts for accents (guess work!).

For a new import into an empty library, needs to be run on the authors in raw_df to prime the pump

property database: Merged database, with exploded authors.

raw_no_file()[source]: Raw entries with no files listed.

static distinct(column_name, df)[source]: Return distinct occurrences of col c in df.

tex_to_unicode(s_in)[source]

Tex codes to Unicode for a string and removing braces with single character.

Errors are added to self._last_decode and looked up in the dictionary self.errors_mapper. Work iteratively: run, look at errors, add or update entries in self.errors_mapper.

Parameters:: s_in (str)
Return type:: str

map_tags()[source]

Remap the tags into standard AuthorYYYY[a-z] format for named df.

Saves a dataframe showing what was done as part of import.

Updated to use reference library.

author_mapper()[source]: dict mapper for author name.

map_authors(df_name)[source]: Actually apply the author mapper to the author column.

import_bibtex_file()[source]

The work happens here! Do the actual import, and normalize each text-based field.

Runs through each task in turn, see comments.

For the initial port choose run_add_hoc=True, but for incremental updates use False.

Updated to remove ad_hoc adjustments, dropped extract citations from abstract, tags use library, etc.

Called automatically by ported_df property if needed.

import_analysis(lib_test=True)[source]: Prepare a detailed analysis of the import. Returns a DataFrame with columns: tag | author | title | hash match | doi match | title match | action

import_analysis_full(lib_test=True, strict=False)[source]: Original detailed diagnostic analysis. Shows scores, field changes, and raw vs ported comparison.

update_library(save=True)[source]

Update self.library underlying files and save.

If self.incremental is True, also shards the new documents into the library’s document store.

save_audit_file(df, suffix)[source]: Save df audit file with a standard filename.

show_audit_files(top=5, trim=100, bib=False)[source]: qd all the audit files.

show_generated_dfs()[source]: Use self.qd to display the main generated dfs.

show_unicode_errors()[source]: Accumulated Unicode errors.

Query Helpers

class archivum.search.query.QuerySpec(raw: str, kind: str, query: str, expression: str)[source]

Parameters:

raw (str)
kind (str)
query (str)
expression (str)

archivum.search.query.has_querex_symbol(query)[source]

Return True when query contains explicit querexfuzz syntax markers.

Parameters:: query (str)
Return type:: bool

archivum.search.query.normalize_query(raw_query, *, default_limit=50, recent=True, projection='type, *', q_projection=None, fuzzy_projection=None, default_empty_pattern=None)[source]

Convert a web search string into the expression passed to querexfuzz.

Parameters:

raw_query (str)
default_limit (int | None)
recent (bool)
projection (str)
q_projection (str | None)
fuzzy_projection (str | None)
default_empty_pattern (str | None)

Return type:

QuerySpec

CrossRef

Code for interacting with the crossref api.

With GPT.

v2 Gemini updates v1 GPT

archivum.crossref.lookup_doi(doi)[source]

Lookup an individual doi string.

Parameters:: doi (str)
Return type:: Dict[str, Any] | None

archivum.crossref.search_by_title(title, rows=1)[source]

Reverse lookup via title search.

Parameters:

title (str)
rows (int)

Return type:

Dict[str, Any] | None

archivum.crossref.search(query=None, title=None, author=None, rows=5, book_mode=False)[source]

Generic search: keywords, title, and/or author.

Parameters:

query (str | None)
title (str | None)
author (str | None)
rows (int)
book_mode (bool)

Return type:

List[Dict[str, Any]]

Utilities

Various utilities for archivum.

archivum.utilities.djvu_convert_file(in_path, out_path, verbose=False, config=None)[source]

Convert a DjVu file to a searchable PDF.

Uses ddjvu for initial conversion and ocrmypdf (via WSL) for OCR.

Parameters:

in_path (Path)
out_path (Path)
verbose (bool)

Return type:

bool

archivum.utilities.safe_int(s)[source]

Safe format of s as a year for greater_tables.

By default s may be interpreted as a float so str(x) give 2015.0 which is not wanted. Hence this function is needed.

archivum.utilities.safe_file_size(s)[source]

Safe format of s as a year for greater_tables.

By default s may be interpreted as a float so str(x) give 2015.0 which is not wanted. Hence this function is needed.

archivum.utilities.make_qd(max_string_length=50, max_rows=10, display_func=None, **gt_kwargs)[source]

Make a qd function with sensible defaults.

If display_func is None use IPython.display display.

archivum.utilities.remove_accents(s)[source]

Remove accents from a string.

Parameters:: s (str)
Return type:: str

archivum.utilities.trim_author(s)[source]: Clean author string: short names, truncate at 3 with et al. if more.

archivum.utilities.clean_latex(s)[source]: Remove LaTeX braces from a string.

archivum.utilities.accent_mapper_dict(names, verbose=False)[source]: Make dict mapper for name -> accented name from list of names.

archivum.utilities.suggest_filename(s)[source]: Clean file name for windows.

archivum.utilities.sanitize_windows_component(name, max_length=255)[source]

Sanitize a string so it can be safely used as a single Windows path component (file or directory name).

Rules:

Remove control characters and invalid punctuation.
Strip leading/trailing spaces and dots.
Avoid reserved device names (CON, PRN, AUX, NUL, COM1-9, LPT1-9).
Enforce a per-component length cap.

Parameters:

name (str)
max_length (int)

Return type:

str

archivum.utilities.suggest_name(author, title, year)[source]

Figure standard dir name and file name from author, title, and year.

Names are make filename safe

author, str, in standard last, first [and…] format title, st year, str

Parameters:

author (str)
title (str)
year (str | int)

archivum.utilities.rename(original_doc_name, doc_hash, pdf_dir_path, dir_name, file_name, hash_len=6, execute=False)[source]

Hard link original file into pdf_dir/dir_name/file_name.

Returns True if copied, else false

Parameters:

original_doc_name (str)
doc_hash (str)
pdf_dir_path (Path)
dir_name (str)
file_name (str)
hash_len (int)
execute (bool)

Return type:

bool