API Reference
The public API is centered on the package entry points, document model, library model, BibTeX handling, query handling, and import helpers.
Package
archivum project.
Document
Document class
Combines best of Gemini and my original document class.
renamer moved into utilities.
v 1.0 2025-12-06
- class archivum.document.Document(doc_path, book_mode=False)[source]
Manages a physical PDF document. Uses a Gather -> Rank -> Verify strategy to reconcile Metadata, Filenames, and OCR.
- Parameters:
doc_path (Path)
book_mode (bool)
- process()[source]
Orchestrates the discovery pipeline by prioritizing evidence:
Gather: Collect raw info from Filename, PDF Metadata, and Visual OCR.
Prioritized Enhance: Attempt lookup using a found DOI or ArXiv ID. If successful, accept the result as definitive.
Fallback Enhance: If no ID was found, determine the best local ‘Anchor’, search external APIs, and validate the results.
- text_path(text_dir_path, extractor)[source]
Return Path to where text is or will be stored. Mirrors the sharded structure: text_dir / first_2_of_fn / fn.md where fn starts with first 10 chars of hash.
- Parameters:
text_dir_path (Path)
extractor (str)
- Return type:
Path
- text_exists(text_dir_path, extractor)[source]
Check if text file exists.
- Parameters:
text_dir_path (Path)
extractor (str)
- Return type:
bool
- extract_text(text_dir_path=None, extractor='pdftotext')[source]
Extracts text using pdftotext (or pymupdf as fallback/alternative). Stores result in self._text and returns it. If text_dir_path is provided, also saves to disk.
- Parameters:
text_dir_path (Path | None)
extractor (str)
- Return type:
str
- archivum.document.extract_text_for_paths(pdf_paths, text_dir_path, extractor='pdftotext', workers=4, hashes=None)[source]
Batch extract text from a list of PDF paths.
- Parameters:
pdf_paths (List[Path])
text_dir_path (Path)
extractor (str)
workers (int)
hashes (Dict[Path, str] | None)
- archivum.document.discover_docs(doc_path, lib)[source]
Discover documents in doc_path if a directory or about doc_path if it is a file.
- Parameters:
doc_path (Path)
- archivum.document.elaborate_duplicates(lib, duplicates, trim=True)[source]
Find the refs corresponding to duplicate hashes from discover_docs.
Return the ref if available. Non-matched returned in missing_refs
Docs in missing_refs already exist in the Library but are orphans with no associated reference record.
Library
Manage config file and index database creation and updating.
Equivalent to and based on manager module in file_database.
Querying uses a file-database project-like combo regex-sql (querex) querier.
- exception archivum.library.LibraryImportBlocked[source]
Raised when an import analysis says the web ingest should not proceed.
- class archivum.library.Library(library_dir_name='', **overrides)[source]
Library specified by config yaml (archivum-config) file.
- Parameters:
library_dir_name (str)
- abspath(p)[source]
Resolve a library-relative path to an absolute path with caching.
- Parameters:
p (str | Path)
- Return type:
Path
- textpath(p)[source]
Return full text path from doc_df path. Does not check existence.
- Parameters:
p (str)
- Return type:
Path
- open_document(path)[source]
Try to open document at path (rel or abs).
- Parameters:
path (str | Path)
- link_document(tag, file_hash, version=0)[source]
Manually link a tag to a specific (hash, version).
- Parameters:
tag (str)
file_hash (str)
version (int)
- get_status_info()[source]
Return a dictionary containing status information for the library.
- Return type:
Dict
- property doc_df
Return the document df, loading if needed.
- property ref_df
Return the document df, loading if needed.
- property ref_doc_df
Return the document df, loading if needed.
- property read_df
Return the read history df, loading if needed.
- record_read(file_hash, caller='')[source]
Record a read event for a specific file hash.
- Parameters:
file_hash (str)
caller (str)
- property database
Merged database, with exploded authors and read history.
- update(importer)[source]
Update internal database and save.
Invalidate all caches to force clean re-load.
Called by the import routine, after figuring what needs to be added.
importer is an import_bibtex.Bib2df_Incremental object.
- remove_reference(tag)[source]
Remove a reference and its links from the library.
- Parameters:
tag (str)
- update_reference(old_tag, new_data)[source]
Update or add a reference. Handles tag changes.
- Parameters:
old_tag (str)
new_data (dict)
- validate(task='sharding', execute=False)[source]
Audit and fix library structure. Tasks: ‘sharding’, ‘orphans’, ‘missing’
- Parameters:
task (str)
execute (bool)
- static get_library_path_list()[source]
Get a list of available libraries (no suffix) as list of Paths (see also
list).
- static rename_library(old_name, new_name)[source]
Rename a library folder and update its internal name.
- Parameters:
old_name (str)
new_name (str)
- static copy_library(old_name, new_name)[source]
Copy a library folder and update its internal name.
- Parameters:
old_name (str)
new_name (str)
- to_name_ex(name, strict=False)[source]
Extend name to longest match using a Trie; in strict mode adds as key if missing.
- next_tag(name, year)[source]
Return the next tag after name, year.
Remembers incremental tags handed out.
- reset_tag_allocator()[source]
You want to remember new tags for each dry run but be able to accept them. Hence this is useful.
- property tag_allocator
Return the loaded key allocator for tag generation.
- run_ripgrep(pattern, args)[source]
Execute and format ripgrep search against library full text extracts.
- write_bibtex()[source]
Write out bibtex file of the library.
Lives in library/LIB_NAME/lib-name.bib with a symlink to config location.
- extract_all_text(force=False, workers=None, execute=False)[source]
Extract text for all documents in the library. If force=False, only extracts if the text file doesn’t exist. If execute=False, does nothing but log what would be done.
- Parameters:
force (bool)
workers (int)
execute (bool)
- get_text_info()[source]
Get info about the text: number of docs, number with text files etc., split by extension.
- clean_text_extracts(execute=False)[source]
Find (and delete if execute) text files with no corresponding document in the library.
- Parameters:
execute (bool)
- audit()[source]
Perform a comprehensive structural audit of the library. Returns a dictionary of findings.
- These three should all be empty::
missing_physical_files: check all files in doc_df actually exist. broken_tag_links: tag in ref_doc but no actual ref. broken_id_links: (hash, version) in ref_doc but no actual doc.
- These may be longer:
refs_missing_doc: a ref withand no doc. Can’t locate an doc (afile). Expected. docs_missing_ref: docs in doc_df with no reference; eg old versions of papers
orphan_extracts: an actual extract exists that is not expected based on doc_df.
- reset_library()[source]
Reset a library back to empty state.
USE WITH CARE!
Deletes all data files and the bibtex link if it exists.
- initial_import(*, dir_name='', dir_iterable=None, errors_mapper=None, qd=<function display>, update=False, incremental=False)[source]
Iterate import dir_name or iterate over if iterable. Find ! bibtex file - error if the bibtex file is not unique.
E.g. uber library created from
- initial_import_bibtex_file(bibtex_file, doc_dir=None, errors_mapper=None, qd=<function display>, update=True, incremental=False)[source]
Import a single bibtex file into library.
Use in prod when you know the bibtex will work to recreate from scratch.
- import_staged_document(bibtex_text, staged_document_path, *, known_hash=None, source_label='web-ingest', extract_text=True)[source]
Import one staged document through the same incremental BibTeX path as the CLI.
The edited BibTeX is the metadata source. This helper only injects the staged document as a Mendeley-style
filefield so Bib2df_Incremental can perform its normal author, tag, duplicate, sharding, and audit work.- Parameters:
bibtex_text (str)
staged_document_path (str | Path)
known_hash (str | None)
source_label (str)
extract_text (bool)
- preview_staged_document_import(bibtex_text, staged_document_path, *, known_hash=None, source_label='web-ingest-preview')[source]
Return the BibTeX and analysis that the real staged import would produce.
- Parameters:
bibtex_text (str)
staged_document_path (str | Path)
known_hash (str | None)
source_label (str)
- Return type:
dict
- find(path)[source]
Hash a file and return the hash and any matching records.
- Parameters:
path (str | Path)
- enhance_refs(update=False)[source]
Run the enhancement process on references only, sort out duplicates etc.
Designed as a one-time run on initial import. Thereafter the import process itself guards against duplicates. It only addresses references and makes no change to docs. See enhance_docs for the corresponding doc version.
BibTeX
Functions for creating bibtex entries from various iterables.
v2 Hack off gemini, which actually was very poor for task at hand. v1 Gemini.
- archivum.bibtex.sanitize_for_latex(val)[source]
Sanitize string for LaTeX compatibility.
- Parameters:
val (Any)
- Return type:
str
- archivum.bibtex.format_mendeley_file(path)[source]
Format a path using Mendeley’s BibTeX file-field convention.
Windows paths are rendered as
:C\:/path/to/file.pdf:pdf. Non-Windows paths keep their absolute/relative POSIX form and still include the final file type segment.- Parameters:
path (Any)
- Return type:
str
- archivum.bibtex.dict_to_bibtex(data, allowed_fields=None, raw_fields=None)[source]
Converts a dict-like object to a sanitized BibTeX string.
raw_fieldsbypasses LaTeX sanitization for fields where the literal value matters, such as Mendeleyfilepaths.- Parameters:
data (Any)
allowed_fields (List[str])
raw_fields (Iterable[str])
- Return type:
str
- archivum.bibtex.rows_to_bibtex(rows, allowed_fields=None, *, include_hash=False, include_file=False, path_resolver=None)[source]
Convert dataframe-like rows to BibTeX text using
dict_to_bibtex.This is the shared path for library-level and ad hoc web exports. The optional
include_hashandinclude_fileflags produce Archivum’s enriched BibTeX+ export without changing normal library BibTeX output.- Parameters:
rows (Any)
allowed_fields (List[str])
include_hash (bool)
include_file (bool)
path_resolver (Callable[[Any], Path] | None)
- Return type:
str
Import BibTeX
BibTeX import helpers for archivum.
This module uses a Bib2df_Incremental, derived from the Mendeley porting logic to incrementally import new references from a BibTeX file into an existing Library.
Each import run is recorded under a timestamped directory so that the original .bib and a copy of the PDFs are preserved and the ETL is, in principle, replayable.
- class archivum.import_bibtex.Bib2df_Incremental(*, bibtex_file_path, doc_dir, reference_library, fillna=True, errors_mapper=None, remap_dashes=False, add_hashes=False, incremental=False, qd=None, write_audit=True)[source]
Bibtex file to dataframe - library aware.
Code started as a copy of mendeley_port.Bib2df, adjusted for incremental, library-aware use. .
- property raw_df
DataFrame of raw(ish) information read directly from bibtex file.
- property ref_df
The reference df contains no file information and has tag NOT as the index.
- property doc_df
Read file information for the current library’s pdf store.
Returns dataframe describing actual files (afiles). These may or may not be referenced in library.database. Currently only PDFs.
- property vfile_df
Information about virtual files (vfiles) found in the file field in the Mendeley bibtex file.
Parses file field created by Mendeley in order to discover them.
Mendeley’s internal file(s) field added to bibtex files. Looks like a semicolon separated list of the form:
:C\:/S/new-papers/Blackwell/1953_Equivalent Comparisons of Experiments.pdf:pdf
Oddly, empty vfiles are represented as
::.
- property ref_doc_df
Make the reference/document dataframe by matching vfiles to afiles.
vfiles (virtual files) are references within the file field in the mendeley bibtex file.
afiles are actual files that exist in the pdf_path directory.
- property author_map_df
DataFrame of author name showing a transition to a normalized form.
Adjusts for initials (puts periods in), takes the longest ! name using a Trie, adjusts for accents (guess work!).
For a new import into an empty library, needs to be run on the authors in raw_df to prime the pump
- property database
Merged database, with exploded authors.
- tex_to_unicode(s_in)[source]
Tex codes to Unicode for a string and removing braces with single character.
Errors are added to self._last_decode and looked up in the dictionary self.errors_mapper. Work iteratively: run, look at errors, add or update entries in self.errors_mapper.
- Parameters:
s_in (str)
- Return type:
str
- map_tags()[source]
Remap the tags into standard AuthorYYYY[a-z] format for named df.
Saves a dataframe showing what was done as part of import.
Updated to use reference library.
- import_bibtex_file()[source]
The work happens here! Do the actual import, and normalize each text-based field.
Runs through each task in turn, see comments.
For the initial port choose run_add_hoc=True, but for incremental updates use False.
Updated to remove ad_hoc adjustments, dropped extract citations from abstract, tags use library, etc.
Called automatically by ported_df property if needed.
- import_analysis(lib_test=True)[source]
Prepare a detailed analysis of the import. Returns a DataFrame with columns: tag | author | title | hash match | doi match | title match | action
- import_analysis_full(lib_test=True, strict=False)[source]
Original detailed diagnostic analysis. Shows scores, field changes, and raw vs ported comparison.
Query Helpers
- class archivum.search.query.QuerySpec(raw: str, kind: str, query: str, expression: str)[source]
- Parameters:
raw (str)
kind (str)
query (str)
expression (str)
- archivum.search.query.has_querex_symbol(query)[source]
Return True when query contains explicit querexfuzz syntax markers.
- Parameters:
query (str)
- Return type:
bool
- archivum.search.query.normalize_query(raw_query, *, default_limit=50, recent=True, projection='type, *', q_projection=None, fuzzy_projection=None, default_empty_pattern=None)[source]
Convert a web search string into the expression passed to querexfuzz.
- Parameters:
raw_query (str)
default_limit (int | None)
recent (bool)
projection (str)
q_projection (str | None)
fuzzy_projection (str | None)
default_empty_pattern (str | None)
- Return type:
CrossRef
Code for interacting with the crossref api.
With GPT.
v2 Gemini updates v1 GPT
- archivum.crossref.lookup_doi(doi)[source]
Lookup an individual doi string.
- Parameters:
doi (str)
- Return type:
Dict[str, Any] | None
Utilities
Various utilities for archivum.
- archivum.utilities.djvu_convert_file(in_path, out_path, verbose=False, config=None)[source]
Convert a DjVu file to a searchable PDF.
Uses ddjvu for initial conversion and ocrmypdf (via WSL) for OCR.
- Parameters:
in_path (Path)
out_path (Path)
verbose (bool)
- Return type:
bool
- archivum.utilities.safe_int(s)[source]
Safe format of s as a year for greater_tables.
By default s may be interpreted as a float so str(x) give 2015.0 which is not wanted. Hence this function is needed.
- archivum.utilities.safe_file_size(s)[source]
Safe format of s as a year for greater_tables.
By default s may be interpreted as a float so str(x) give 2015.0 which is not wanted. Hence this function is needed.
- archivum.utilities.make_qd(max_string_length=50, max_rows=10, display_func=None, **gt_kwargs)[source]
Make a qd function with sensible defaults.
If display_func is None use IPython.display display.
- archivum.utilities.remove_accents(s)[source]
Remove accents from a string.
- Parameters:
s (str)
- Return type:
str
- archivum.utilities.trim_author(s)[source]
Clean author string: short names, truncate at 3 with et al. if more.
- archivum.utilities.accent_mapper_dict(names, verbose=False)[source]
Make dict mapper for name -> accented name from list of names.
- archivum.utilities.sanitize_windows_component(name, max_length=255)[source]
Sanitize a string so it can be safely used as a single Windows path component (file or directory name).
- Rules:
Remove control characters and invalid punctuation.
Strip leading/trailing spaces and dots.
Avoid reserved device names (CON, PRN, AUX, NUL, COM1-9, LPT1-9).
Enforce a per-component length cap.
- Parameters:
name (str)
max_length (int)
- Return type:
str
- archivum.utilities.suggest_name(author, title, year)[source]
Figure standard dir name and file name from author, title, and year.
Names are make filename safe
author, str, in standard last, first [and…] format title, st year, str
- Parameters:
author (str)
title (str)
year (str | int)
- archivum.utilities.rename(original_doc_name, doc_hash, pdf_dir_path, dir_name, file_name, hash_len=6, execute=False)[source]
Hard link original file into pdf_dir/dir_name/file_name.
Returns True if copied, else false
- Parameters:
original_doc_name (str)
doc_hash (str)
pdf_dir_path (Path)
dir_name (str)
file_name (str)
hash_len (int)
execute (bool)
- Return type:
bool