API Reference

Types

PaperFetch.BibEntryType
BibEntry(key, type, fields)

Stable internal representation of one BibTeX entry.

fields stores lower-case BibTeX-style field names mapped to string values. The input file is never edited; BibEntry is only an analysis view.

Example

entry = BibEntry("smith2020", "article", Dict("doi" => "10.1000/example"))
entry.key
source
PaperFetch.WorkIdentifierType
WorkIdentifier(kind, value)

A normalized identifier extracted from a bibliography entry.

kind is one of :doi, :isbn, :url, :arxiv, :pmid, or :openalex. value is the normalized identifier string.

Example

id = WorkIdentifier(:doi, "10.1000/example")
id.kind == :doi
source
PaperFetch.CandidateSourceType
CandidateSource(record, identifier)

A possible authority for a bibliography entry, recording which WorkIdentifier was used to find it.

Example

r = SourceRecord(provider="test", id="x")
id = WorkIdentifier(:doi, "10.1000/x")
cs = CandidateSource(r, id)
cs.identifier.kind == :doi
source
PaperFetch.SourceRecordType
SourceRecord(; provider, id="", title=nothing, authors=String[], year=nothing, doi=nothing,
               url=nothing, journal=nothing, pages=nothing, publisher=nothing, pdf_url=nothing,
               raw=Dict{String,Any}())

Metadata about a work returned by an API, fixture, or landing-page adapter.

authors stores the creator list returned by the provider. For edited books and book chapters this may be compared with a BibTeX editor field when the entry has no author field.

Example

source = SourceRecord(provider="fixture", doi="10.1000/example", title="Example")
source.provider
source
PaperFetch.FieldComparisonType
FieldComparison(field, status, input, source, note)

Result of comparing one bibliography field with source metadata.

status is one of :exact, :normalized, :equivalent, :missing_input, :missing_source, :conflict, or :ambiguous.

Example

cmp = FieldComparison("doi", :exact, "10.1000/x", "10.1000/x", "same DOI")
cmp.status
source
PaperFetch.EntryReportType
EntryReport(entry, sources, comparisons, confidence, notes, pdf_candidates)

Review result for a single bibliography entry.

notes contains entry-level diagnostics such as provider errors, discarded candidate sources, and warnings about self-comparison fallback. Field-level diagnostics live in comparisons.

Example

entry = BibEntry("x", "misc", Dict("title" => "Example"))
report = EntryReport(entry, SourceRecord[], FieldComparison[], 0.0, ["no source"], String[])
report.entry.key
source
PaperFetch.FetchResultType
FetchResult(key, status, file, source_url, final_url, note, sha256, bytes)

Manifest record for one PDF fetch attempt.

Example

result = FetchResult("x", "skipped", nothing, nothing, nothing, "no PDF", nothing, 0)
result.status
source

Input And Normalization

PaperFetch.read_bibtexFunction
read_bibtex(path; check=:warn)

Read a BibTeX file into BibEntry values using BibParser.jl.

Entries are returned sorted by key for stable, reproducible ordering.

Example

entries = read_bibtex("examples/01_exact_article.bib"; check=:none)
length(entries) >= 1
source
PaperFetch.read_itemsFunction
read_items(path; check=:warn)

Read bibliography input. BibTeX files are parsed with BibParser; plain text files are interpreted as one DOI or URL per non-comment line.

Item keys for plain-text input are item1, item2, … in line order, skipping blank lines and comments.

Example

items = read_items("examples/11_plain_dois.txt"; check=:none)
length(items) == 2
source
PaperFetch.extract_identifiersFunction
extract_identifiers(entry)

Extract normalized WorkIdentifier values from a BibEntry.

Checks for DOI, arXiv eprint (when archiveprefix is arXiv), ISBN, PMID, and URL fields in that priority order. DOI-like strings, arXiv identifiers, and URLs are also recovered from common misplaced fields such as note, howpublished, and LaTeX \url{...} macros.

Example

entry = BibEntry("x", "article", Dict("doi" => "10.1000/example"))
ids = extract_identifiers(entry)
ids[1].kind == :doi
source
PaperFetch.normalize_doiFunction
normalize_doi(value)

Normalize a DOI to a lower-case bare DOI string.

Example

normalize_doi("https://doi.org/10.1000/ABC") == "10.1000/abc"
source
PaperFetch.normalize_urlFunction
normalize_url(value)

Normalize a URL for tolerant comparison.

DOI resolver URLs (https://doi.org/10.x/y, https://dx.doi.org/10.x/y) are canonicalized to doi:<normalized-doi>. Other HTTP(S) URLs have their scheme stripped, host lowercased, default port removed, and trailing slashes or punctuation removed. URL paths and queries keep their original case because many web servers treat those components as case-sensitive.

Example

normalize_url("https://doi.org/10.1000/ABC") == "doi:10.1000/abc"
normalize_url("https://doi.org/10.1000/abc") == normalize_url("https://dx.doi.org/10.1000/ABC")
normalize_url("https://example.org/") == "example.org"
normalize_url("https://Example.org/Data/File.pdf?ID=ABC") == "example.org/Data/File.pdf?ID=ABC"
source
PaperFetch.normalize_textFunction
normalize_text(value)

Normalize bibliographic text for tolerant comparison.

Removes BibTeX braces and LaTeX accent commands, applies Unicode normalization (NFD + stripmark), lowercases, and collapses punctuation and whitespace.

Example

normalize_text("{Caf\'e} Data") == "cafe data"
source

Checking

PaperFetch.compare_entryFunction
compare_entry(entry, sources; fields=nothing)

Compare one BibEntry with candidate source records and return an EntryReport.

By default, proceedings and chapter-style entries compare their container as booktitle; articles compare journal. Books and chapter-style entries with an editor but no author compare editor as the creator field.

Source records are treated as candidates, not automatic truth. Candidate resolution first rejects hard title, creator, or year mismatches, then requires enough identity evidence such as a matching DOI, matching title and creator, or matching title and year. A close-but-not-identical title can still be accepted when creator and year evidence are strong, with the title comparison marked for manual review. Extra source fields that are absent from the BibTeX entry remain visible as missing-input comparisons, but they do not by themselves make the source less likely to be the same work.

When journal-article metadata and arXiv preprint metadata both match the same entry with equal source-resolution confidence, the journal article is preferred and the report records that choice in its notes.

The comparison is tolerant for bibliographic formatting, but conflicts are still reported explicitly. DOI values must match after DOI normalization. Author and editor names use the same normalization, including accents and initials.

Example

entry = BibEntry("x", "article", Dict("doi" => "10.1000/x", "title" => "A"))
source = SourceRecord(provider="fixture", doi="10.1000/x", title="A")
compare_entry(entry, [source]).confidence == 1.0
book = BibEntry("edited", "book", Dict("editor" => "Example, Erin", "title" => "Edited"))
src = SourceRecord(provider="fixture", title="Edited", authors=["Erin Example"])
any(cmp -> cmp.field == "editor", compare_entry(book, [src]).comparisons)
source
PaperFetch.check_bibliographyFunction
check_bibliography(path; providers=AbstractProvider[], fixture=nothing,
                   email="noreply@example.org", use_apis=false,
                   cache_dir=nothing, rate_limit_seconds=0.05,
                   ignore_keys=Set(["anon"]), check=:warn,
                   progress_io=nothing)

Read a bibliography, collect source metadata, and return one EntryReport per entry.

The input file is not edited. Reports preserve the original BibTeX keys and are intended to guide a human or a separate editing step.

Provider selection order:

  1. A FixtureProvider is added when fixture is set.
  2. Explicitly supplied providers are appended.
  3. An ApiProvider is added when use_apis=true. It can query Crossref, OpenAlex, Unpaywall, DataCite, arXiv, Semantic Scholar, PubMed, CORE, Figshare, Open Library, Google Books, and URL landing pages as appropriate. For books without an ISBN, title/creator search results can supply an ISBN that is then used for ISBN-specific Open Library and Google Books lookups. GitHub repository URLs can use CITATION.cff as structured software citation metadata when such a file is available.
  4. If still empty, a CandidateProvider is used as a read-only fallback that only echoes each entry's own title/doi/url back as its "source". This cannot detect an incorrect doi, title, or author. A @warn is emitted when this fallback is used, and affected EntryReports carry a matching note.

Set cache_dir to a directory path to cache API responses between runs. Set rate_limit_seconds to the minimum delay between uncached live API requests made by the default ApiProvider. Set ignore_keys=nothing to keep all entries, including review artifacts such as anon. Set progress_io to an IO stream such as stderr to print entry-by-entry progress; leave it as nothing for quiet programmatic use.

Identifier recovery is deliberately forgiving: DOI, arXiv, PMID, ISBN, and URL values can be extracted from standard fields and common misplaced fields such as note and howpublished. Later comparison remains explicit about conflicts.

Example

reports = check_bibliography("examples/01_exact_article.bib";
    fixture="examples/metadata_fixture.json", check=:none)
length(reports) == 1
source

Reports And Fetching

PaperFetch.write_reportsFunction
write_reports(reports, outdir; basename="paperfetch_report")

Write Markdown and INC reports for reports.

The default basename is paperfetch_report for direct API calls. CLI-generated reports use the input file stem unless --report-basename is supplied. Pass basename explicitly when a different output name is needed.

Markdown reports include entry-level general flags and field-level comparison flags. INC reports contain one row per compared field, or one red no_comparison row when no source comparison was possible.

Example

entry = BibEntry("x", "misc", Dict("title" => "Example"))
report = EntryReport(entry, SourceRecord[], FieldComparison[], 0.0, String[], String[])
paths = write_reports([report], mktempdir())
haskey(paths, :markdown) && haskey(paths, :inc)
source
PaperFetch.fetch_pdfsFunction
fetch_pdfs(reports, outdir; cookie_file=nothing, ezproxy=nothing, progress_io=nothing)

Download PDF candidates from reports and write INC and Markdown manifests.

Only explicit PDF candidate URLs are attempted. Missing PDFs are recorded as skipped, not as validation failures. The function returns the fetch results and the path to manifest.inc; manifest.md is written in the same directory for human review. Set progress_io to an IO stream such as stderr to print per-reference fetch progress.

Example

entry = BibEntry("x", "misc", Dict("title" => "No PDF"))
report = EntryReport(entry, SourceRecord[], FieldComparison[], 0.0, String[], String[])
results, manifest = fetch_pdfs([report], mktempdir())
results[1].status == "skipped" && basename(manifest) == "manifest.inc"
source

Command Line

PaperFetch.mainFunction
main(args=ARGS)

Command-line entry point.

check writes Markdown and INC reports. fetch also writes manifest.inc, manifest.md, and any successfully downloaded PDFs. Report basenames default to the input file stem unless --report-basename is supplied. Progress is written to stderr by default; pass --quiet to suppress it.

Example

PaperFetch.main(["check", "examples/01_exact_article.bib", "--fixture", "examples/metadata_fixture.json", "--outdir", mktempdir()])
source