Indexing content using the Python API
Pagefind provides an interface to the indexing binary as a Python package you can install and import.
There are situations where using this Python package is beneficial:
- Integrating Pagefind into an existing Python project, e.g. writing a plugin for a static site generator that can pass in-memory HTML files to Pagefind. Pagefind can also return the search index in-memory, to be hosted via the dev mode alongside the files.
- Users looking to index their site and augment that index with extra non-HTML pages can run a standard Pagefind crawl with
add_directory
and augment it withadd_custom_record
. - Users looking to use Pagefind’s engine for searching miscellaneous content such as PDFs or subtitles, where
add_custom_record
can be used to build the entire index from scratch.
#Installation
To install just the Python wrapper, and use a pagefind
executable from your system:
python3 -m pip install 'pagefind'
To install the Python wrapper as well as the standard binary for your platform:
python3 -m pip install 'pagefind[bin]'
To install the Python wrapper as well as the extended binary for your platform:
python3 -m pip install 'pagefind[extended]'
#Example Usage
import asyncio
import json
import logging
import os
from pagefind.index import PagefindIndex, IndexConfig
logging.basicConfig(level=os.environ.get("LOG_LEVEL", "INFO"))
log = logging.getLogger(__name__)
html_content = (
"<html>"
" <body>"
" <main>"
" <h1>Example HTML</h1>"
" <p>This is an example HTML page.</p>"
" </main>"
" </body>"
"</html>"
)
def prefix(pre: str, s: str) -> str:
return pre + s.replace("\n", f"\n{pre}")
async def main():
config = IndexConfig(
root_selector="main", logfile="index.log", output_path="./output", verbose=True
)
async with PagefindIndex(config=config) as index:
log.debug("opened index")
new_file, new_record, new_dir = await asyncio.gather(
index.add_html_file(
content=html_content,
url="https://example.com",
source_path="other/example.html",
),
index.add_custom_record(
url="/elephants/",
content="Some testing content regarding elephants",
language="en",
meta={"title": "Elephants"},
),
index.add_directory("./public"),
)
print(prefix("new_file ", json.dumps(new_file, indent=2)))
print(prefix("new_record ", json.dumps(new_record, indent=2)))
print(prefix("new_dir ", json.dumps(new_dir, indent=2)))
files = await index.get_files()
for file in files:
print(prefix("files", f"{len(file['content']):10}B {file['path']}"))
if __name__ == "__main__":
asyncio.run(main())
All interactions with Pagefind are asynchronous, as they communicate with the native Pagefind binary in the background.
#PagefindIndex
pagefind.index.PagefindIndex
manages a pagefind index.
PagefindIndex
operates as an async contextmanager.
Entering the context starts a backing Pagefind service and creates an in-memory index in the backing service.
Exiting the context writes the in-memory index to disk and then shuts down the backing Pagefind service.
from pagefind.index import PagefindIndex
async def main():
async with PagefindIndex() as index: # open the index
... # update the index
# the index is closed here and files are written to disk.
Each method of PagefindIndex
that talks to the backing Pagefind service can raise errors.
If an error is is thrown inside PagefindIndex
’s context, the context closes without writing the index files to disk.
async def main():
async with PagefindIndex() as index: # open the index
await index.add_directory("./public")
raise Exception("not today")
# the index closes without writing anything to disk
PagefindIndex
optionally takes a configuration dictionary that can apply parts of the Pagefind CLI config. The options available at this level are:
from pagefind.index import PagefindIndex, IndexConfig
config = IndexConfig(
root_selector="main",
exclude_selectors="nav",
force_language="en",
verbose=True,
logfile="index.log",
keep_index_url=True,
output_path="./output",
)
async def main():
async with PagefindIndex(config=config) as index:
...
See the relevant documentation for these configuration options in the Configuring the Pagefind CLI documentation.
#index.add_directory
Indexes a directory from disk using the standard Pagefind indexing behaviour.
This is equivalent to running the Pagefind binary with --site <dir>
.
# Index all the HTML files in the public directory
indexed_dir = await index.add_directory("./public")
page_count: int = new_dir["page_count"]
If the path
provided is relative, it will be relative to the current working directory of your Python process.
# Index files in a directory matching a given glob pattern.
indexed_dir = await index.add_directory("./public", glob="**.{html}")
Optionally, a custom glob
can be supplied which controls which files Pagefind will consume within the directory. The default is shown, and the glob
option can be omitted entirely.
See Wax patterns documentation for more details.
#index.add_html_file
Adds a virtual HTML file to the Pagefind index. Useful for files that don’t exist on disk, for example a static site generator that is serving files from memory.
html_content = (
"<html lang='en'><body>"
" <h1>A Full HTML Document</h1>"
" <p> ... </p>"
"</body></html>"
)
# Index a file as if Pagefind was indexing from disk
new_file = await index.add_html_file(
content=html_content,
source_path="other/example.html",
)
# Index HTML content, giving it a specific URL
new_file = await index.add_html_file(
content=html_content,
url="https://example.com",
)
The source_path
should represent the path of this HTML file if it were to exist on disk. Pagefind will use this path to generate the URL. It should be relative, or absolute to a path within the current working directory.
Instead of source_path
, a url
may be supplied to explicitly set the URL of this search result.
The content
should be the full HTML source, including the outer <html> </html>
tags. This will be run through Pagefind’s standard HTML indexing process, and should contain any required Pagefind attributes to control behaviour.
If successful, the file
object is returned containing metadata about the completed indexing.
#index.add_custom_record
Adds a direct record to the Pagefind index. Useful for adding non-HTML content to the search results.
custom_record = await index.add_custom_record(
url="/contact/",
content=(
"My raw content to be indexed for search. "
"Will be lightly processed by Pagefind."
),
language="en",
meta={
"title": "Contact",
"category": "Landing Page"
},
filters={"tags": ["landing", "company"]},
sort={"weight": "20"},
)
page_word_count: int = custom_record["page_word_count"]
page_url: str = custom_record["page_url"]
page_meta: dict[str, str] = custom_record["page_meta"]
The url
, content
, and language
fields are all required. language
should be an ISO 639-1 code.
meta
is optional, and is strictly a flat object of keys to string values.
See the Metadata documentation for semantics.
filters
is optional, and is strictly a flat object of keys to arrays of string values.
See the Filters documentation for semantics.
sort
is optional, and is strictly a flat object of keys to string values.
See the Sort documentation for semantics.
When Pagefind is processing an index, number-like strings will be sorted numerically rather than alphabetically. As such, the value passed in should be "20"
and not 20
If successful, the file
object is returned containing metadata about the completed indexing.
#index.get_files
Get raw data of all files in the Pagefind index. Useful for integrating a Pagefind index into the development mode of a static site generator and hosting these files yourself.
WATCH OUT: these files can be large enough to clog the pipe reading from the pagefind
binary’s subprocess, causing a deadlock.
for file in (await index.get_files()):
path: str = file["path"]
content: str = file["content"]
...
#index.write_files
Calling index.write_files()
writes the index files to disk, as they would be written when running the standard Pagefind binary directly.
Closing the PagefindIndex
’s context automatically calls index.write_files
, so calling this function is not necessary in normal operation.
Calling this function won’t prevent files being written when the context closes, which may cause duplicate files to be written.
If calling this function manually, you probably want to also call index.delete_index()
.
config = IndexConfig(
output_path="./public/pagefind",
)
async with PagefindIndex(config=config) as index:
# ... add content to index
# write files to the configured output path for the index:
await index.write_files()
# write files to a different output path:
await index.write_files(output_path="./custom/pagefind")
# prevent also writing files when closing the `PagefindIndex`:
await index.delete_index()
The output_path
option should contain the path to the desired Pagefind bundle directory. If relative, is relative to the current working directory of your Python process.
#index.delete_index
Deletes the data for the given index from its backing Pagefind service.
Doesn’t affect any written files or data returned by get_files()
.
await index.delete_index()
Calling index.get_files()
or index.write_files()
doesn’t consume the index, and further modifications can be made. In situations where many indexes are being created, the delete_index
call helps clear out memory from a shared Pagefind binary service.
Reusing an PagefindIndex
object after calling index.delete_index()
will cause errors to be returned.
Not calling this method is fine — these indexes will be cleaned up when your PagefindIndex
’s context closes, its backing Pagefind service closes, or your Python process exits.
#PagefindService
PagefindService
manages a pagefind service running in a subprocess.
PagefindService
operates as an async context manager: when the context is entered, the backing service starts, and when the context exits, the backing service shuts down.
from pagefind.service import PagefindService
async def main():
# or you can write
service = await PagefindService().launch()
...
await service.close()
async with PagefindService() as service: # the service launches
...
# the service closes
You should invoke PagefindService
directly when you want to use the same backing service for many indexes:
async with PagefindService() as service:
default_index = await service.create_index()
other_index = await service.create_index(
config=IndexConfig(output_path="./search/nonstandard"),
)
await asyncio.gather(
default_index.add_directory("./a"),
other_index.add_directory("./b"),
)
await asyncio.gather(
default_index.write_files(),
other_index.write_files(),
)