Commit Graph

13 Commits

Author SHA1 Message Date
Jake Kenneally
f0ac68d26c Address review comments
- Renamed `getIndentWidth` to `getTextAdvanceX`
- Collapsed `Style` and `BlockStyle` into a single struct, and switched to using bitflag setup for determining font style in `EpdFontFamily::Style`, including underlined text
- Added caching for parsed CSS rules
- Reverted changes for fixing spurious spaces
- Skipped loading CSS on Sleep and HomeScreen activities, since we only need BookMetadata and the cover image
- Reverted changes to BookMetadataCache, since we don't need to cache the individual CSS files and can instead use the parsed CSS rules (and the new cache file for those)
- Switched intermediary values to direct assignment in `CssParser.cpp`
- Added function in `BlockStyle.h` to directly convert from a `CssStyle` to a `BlockStyle`, as well as combined multiple `BlockStyle`s together for nested elements that should inherit the parent's style when the child's is unspecified
- Updated names of variables in `CssStyle` to match those of the CSS they represent (e.g. alignment -> textAlign, indent -> textIndent)
- General cleaning up and simplifying the code
2026-02-02 22:18:06 -05:00
Jake Kenneally
834440aab4 Merge branch 'master' into feature/add-epub-css-parsing
* master: (33 commits)
  feat: add HalDisplay and HalGPIO (#522)
  feat: Display epub metadata on Recents (#511)
  chore: Cut release 0.16.0
  fix: Correctly render italics on image alt placeholders (#569)
  chore: .gitignore: add compile_commands.json & .cache (#568)
  fix: Render keyboard entry over multiple lines (#567)
  fix: missing front layout in mapLabels() (#564)
  refactor: Re-work for OTA feature (#509)
  perf: optimize large EPUB indexing from O(n^2) to O(n) (#458)
  feat: Add Spanish hyphenation support (#558)
  feat: Add support to B&W filters to image covers (#476)
  feat(ux): page turning on button pressed if long-press chapter skip is disabled (#451)
  feat: Add status bar option "Full w/ Progress Bar" (#438)
  fix: Validate settings on read. (#492)
  fix: rotate origin in drawImage (#557)
  feat: Extract author from XTC/XTCH files (#563)
  fix: add txt books to recent tab (#526)
  docs: add font generation commands to builtin font headers (#547)
  docs: Update README with supported languages for EPUB  (#530)
  fix: Fix KOReader document md5 calculation for binary matching progress sync (#529)
  ...
2026-01-27 20:24:38 -05:00
Daniel Chelling
83315b6179
perf: optimize large EPUB indexing from O(n^2) to O(n) (#458)
## Summary

Optimizes EPUB metadata indexing for large books (2000+ chapters) from
~30 minutes to ~50 seconds by replacing O(n²) algorithms with O(n log n)
hash-indexed lookups.

Fixes #134

## Problem

Three phases had O(n²) complexity due to nested loops:

| Phase | Operation | Before (2768 chapters) |
|-------|-----------|------------------------|
| OPF Pass | For each spine ref, scan all manifest items | ~25 min |
| TOC Pass | For each TOC entry, scan all spine items | ~5 min |
| buildBookBin | For each spine item, scan ZIP central directory | ~8.4
min |

Total: **~30+ minutes** for first-time indexing of large EPUBs.

## Solution

Replace linear scans with sorted hash indexes + binary search:

- **OPF Pass**: Build `{hash(id), len, offset}` index from manifest,
binary search for each spine ref
- **TOC Pass**: Build `{hash(href), len, spineIndex}` index from spine,
binary search for each TOC entry
- **buildBookBin**: New `ZipFile::fillUncompressedSizes()` API - single
ZIP central directory scan with batch hash matching

All indexes use FNV-1a hashing with length as secondary key to minimize
collisions. Indexes are freed immediately after each phase.

## Results

**Shadow Slave EPUB (2768 chapters):**

| Phase | Before | After | Speedup |
|-------|--------|-------|---------|
| OPF pass | ~25 min | 10.8 sec | ~140x |
| TOC pass | ~5 min | 4.7 sec | ~60x |
| buildBookBin | 506 sec | 34.6 sec | ~15x |
| **Total** | **~30+ min** | **~50 sec** | **~36x** |

**Normal EPUB (87 chapters):** 1.7 sec - no regression.

## Memory

Peak temporary memory during indexing:
- OPF index: ~33KB (2770 items × 12 bytes)
- TOC index: ~33KB (2768 items × 12 bytes)
- ZIP batch: ~44KB (targets + sizes arrays)

All indexes cleared immediately after each phase. No OOM risk on
ESP32-C3.

## Note on Threshold

All optimizations are gated by `LARGE_SPINE_THRESHOLD = 400` to preserve
existing behavior for small books. However, the algorithms work
correctly for any book size and are faster even for small books:

| Book Size | Old O(n²) | New O(n log n) | Improvement |
|-----------|-----------|----------------|-------------|
| 10 ch | 100 ops | 50 ops | 2x |
| 100 ch | 10K ops | 800 ops | 12x |
| 400 ch | 160K ops | 4K ops | 40x |

If preferred, the threshold could be removed to use the optimized path
universally.

## Testing

- [x] Shadow Slave (2768 chapters): 50s first-time indexing, loads and
navigates correctly
- [x] Normal book (87 chapters): 1.7s indexing, no regression
- [x] Build passes
- [x] clang-format passes

## Files Changed

- `lib/Epub/Epub/parsers/ContentOpfParser.h/.cpp` - OPF manifest index
- `lib/Epub/Epub/BookMetadataCache.h/.cpp` - TOC index + batch size
lookup
- `lib/ZipFile/ZipFile.h/.cpp` - New `fillUncompressedSizes()` API
- `lib/Epub/Epub.cpp` - Timing logs

<details>
<summary><b>Algorithm Details</b> (click to expand)</summary>

### Phase 1: OPF Pass - Manifest to Spine Lookup

**Problem**: Each `<itemref idref="ch001">` in spine must find matching
`<item id="ch001" href="...">` in manifest.

```
OLD: For each of 2768 spine refs, scan all 2770 manifest items
     = 7.6M string comparisons

NEW: While parsing manifest, build index:
     { hash("ch001"), len=5, file_offset=120 }
     
     Sort index, then binary search for each spine ref:
     2768 × log₂(2770) ≈ 2768 × 11 = 30K comparisons
```

### Phase 2: TOC Pass - TOC Entry to Spine Index Lookup

**Problem**: Each TOC entry with `href="chapter0001.xhtml"` must find
its spine index.

```
OLD: For each of 2768 TOC entries, scan all 2768 spine entries
     = 7.6M string comparisons

NEW: At beginTocPass(), read spine once and build index:
     { hash("OEBPS/chapter0001.xhtml"), len=25, spineIndex=0 }
     
     Sort index, binary search for each TOC entry:
     2768 × log₂(2768) ≈ 30K comparisons
     
     Clear index at endTocPass() to free memory.
```

### Phase 3: buildBookBin - ZIP Size Lookup

**Problem**: Need uncompressed file size for each spine item (for
reading progress). Sizes are in ZIP central directory.

```
OLD: For each of 2768 spine items, scan ZIP central directory (2773 entries)
     = 7.6M filename reads + string comparisons
     Time: 506 seconds

NEW: 
  Step 1: Build targets from spine
          { hash("OEBPS/chapter0001.xhtml"), len=25, index=0 }
          Sort by (hash, len)
  
  Step 2: Single pass through ZIP central directory
          For each entry:
            - Compute hash ON THE FLY (no string allocation)
            - Binary search targets
            - If match: sizes[target.index] = uncompressedSize
  
  Step 3: Use sizes array directly (O(1) per spine item)
  
  Total: 2773 entries × log₂(2768) ≈ 33K comparisons
  Time: 35 seconds
```

### Why Hash + Length?

Using 64-bit FNV-1a hash + string length as a composite key:
- Collision probability: ~1 in 2⁶⁴ × typical_path_lengths
- No string storage needed in index (just 12-16 bytes per entry)
- Integer comparisons are faster than string comparisons
- Verification on match handles the rare collision case

</details>

---

_AI-assisted development. All changes tested on hardware._
2026-01-28 01:29:15 +11:00
Jake Kenneally
8f3d226bf3 increment versions to prevent error when opening cached EPUBs 2026-01-20 10:27:55 -06:00
Jake Kenneally
be2de1123b Merge remote-tracking branch 'origin' into feature/add-epub-css-parsing
* origin:
  fix: truncate chapter names that are too long (#422)
  feat: dict based Hyphenation (#305)
  fix: render U+FFFD replacement character instead of ? (#366)
  fix: Invert colors on home screen cover overlay when recent book is selected (#390)
  Adds KOReader Sync support (#232)
  feat: Change keyboard "caps" to "shift" & Wrap Keyboard (#377)
  fix: XTC 1-bit thumb BMP polarity inversion (#373)
2026-01-19 22:37:37 -06:00
Arthur Tazhitdinov
8824c87490
feat: dict based Hyphenation (#305)
## Summary

* Adds (optional) Hyphenation for English, French, German, Russian
languages

## Additional Context

* Included hyphenation dictionaries add approximately 280kb to the flash
usage (German alone takes 200kb)
* Trie encoded dictionaries are adopted from hypher project
(https://github.com/typst/hypher)
* Soft hyphens (and other explicit hyphens) take precedence over
dict-based hyphenation. Overall, the hyphenation rules are quite
aggressive, as I believe it makes more sense on our smaller screen.

---------

Co-authored-by: Dave Allie <dave@daveallie.com>
2026-01-19 12:56:26 +00:00
Jake Kenneally
94ce987f2c feat: Add CSS parsing and CSS support in EPUBs 2026-01-17 17:57:04 -05:00
Dave Allie
8f3df7e10e
fix: Handle EPUB 3 TOC to spine mapping when nav file in subdirectory (#332)
## Summary

- Nav file in EPUB 3 file is a HTML file with relative hrefs
- If this file exists anywhere but in the same location as the
content.opf file, navigating in the book will fail
- Bump the book cache version to rebuild potentially broken books

## Additional Context

- Fixes https://github.com/daveallie/crosspoint-reader/issues/264

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code?

- [ ] Yes
- [ ] Partially
- [x] No
2026-01-13 00:57:34 +11:00
Jonas Diemer
03f0ce04cc
Feature: go to text/start reference in epub guide section at first start (#156)
This parses the guide section in the content.opf for text/start
references and jumps to this on first open of the book.

Currently, this behavior will be repeated in case the reader manually
jumps to Chapter 0 and then re-opens the book. IMO, this is an
acceptable edge case (for which I couldn't see a good fix other than to
drag a "first open" boolean around).

---------

Co-authored-by: Sam Davis <sam@sjd.co>
Co-authored-by: Dave Allie <dave@daveallie.com>
2025-12-30 23:02:46 +11:00
Dave Allie
9f31f80c80
Show previous title for unnamed spines (#158)
## Summary

* Show previous title for unnamed spines
* The spec is a little unclear, but there are plenty of cases where
chapters are split up in parts and should show the previous chapter's
title
* List TOC items instead of spine items in chapter select
* Bump `BOOK_CACHE_VERSION` to `2` to force regeneration of spine item's
TOC indexes
2025-12-30 18:52:42 +11:00
Dave Allie
fb5fc32c5d
Add exFAT support (#150)
## Summary

* Swap to updated SDCardManager which uses SdFat
* Add exFAT support
  * Swap to using FsFile everywhere
* Use newly exposed `SdMan` macro to get to static instance of
SDCardManager
* Move a bunch of FsHelpers up to SDCardManager
2025-12-30 16:09:30 +11:00
Dave Allie
071ccb9d1b
Custom zip parsing (#140)
## Summary

* Use custom zip central directory parsing to lower memory usage when
loading zipped epub content
2025-12-29 21:17:29 +11:00
Dave Allie
b6bc1f7ed3
New book.bin spine and table of contents cache (#104)
## Summary

* Use single unified cache file for book spine, table of contents, and
core metadata (title, author, cover image)
* Use new temp item store file in OPF parsing to store items to be
rescaned when parsing spine
  * This avoids us holding these items in memory
* Use new toc.bin.tmp and spine.bin.tmp to build out partial toc / spine
data as part of parsing content.opf and the NCX file
  * These files are re-read multiple times to ultimately build book.bin

## Additional Context

* Spec for file format included below as an image
* This should help with:
  * #10 
  * #60 
  * #99
2025-12-24 22:36:13 +11:00