···7788## Added
991010+### 2026-01-03
1111+1212+- Implemented `malfestio-readability` crate: A custom, rule-based content extraction engine replacing `dom_smoothie`, featuring XPath support (ftr-site-config compatible) and a Mozilla Readability-based generic fallback.
1313+1014### 2026-01-02
11151216- Published AT Protocol Lexicons for all core types (`org.stormlightlabs.malfestio.*`)
1717+1818+### 2025-12-*
1919+2020+- *TODO*
···1313- [Information Architecture](./docs/information-architecture.md) - Navigation and URL structure
1414- [Data Model Mapping](./docs/data-model-mapping.md) - Lexicon to database mapping
1515- [Roadmap](./docs/todo.md) - Development milestones
1616+1717+## Test Data
1818+1919+The file `crates/server/tests/data/1904.09828v2.pdf` is included for testing purposes.
2020+It contains the paper "Magic: The Gathering is Turing Complete" (arXiv:1904.09828).
+2
crates/server/Cargo.toml
···3737tracing-subscriber = { version = "0.3.22", features = ["env-filter"] }
3838uuid = { version = "1.19.0", features = ["v4", "fast-rng"] }
3939hickory-resolver = "0.24"
4040+pdf-extract = "0.10"
4141+docx-rs = "0.4"
···11+use anyhow::Result;
22+use std::path::Path;
33+44+pub mod docx;
55+pub mod pdf;
66+77+/// Trait for parsing documents (PDF, DOCX, etc.) and extracting text.
88+pub trait DocumentParser {
99+ /// Parse the document at the given path and return the extracted text.
1010+ fn parse(&self, path: &Path) -> Result<String>;
1111+}
+13
crates/server/src/import/pdf.rs
···11+use super::DocumentParser;
22+use anyhow::{Context, Result};
33+use std::path::Path;
44+55+pub struct PdfParser;
66+77+impl DocumentParser for PdfParser {
88+ fn parse(&self, path: &Path) -> Result<String> {
99+ let text =
1010+ pdf_extract::extract_text(path).with_context(|| format!("Failed to extract text from PDF: {:?}", path))?;
1111+ Ok(text)
1212+ }
1313+}
+1
crates/server/src/lib.rs
···11pub mod api;
22pub mod db;
33pub mod firehose;
44+pub mod import;
45pub mod middleware;
56pub mod oauth;
67pub mod pds;
crates/server/tests/data/1904.09828v2.pdf
This is a binary file and will not be displayed.
+74
crates/server/tests/import_tests.rs
···11+use malfestio_server::import::{DocumentParser, docx::DocxParser, pdf::PdfParser};
22+use std::path::PathBuf;
33+44+fn get_test_data_path(filename: &str) -> PathBuf {
55+ let mut path = PathBuf::from(env!("CARGO_MANIFEST_DIR"));
66+ path.push("tests/data");
77+ path.push(filename);
88+ path
99+}
1010+1111+#[test]
1212+fn test_pdf_extraction() {
1313+ let path = get_test_data_path("1904.09828v2.pdf");
1414+ assert!(path.exists(), "Test PDF not found at {:?}", path);
1515+1616+ let parser = PdfParser;
1717+ let result = parser.parse(&path);
1818+ assert!(result.is_ok(), "PDF parsing should succeed");
1919+2020+ let content = result.unwrap();
2121+ assert!(!content.is_empty(), "Extracted content should not be empty");
2222+ assert!(
2323+ content.contains("Magic: The Gathering"),
2424+ "Content should contain 'Magic: The Gathering'"
2525+ );
2626+ assert!(
2727+ content.contains("Turing Complete"),
2828+ "Content should contain 'Turing Complete'"
2929+ );
3030+3131+ assert!(
3232+ content.contains("Alex Churchill"),
3333+ "Content should contain author 'Alex Churchill'"
3434+ );
3535+3636+ let content_lower = content.to_lowercase();
3737+ assert!(content_lower.contains("abstract"), "Content should contain 'Abstract'");
3838+3939+ assert!(
4040+ content_lower.contains("introduction"),
4141+ "Content should contain 'Introduction'"
4242+ );
4343+ assert!(
4444+ content_lower.contains("references"),
4545+ "Content should contain 'References'"
4646+ );
4747+4848+ assert!(
4949+ content.len() > 5000,
5050+ "Content should be substantial (likely > 5000 chars)"
5151+ );
5252+}
5353+5454+#[test]
5555+fn test_docx_stub_extraction() {
5656+ let path = get_test_data_path("dummy.docx");
5757+ let parser = DocxParser;
5858+ let result = parser.parse(&path);
5959+6060+ assert!(result.is_ok(), "DOCX stub should return Ok");
6161+ let content = result.unwrap();
6262+ assert!(
6363+ content.contains("not yet implemented"),
6464+ "Content should indicate stub implementation"
6565+ );
6666+}
6767+6868+#[test]
6969+fn test_pdf_missing_file() {
7070+ let path = get_test_data_path("non_existent.pdf");
7171+ let parser = PdfParser;
7272+ let result = parser.parse(&path);
7373+ assert!(result.is_err(), "Parsing missing file should return error");
7474+}
+40
docs/todo.md
···184184185185**Reference:** [Ozone Moderation Service](https://github.com/bluesky-social/ozone)
186186187187+### Milestone P - Readability Updates
188188+189189+#### Deliverables
190190+191191+**Multi-page Support:**
192192+193193+- [ ] `single_page_link` directive - find "view full article" link
194194+- [ ] `next_page_link` directive - paginate through article pages
195195+- [ ] Concatenate content from multiple pages
196196+- [ ] Avoid circular pagination
197197+198198+**Advanced Directives:**
199199+200200+- [ ] `http_header(name)` directive - custom headers for fetching
201201+- [ ] `replace_string(find): replace` directive - text replacement
202202+- [ ] `find_string` directive - text pattern matching
203203+204204+**Quality:**
205205+206206+- [ ] Better table handling in markdown
207207+- [ ] Image caption extraction
208208+- [ ] JSON-LD support
209209+210210+**Performance:**
211211+212212+- [ ] LRU cache for parsed configs
213213+- [ ] Parallel candidate scoring
214214+- [ ] Lazy XPath evaluation
215215+216216+**Markdown Conversion:**
217217+218218+- [ ] Custom markdown converter (more control than html2md)
219219+- [ ] Code block language detection
220220+221221+#### Acceptance
222222+223223+- [ ] Can correctly extract multi-page articles (e.g., long news reports).
224224+- [ ] Advanced string manipulation allows for cleaner output on tricky sites.
225225+- [ ] Performance remains stable under high load.
226226+187227## Open Question/Parked Decisions
188228189229- Full offline authoring + later publish