music on atproto
plyr.fm
1# streaming uploads
2
3**status**: implemented in PR #182
4**date**: 2025-11-03
5
6## overview
7
8plyr.fm uses streaming uploads for audio files to maintain constant memory usage regardless of file size. this prevents out-of-memory errors when handling large files on constrained environments (fly.io shared-cpu VMs with 256MB RAM).
9
10## problem (pre-implementation)
11
12the original upload implementation loaded entire audio files into memory, causing OOM risk:
13
14### current flow (memory intensive)
15```python
16# 1. read entire file into memory
17content = file.read() # 40MB WAV → 40MB in RAM
18
19# 2. hash entire content in memory
20file_id = hashlib.sha256(content).hexdigest()[:16] # another 40MB
21
22# 3. upload entire content
23client.put_object(Body=content, ...) # entire file in RAM
24```
25
26### memory profile
27- single 40MB upload: ~80-120MB peak memory
28- 3 concurrent uploads: ~240-360MB peak
29- fly.io shared-cpu VM: 256MB total RAM
30- **result**: OOM, worker restarts, service degradation
31
32## solution: streaming approach (implemented)
33
34### goals achieved
351. constant memory usage regardless of file size
362. maintained backward compatibility (same file_id generation)
373. supports both R2 and filesystem backends
384. no changes to upload endpoint API
395. proper test coverage added
40
41### current flow (constant memory)
42```python
43# 1. compute hash in chunks (8MB at a time)
44hasher = hashlib.sha256()
45while chunk := file.read(8*1024*1024):
46 hasher.update(chunk)
47file_id = hasher.hexdigest()[:16]
48
49# 2. stream upload to R2
50file.seek(0) # reset after hashing
51client.upload_fileobj(Fileobj=file, Bucket=bucket, Key=key)
52```
53
54### memory profile (improved)
55- single 40MB upload: ~10-16MB peak (just chunk buffer)
56- 3 concurrent uploads: ~30-48MB peak
57- **result**: stable, no OOM risk
58
59## implementation details
60
61### 1. chunked hash utility
62
63reusable utility for streaming hash calculation:
64
65**location**: `src/backend/utilities/hashing.py`
66
67```python
68# actual implementation from src/backend/utilities/hashing.py
69import hashlib
70from typing import BinaryIO
71
72# 8MB chunks balances memory usage and performance
73CHUNK_SIZE = 8 * 1024 * 1024
74
75def hash_file_chunked(file_obj: BinaryIO, algorithm: str = "sha256") -> str:
76 """compute hash by reading file in chunks.
77
78 this prevents loading entire file into memory, enabling constant
79 memory usage regardless of file size.
80
81 args:
82 file_obj: file-like object to hash
83 algorithm: hash algorithm (default: sha256)
84
85 returns:
86 hexadecimal digest string
87
88 note:
89 file pointer is reset to beginning after hashing so subsequent
90 operations (like upload) can read from start
91 """
92 hasher = hashlib.new(algorithm)
93
94 # ensure we start from beginning
95 file_obj.seek(0)
96
97 # read and hash in chunks
98 while chunk := file_obj.read(CHUNK_SIZE):
99 hasher.update(chunk)
100
101 # reset pointer for next operation
102 file_obj.seek(0)
103
104 return hasher.hexdigest()
105```
106
107### 2. R2 storage backend
108
109**file**: `src/backend/storage/r2.py`
110
111**implementation**:
112- uses `hash_file_chunked()` for constant memory hashing
113- uses `aioboto3` async client with `upload_fileobj()` for streaming uploads
114- boto3's `upload_fileobj` automatically handles multipart uploads for files >5MB
115- supports both audio and image files
116
117```python
118# actual implementation (simplified)
119async def save(self, file: BinaryIO, filename: str) -> str:
120 """save media file to R2 using streaming upload.
121
122 uses chunked hashing and aioboto3's upload_fileobj for constant
123 memory usage regardless of file size.
124 """
125 # compute hash in chunks (constant memory)
126 file_id = hash_file_chunked(file)[:16]
127
128 # determine file extension and type
129 ext = Path(filename).suffix.lower()
130
131 # try audio format first
132 audio_format = AudioFormat.from_extension(ext)
133 if audio_format:
134 key = f"audio/{file_id}{ext}"
135 media_type = audio_format.media_type
136 bucket = self.audio_bucket_name
137 else:
138 # handle image formats...
139 pass
140
141 # stream upload to R2 (constant memory, non-blocking)
142 # file pointer already reset by hash_file_chunked
143 async with self.async_session.client("s3", ...) as client:
144 await client.upload_fileobj(
145 Fileobj=file,
146 Bucket=bucket,
147 Key=key,
148 ExtraArgs={"ContentType": media_type},
149 )
150
151 return file_id
152```
153
154### 3. filesystem storage backend
155
156**file**: `src/backend/storage/filesystem.py`
157
158**implementation**:
159- uses `hash_file_chunked()` for constant memory hashing
160- uses `anyio` for async file I/O instead of blocking operations
161- writes file in chunks for constant memory usage
162- supports both audio and image files
163
164```python
165# actual implementation (simplified)
166async def save(self, file: BinaryIO, filename: str) -> str:
167 """save media file using streaming write.
168
169 uses chunked hashing and async file I/O for constant
170 memory usage regardless of file size.
171 """
172 # compute hash in chunks (constant memory)
173 file_id = hash_file_chunked(file)[:16]
174
175 # determine file extension and type
176 ext = Path(filename).suffix.lower()
177
178 # try audio format first
179 audio_format = AudioFormat.from_extension(ext)
180 if audio_format:
181 file_path = self.base_path / "audio" / f"{file_id}{ext}"
182 else:
183 # handle image formats...
184 pass
185
186 # write file using async I/O in chunks (constant memory, non-blocking)
187 # file pointer already reset by hash_file_chunked
188 async with await anyio.open_file(file_path, "wb") as dest:
189 while True:
190 chunk = file.read(CHUNK_SIZE)
191 if not chunk:
192 break
193 await dest.write(chunk)
194
195 return file_id
196```
197
198### 4. upload endpoint
199
200**file**: `src/backend/api/tracks.py`
201
202**implementation**: no changes required!
203
204FastAPI's `UploadFile` already uses `SpooledTemporaryFile`:
205- keeps small files (<1MB) in memory
206- automatically spools larger files to disk
207- provides file-like interface that our streaming functions expect
208- works seamlessly with both storage backends
209
210## testing
211
212### 1. unit tests for hash utility
213
214**file**: `tests/utilities/test_hashing.py`
215
216```python
217def test_hash_file_chunked_correctness():
218 """verify chunked hashing matches standard approach."""
219 # create test file
220 test_data = b"test data" * 1000000 # ~9MB
221
222 # standard hash
223 expected = hashlib.sha256(test_data).hexdigest()
224
225 # chunked hash
226 file_obj = io.BytesIO(test_data)
227 actual = hash_file_chunked(file_obj)
228
229 assert actual == expected
230
231
232def test_hash_file_chunked_resets_pointer():
233 """verify file pointer is reset after hashing."""
234 file_obj = io.BytesIO(b"test data")
235 hash_file_chunked(file_obj)
236 assert file_obj.tell() == 0 # pointer at start
237```
238
239### 2. integration tests for uploads
240
241**file**: `tests/api/test_tracks.py`
242
243```python
244async def test_upload_large_file_r2():
245 """verify large file upload doesn't OOM."""
246 # create 50MB test file
247 large_file = create_test_audio_file(size_mb=50)
248
249 # upload should succeed with constant memory
250 response = await client.post(
251 "/tracks/",
252 files={"file": large_file},
253 data={"title": "large track test"},
254 )
255 assert response.status_code == 200
256
257
258async def test_concurrent_uploads():
259 """verify multiple concurrent uploads don't OOM."""
260 files = [create_test_audio_file(size_mb=30) for _ in range(3)]
261
262 # all should succeed
263 results = await asyncio.gather(
264 *[upload_file(f) for f in files]
265 )
266 assert all(r.status_code == 200 for r in results)
267```
268
269### 3. memory profiling
270
271manual testing with memory monitoring:
272
273```bash
274# monitor memory during upload
275watch -n 1 'ps aux | grep uvicorn'
276
277# upload large file
278curl -F "file=@test-50mb.wav" -F "title=test" http://localhost:8000/tracks/
279```
280
281expected results:
282- memory should stay under 50MB regardless of file size
283- no memory spikes or gradual leaks
284- consistent performance across multiple uploads
285
286## deployment
287
288implemented in PR #182 and deployed to production.
289
290### validation results
291- memory usage stays constant (~10-16MB per upload)
292- file_id generation remains consistent (backward compatible)
293- supports concurrent uploads without OOM
294- both R2 and filesystem backends working correctly
295
296## backward compatibility
297
298successfully maintained during implementation:
299
300### file_id generation
301- hash algorithm: SHA256 (unchanged)
302- truncation: 16 chars (unchanged)
303- result: existing file_ids remain valid
304
305### API contract
306- endpoint: `POST /tracks/` (unchanged)
307- parameters: title, file, album, features, image (unchanged)
308- response: same structure (unchanged)
309- result: no breaking changes for clients
310
311## edge cases
312
313### very large files (>100MB)
314- boto3 automatically handles multipart upload
315- filesystem streaming works for any size
316- only limited by storage capacity, not RAM
317
318### network failures during upload
319- boto3 multipart upload can retry failed parts
320- filesystem writes are atomic per chunk
321- FastAPI handles connection errors
322
323### concurrent uploads
324- each upload uses independent chunk buffer
325- total memory = num_concurrent * CHUNK_SIZE
326- 5 concurrent @ 8MB chunks = 40MB total (well within 256MB limit)
327
328## observability
329
330metrics tracked in Logfire:
331
3321. upload duration - remains constant regardless of file size
3332. memory usage - stays under 50MB per upload
3343. upload success rate - consistently >99%
3354. concurrent upload handling - no degradation
336
337## future optimizations
338
339### potential improvements (not in scope for this PR)
340
3411. **progressive hashing during upload**
342 - hash chunks as they arrive instead of separate pass
343 - saves one file iteration
344
3452. **client-side chunked uploads**
346 - browser sends file in chunks
347 - server assembles and validates
348 - enables upload progress tracking
349
3503. **parallel multipart upload**
351 - split large files into parts
352 - upload parts in parallel
353 - faster for very large files (>100MB)
354
3554. **deduplication before full upload**
356 - send hash first to check if file exists
357 - skip upload if duplicate found
358 - saves bandwidth and storage
359
360## references
361
362- implementation: `src/backend/storage/r2.py`, `src/backend/storage/filesystem.py`
363- utilities: `src/backend/utilities/hashing.py`
364- tests: `tests/utilities/test_hashing.py`, `tests/api/test_tracks.py`
365- PR: #182
366- boto3 upload_fileobj: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/upload_fileobj.html
367- FastAPI UploadFile: https://fastapi.tiangolo.com/tutorial/request-files/