We spent months watching users struggle with the same problem. They had data locked up in PDFs, Word docs, spreadsheets, PowerPoint decks, XML feeds, and a dozen other formats. They wanted to use that data with AI. Every time, the workflow was the same: write a Python script, debug it, run it, realize the output format was wrong, rewrite it. For every single file type.

So we built the File Processor module. You upload files. You pick an output format. You click a button. That is the entire workflow.

50+

Input Formats

Output Formats

Extractors

41+

Extensions

What Goes In

The short answer: almost everything. The module accepts 41+ file extensions across every common document type you run into in actual work. Not toy examples. Real files from real organizations.

Category	Formats	Library
Documents	PDF, DOCX, DOC, RTF, ODT	PyMuPDF, python-docx
Spreadsheets	XLSX, XLS, CSV, TSV, ODS	openpyxl, pandas
Presentations	PPTX, PPT, ODP	python-pptx
Structured Data	JSON, XML, YAML, TOML	Built-in parsers
Text / Code	TXT, MD, HTML, CSS, JS, PY, etc.	Plain text extraction
Images (OCR)	PNG, JPG, TIFF, BMP, WebP	Tesseract OCR
Audio	MP3, WAV, M4A, OGG, FLAC	Transcription pipeline

The image support is worth calling out specifically. If you have scanned contracts, old invoices, or any document that got saved as an image, OCR handles it. We run Tesseract under the hood. It is not perfect on handwriting, but for typed text in scanned documents, it works well.

Audio files go through a transcription pipeline before processing. Upload a recorded meeting, get back structured text you can actually work with.

What Comes Out

Five output formats cover the vast majority of use cases we see:

JSONL

One JSON object per line. This is the standard format for fine-tuning OpenAI, Anthropic, and most other LLM providers. If your goal is training data, pick this.

JSON

Standard JSON array. Good for downstream applications, API ingestion, or any system that expects structured data.

Markdown

Clean Markdown with headers, lists, and tables preserved. Good for documentation, knowledge bases, or human review.

CSV and TXT

CSV for tabular output, plain text for everything else. Sometimes simple is what you need.

10 Extractors Under the Hood

The ChunkingService runs 10 specialized extractors. Each one knows how to handle a specific file structure. A PDF with tables gets a different extractor than a PDF with flowing text. A spreadsheet with formulas gets parsed differently than a simple CSV.

This matters because naive text extraction produces garbage. If you have ever tried to pull text from a PDF and gotten a jumbled mess of headers mixed with footers mixed with body text, you know exactly what we are talking about.

Our extractors understand document structure. They know what a heading is. They know what a table cell is. They know where one section ends and another begins. The output reflects that structure, which means the AI model trained on that output actually understands the relationships in your data.

Under the hood: PDF extraction uses PyMuPDF (not pdfminer, not PyPDF2). We tested all the major libraries and PyMuPDF gave us the best balance of speed, accuracy, and table handling. Word documents go through python-docx, and spreadsheets use openpyxl.

The Pipeline: Upload to Knowledge Base

This part surprised even us with how useful it turned out to be. When you process files, the output does not just sit there as a download. It automatically saves to your My Files database and gets indexed by NeuroGen Cortex.

Step 1

Upload Files

Step 2

Process & Extract

Step 3

Save to My Files

Step 4

Smart Index

Step 5

Knowledge Base

That last step is the real payoff. You can create a Knowledge Base directly from your processed files using what we call the "fast path." Since the content is already indexed by NeuroGen Cortex, creating a KB skips the re-reading step entirely and goes straight to embedding generation. On a batch of 50 documents, that saves minutes of processing time.

Once you have a Knowledge Base, any assistant or agent you build on NeuroGen can use that KB to answer questions grounded in your actual data. Upload a company handbook, process it, create a KB, and now your chatbot knows your internal policies.

Chat With Your Processed Results

We added something we call session-based chat. After processing, you get a chat interface that lets you ask questions about the processed results. This runs on NeuroGen Cortex's interactive approach, which is fundamentally different from dumping everything into a prompt.

The processed content sits in a Python variable (stored in Redis, 24-hour TTL), and the LLM writes code to explore it. Think of it as giving the AI a Python notebook with your data already loaded. It can filter, search, summarize, compare, and extract specific information from your processed files.

The API endpoints are straightforward: POST /api/fp-chat for chat messages and GET /api/fp-kb-status/<session_id> to check Knowledge Base creation progress. Every LLM call during the chat is credit-tracked per session.

Actual Use Cases We See

We are not going to pretend every user needs this for ML training. What people actually do with the File Processor:

Research Teams

Upload 30 PDFs from a literature review. Process to JSONL. Create a Knowledge Base. Now the team has a chatbot that can answer questions about the entire body of research with citations back to specific papers.

Operations Teams

Take a folder of SOPs, training manuals, and policy documents in mixed formats (some PDF, some Word, some old HTML). Process the whole batch. Output as Markdown for a documentation site, or create a KB for an internal help chatbot.

Data Teams

Pull data from Excel files, CSVs, and XML feeds into a consistent JSON format. Use the chat feature to do quick analysis before committing to a full data pipeline.

Legal Teams

Process contract PDFs (including scanned ones via OCR). Extract structured data. Feed into the Legal Discovery module for deeper analysis, or create a KB for a contract review assistant.

The common thread: people have data in messy formats, and they need it in clean formats. The File Processor is the bridge.

What It Costs

File processing itself uses credits based on file size and complexity. OCR and audio transcription cost more than plain text extraction because they require more compute. The session chat feature tracks credits per LLM call, and Knowledge Base creation costs credits for embedding generation.

All of this shows up in your credit usage dashboard with full breakdowns by module, so you can see exactly what each processing job cost.

Try the File Processor

Upload a file and see the output in under a minute. The demo tier includes 100 credits to get started.

Start Free Trial

Back to Blog

50+ File Formats to Training Data, No Scripts Required