A powerful Go library for extracting and manipulating text in Microsoft Word (DOCX) files, with zero external dependencies. Transform your documents using Go templates or custom replacers while maintaining document structure.
- Text extraction from DOCX files
- Text modification using Go templates or custom replacers
- Support for:
- Main document body
- Headers and footers
- Tables and cells
- Bullet points and numbered lists
- Zero external dependencies
- Efficient (single pass processing)
- OOXML standard compliant
- MIT License
go get github.com/xavier268/mydocx
import "github.com/xavier268/mydocx"
func main() {
// Extract text from all document parts (main body, headers, footers)
content, err := mydocx.ExtractText("document.docx")
if err != nil {
log.Fatal(err)
}
// content is a map[string][]string where:
// - key is the container name (e.g., "word/document.xml", "word/footer1.xml")
// - value is a slice of strings, one for each paragraph
for container, paragraphs := range content {
fmt.Printf("Content from %s:\n", container)
for _, para := range paragraphs {
fmt.Println(para)
}
}
}
import "github.com/xavier268/mydocx"
func main() {
// Define template data
data := struct {
Name string
Company string
Date string
}{
Name: "John Doe",
Company: "ACME Corp",
Date: time.Now().Format("2006-01-02"),
}
// Create a template-based replacer
replacer := mydocx.NewTplReplacer(data)
// Modify the document
err := mydocx.ModifyText("template.docx", replacer, "output.docx")
if err != nil {
log.Fatal(err)
}
}
// Define your custom replacer
func myReplacer(container, text string) []string {
// container: "word/document.xml", "word/footer1.xml", etc.
// text: original paragraph text
// Return:
// - empty slice to remove the paragraph
// - slice with multiple strings to create multiple paragraphs
// - slice with one string to replace paragraph content
switch {
case strings.Contains(text, "DELETE"):
return []string{} // Remove paragraph
case strings.Contains(text, "DUPLICATE"):
return []string{text, text} // Duplicate paragraph
default:
return []string{strings.ToUpper(text)} // Convert to uppercase
}
}
// Use your replacer
err := mydocx.ModifyText("input.docx", myReplacer, "output.docx")
All go template functions are available. In addition, the following built-in functions are always available :
{{nl}}
- Inserts a new paragraph{{version}}
- Returns version information{{copyright}}
- Returns copyright text{{date}}
- Returns the current date, as 2006-02-10{{join}}
- Expects an array of strings and a delimiter string, returns a single concatenated string with the delimiter (see go functionstrings.Join
){{keepEmpty}}
- From this point, will never remove a paragraph that becomes empty after modification.{{removeEmpty}}
- From this point, non empty paragraphs tat become empty afterReplacer
is applied are removed. This is the default.
// Register a custom function
mydocx.RegisterTplFunction("upper", strings.ToUpper)
// Use in template
// {{upper .Name}}
-
Each paragraph is an independent template
-
Templates cannot span across paragraphs
-
Valid example:
Hello {{.Name}}! Your order #{{.OrderID}} has been processed.
-
Invalid example:
Hello {{if .Premium}} Premium customer {{.Name}}! {{else}} Valued customer {{.Name}}! {{end}}
The Replacer function controls paragraph creation and removal through its return value:
type Replacer func(container string, text string) []string
-
Remove Paragraph
// Return empty slice to remove the paragraph (unless {{keepEmpty}} was called earlier) func myReplacer(container, text string) []string { if strings.Contains(text, "DELETE") { return []string{} // Paragraph will be removed } return []string{text} }
-
Create Multiple Paragraphs
// Return multiple strings to create multiple paragraphs func myReplacer(container, text string) []string { if strings.Contains(text, "DUPLICATE") { // Creates three identical paragraphs with the same formatting return []string{text, text, text} } return []string{text} }
Each new paragraph inherits the formatting of the original paragraph.
When using the template-based replacer (NewTplReplacer
), paragraph management is controlled by newlines in the template output:
-
Remove Paragraph
{{if .ShouldDelete}}{{else}}Original content{{end}}
If
.ShouldDelete
is true, the empty output will remove the paragraph (unless {{keepEmpty}} was called before). -
Create Multiple Paragraphs
{{.Title}} Items: {{range .Items}} - {{.}}{{nl}}{{end}} Contact: {{.Contact}}
The {{nl}}
function inserts a newline, and the template output is split on newlines to create new paragraphs. Each resulting paragraph inherits the formatting of the original paragraph. Notice how {{range}} ... {{end}} fits within a single source paragraph but will create multiple paragraphs !
-
Initially empty paragraphs
- Initially empty paragraphs are always left unchanged
-
Empty Result
- If the Replacer returns an empty slice β paragraph is removed
- If a template produces empty output β paragraph is removed
Note : this is the default, it can be changed with {{keepEmpty}}
-
Multiple Paragraphs
- Custom Replacer: Each string in the returned slice becomes a new paragraph
- Template: Output is split on newlines (
\n
), each line becomes a new paragraph - All new paragraphs inherit formatting from the original paragraph
-
Examples with Templates
// Template in document Dear {{.Name}}, {{if .Premium}}Thank you for being a premium member!{{end}} Your balance is ${{.Balance}}.
This template could produce:
Dear John Doe, Thank you for being a premium member! Your balance is $100.
Or (if not premium):
Dear John Doe, Your balance is $100.
Microsoft Word splits text into "runs" - segments sharing the same formatting. This creates challenges for text replacement:
Example: "Hello {{.Name}}!" might be split into:
Run 1: "Hello "
Run 2: "{{.Name"
Run 3: "}}!"
Our solution:
- Consolidates all runs in a paragraph into the first run
- Processes the complete text with a
Replacer
- Creates new paragraphs for each line in the result
- Tables and lists are fully supported
- Each cell must contain at least one paragraph (even if empty)
- Word will show an error when opening files with empty cells but can recover
-
Formatting
- Paragraph formatting is unified based on the first run
- In-paragraph formatting variations are lost
-
Template Boundaries
- Templates must be contained within a single paragraph
- Cross-paragraph templates are not supported
-
Table Cells
- Avoid creating completely empty cells
- Always include at least one (empty) paragraph in cells
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - See LICENSE file for details