Pure Go reader for the ZIM file format
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Jojii e8996e53cf
Merge pull request #1 from tim-st/master
3 weeks ago
cmd improve `zimtext` 1 month ago
testdata Initial Commit 2 years ago
LICENSE Initial commit 2 years ago
README.md Update README.md 7 months ago
blob_reader.go Better information packing of a DirectoryEntry 2 years ago
checksum.go Initial Commit 2 years ago
checksum_test.go Initial Commit 2 years ago
cluster.go `ClusterAt`: ignore magic number mismatch error from `compress/zstd` 1 month ago
cluster_test.go Add Cluster operations 2 years ago
entry.go remove unused `parameterLen` from `DirectoryEntry` 2 years ago
entry_search.go add ForEachEntryAfterPosition 12 months ago
entry_test.go Initial Commit 2 years ago
file.go file.go: add zstd support 8 months ago
file_operations.go remove func `readNullTerminatedString` 2 years ago
file_test.go Initial Commit 2 years ago
go.mod Add ForEachEntryWith...Prefix funcs 12 months ago
go.sum Add ForEachEntryWith...Prefix funcs 12 months ago
header.go Initial Commit 2 years ago
metadata.go Initial Commit 2 years ago
mimetype.go remove func `readNullTerminatedString` 2 years ago
mimetype_test.go Initial Commit 2 years ago
namespace.go better namespace description 2 years ago

README.md

go-zim

Package zim implements reading support for the ZIM File Format.

Documentation at https://pkg.go.dev/github.com/tim-st/go-zim.

Download and install package zim and its tools with

go get -u github.com/tim-st/go-zim/...

or download a prebuilt binary from the release-section.

You can download a ZIM file for testing here.

Commands

The command above installs the tools of this package to $GOPATH/bin/.

zimserver

Tool for browsing a ZIM file in your webbrowser via an HTTP interface.

  • Starting a ZIM server at TCP port 8080: zimserver -filename="filename.zim" -port=8080
  • Browsing the ZIM file via Web Browser is now possible at http://localhost:8080/
  • The last part of the URL can be used as a basic prefix search by passing the search term after the last / in the URL

zimindex

Tool for creating a full text index of a given ZIM file.

zimsearch

Tool that lists search results for a given ZIM file and text query. If no index file created by zimindex is found, a builtin prefix search is used. Otherwise the index file is used to retrieve search results sorted by score, where the search result can be calculated by union or intersection operation.

zimtext

Tool to extract clean texts from a Wikipedia ZIM file. Each clean HTML paragraph is written on a single line in a text file.

  • Extracting first 1000 clean texts from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -limit=1000
  • Extracting all clean texts from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt"
  • Extracting first 1000 clean sentences (likely a sentence) from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -limit=1000 -sentences
  • Extracting all clean sentences (likely a sentence) from a ZIM file: zimtext -zim="filename.zim" -txt="lines.txt" -sentences
  • If you want to support your language or use-case better, it's recommended to define your own Regular Expression to extract only texts you accept. The RE-Syntax is defined here and can be tested here (select Flavor=Golang).

Example:

zimtext -zim="wikipedia_de_top_nopic_2019-08.zim" -txt="de.txt" -limit=10000 -regexFilter="^(?:\p{Lu}|\p{N})[ \pL\pN\,\;\:\-]{10,}[\.\)\]\?\"…«»›‹‘“’”]{1}$"