Comments (3)
Comment by Brion:
Can this be done with the standard POSIX file APIs? Or do we need to use Win32 APIs for file access? :P
(Use of legacy 8-bit encodings in POSIX APIs is a problem for PHP, and thus MediaWiki, when running on Windows servers as well.)
Sounds like you gotta use the Win32 wide-char APIs to get Unicode filename support... at least that's what glib does on Windows for this reason -- https://developer.gnome.org/glib/2.26/glib-File-Utilities.html#glib-File-Utilities.description
from libzim.
lvandeve wrote:
Hello,
I have done a little bit of research on this (I must admit it was on a Linux computer however).
I hope this may help at least with fixing this issue.
Most File, FileImpl, zim::ifstream and zim::streambuf works with string. There is one intermediate string -> const char* -> string conversion in FileImpl, which the attached patch removes (to use string for everything). This to hopefully support the NUL-characters that may appear in the octets of UTF-16 strings put in a 8-bit char string.
In addition, the patch has a comment at the locations in fstream.cpp where probably different Windows API needs to be called.
I think that this way, file.h does not need to be changed to have wchar_t or wstring constructors.
What follows below the line is the patch:
diff --git a/zimlib/include/zim/file.h b/zimlib/include/zim/file.h
index a6ac75b..8aa5f0d 100644
--- a/zimlib/include/zim/file.h
+++ b/zimlib/include/zim/file.h
@@ -39,7 +39,7 @@ namespace zim
File()
{ }
explicit File(const std::string& fname)
-
: impl(new FileImpl(fname.c_str()))
-
: impl(new FileImpl(fname)) { } const std::string& getFilename() const { return impl->getFilename(); }
diff --git a/zimlib/include/zim/fileimpl.h b/zimlib/include/zim/fileimpl.h
index 1cf584d..ada65ee 100644
--- a/zimlib/include/zim/fileimpl.h
+++ b/zimlib/include/zim/fileimpl.h
@@ -53,7 +53,7 @@ namespace zim
offset_type getOffset(offset_type ptrOffset, size_type idx);
public:
-
explicit FileImpl(const char* fname);
-
explicit FileImpl(const std::string& fname); time_t getMTime() const { return zimFile.getMTime(); }
diff --git a/zimlib/src/fileimpl.cpp b/zimlib/src/fileimpl.cpp
index 8c072eb..d7a7f4a 100644
--- a/zimlib/src/fileimpl.cpp
+++ b/zimlib/src/fileimpl.cpp
@@ -38,7 +38,7 @@ namespace zim
//////////////////////////////////////////////////////////////////////
// FileImpl
//
- FileImpl::FileImpl(const char* fname)
- FileImpl::FileImpl(const std::string& fname)
: zimFile(fname),
direntCache(envValue("ZIM_DIRENTCACHE", DIRENT_CACHE_SIZE)),
clusterCache(envValue("ZIM_CLUSTERCACHE", CLUSTER_CACHE_SIZE))
diff --git a/zimlib/src/fstream.cpp b/zimlib/src/fstream.cpp
index 5ce72f5..e4accf8 100644
--- a/zimlib/src/fstream.cpp
+++ b/zimlib/src/fstream.cpp
@@ -59,6 +59,24 @@ namespace zim
//
streambuf::OpenfileInfo::OpenfileInfo(const std::string& fname_)
: fname(fname_),
+// I think regular std::string with 8-bit characters
+// is OK up to this point, that is, in the ctor and fields of File,
+// FileImpl, ifstream, streambuf etc... (that is, no wchar_t or wstring
+// necessary there and no API change needed), the string just may have
+// NULL-characters if bytes of UTF-16 are present, but std::string
+// supports that (const char* does not).
+// So c_str() or any C/C++ API that uses C strings should not be used at
+// this point or anywhere before.
+// So here a Windows-specific API that supports UTF-16 is needed for
+// WIN32, and the fname string needs to converted here to what it
+// needs if necessary. That is still TODO, hence the comment here.
+// Apparently the fname received here on Windows is UTF-16 encoded with
+// two chars per wchar_t, if used by Kiwix (the other possibility would
+// have been that it was UTF-8 encoded and needed to be converted to
+// UTF-16). Maybe a comment in the File constructor could also be nice
+// to ensure all users will be consistent with this?
+// This concludes my investigation, I hope it can help fixing the
+// issue :)
#ifdef HAVE_OPEN64
fd(::open64(fname.c_str(), O_RDONLY | O_LARGEFILE | O_BINARY))
#else
@@ -293,6 +311,8 @@ time_t streambuf::getMTime() const
if (mtime || files.empty())
return mtime;
+// See comment under streambuf::OpenfileInfo::OpenfileInfo about UTF-16:
+// the same applies here as well to the c_str() call and stat.
const char* fname = files.front()->fname.c_str();
#ifdef HAVE_STAT64
from libzim.
@mgautierfr You will probably need to have a look for kiwix-desktop 2.0
from libzim.
Related Issues (20)
- Libzim should support alias creation. HOT 5
- Support storing and retrieving fuzzy rules HOT 1
- Extend API to support fuzzy rules matching. HOT 1
- [REGRESSION] Fulltext search crashes on javascript libzim since version 8.2.1 HOT 11
- Unable to compile libzim in fresh Ubuntu 22.04
- Failling iOS cross-compilation on macOS 13 (CI) HOT 4
- Release 9.1.0
- Potential minor difference between specification and implementation of counter HOT 3
- Been able to load split ZIM file (chunks) by filedescriptor HOT 11
- Can't get libzim version >= 8.0.0 HOT 1
- Mess with `zim::Archive::getEntryByPath(const std::string& path)` HOT 12
- Bump minor version of zimfile format
- Looking for direction on building for Windows HOT 3
- Add option for disabling tests HOT 8
- Proposal: Harvest the power of xapian to provide advanced search and filter capabilities HOT 3
- Allow extern `getFd()` to get file descriptors from Java for Kiwix Android HOT 5
- Document ZIM major and minor versions HOT 6
- Release 9.2.0
- Impossible to drop entries from the kiwix::Downloader's cache HOT 1
- Last version of libzim seems to no be present in debian repo HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libzim.