I am playing around with xml string and with attribute Name which contains some define

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

problem with encoded additional HTML characters about fox HOT 5 CLOSED

andreww commented on May 30, 2024

problem with encoded additional HTML characters

from fox.

Comments (5)

andreww commented on May 30, 2024 1

Yes, I think so - I don't certainly know of an XML parser written in Fortran that supports character references to unicode characters.

from fox.

andreww commented on May 30, 2024

Could you post the Fortran code you are using along with the (X)HTML document (here or elsewhere)?

from fox.

zmiimz commented on May 30, 2024

program    xml_mini
   use FoX_dom
   use FoX_sax
   implicit none
   integer :: i
   type(Node), pointer :: doc => null()
   type(Node), pointer :: p1 => null()
   type(Node), pointer :: p2 => null()
   type(NodeList), pointer :: pointList => null()
   character(len=100) :: name


   doc => parseFile("file.xml")
   if(.not. associated(doc)) stop "error doc"


   p1 => item(getElementsByTagName(doc, "Students"), 0)
   if(.not. associated(p1)) stop "error p1"
   write(*,*) getNodeName(p1)


   pointList => getElementsByTagname(p1, "Student")
   write(*,*) getLength(pointList), "Student elements"

   do i = 0, getLength(pointList) - 1
      p2 => item(pointList, i)
      call extractDataAttribute(p2, "Name", name)
      write(*,*) "number ", i," name = ", name
   enddo


   call destroy(doc)

end program xml_mini

file.xml

<Students>
  <Student Name="April" Gender="F" DateOfBirth="1989-01-02" />
  <Student Name="Bob" Gender="M"  DateOfBirth="1990-03-04" />
  <Student Name="Chad" Gender="M"  DateOfBirth="1991-05-06" />
  <Student Name="Dave" Gender="M"  DateOfBirth="1992-07-08">
    <Pet Type="dog" Name="Rover" />
  </Student>
  <Student DateOfBirth="1993-09-10" Gender="F" Name="&#x00C9;mily" />
</Students>

output

./xml_mini.x
Students
5 Student elements
number 0 name = April
number 1 name = Bob
number 2 name = Chad
number 3 name = Dave
number 4 name = &mily

from fox.

andreww commented on May 30, 2024

I've now had a chance to take a proper look at this. I'm afraid the way FoX is set up (and, in particular, the way the SAX parser works) makes it impossible to 'smuggle' a non-ascii character in and out of the DOM as a character reference. The main problem is that tokenisation of the document involves converting character references into their ascii representation and putting the result into an array of Fortran characters.

If É is included in text (between element tags) the SAX parser gives an error apologising that it "cannot digest" the character reference. This is the intended behaviour. When using the DOM you just end up with a "parsing failed" error, but this is ultimately the same error. I think it's a bug that you don't see this error when the character reference is part of an attribute value. This should probably be fixed...

To properly fix this would involve finally making the upgrade to allow FoX to handle unicode. Those arrays of fortran characters would need replacing with integer arrays of unicode code points, and the reading and writing sorted out (Toby White once figured out this bit, it is possible in modern Fortran).

I think any quick fix to try to avoid the problem by storing the character reference is going to be very messy and involve surgery to the SAX parser and, I think, modifications to the DOM code. I really wouldn't want to go down that road.

from fox.

zmiimz commented on May 30, 2024

Dear Andrew,
thank you for the answer. I am aware of problematics of unicode characters in Fortran but the ability to hande (or ignore) extended special XHTML characters is rather a (basic and expected) feature of any modern xml parser ( this example comes from the http://rosettacode.org/wiki/XML/Input#C and most of parsers used there support such characters trafo). So, without changing the mentioned input file, the only option for fortran now is writing interface and using LIBXML2 ?

from fox.

problem with encoded additional HTML characters about fox HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent