Code Monkey home page Code Monkey logo

Comments (5)

andreww avatar andreww commented on May 30, 2024 1

Yes, I think so - I don't certainly know of an XML parser written in Fortran that supports character references to unicode characters.

from fox.

andreww avatar andreww commented on May 30, 2024

Could you post the Fortran code you are using along with the (X)HTML document (here or elsewhere)?

from fox.

zmiimz avatar zmiimz commented on May 30, 2024
program    xml_mini
   use FoX_dom
   use FoX_sax
   implicit none
   integer :: i
   type(Node), pointer :: doc => null()
   type(Node), pointer :: p1 => null()
   type(Node), pointer :: p2 => null()
   type(NodeList), pointer :: pointList => null()
   character(len=100) :: name


   doc => parseFile("file.xml")
   if(.not. associated(doc)) stop "error doc"


   p1 => item(getElementsByTagName(doc, "Students"), 0)
   if(.not. associated(p1)) stop "error p1"
   write(*,*) getNodeName(p1)


   pointList => getElementsByTagname(p1, "Student")
   write(*,*) getLength(pointList), "Student elements"

   do i = 0, getLength(pointList) - 1
      p2 => item(pointList, i)
      call extractDataAttribute(p2, "Name", name)
      write(*,*) "number ", i," name = ", name
   enddo


   call destroy(doc)

end program xml_mini

file.xml

<Students>
  <Student Name="April" Gender="F" DateOfBirth="1989-01-02" />
  <Student Name="Bob" Gender="M"  DateOfBirth="1990-03-04" />
  <Student Name="Chad" Gender="M"  DateOfBirth="1991-05-06" />
  <Student Name="Dave" Gender="M"  DateOfBirth="1992-07-08">
    <Pet Type="dog" Name="Rover" />
  </Student>
  <Student DateOfBirth="1993-09-10" Gender="F" Name="&#x00C9;mily" />
</Students>

output

./xml_mini.x
Students
5 Student elements
number 0 name = April
number 1 name = Bob
number 2 name = Chad
number 3 name = Dave
number 4 name = &mily

from fox.

andreww avatar andreww commented on May 30, 2024

I've now had a chance to take a proper look at this. I'm afraid the way FoX is set up (and, in particular, the way the SAX parser works) makes it impossible to 'smuggle' a non-ascii character in and out of the DOM as a character reference. The main problem is that tokenisation of the document involves converting character references into their ascii representation and putting the result into an array of Fortran characters.

If &#x00C9; is included in text (between element tags) the SAX parser gives an error apologising that it "cannot digest" the character reference. This is the intended behaviour. When using the DOM you just end up with a "parsing failed" error, but this is ultimately the same error. I think it's a bug that you don't see this error when the character reference is part of an attribute value. This should probably be fixed...

To properly fix this would involve finally making the upgrade to allow FoX to handle unicode. Those arrays of fortran characters would need replacing with integer arrays of unicode code points, and the reading and writing sorted out (Toby White once figured out this bit, it is possible in modern Fortran).

I think any quick fix to try to avoid the problem by storing the character reference is going to be very messy and involve surgery to the SAX parser and, I think, modifications to the DOM code. I really wouldn't want to go down that road.

from fox.

zmiimz avatar zmiimz commented on May 30, 2024

Dear Andrew,
thank you for the answer. I am aware of problematics of unicode characters in Fortran but the ability to hande (or ignore) extended special XHTML characters is rather a (basic and expected) feature of any modern xml parser ( this example comes from the http://rosettacode.org/wiki/XML/Input#C and most of parsers used there support such characters trafo). So, without changing the mentioned input file, the only option for fortran now is writing interface and using LIBXML2 ?

from fox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.