Code Monkey home page Code Monkey logo

idna-convert's Introduction

IDNA Convert - pure PHP IDNA converter

latest stable version Travis CI status

Project homepage: http://idnaconv.net
by Matthias Sommerfeld [email protected]

Introduction

The library IdnaConvert allows to convert internationalized domain names (see RFC 3492, RFC 5890, RFC 5891, RFC 5892, RFC 5893, RFC 5894, RFC 6452, for details) as they can be used with various registries worldwide to be translated between their original (localized) form and their encoded form as it will be used in the DNS (Domain Name System).

The library provides two classes (ToIdn and ToUnicode respectively), which expose three public methods to convert between the respective forms. See the Example section below. This allows you to convert host names (simple labels like localhost or FQHNs like some-host.domain.example), email addresses and complete URLs.

Errors, incorrectly encoded or invalid strings will lead to various exceptions. They should help you to find out, what went wrong.

Unicode strings are expected to be UTF-8 strings. ACE strings (the Punycode form) are always 7bit ASCII strings.

Installation

Via Composer

composer require algo26-matthias/idna-convert

Official ZIP Package

The official ZIP packages are discontinued. Stick to Composer or GitHub to acquire your copy, please.

Upgrading from a previous version

See the upgrading notes to learn about upgrading from a previous version.

Examples

Example 1.

Say we wish to encode the domain name nörgler.com:

<?php  
// Include the class
use Algo26\IdnaConvert\ToIdn;
// Instantiate it
$IDN = new ToIdn();
// The input string, if input is not UTF-8 or UCS-4, it must be converted before  
$input = utf8_encode('nörgler.com');  
// Encode it to its punycode presentation  
$output = $IDN->convert($input);  
// Output, what we got now  
echo $output; // This will read: xn--nrgler-wxa.com

Example 2.

We received an email from a internationalized domain and are want to decode it to its Unicode form.

<?php  
// Include the class
use Algo26\IdnaConvert\ToUnicode;
// Instantiate it
$IDN = new ToUnicode();
// The input string  
$input = '[email protected]';  
// Encode it to its punycode presentation  
$output = $IDN->convertEmailAddress($input);  
// Output, what we got now, if output should be in a format different to UTF-8  
// or UCS-4, you will have to convert it before outputting it  
echo utf8_decode($output); // This will read: andre@börse.knörz.info

Example 3.

The input is read from a UCS-4 coded file and encoded line by line. By appending the optional second parameter we tell enode() about the input format to be used

<?php  
// Include the class
use Algo26\IdnaConvert\ToIdn;
use Algo26\IdnaConvert\TranscodeUnicode\TranscodeUnicode;
// Instantiate
$IDN = new ToIdn();
$UCTC = new TranscodeUnicode();
// Iterate through the input file line by line  
foreach (file('ucs4-domains.txt') as $line) {
    $utf8String = $UCTC->convert(trim($line), 'ucs4', 'utf8');
    echo $IDN->convert($utf8String);
    echo "\n";
}

Example 4.

We wish to convert a whole URI into the IDNA form, but leave the path or query string component of it alone. Just using encode() would lead to mangled paths or query strings. Here the public method convertUrl() comes into play:

<?php  
// Include the class
use Algo26\IdnaConvert\ToIdn;
// Instantiate it
$IDN = new ToIdn();
// The input string, a whole URI in UTF-8 (!)  
$input = 'http://nörgler:secret@nörgler.com/my_päth_is_not_ÄSCII/');  
// Encode it to its punycode presentation  
$output = $IDN->convertUrl($input);
// Output, what we got now  
echo $output; // http://nörgler:[email protected]/my_päth_is_not_ÄSCII/

Example 5.

Per default, the class converts strings according to IDNA version 2008. To support IDNA 2003, the class needs to be invoked with an additional parameter.

<?php  
// Include the class  
use Algo26\IdnaConvert\ToIdn;
// Instantiate it, switching to IDNA 2003, the original, now outdated standard
$IDN = new ToIdn(2008);
// Sth. containing the German letter ß  
$input = 'meine-straße.example';
// Encode it to its punycode presentation  
$output = $IDN->convert($input);  
// Output, what we got now  
echo $output; // xn--meine-strae-46a.example
  
// Switch back to IDNA 2008
$IDN = new ToIdn(2003);
// Sth. containing the German letter ß  
$input = 'meine-straße.example';  
// Encode it to its punycode presentation  
$output = $IDN->convert($input);
// Output, what we got now  
echo $output; // meine-strasse.example

Encoding helper

In case you have strings in encodings other than ISO-8859-1 and UTF-8 you might need to translate these strings to UTF-8 before feeding the IDNA converter with it. PHP's built in functions utf8_encode() and utf8_decode() can only deal with ISO-8859-1.
Use the encoding helper class supplied with this package for the conversion. It requires either iconv, libiconv or mbstring installed together with one of the relevant PHP extensions. The functions you will find useful are toUtf8() as a replacement for utf8_encode() and fromUtf8() as a replacement for utf8_decode().

Example usage:

<?php  
use Algo26\IdnaConvert\ToIdn;
use Algo26\IdnaConvert\EncodingHelper\ToUtf8;

$IDN = new ToIdn();
$encodingHelper = new ToUtf8();

$mystring = $encodingHelper->convert('<something in e.g. ISO-8859-15', 'ISO-8859-15');
echo $IDN->convert($mystring);

UCTC — Unicode Transcoder

Another class you might find useful when dealing with one or more of the Unicode encoding flavours. It can transcode into each other:

  • UCS-4 string / array
  • UTF-8
  • UTF-7
  • UTF-7 IMAP (modified UTF-7)
    All encodings expect / return a string in the given format, with one major exception: UCS-4 array is just an array, where each value represents one code-point in the string, i.e. every value is a 32bit integer value.

Example usage:

<?php  
use Algo26\IdnaConvert\TranscodeUnicode\TranscodeUnicode;
$transcodeUnicode = new TranscodeUnicode();

$mystring = 'nörgler.com';  
echo $transcodeUnicode->convert($mystring, 'utf8', 'utf7imap');

Run PHPUnit tests

The library is supplied with a docker-compose.yml, that allows to run the supplied tests. This assumes, you have Docker installed and docker-compose available as a command. Just issue

docker-compose up

in you local command line and see the output of PHPUnit.

Reporting bugs

Please use the issues tab on GitHub to report any bugs or feature requests.

Contact the author

For questions, bug reports and security issues just send me an email.

algo26 Beratungs GmbH
c/o Matthias Sommerfeld
Zedernweg 1
D-16348 Wandlitz

Germany

mailto:[email protected]

idna-convert's People

Contributors

algo26-matthias avatar d00p avatar fubbyb avatar glensc avatar hackwar avatar josefglatz avatar jsmitty12 avatar makarms avatar usox avatar vertexvaar avatar webignition avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

idna-convert's Issues

IdnaConvert::decode() fails to convert simple domain name

$converter = new IdnaConvert();
$converter->decode('xn--zcaj8cya.bar.baz');

The above code will fail with the following exception: This is not a punycode string

This is because it's pushing every domain part into Punycode::decode(), which will raise the aforementioned exception when the provided string (bar in this case) is not punycoded.

Imho parse_url() shouldn't be used within the decode() method. There should be a separate decodeUri() method, to reflect how encoding is handled. And of course the exception must be handled within the IdnaConvert class, i.e. just ignoring it and using the input string.

Needle and haystack mixed up

Hi all, while looking at the code, I came across an interesting block:

if (strpos('/', $host) !== false
            || strpos(':', $host) !== false
            || strpos('?', $host) !== false
            || strpos('@', $host) !== false
        ) {
            throw new InvalidCharacterException('Neither email addresses nor URLs are allowed', 205);
}

ToIdn.php Line: 39 v. 3.1.0
I think needle and haystack are mixed up in places

Endless loops in `ToIdn::convert`

Describe the bug
I found two unicode points that lead to an endless loop in ToIdn::convert, see below.

To Reproduce
(new \Algo26\IdnaConvert\ToIdn(2003))->convert("\u{37a}");
or
(new \Algo26\IdnaConvert\ToIdn(2003))->convert("\u{33c7}");

Reproducible with v3.0.5 and dev-master. PHP version is PHP 8.1.12.

Use Travis CI for automated build testing

Is your feature request related to a problem? Please describe.
Issue #19 is trivial to catch and it looks like the tests cover it.

Having the tests run automatically when changes are pushed would prevent such matters from getting into master and releases.

Describe the solution you'd like

  • add require-dev dependencies to composer.json to make phpunit available to the test environment (instead of depending on the host machine version, if any)
  • add a .travis.yml build script to run the tests

Russian domains

Describe the bug
Blank screen without errors if convert russian domains.

To Reproduce
utf-8 encoding

$domain = '*.фтс.рф';
$idn = new ToIdn();
$domain = $idn->convert($domain);

Desktop (please complete the following information):

  • Ubuntu 22
  • PHP 8.3
  • "mso/idna-convert": "^v4.0.1"

On version 3.1.0 all work fine, as it should

Class 'Algo26\IdnaConvert\Exception\AlreadyPunycodeException' not found

Describe the bug
There is a typing mistake in the name of the file defining \Algo26\IdnaConvert\Exception\AlreadyPunycodeException.

The file in which this exception is defined is src/Exception/AlreadyPunyocdeException.php. This filename has a typo - Punyocde instead of Punycode.

The classname to filename mismatch prevents the exception class from autoloading.

To Reproduce

use Algo26\IdnaConvert\ToIdn;

$idn = new ToIdn();
$idn->convert('xn--g6h');

// PHP exits with:
// Error: Class 'Algo26\IdnaConvert\Exception\AlreadyPunycodeException' not found

Expected behavior
\Algo26\IdnaConvert\Exception\AlreadyPunycodeException is thrown.

Question about B/C

We use idna convert for Joomla.
Is this library B/C with the former mso one?

Error: Prohibited input U+00000081

Hi! I am used of idna_convert.class.php @Version 0.8.1 2011-12-19 and it works ok.
Now I tried to use https://packagist.org/packages/algo26-matthias/idna-convert and I have some error:

CRITICAL - 2021-05-21 23:22:06 --> Prohibited input U+00000081
#0 .../vendor/algo26-matthias/idna-convert/src/NamePrep/NamePrep.php(54): Algo26\IdnaConvert\NamePrep\NamePrep->applyCharacterMaps(Array)
#1 .../vendor/algo26-matthias/idna-convert/src/Punycode/ToPunycode.php(51): Algo26\IdnaConvert\NamePrep\NamePrep->do(Array)
#2 .../vendor/algo26-matthias/idna-convert/src/ToIdn.php(58): Algo26\IdnaConvert\Punycode\ToPunycode->convert(Array)
#3 .../app/Helpers/domain_helper.php(11): Algo26\IdnaConvert\ToIdn->convert('\xC3\x90\xC2\xBC\xC3\x90\xC2\xB0\xC3\x91\xC2\x81\xC3\x91\xC2...')
#4 .../app/Controllers/Home.php(111): get_sitename('\xD0\xBC\xD0\xB0\xD1\x81\xD1\x82\xD0\xB5\xD1\x80\xD1\x81\xD0...')
#5 .../system/CodeIgniter.php(918): App\Controllers\Home->index()
#6 .../system/CodeIgniter.php(404): CodeIgniter\CodeIgniter->runController(Object(App\Controllers\Home))
#7 .../system/CodeIgniter.php(312): CodeIgniter\CodeIgniter->handleRequest(NULL, Object(Config\Cache), false)
#8 .../public/index.php(45): CodeIgniter\CodeIgniter->run()
#9 {main}

I tried to input cyrillic domain and I found an error then I tried domain from your example nörgler.com. There was error also

To Reproduce

  1. I required https://packagist.org/packages/algo26-matthias/idna-convert to my CodeIgniter 4 project
  2. I created helper app/Helpers/domain_helper.php
  3. I wrote code:
<?php
use Algo26\IdnaConvert\ToIdn;

function get_sitename($domain)
{    
    if (!preg_match("/[a-z.-]+$/", $domain)) {
        $IDN = new ToIdn(2003);
        $input = utf8_encode($domain);
        $domain = $IDN->convert($input);  
    }
    
    return $domain;
}

Expected behavior
I would like to transform cyrillic domain to punycode in my form

Finnaly I resolved this problem such a way:

// $input = utf8_encode($domain);
$input = mb_convert_encoding($domain, 'utf-8', mb_detect_encoding($domain));

Outdated Composer package

Hello,

I've noticed that "Mso\IdnaConvert" package is outdated and it recommends using "algo26-matthias/idna-convert". If this is the case, please update Composer package to have the latest code changes, as current version is still using "Mso\IdnaConvert\IdnaConvert" namespace instead of "Algo26\IdnaConvert\IdnaConvert". All examples mention the newer namespace.

Punycode.php Exception

Hi Guys,

is there any logical reason, why an exception is thrown, if the string is already a punycode string?
Punycode.php

 public function encode($decoded)
    {
        // We cannot encode a domain name containing the Punycode prefix
        $extract = self::byteLength(self::punycodePrefix);
        $check_pref = $this->UnicodeTranscoder->utf8_ucs4array(self::punycodePrefix);
        $check_deco = array_slice($decoded, 0, $extract);

        if ($check_pref == $check_deco) {
            throw new \InvalidArgumentException('This is already a Punycode string');
        }

In my opinion, it should not throw an exception. Instead, the string is valid and should be directly returned as is.
What do you think?

domains with idna localpart and "normal" tld can't be encoded

<?php

require_once(__DIR__ . '/IdnaConvert.php');
use Mso\IdnaConvert\IdnaConvert;

$IDN = new IdnaConvert();
$output = $IDN->encode('knörz.info');
error_log($output);
$output2 = $IDN->decode($output);
error_log($output2);

results in

$ php test.php
xn--knrz-6qa.info
PHP Fatal error:  Uncaught exception 'InvalidArgumentException' with message 'This is not a punycode string' in <dir>/IdnaConvert.php:431
Stack trace:
#0 <dir>/IdnaConvert.php(279): Mso\IdnaConvert\IdnaConvert->_decode('info')
#1 <dir>/test.php(9): Mso\IdnaConvert\IdnaConvert->decode('xn--knrz-6qa.in...')
#2 {main}
  thrown in <dir>/IdnaConvert.php on line 431

because "_decode" also tries to decode the "info" which is not punycode. If I try it with "knörz.knörz" it works. Is this a bug or am I doing something else wrong?

php infos:

$ php --version
PHP 5.4.39 (cli) (built: Mar 19 2015 06:59:35) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2014 Zend Technologies
    with Suhosin v0.9.37.1, Copyright (c) 2007-2014, by SektionEins GmbH

update: btw. it works with version 0.9.0

Tilde causes exceptions

Describe the bug

URL like http://example.com/~test/ cases exception NAMEPREP: Prohibited input U+0000007E. in file idna-convert/src/Punycode.php in function namePrep near line 335.

To Reproduce
Steps to reproduce the behavior:

  1. Call new IdnaConvert())->encode('http://example.com/~test/')

Expected behavior

No exception because tilde is a valid character in URLs according to RFC 3986:

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

 unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

Solution

NamePrepData::$generalProhibited should not include 126.

Additional link

TYPO3 CMS is affected by this bug: https://forge.typo3.org/issues/86921

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.