Code Monkey home page Code Monkey logo

Comments (6)

blagoySimandov avatar blagoySimandov commented on June 10, 2024

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

from colly.

Dinver avatar Dinver commented on June 10, 2024

Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.

It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.

from colly.

WGH- avatar WGH- commented on June 10, 2024

Yeah, I can reproduce it with colly/v2, too

from colly.

Dinver avatar Dinver commented on June 10, 2024

Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.

response.go:

package colly

import (
	"bytes"
	"fmt"
	"io/ioutil"
	"mime"
	"net/http"
	"strings"

	"github.com/PuerkitoBio/goquery"
	"github.com/saintfish/chardet"
	"golang.org/x/net/html/charset"
)

// Response is the representation of a HTTP response made by a Collector
type Response struct {
	// StatusCode is the status code of the Response
	StatusCode int
	// Body is the content of the Response
	Body []byte
	// Ctx is a context between a Request and a Response
	Ctx *Context
	// Request is the Request object of the response
	Request *Request
	// Headers contains the Response's HTTP headers
	Headers *http.Header
	// Trace contains the HTTPTrace for the request. Will only be set by the
	// collector if Collector.TraceHTTP is set to true.
	Trace *HTTPTrace
}

// Save writes response body to disk
func (r *Response) Save(fileName string) error {
	return ioutil.WriteFile(fileName, r.Body, 0644)
}

// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
	_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
	if fName, ok := params["filename"]; ok && err == nil {
		return SanitizeFileName(fName)
	}
	if r.Request.URL.RawQuery != "" {
		return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
	}
	return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
}

func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
	if len(r.Body) == 0 {
		return nil
	}
	if defaultEncoding != "" {
		tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
		if err != nil {
			return err
		}
		r.Body = tmpBody
		return nil
	}
	contentType := strings.ToLower(r.Headers.Get("Content-Type"))

	if strings.Contains(contentType, "image/") ||
		strings.Contains(contentType, "video/") ||
		strings.Contains(contentType, "audio/") ||
		strings.Contains(contentType, "font/") {
		// These MIME types should not have textual data.

		return nil
	}

	if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
		if !detectCharset {
			return nil
		}
		contentTypeBody := checkContentTypeInBody(string(r.Body))
		if contentTypeBody != "" {
			contentType = contentTypeBody
		}
	}

	if !strings.Contains(contentType, "charset") {
		if !detectCharset {
			return nil
		}
		d := chardet.NewTextDetector()
		r, err := d.DetectBest(r.Body)
		if err != nil {
			return err
		}
		contentType = "text/plain; charset=" + r.Charset
	}
	if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
		return nil
	}
	tmpBody, err := encodeBytes(r.Body, contentType)
	if err != nil {
		return err
	}
	r.Body = tmpBody
	return nil
}

func encodeBytes(b []byte, contentType string) ([]byte, error) {
	r, err := charset.NewReader(bytes.NewReader(b), contentType)
	if err != nil {
		return nil, err
	}
	return ioutil.ReadAll(r)
}

func checkContentTypeInBody(b string) string {
	reader := strings.NewReader(b)
	doc, err := goquery.NewDocumentFromReader(reader)
	if err != nil {
		fmt.Println(err)
	}
	metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
	if exists {
		return metaContent
	} else {
		return ""
	}
}

from colly.

WGH- avatar WGH- commented on June 10, 2024

There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the <meta tags.

It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding

There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348

We really should incorporate it into Colly.

from colly.

blagoySimandov avatar blagoySimandov commented on June 10, 2024

Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?

from colly.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.