Comments (6)
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
from colly.
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode.
It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly.
from colly.
Yeah, I can reproduce it with colly/v2, too
from colly.
Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem.
response.go:
package colly
import (
"bytes"
"fmt"
"io/ioutil"
"mime"
"net/http"
"strings"
"github.com/PuerkitoBio/goquery"
"github.com/saintfish/chardet"
"golang.org/x/net/html/charset"
)
// Response is the representation of a HTTP response made by a Collector
type Response struct {
// StatusCode is the status code of the Response
StatusCode int
// Body is the content of the Response
Body []byte
// Ctx is a context between a Request and a Response
Ctx *Context
// Request is the Request object of the response
Request *Request
// Headers contains the Response's HTTP headers
Headers *http.Header
// Trace contains the HTTPTrace for the request. Will only be set by the
// collector if Collector.TraceHTTP is set to true.
Trace *HTTPTrace
}
// Save writes response body to disk
func (r *Response) Save(fileName string) error {
return ioutil.WriteFile(fileName, r.Body, 0644)
}
// FileName returns the sanitized file name parsed from "Content-Disposition"
// header or from URL
func (r *Response) FileName() string {
_, params, err := mime.ParseMediaType(r.Headers.Get("Content-Disposition"))
if fName, ok := params["filename"]; ok && err == nil {
return SanitizeFileName(fName)
}
if r.Request.URL.RawQuery != "" {
return SanitizeFileName(fmt.Sprintf("%s_%s", r.Request.URL.Path, r.Request.URL.RawQuery))
}
return SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, "/"))
}
func (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {
if len(r.Body) == 0 {
return nil
}
if defaultEncoding != "" {
tmpBody, err := encodeBytes(r.Body, "text/plain; charset="+defaultEncoding)
if err != nil {
return err
}
r.Body = tmpBody
return nil
}
contentType := strings.ToLower(r.Headers.Get("Content-Type"))
if strings.Contains(contentType, "image/") ||
strings.Contains(contentType, "video/") ||
strings.Contains(contentType, "audio/") ||
strings.Contains(contentType, "font/") {
// These MIME types should not have textual data.
return nil
}
if !strings.Contains(contentType, "charset") && strings.Contains(contentType, "text/html") {
if !detectCharset {
return nil
}
contentTypeBody := checkContentTypeInBody(string(r.Body))
if contentTypeBody != "" {
contentType = contentTypeBody
}
}
if !strings.Contains(contentType, "charset") {
if !detectCharset {
return nil
}
d := chardet.NewTextDetector()
r, err := d.DetectBest(r.Body)
if err != nil {
return err
}
contentType = "text/plain; charset=" + r.Charset
}
if strings.Contains(contentType, "utf-8") || strings.Contains(contentType, "utf8") {
return nil
}
tmpBody, err := encodeBytes(r.Body, contentType)
if err != nil {
return err
}
r.Body = tmpBody
return nil
}
func encodeBytes(b []byte, contentType string) ([]byte, error) {
r, err := charset.NewReader(bytes.NewReader(b), contentType)
if err != nil {
return nil, err
}
return ioutil.ReadAll(r)
}
func checkContentTypeInBody(b string) string {
reader := strings.NewReader(b)
doc, err := goquery.NewDocumentFromReader(reader)
if err != nil {
fmt.Println(err)
}
metaContent, exists := doc.Find("meta[http-equiv='Content-Type']").Attr("content")
if exists {
return metaContent
} else {
return ""
}
}
from colly.
There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the <meta
tags.
It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding
There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348
We really should incorporate it into Colly.
from colly.
Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ?
from colly.
Related Issues (20)
- Add Features urljoin
- How to crawl a single page of vue or react? HOT 2
- Support for Brotli HOT 1
- How long does colly's cache last? HOT 2
- DOM.SetText not modifying the text content of an element HOT 3
- Ignore certain MIME types in fixCharset is not enough
- Styling in the documentation looks broken HOT 1
- Proxies are not rotated HOT 1
- XMLElement.Attr causes runtime error HOT 1
- User Agent update and Edge Browser UA addition HOT 3
- Colly has vulnerabilities with medium criticality HOT 4
- Fetching data that is not coming in curl output HOT 1
- SIGSEG on local files HOT 1
- Misleading Request.Depth documentation HOT 1
- I have a request that I don't know how to make using Colly. HOT 1
- Different MaxDepth on AllowedDomains and others ? HOT 3
- Is there a way spider for "https://netbanking.hdfcbank.com/netbanking" with my owner account? HOT 1
- request chan error HOT 1
- Cannot send request with no Accept header HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from colly.