Hello Elias, Advertools is a really great package ! Many thanks for

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

How to get initial url in output.jl ? about advertools HOT 2 CLOSED

eliasdabbas commented on May 24, 2024

How to get initial url in output.jl ?

from advertools.

Comments (2)

eliasdabbas commented on May 24, 2024

Thanks a lot @caroheymes
Glad you're using it!

This is a very important issue. Good news, the solution already exists (in all crawl files you have already). Maybe I should do a better job documenting and explaining.

In the crawl df, url refers to the crawled URL (downloaded, parsed, and its elements saved to a file on disk), but as you know it might not be the URL that was requested. This typically happens where there are redirects. The issue is that we might have one, two, three, or even more redirects.

Example:

import pandas as pd
crawl_df = pd.read_json('output_file.jl', lines=True)
crawl_df.filter(regex='^url$|redirect_').dropna()

	url	redirect_times	redirect_ttl	redirect_urls	redirect_reasons
0	https://www.nytimes.com/	1	19	https://nytimes.com/	301
83	https://www.nytimes.com/newsletters/realestate	2	18	http://www.nytimes.com/newsletters/realestate/@@https://www.nytimes.com/newsletters/realestate/	301@@301

The redirect_urls column contains all URLs requested and redirected, leading finally to url. In the first row we have one redirect (301), and in the second we have two. We also have the full redirect chain in both rows (one in the first and two in the second). So you can also see intermediate pages if any.

As all other columns, multiple values are separated by @@.

I hope this clarifies it.

from advertools.

eliasdabbas commented on May 24, 2024

Assuming the question is answered. Feel free to re-open if you have other questions on this topic.

from advertools.

Recommend Projects

How to get initial url in output.jl ? about advertools HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent