Comments (2)
Thanks a lot @caroheymes
Glad you're using it!
This is a very important issue. Good news, the solution already exists (in all crawl files you have already). Maybe I should do a better job documenting and explaining.
In the crawl df, url
refers to the crawled URL (downloaded, parsed, and its elements saved to a file on disk), but as you know it might not be the URL that was requested. This typically happens where there are redirects. The issue is that we might have one, two, three, or even more redirects.
Example:
import pandas as pd
crawl_df = pd.read_json('output_file.jl', lines=True)
crawl_df.filter(regex='^url$|redirect_').dropna()
url | redirect_times | redirect_ttl | redirect_urls | redirect_reasons | |
---|---|---|---|---|---|
0 | https://www.nytimes.com/ | 1 | 19 | https://nytimes.com/ | 301 |
83 | https://www.nytimes.com/newsletters/realestate | 2 | 18 | http://www.nytimes.com/newsletters/realestate/@@https://www.nytimes.com/newsletters/realestate/ | 301@@301 |
The redirect_urls
column contains all URLs requested and redirected, leading finally to url
. In the first row we have one redirect (301), and in the second we have two. We also have the full redirect chain in both rows (one in the first and two in the second). So you can also see intermediate pages if any.
As all other columns, multiple values are separated by @@
.
I hope this clarifies it.
from advertools.
Assuming the question is answered. Feel free to re-open if you have other questions on this topic.
from advertools.
Related Issues (20)
- FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. HOT 2
- Getting NaN values for serp_goog function HOT 3
- File not found on crawl method HOT 4
- opening .jl file command doesn't show 'my_output_file.jl' HOT 7
- crawl dataFrame - jsonld objects HOT 2
- Suggestion - don't treat jsonld items in distinct script tags as distinct. HOT 8
- Bypass protection HOT 3
- Need some way to rate limit requests for sitemap_to_df HOT 3
- Scraps forever HOT 3
- Feature Request - Alternative Crawl Output HOT 5
- Python 3.10/11 SSL: SSLV3_ALERT_HANDSHAKE_FAILURE HOT 5
- Advertools in Ubuntu in a Venv (Python 3.10.12 and Python 3.9.18) HOT 7
- browser can get https://zapier.com but when run scrape failed HOT 2
- logs_to_df() Limitation HOT 5
- Bypass a cookie wall HOT 4
- Pandas Futurewarning "fillna" in url_to_df() HOT 1
- request_url_df creates wide list? HOT 3
- How to get started with development? HOT 7
- Instagram Mentions Allows Periods HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from advertools.