Code Monkey home page Code Monkey logo

medium_stats's Introduction

Medium Stats Scraper

CircleCI codecov

A command-line tool and Python package to fetch your Medium profile statistics and save the data as JSON.

Executes the same API and Graphql requests as the Medium front-end does, providing you with the data as it is, pre-rendered.

Install

$ pip install medium-stats

Setup

Step 1

To make authenticated requests to Medium for your stats, the CLI tool needs two cookies from a signed-in Medium session - "sid" and "uid".

If you want to manually find and store your cookies:

  • Sign in to Medium
  • Get to your browser's developer tools and find the tab that holds cookies:
    • Application for Chrome
    • Storage for Firefox
  • Scroll through the cookies for medium.com until you find 'sid' and 'uid'

Create a .medium_creds.ini file to hold these cookie values:

cat > path_to_directory/.medium_creds.ini << EOF
[your_medium_handle_here]
sid=insert_sid_value_here
uid=insert_uid_value_here
EOF

#Note: the default behavior of the package will search your home directory for 
#this file, but you are welcome to set it to whatever directory you like and 
#pass that path as an argument to the CLI tool.

#Note: your Medium "handle" is your Medium username without the "@" prefix,
#e.g. "olivertosky" from https://medium.com/@olivertosky

If you want to automatically find and store your cookies:

$ pip install medium-stats[selenium]

This installs some extra dependencies allowing a webscraper to authenticate to Medium on your behalf and grab your "sid" and "uid" cookies. Note: You must already have Chrome installed.

Currently only valid for Gmail OAuth:

$ medium-stats fetch_cookies -u [HANDLE] --email [EMAIL] --pwd [PASSWORD]

# Or specify that your password should be pulled from an environment variable:
$ export MEDIUM_AUTH_PWD='[PASSWORD]'
$ medium-stats fetch_cookies -u [HANDLE] --email [EMAIL] --pwd-in-env

Step 2 - Optional:

Create a directory for your stats exports; the CLI tool will run under the working directory by default.

$ mkdir path_to_target_stats_directory

Once executed the CLI tool will create the following directory structure:

target_stats_directory/
    stats_exports/
        [HANDLE]/
            agg_stats/ 
            agg_events/ 
            post_events/
            post_referrers/

Usage

Command-Line

Simple Use:

$ medium-stats scrape_user -u [USERNAME] --all

This will get all Medium stats for a user until now.

For a publication:

$ medium-stats scrape_publication -u [USERNAME] -s [SLUG] --all

# The "slug" parameter is typically your publication's name in lower-case,
# with spaces delimited by dashes, and is the portion of your page's URL after "medium.com/"
# e.g. "test-publication" if the URL is https://medium.com/test-publication and name is "Test Publication"

General Use pattern:

medium-stats (scrape_user | scrape_publication) -u USERNAME/URL -s [PUBLICATION_SLUG]
[--output_dir DIR] (--creds PATH | (--sid SID --uid UID)) \
(--all | [--start PERIOD_START] [--end PERIOD END]) [--is-utc] \
[--mode {summary, events, articles, referrers, story_overview}]

FLAGS:

flag function default
--all gets all stats until now
--end end of period for stats fetched [exclusive] now (UTC)
--start beginning of period for stats fetched [inclusive] --end minus 1 day @midnight
--is-utc whether start/stop are already in UTC time False
--output_dir directory to hold stats exports current working directory
--creds path to credentials file ~/.medium_stats.ini
--sid your Medium session id from cookie
--uid your Medium user id from cookie
--mode limits retrieval to particular statistics ['summary', 'events', 'articles', 'referrers'] for scrape_user
['events', 'story_overview', 'articles', 'referrers'] for scrape_publication

Python

Basic Usage:

#### SETUP ####
from datetime import datetime

start = datetime(year=2020, month=1, day=1)
stop = datetime(year=2020, month=4, day=1)
#### FOR A USER ####
from medium_stats.scraper import StatGrabberUser

# get aggregated summary statistics; note: start/stop will be converted to UTC
me = StatGrabberUser('username', sid='sid', uid='uid', start=start, stop=stop)
data = me.get_summary_stats()

# get the unattributed event logs for all your stories:
data_events = me.get_summary_stats(events=True)

# get individual article statistics
articles = me.get_article_ids(data) # returns a list of article_ids

article_events = me.get_all_story_stats(articles) # daily event logs
referrers = me.get_all_story_stats(articles, type_='referrer') # all-time referral sources
#### FOR A PUBLICATION ####
from medium_stats.scraper import StatGrabberPublication

# first argument should be your publication slug, i.e. what follows the URL after "medium.com/"
pub = StatGrabberPublication('test-publication', 'sid', 'uid', start, stop)

# get publication views & visitors (like the stats landing page)
views = pub.get_events(type_='views')
visitors = pub.get_events(type_='visitors')

# get summary stats for all publication articles
story_stats = pub.get_all_story_overview()

# get individual article statistics
articles = pub.get_article_ids(story_stats)

article_events = pub.get_all_story_stats(articles)
referrers = pub.get_all_story_stats(articles, type_='referrer')

# Note: if you want to specify naive-UTC datetimes, set already_utc=True in the class instantiation to
# avoid offset being applied.  Better practice is to just input tz-aware datetimes to "start" & "stop"
# params in the first place...

Note: "summary_stats" and "referrer" data pre-aggregates to your full history, i.e. they don't take into account "start" & "stop" parameters.

Example output:

data (summary):

[   {   'claps': 3,
        'collectionId': '',
        'createdAt': 1570229100438,
        'creatorId': 'UID',
        'firstPublishedAt': 1583526956495,
        'friendsLinkViews': 46,
        'internalReferrerViews': 17,
        'isSeries': False,
        'postId': 'ARTICLE_ID',
        'previewImage': {   'id': 'longstring.png',
                            'isFeatured': True,
                            'originalHeight': 311,
                            'originalWidth': 627},
        'readingTime': 7,
        'reads': 67,
        'slug': 'this-will-be-a-title',
        'syndicatedViews': 0,
        'title': 'This Will Be A Title',
        'type': 'PostStat',
        'updateNotificationSubscribers': 0,
        'upvotes': 3,
        'views': 394,
        'visibility': 0},
        ...

data_events:

[{  'claps': 0,
    'flaggedSpam': 0,
    'reads': 0,
    'timestampMs': 1585695600000,
    'updateNotificationSubscribers': 0,
    'upvotes': 0,
    'userId': 'UID',
    'views': 1},
        ...

article_events:

{'data': {
    'post': [{
        '__typename': 'Post',
        'dailyStats': [
            {   '__typename': 'DailyPostStat',
                'internalReferrerViews': 1,
                'memberTtr': 119,
                'periodStartedAt': 1583452800000,
                'views': 8},
            ... 
            {   '__typename': 'DailyPostStat',
                'internalReferrerViews': 5,
                'memberTtr': 375,
                'periodStartedAt': 1583539200000,
                'views': 40}],
        'earnings': {
            '__typename': 'PostEarnings',
            'dailyEarnings': [],
            'lastCommittedPeriodStartedAt': 1585526400000},
        'id': 'ARTICLE_ID'},
        ...
    ]}
}

referrers:

{'data': {'post': [{'__typename': 'Post',
                    'id': 'POST_ID',
                    'referrers': [{'__typename': 'Referrer',
                                   'internal': None,
                                   'platform': None,
                                   'postId': 'POST_ID',
                                   'search': None,
                                   'site': None,
                                   'sourceIdentifier': 'direct',
                                   'totalCount': 222,
                                   'type': 'DIRECT'},
                                  ...
                                  {'__typename': 'Referrer',
                                   'internal': None,
                                   'platform': None,
                                   'postId': 'POST_ID',
                                   'search': None,
                                   'site': {'__typename': 'SiteReferrer',
                                            'href': 'https://www.inoreader.com/',
                                            'title': None},
                                   'sourceIdentifier': 'inoreader.com',
                                   'totalCount': 1,
                                   'type': 'SITE'}],
                    'title': 'TITLE_HERE',
                    'totalStats': {'__typename': 'SummaryPostStat',
                                   'views': 395}},
                    ...
                   ]
            }
}

If you set up your credentials file already, there is a helper class MediumConfigHelper, that wraps the standard configparser:

import os
from medium_stats.cli import MediumConfigHelper

default_creds = os.path.join(os.path.expanduser('~'), '.medium_creds.ini')

cookies = MediumConfigHelper(config_path=default_creds, account_name='your_handle')
sid = cookies.sid
uid = cookies.uid

TODO:

  • Add story author and title to post stats

medium_stats's People

Contributors

daveflynn avatar otosky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

medium_stats's Issues

Add aliases in medium_creds.ini file

I think it's common for people to be part of multiple publications. Right now, you have to make a separate section for each and copy the same sid/uid values, which would all have to be updated when they change.

It would be great to have a field like aliases to list the names of publications the values in a section also apply to.

Canonical name of publications for credits matching, etc.

When scraping a publication, it seems that the credits file is matched against the URL the user specifies, which might include the medium.com/ part or not. There should be a canonical name for a publication (IMHO just the part of the URL after https://medium.com/) that is used everywhere. That way, even if I specified the full URL, it would still match the creds based on just the canonical name.

Adding story topics from story settings page

It would be awesome to get the story topics for a specific story as it makes it easer to see which story topics were getting the most traction during a given time. It would then help any user to focus on content around the topics which are getting the most hits.

Story topics can be fetched from : https://medium.com/p/<POST_ID>/settings

I would be glad to contribute towards this issue. Would really appreciate if you can push me in the right direction.

doesn't seem to install via pip?

I'm sure I'm just missing something. I'm fairly new to python.

I'm running pip install medium_stats (and also medium-stats) but when I run the sample script included in the readme.md (with my info) it errors with:

File "<directory path>/medium_stats.py", line 3, in <module> from medium_stats.scraper import StatGrabberUser
ModuleNotFoundError: No module named 'medium_stats.scraper'; 'medium_stats' is not a package

This is happening on 2 different systems where other python scripts are running fine.

Fetch more than 50 stories (get_all_story_overview)

This is a fantastic library, thanks for building it!

I've got a publication with more than 50 articles, and would like to fetch stats for all of them. However, it seems like it's not possible to fetch more than 50 at the moment due to Medium's pagination - see you comment here:

# TODO: need to figure out how pagination works after limit exceeded

I am happy to test this using the publication I manage, if this helps.

Unable to scrape any stats: 404

Issue

Until a week ago scraping publication stats worked.
Suddenly, last week, it stopped working.

Command:

medium-stats scrape_publication -u <username> -s <pubname> --output_dir . --sid "<mySID>" --uid "<myUID>" --all

The error:

Traceback (most recent call last):
  File "/Users/dave/data-projects/marketing-pipeline/venv/bin/medium-stats", line 8, in <module>
    sys.exit(main())
  File "/Users/dave/data-projects/marketing-pipeline/venv/lib/python3.10/site-packages/medium_stats/__main__.py", line 220, in main
    data = sg.get_all_story_overview()
  File "/Users/dave/data-projects/marketing-pipeline/venv/lib/python3.10/site-packages/medium_stats/scraper.py", line 294, in get_all_story_overview
    data = self._decode_json(response)
  File "/Users/dave/data-projects/marketing-pipeline/venv/lib/python3.10/site-packages/medium_stats/scraper.py", line 146, in _decode_json
    return json.loads(cleaned)["payload"]
  File "/Users/dave/.pyenv/versions/3.10.9/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Users/dave/.pyenv/versions/3.10.9/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/dave/.pyenv/versions/3.10.9/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

JSON is expected, but not returned.

Expected result

medium_stats would output the stats to ./stats_export/<publication>

Debugging steps

  • Changed cookie
  • Tried with VPN on/off
  • Dumped the response from the server and it seems to be a 404 page (though I can load the publication stats page directly

Screenshot 2024-06-11 at 11 14 37 AM

Anyone else running into issues, got a workaround? Or is Medium updating its stats pages?

scrape_publication seems to require full URL

When I'm trying to scrape a publication under a custom URL, that works fine. But a regular publication under medium.com seems to require the full URL (at least including the domain).

So this works:
scrape_publication -u medium.com/name-of-publication
while this doesn't:
scrape_publication -u name-of-publication

The error is socket.gaierror: [Errno 8] nodename nor servname provided, or not known, which to me suggests that the script doesn't prepend the medium.com/ bit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.