feng-cityuhk / equitycharacteristics Goto Github PK

View Code? Open in Web Editor NEW

79.0 1.0 66.0 4.2 MB

Calculate U.S. equity (portfolio) characteristics

Home Page: https://feng-cityuhk.github.io/EquityCharacteristics/

Python 99.95% SAS 0.01% Shell 0.04%

equitycharacteristics's Introduction

Contact

Version

All in Python
The SAS version is here EquityCharacteristicsSAS
Extension to China A Share Market
Extension to Factors and Portfolios in China Market

Academic Background

For financial researches, we need equity characteristics. This repository is a toolkit to calculate asset characteristics in individual equity level and portfolio level.

Prerequisite

Read the listed papers
WRDS account with subscription to CRSP, Compustat and IBES.
Python

Files

Characteristics list

Main Files

accounting_60_hxz.py -- most annual, quarterly and monthly frequency characteristics
functions.py -- impute and rank functions
merge_chars.py -- merge all the characteristics from different pickle file into one pickle file
impute_rank_output_bchmk.py -- impute the missing values and standardize raw data
iclink.py -- preparation for IBES
pkl_to_csv.py -- converge the pickle file to csv

Single Characteristic Files

beta.py -- 3 months rolling CAPM beta
rvar_capm.py, rvar_ff3.py -- residual variance of CAPM and fama french 3 factors model, rolling window is 3 months
rvar_mean.py -- variance of return, rolling window is 3 months
abr.py -- cumulative abnormal returns around earnings announcement dates
myre.py -- revisions in analysts’ earnings forecasts
sue.py -- unexpected quarterly earnings
ill.py -- illiquidity, rolling window is 3 months
maxret_d.py -- maximum daily returns, rolling window is 3 months
std_dolvol.py -- std of dollar trading volume, rolling window is 3 months
std_turn.py -- std of share turnover, rolling window is 3 months
bid_ask_spread.py -- bid-ask spread, rolling window is 3 months
zerotrade.py -- number of zero-trading days, rolling window is 3 months

How to use

run accounting_60_hxz.py
run all the single characteristic files (you can run them in parallel)
run merge_chars.py
run impute_rank_output_bckmk.py (you may want to comment the part of sp1500 in this file if you just need the all stocks version)

Outputs

Data

The date range is 1972 to 2019. The stock universe is top 3 exchanges (NYSE/AMEX/NASDAQ) in US.

The currant time of data is $ret_t = chars_{t-1}$

chars_raw_no_impute.feather (all data with original missing value)
chars_raw_imputed.feather (impute missing value with industry median/mean value)
chars_rank_no_imputed.feather (standardize chars_raw_no_impute.pkl)
chars_rank_imputed.feather (standardize chars_raw_imputed.pkl)

Information Variables:

stock indicator: gvkey, permno
time: datadate, date, year ('datadate' is the available time for data and 'date' is the date of return)
industry: sic, ffi49
exchange info: exchcd, shrcd
return: ret (we also provide original return and return without dividend, you can keep them by modifing impute_rank_output_bchmk.py)
market equity: me/rank_me

Method

Equity Characteristics

This topic is summaried by Green Hand Zhang and Hou Xue Zhang.

Portfolio Characteristics

Portfolio charactaristics is the equal-weighted / value-weighted averge of the characteristics for all equities in the portfolio.

The portfolios includes and not limited to:

Characteristics-sorted Portfolio, see the listed papers and also Deep Learning in Characteristics-Sorted Factor Models
DGTW Benchmark, see DGTW 1997 JF
Industry portfolio

Reference

Papers

Many papers contribute a lot to this repository. I am very sorry for only listing the following papers.

Measuring Mutual Fund Performance with Characteristic‐Based Benchmarks by DANIEL, GRINBLATT, TITMAN, WERMERS 1997 JF
Benchmarks on Wermer's website
Dissecting Anomalies with a Five-Factor Model by Fama and French 2015 RFS
- Define the characteristics of a portfolio as the value-weight averages (market-cap weights) of the variables for the firms in the portfolio
- French's Data Library
The Characteristics that Provide Independent Information about Average U.S. Monthly Stock Returns by Green Hand Zhang 2017 RFS
- sas code from Green's website
Replicating Anormalies by Hou Xue Zhang 2018 RFS
- Anormaly Portfolios by Zhang's website

Codes

Calculate equity characteristics with SAS code, mainly refering to SAS code by Green Hand Zhang.
Portfolio characteristics, mainly refering to WRDS Financial Ratios Suite and Variable Definition
DGTW code refers to this python code or this SAS code

All comments are welcome.

equitycharacteristics's People

Contributors

Stargazers

Watchers

Forkers

ericma4 miyama1209 dijunliu1995 yizhis zouhx11 andreabartolucci yuanzwang5 lsy617004926 acvr885 muboyeyinka zz2585 tushar-mb-goyal babymetal287 webclinic017 hanlin891016 ingridlofman xfx88 amyyangzhou vishalbelsare eason990803 jingfengrong xingfuxu hkogel licheng-sun qianshuzhang zhuo-zhao chinchiehao wizardshowing altaken markch00 firmai-research yankikalfa abigailhust peng-liu ollie0317 philipcaochicago eulersnumber whentostart ricardolu123 deanryu7 weiguanwang financeacadfpm shizelong1985 xrubberduck lfamarantine zl-niu skylineyang maxrel95 giuliorossetti94 luilui163 msdbghbn princezard fanshuone vgtoby thomasmconnors hrzzzzz chienhoyin

equitycharacteristics's Issues

Out of memory when running chars60/beta.py

When running python char60/beta.py, my machine runs for about 37 minutes before running out of memory and crashing with the following std output in my Ubuntu 22 Terminal:

[1] 7949 killed python char60/beta.py

I am not sure if this is a memory-related error, but some aspects make this seem the most likely conclusion to me:

The memory usage of Python after about 10 minutes is still only 9.5 GB and keeps increasing.
After about 15-20 minutes, no network traffic is detected anymore, so I assume that the call to WRDS was successful. CPU usage increases significantly when network traffic falls to zero.
Memory usage keeps increasing steadily, then hovers around 24 GB after about 30 minutes.

Questions

A) I'm using a computer with 32 GB RAM - is this too little for running the scripts from this repository?
B) Am I doing something wrong when running this script?
C) Should I even run the script or should I only run pychars/beta.py?
D) If this is a bug, would you like me to look into it? In issue #11, williamjin1992 mentions that the script is unnecessarily slow?

Thank you!

Possible faster method for calculating Beta

In the file of char60/beta.py, I noticed that there is a TODO for a faster way to get rolling Beta estimation, the original function get_beta uses the matrix operation of the OLS formula to estimate Beta, I think that the covariance representation can be faster, here is my suggestion:
def_get_beta_VCV(df): temp = crsp.loc[df.index,:] vcv = temp.loc[:,['exret','mktrf]].cov(min_periods = None, ddof = 1).values beta = vcv[0,1]/vcv[1,1] return beta
I've tested it on my computer, and it is faster than the matrix operation approach.

Add values to .gitignore for Python venv environments

In .gitignore add at the bottom:

venv/
.venv/

Add a Constant to allow clearer indication where data is written

The output from char files are written to the current directory. This may not be desirable if the directory is networked connected or sync'ed across multiple computers.

Proposed solution: add a constant at the top of affected files as follows:

# Output directory. Usually use the current directory as './' or 'c:/temp/' as an example
OUT_DIR = 'c:/temp/'

Change code to use OUT_DIR

    with open(OUT_DIR + 'zerotrade.feather', 'wb') as f:
        feather.write_feather(crsp, f)

Files and lines impacted:

abr.py:238:with open('abr.feather', 'wb') as f:
accounting_100.py:1637:with open('chars_a_60.pkl', 'wb') as f:
accounting_100.py:1640:with open('chars_q_60.pkl', 'wb') as f:
accounting_60_hxz.py:1231:with open('chars_a_60.feather', 'wb') as f:
accounting_60_hxz.py:1234:with open('chars_q_60.feather', 'wb') as f:
accounting_60.py:1214:with open('chars_a_60.feather', 'wb') as f:
accounting_60.py:1217:with open('chars_q_60.feather', 'wb') as f:
beta.py:181:with open('beta.feather', 'wb') as f:
bid_ask_spread.py:160:with open('baspread.feather', 'wb') as f:
feather_to_csv.py:5:# with open('chars60_raw_imputed.feather', 'rb') as f:
feather_to_csv.py:8:with open('chars60_rank_imputed.feather', 'rb') as f:
iclink.py:243:with open('iclink.feather', 'wb') as f:
ill.py:174:with open('ill.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:11:with open('chars_q_raw.feather', 'rb') as f:
impute_rank_output_bchmk_60.py:19:with open('chars_a_raw.feather', 'rb') as f:
impute_rank_output_bchmk_60.py:96:with open('chars60_raw_no_impute.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:118:with open('chars60_raw_imputed.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:131:with open('chars60_rank_no_impute.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:143:with open('chars60_rank_imputed.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:150:# with open('/home/jianxinma/chars/data/sp1500_impute_benchmark.feather', 'rb') as f:
impute_rank_output_bchmk_60.py:160:# with open('sp1500_impute_60.feather', 'wb') as f:
impute_rank_output_bchmk_60.py:166:# with open('sp1500_rank_60.feather', 'wb') as f:
maxret_d.py:158:with open('maxret.feather', 'wb') as f:
merge_chars_60.py:9:with open('chars_a_60.feather', 'rb') as f:
merge_chars_60.py:17:with open('beta.feather', 'rb') as f:
merge_chars_60.py:27:with open('rvar_capm.feather', 'rb') as f:
merge_chars_60.py:37:with open('rvar_mean.feather', 'rb') as f:
merge_chars_60.py:47:with open('rvar_ff3.feather', 'rb') as f:
merge_chars_60.py:57:with open('sue.feather', 'rb') as f:
merge_chars_60.py:67:with open('myre.feather', 'rb') as f:
merge_chars_60.py:77:with open('abr.feather', 'rb') as f:
merge_chars_60.py:87:with open('baspread.feather', 'rb') as f:
merge_chars_60.py:97:with open('maxret.feather', 'rb') as f:
merge_chars_60.py:107:with open('std_dolvol.feather', 'rb') as f:
merge_chars_60.py:117:with open('ill.feather', 'rb') as f:
merge_chars_60.py:127:with open('std_turn.feather', 'rb') as f:
merge_chars_60.py:137:with open('zerotrade.feather', 'rb') as f:
merge_chars_60.py:148:with open('chars_a_raw.feather', 'wb') as f:
merge_chars_60.py:155:with open('chars_q_60.feather', 'rb') as f:
merge_chars_60.py:163:with open('beta.feather', 'rb') as f:
merge_chars_60.py:173:with open('rvar_capm.feather', 'rb') as f:
merge_chars_60.py:183:with open('rvar_mean.feather', 'rb') as f:
merge_chars_60.py:193:with open('rvar_ff3.feather', 'rb') as f:
merge_chars_60.py:203:with open('sue.feather', 'rb') as f:
merge_chars_60.py:213:with open('myre.feather', 'rb') as f:
merge_chars_60.py:223:with open('abr.feather', 'rb') as f:
merge_chars_60.py:233:with open('baspread.feather', 'rb') as f:
merge_chars_60.py:243:with open('maxret.feather', 'rb') as f:
merge_chars_60.py:253:with open('std_dolvol.feather', 'rb') as f:
merge_chars_60.py:263:with open('ill.feather', 'rb') as f:
merge_chars_60.py:273:with open('std_turn.feather', 'rb') as f:
merge_chars_60.py:283:with open('zerotrade.feather', 'rb') as f:
merge_chars_60.py:294:with open('chars_q_raw.feather', 'wb') as f:
myre.py:23:with open('iclink.feather', 'rb')as f:
myre.py:120:with open('myre.feather', 'wb') as f:
rvar_capm.py:185:with open('rvar_capm.feather', 'wb') as f:
rvar_ff3.py:218:with open('rvar_ff3.feather', 'wb') as f:
rvar_mean.py:167:with open('rvar_mean.feather', 'wb') as f:
std_dolvol.py:158:with open('std_dolvol.feather', 'wb') as f:
std_turn.py:158:with open('std_turn.feather', 'wb') as f:
sue.py:106:with open('sue.feather', 'wb') as f:
zerotrade.py:161:with open('zerotrade.feather', 'wb') as f:

Change Database State Date to a Constatnt

Characteristic .py files too many connections

A connection to WRDS is created in the main Python process -- and also -- in each pool process. With a large pool e.g. 20, errors of a PostgreSQL too many database connections can occur. Other redundant code is also processed unnecessarily.

There is a test for name == 'main' and this also needs to be expanded to top and bottoms of code that only needs execution in the main process.

A judicious use of a conn.close() is also warranted to free the connection(s) for other characteristic runs in parallel.

Use of mp.Pool() Needs Correction.

An attempt in many characteristic .py files is to split dataframe and process according to a specific CPU situation.

For example zerotrade.py line 153:

if __name__ == '__main__':
    crsp = main(0, 1, 0.05)

This leads to zerotrade.py line 137:

zerotrade.py line 137:
    pool = mp.Pool()

However, Python documents the Pool() as worker processes to use. If processes is None then the number returned by [os.cpu_count()] is used.

This is inefficient (especially with debugging with limited cores and shorter SQL date ranges) when debugging or on large machines with large number of cores -- especially if the README.md advice is followed to run characteristic in parallel.

I show 107,616K Working Set and 703,800K Private Bytes for each of my 20 cores even if I change the main function to select 1 core.

Suggest changes to all impacted .py files. Add a constant at the top, call Pool() with an explicit value, change main() to clearer show intent.

# Number of CPU cores to use. Usually use 20 or more. 1 for debugging.
CPU_CORE_COUNT = 10
# ...
    pool = mp.Pool(CPU_CORE_COUNT)
# ...
   crsp = main(0, 1, 1/CPU_CORE_COUNT)

Files to be changed:
beta.py
bid_ask_spread.py
ill.py
maxret_d.py
rvar_capm.py
rvar_ff3.py
rvar_mean.py
std_dolvol.py
std_turn.py
zerotrade.py

Need a utility to show characteristic .feather stats

I propose a utility that will show any missing characteristic .feather files before a merge_chars is run.

In addition, a .feather file is capable of providing its own internal statistics on its rows, cols, and bytes.

The file is to be named char_file_stats.py. Output would be

WARNING: File ./abr.feather does not exist
WARNING: File ./sue.feather does not exist
file: baspread.feather rows: 4,547,622, cols: 3, bytes: 90,952,440
file: beta.feather rows: 4,631,954, cols: 3, bytes: 92,639,080
file: chars_a_60.feather rows: 2,969,208, cols: 90, bytes: 2,196,523,973
file: chars_q_60.feather rows: 2,697,915, cols: 62, bytes: 1,359,411,950

Update to better doument pgpass.conf

Add a sample pgpass.conf file and update README.md (Or reference more details in the sample pgpass.conf file as file comments.

Specifically, for Windows users, location of the pgpass is in a location most Windows user would not be familiar with.

See https://www.postgresql.org/docs/current/libpq-pgpass.html for more information

Many characteristics runs generate Pandas FutureWarnings

Many characteristics runs generate Pandas FutureWarnings. As an example, let look at beta.py.

Line 195 is
crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
and warns as
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html and https://www.geeksforgeeks.org/python-pandas-dataframe-bfill/ for more information.

When fillna is removed, the impact to EquityCharacteristics will be significant. A grep of fillna results in

abr.py:48:ccm['linkenddt'] = ccm['linkenddt'].fillna(pd.to_datetime('today'))
accounting_100.py:160:crsp['me'] = np.where(crsp['permno'] == crsp['permno'].shift(1), crsp['me'].fillna(method='ffill'), crsp['me'])
accounting_100.py:198:ccm['linkenddt'] = ccm['linkenddt'].fillna(pd.to_datetime('today'))
accounting_100.py:248:data_rawa['txditc'] = data_rawa['txditc'].fillna(0)
accounting_100.py:373:data_rawa['noa'] = ((data_rawa['at']-data_rawa['che']-data_rawa['ivao'].fillna(0))-
accounting_100.py:374: (data_rawa['at']-data_rawa['dlc'].fillna(0)-data_rawa['dltt'].fillna(0)-data_rawa['mib'].fillna(0)
accounting_100.py:375: -data_rawa['pstk'].fillna(0)-data_rawa['ceq'])/data_rawa['at_l1'])
accounting_100.py:582:data_rawa['ffi49'] = data_rawa['ffi49'].fillna(49)
accounting_100.py:1326:crsp_mom['dlret'] = crsp_mom['dlret'].fillna(0)
accounting_100.py:1327:crsp_mom['ret'] = crsp_mom['ret'].fillna(0)
accounting_100.py:1448:data_rawa['datadate'] = data_rawa.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_100.py:1449:data_rawa = data_rawa.groupby(['permno', 'datadate'], as_index=False).fillna(method='ffill')
accounting_100.py:1456:data_rawq['datadate'] = data_rawq.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_100.py:1457:data_rawq = data_rawq.groupby(['permno', 'datadate'], as_index=False).fillna(method='ffill')
accounting_60_hxz.py:158:crsp['me'] = np.where(crsp['permno'] == crsp['permno'].shift(1), crsp['me'].fillna(method='ffill'), crsp['me'])
accounting_60_hxz.py:196:ccm['linkenddt'] = ccm['linkenddt'].fillna(pd.to_datetime('today'))
accounting_60_hxz.py:247:data_rawa['txditc'] = data_rawa['txditc'].fillna(0)
accounting_60_hxz.py:363:data_rawa['noa'] = ((data_rawa['at']-data_rawa['che']-data_rawa['ivao'].fillna(0))-
accounting_60_hxz.py:364: (data_rawa['at']-data_rawa['dlc'].fillna(0)-data_rawa['dltt'].fillna(0)-data_rawa['mib'].fillna(0)
accounting_60_hxz.py:365: -data_rawa['pstk'].fillna(0)-data_rawa['ceq'])/data_rawa['at_l1'])
accounting_60_hxz.py:574:data_rawa['ffi49'] = data_rawa['ffi49'].fillna(49)
accounting_60_hxz.py:1038:crsp_mom['dlret'] = crsp_mom['dlret'].fillna(0)
accounting_60_hxz.py:1039:crsp_mom['ret'] = crsp_mom['ret'].fillna(0)
accounting_60_hxz.py:1106:data_rawa['datadate'] = data_rawa.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_60_hxz.py:1108:data_rawa = data_rawa.groupby(['permno1', 'datadate1'], as_index=False).fillna(method='ffill')
accounting_60_hxz.py:1115:data_rawq['datadate'] = data_rawq.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_60_hxz.py:1117:data_rawq = data_rawq.groupby(['permno1', 'datadate1'], as_index=False).fillna(method='ffill')
accounting_60.py:158:crsp['me'] = np.where(crsp['permno'] == crsp['permno'].shift(1), crsp['me'].fillna(method='ffill'), crsp['me'])
accounting_60.py:196:ccm['linkenddt'] = ccm['linkenddt'].fillna(pd.to_datetime('today'))
accounting_60.py:247:data_rawa['txditc'] = data_rawa['txditc'].fillna(0)
accounting_60.py:363:data_rawa['noa'] = ((data_rawa['at']-data_rawa['che']-data_rawa['ivao'].fillna(0))-
accounting_60.py:364: (data_rawa['at']-data_rawa['dlc'].fillna(0)-data_rawa['dltt'].fillna(0)-data_rawa['mib'].fillna(0)
accounting_60.py:365: -data_rawa['pstk'].fillna(0)-data_rawa['ceq'])/data_rawa['at_l1'])
accounting_60.py:574:data_rawa['ffi49'] = data_rawa['ffi49'].fillna(49)
accounting_60.py:1021:crsp_mom['dlret'] = crsp_mom['dlret'].fillna(0)
accounting_60.py:1022:crsp_mom['ret'] = crsp_mom['ret'].fillna(0)
accounting_60.py:1089:data_rawa['datadate'] = data_rawa.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_60.py:1091:data_rawa = data_rawa.groupby(['permno1', 'datadate1'], as_index=False).fillna(method='ffill')
accounting_60.py:1098:data_rawq['datadate'] = data_rawq.groupby(['permno'])['datadate'].fillna(method='ffill')
accounting_60.py:1100:data_rawq = data_rawq.groupby(['permno1', 'datadate1'], as_index=False).fillna(method='ffill')
beta.py:56:crsp['dlret'] = crsp['dlret'].fillna(0)
beta.py:57:crsp['ret'] = crsp['ret'].fillna(0)
beta.py:80:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
bid_ask_spread.py:63:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
functions.py:708:def fillna_atq(df_q, df_a):
functions.py:733:def fillna_ind(df, method, ffi):
functions.py:765: df['%s' % na_column] = df['%s' % na_column].fillna(df['%s_mean' % na_column])
functions.py:768: df['%s' % na_column] = df['%s' % na_column].fillna(df['%s_median' % na_column])
functions.py:775:def fillna_all(df, method):
functions.py:805: df['%s' % na_column] = df['%s' % na_column].fillna(df['%s_mean' % na_column])
functions.py:808: df['%s' % na_column] = df['%s' % na_column].fillna(df['%s_median' % na_column])
functions.py:832: df = df.fillna(0)
ill.py:54:crsp['dlret'] = crsp['dlret'].fillna(0)
ill.py:55:crsp['ret'] = crsp['ret'].fillna(0)
ill.py:77:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
impute_rank_output_bchmk_60.py:105:df_impute['ffi49'] = df_impute['ffi49'].fillna(49) # we treat na in ffi49 as 'other'
impute_rank_output_bchmk_60.py:109:df_impute = fillna_ind(df_impute, method='median', ffi=49)
impute_rank_output_bchmk_60.py:111:df_impute = fillna_all(df_impute, method='median')
impute_rank_output_bchmk_60.py:112:df_impute['re'] = df_impute['re'].fillna(0) # re use IBES database, there are lots of missing data
maxret_d.py:61:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
rvar_capm.py:56:crsp['dlret'] = crsp['dlret'].fillna(0)
rvar_capm.py:57:crsp['ret'] = crsp['ret'].fillna(0)
rvar_capm.py:80:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
rvar_ff3.py:56:crsp['dlret'] = crsp['dlret'].fillna(0)
rvar_ff3.py:57:crsp['ret'] = crsp['ret'].fillna(0)
rvar_ff3.py:80:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
rvar_mean.py:47:crsp['dlret'] = crsp['dlret'].fillna(0)
rvar_mean.py:48:crsp['ret'] = crsp['ret'].fillna(0)
rvar_mean.py:71:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
std_dolvol.py:61:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
std_turn.py:61:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')
sue.py:47:ccm['linkenddt'] = ccm['linkenddt'].fillna(pd.to_datetime('today'))
zerotrade.py:61:crsp['month_count'] = crsp.groupby(['permno'])['month_count'].fillna(method='bfill')

Possible little bug & typo in accounting.py

Line 118, filling nan with 0

comp['xsga0'] = np.where(comp['xsga'].isnull, 0, 0)

might be

comp['xsga0'] = np.where(comp['xsga'].isnull(), 0, comp['xsga'])

Line 166-175, dealing with multiple "permno" under the same "permco"

crsp1 = pd.merge(crsp, crsp_maxme, how='inner', on=['monthend', 'permco', 'me'])

For there are a few different "permno" with the same "me", so merge function return a slightly lager DataFrame than expectation, though having little influence to empircal results, we should be caucious with merging on numeric columns (they are probably not unique).

I think following procedure may be better

crsp_summe = crsp.groupby(['monthend', 'permco'])['me'].sum().reset_index()
crsp1 = crsp.sort_values(by=['permco', 'monthend', 'me'], ascending=[True, True, False]).drop_duplicates(['monthend', 'permco'])
crsp1 = crsp1.drop(['me'], axis=1)
crsp2 = pd.merge(crsp1, crsp_summe, how='left', on=['monthend', 'permco'])

Line 231-235, deal with the duplicates, should drop the "temp" column or generate different temp columns like

data_rawa.loc[data_rawa.groupby(['datadate', 'permno', 'linkprim'], as_index=False).nth([0]).index, 'temp1'] = 1
data_rawa = data_rawa[data_rawa['temp1'].notna()]
data_rawa.loc[data_rawa.groupby(['permno', 'yearend', 'datadate'], as_index=False).nth([-1]).index, 'temp2'] = 1
data_rawa = data_rawa[data_rawa['temp2'].notna()]

if not, last two lines will filter nothing out.

README.md Update for char_file_stats.py

Add line to README.md in the How to use section after item #2 to read

run char_file_stats.py and check for missing .feather file. (Run anytime to see .feather stats for rows, cols, bytes)

iclink.py fails to run with current version of Pandas

Near the end of the iclink.py file is this line:
iclink = _link1_2.append(_link2_3)

As of Pandas version 2.0, the Pandas append() function is deprecated. Use the Pandas pd.concat() function instead

A correction to
iclink = pd.concat([_link1_2, _link2_3])

Note that the concat() single parameter is a Python list.

Question regarding the README: Clarification on which files need to be run

I suggest a small change to the README so that it is clearer and easier to understand.

From reading the README, it is unclear what folders the executable files are located in.

I would therefore suggest adding the relative file path to the README.

Old README Section

run accounting_60_hxz.py
run all the single characteristic files (you can run them in parallel)
run merge_chars.py
run impute_rank_output_bckmk.py (you may want to comment the part of sp1500 in this file if you just need the all stocks version)

New README Section

run char60/accounting_60_hxz.py by running python char60/accounting_60_hxz.py.
run all the single characteristic files (you can run them in parallel) by running all files in the char60 folder.
run pychars/merge_chars.py.
run pychars/impute_rank_output_bckmk.py (you may want to comment the part of sp1500 in this file if you just need the all stocks version).

Next Steps

Should I make a PR with the above changes?

ValueError with misaligned shapes in pychars/beta.py

When running python pychars/beta.py, I get the following error:

/EquityCharacteristics/pychars/beta.py", line 59, in <module>
    crsp_temp = crsp.groupby('permno').rolling(rolling_window).apply(get_beta, raw=False)

[...]

/EquityCharacteristics/pychars/beta.py", line 54, in get_beta
    beta = (X.T.dot(M).dot(X)).I.dot((X.T.dot(M).dot(Y)))
ValueError: shapes (1,600) and (60,60) not aligned: 600 (dim 1) != 60 (dim 0)

Is this a known issue and can this be mitigated? Maybe I'm doing something wrong. Please notify me if you consider this a bug, I can try to find a solution and make a PR.