miha42-github / company_dns Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 4.29 MB

An open source micro-service focused that provides company data from EDGAR plus Wikipedia, and SIC lookup.

Home Page: https://miha42-github.github.io/company_dns/

License: Apache License 2.0

Python 77.49% Dockerfile 0.68% HTML 5.17% Shell 3.52% JavaScript 7.03% CSS 6.10%

edgar mediumroast public-companies sec sic sqlite-database starlette

company_dns's People

Contributors

Stargazers

Watchers

Forkers

christensenjoe snacey

company_dns's Issues

Create /V1.1/company/summary/<string:query> endpoint

Create a simpler summary endpoint that enables users to more quickly get data back.

Created the summary logic to enable this capability
Mounted in the endpoint at /V1.1/edgar/companies/summary
Reformatted the previous endpoint to /V1.1/Edgar/companies/detail
Updated /V1.1/help accordingly.

Add logo url

The logo filename can be found from the wikipedia data, if you know the wikipedia URL you can derive the total URL to the logo if it exists.

Example:
https://en.wikipedia.org/wiki/Alphabet_Inc.#/media/File:Alphabet_Inc_Logo_2015.svg

Ideally, the logo name would be Alphabet_Inc_Logo_2015.svg and the URL would be https://en.wikipedia.org/wiki/Alphabet_Inc.

Data Lineage endpoint and features

New Feature Proposal: Data Lineage

Given that we're in a time where facts aren't always reliable and data sourcing can be considered suspect, it is important to create a way to show where data has originated from. Therefore, the intention of this feature set will be to create a digital map for the data source(s) both in the general context of the entire cached data set and in the specific endpoint context.

What is Data Lineage?

According to Wikipedia, "Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process."

General Case: `/lineage` endpoint

For data that is stored in the cache and for endpoints where data is gathered dynamically, an endpoint for lineage is needed. The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source and understand the processes. All steps should be described again in a JSON structure, and important included libraries can be referenced as the method to capture the data. Additionally, the report should include when the last local update was run for the cache creation and what was used, and both static and dynamic sources should be called out. Additional details will be provided in this issue as the feature is designed in sections below.

Ideas

The report should be digital, meaning ideally in a JSON format, so that users should be able to programmatically trace the source, and understand the processes.
All steps should be described again in a JSON structure
Important included libraries can be referenced as the method to capture the data
When the last local update was run for the cache creation and what was used should be included
Both static and dynamic sources should be called out

Specific case lineage for each query endpoint

When a query is run, each endpoint should report the data source(s) including those inside the system via the cache. Ideally, these sources should be linkable, if they are digital, so those interested can follow the trail. The URLs, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field. If locally cached data is used in the result, that should be referred to as well. Note that there can be references to the general lineage endpoint.

Ideas

The urls, in particular for Wikipedia and EDGAR, should be disclosed with each query result as an additional JSON field
If locally cached data is used in the result then that should be referred too as well. Note that there can be references to the general lineage endpoint.

Malformed ISIN

Introduction

ISIN isn't being correctly parsed and outputted for at least the following companies:

SAP - malformed isin example: "isin": "{{ISIN|sl|=|n|pl|=|y|DE0007164600}}"
Apple Inc - malformed isin example: "isin": "{{ISIN|sl|=|n|pl|=|y|US0378331005}}"
Schlumberger - malformed isin example: "isin": "{{ISIN|sl|=|n|pl|=|y|AN8068571086}}"

Action

Fix the parsing of the ISIN by adding additional logic/regex to account for this format.

Create /V1.1/company/detail/<string:query> endpoint

Provide an endpoint which enables a user to specific either the int:CIK or string:query to get information about the company. This option should avoid talking to GeoPy. Research should be done to check on if merely a CIK can be used to gather the information. If so then that is likely the simplest approach. Specifically, the get all filings from the EDGAR might have all of the required information to generate the company detail.

Equivalent function needs to be created in the dbshell
Version the APIs appropriately
Update help with the appropriate information
Create an additional endpoint to just return the CIKs

Augment embedded help with active links

Create active links for each endpoint so users can actually see the results from an API call.

To do this it will be necessary to use the external IP and port definitions. This means we will likely need to create a configuration file that contains the external server name.

SEC forms are thin

Introduction

For at least the following companies a small number of SEC forms is reported back.

Apple Inc
IBM
Schlumberger

Potential cause

A change was introduced to detect multiple CIKs and only return the values from one CIK, this change could result in there only being the first return. Specifically, the SQL change to include the DISTINCT command could be the culprit. This should be quickly investigated and resolved.

Action

Test the removal of the DISTINCT clause from the SQL statements in edgar.py: This did not resolve the matter
Fix how the names of companies are uses as keys in the reporting dict(): This resolved the problem

Simple automation

Product some simple automation which will enable users to quickly start, stop, initialize, etc.

Initialized functions either in Docker or standalone
Startup functions either standalone or in systemd
Stop functions either in standalone or in systemd
Cleanup script to remove docker images, cache DB, etc. and start again

Update SQL and associated method in apputils.py to be more efficient and flexible

The SQL can be more efficient by only selecting the form time which both should minimize the selection output and improve flexibility. The current naive approach uses Python's regex engine to do this which likely is inefficient. Consider the following points when working this issue:

Enable various form type checking in all entry points: RESTful and command shell.
Ensure that the SQL is as secure as possible
Consider if some amount of the derivable data (filing URLs, corporate info, SIC, address, etc.) can be put into the DB cache in general to further simplify client coding so that call-outs to gather additional data are not required
Mark in the client code which portions could be moved into the DB cache
When appropriate update decontrol.py and the clients with the required changes

Rewrite the README.md file

The README.md file doesn't reflect the current state of the service and needs to be improved. The major focus will be on the installation and startup process. Additionally, the overall motivation section, status, etc. needs to be updated to reflect the change in strategy. We will reuse as much as possible and add appropriate screenshots to enrich the documentation. We can and should also consider bringing links to the documentation so users can see an example in action.

Wikipedia's stock exchange name and the official stock exchange name differ

Wikipedia's stock exchange name and the official stock exchange name differ, and this matters especially when creating a link to Google finance. Essentially, Google is expecting the official name and when not present or slightly off seems to make a recommendation to the right company. More investigation is needed, but it is reasonable as is.
This far the encountered problems are:

Armco - Wikipedia: Saudi Stock Exchange; Google expects: TADAWUL
Hitachi - Wikipedia: NAG; Google expects: TYO
Fujitsu - Wikipedia: NAG; Google expects: TYO

Implement TLS

Introduction

To ensure that privacy is protected TLS needs to be implemented for company_dns when behind nginx. This has already been accomplished making the work largely about documentation.

Actions

@TeraBlitz will provide a howto that records steps taken for putting the company_dns behind nginx with TLS enabled.

Example implementation

The company_dns can be found operating behind nginx with TLS enabled at https://www.mediumroast.io/company_dns/help.

Searching for Oracle via wikipedia errors out

Error message:

File "/home/mediumroast/company_dns/company_dns/app/main.py", line 143, in get
wiki_data = self.f.get_firmographics_wikipedia()
File "/home/mediumroast/company_dns/company_dns/app/lib/firmographics.py", line 89, in get_firmographics_wikipedia
return my_query.get_firmographics()
File "/home/mediumroast/company_dns/company_dns/app/lib/wikipedia.py", line 165, in get_firmographics
if 'type' in company_info and company_info['type'] != None:
TypeError: argument of type 'NoneType' is not iterable

Add ticker data from Wikipedia

Accessing the ticker data from Wikipedia is problematic. We've researched a python regex (see below) which should work to capture the ticker data for many if not most companies. There remains some work to implement, test this and then look for alternative ways to capture the rest.

pattern = '\{\{.+?\|.+?\}\}'
re.findall(pattern, fomoco)[-1]

The above might be too greedy, but we can start there.

Some examples to work with are below:

Hitachi: 
{{plainlist|
*|TYO|6501|
*|NAG|6501|
*[[Nikkei 225]] component (TYO)
*[[TOPIX]] Core30 component (TYO)}} {{TYO|6501}} * {{NAG|6501}} *[[Nikkei 225]] component (TYO)
*[[TOPIX]] Core30 component (TYO)
 
Aramco:
{{Saudi Stock Exchange|2222}}
 
IBM:
{{ubl|NYSE|IBM|[[DJIA]] component|[[S&P 100]] component|[[S&P 500]] component}} {{NYSE|IBM}}
 
Fujitsu:
{{Unbulleted list|tyo|6702|NAG|6702|[[Nikkei 225]] component (TYO)|[[TOPIX]] Large70 component (TYO)}} {{tyo|6702}} {{NAG|6702}}
 
Ford Motors:
{{unbulleted list|nyse|F|[[S&P 100|S&P 100 Component]]|[[S&P 500|S&P 500 Component]]}} {{nyse|F}}
 
HSBC:
{{plainlist|
*|LSE|HSBA|
*|SEHK|5|
*|NYSE|HSBC|
*|bsx|id|=|1077223879|HSBC.BH|
*[[FTSE 100 Index|FTSE 100]] component (HSBA)
*[[Hang Seng Index|Hang Seng]] component (5)}} {{LSE|HSBA}} * {{SEHK|5}} * {{NYSE|HSBC}} * {{bsx|id|=|1077223879|HSBC.BH}} *[[FTSE 100 Index|FTSE 100]] component (HSBA)
*[[Hang Seng Index|Hang Seng]] component (5)
 
SAP:
{{FWB|SAP}} <br />[[DAX|DAX Component]]
 
Tesla Inc:
{{Unbulleted list
   | |NASDAQ|TSLA|
   | [[Nasdaq-100]] component
   | [[S&P 100]] component
   | [[S&P 500]] component}} {{NASDAQ|TSLA}}
 
Sony:
{{plainlist|
* |Tyo|6758|
* |Nyse|SONY|
* [[Nikkei 225]] component (6758)
* [[TOPIX]] Core30 component (6758)}} {{Tyo|6758}} * {{Nyse|SONY}} * [[Nikkei 225]] component (6758)
* [[TOPIX]] Core30 component (6758)
 
Teradata:
{{Unbulleted list|nyse|TDC|[[S&P 400]] component}} {{nyse|TDC}}

An idea to consider is to create some if/then/else logic to look at planlist, unbulleted list, nothing, etc. This would left different regexes act on the strings.

Enhance SIC detail

Introduction

SIC search actually includes additional details that could be helpful for the consumer of the service. Therefore additional information like major group, industry group, etc will be added to the service. Additionally other SIC endpoints will be considered to help obtain additional information about standard industry codes.

Tasks

Add major group code to EDGAR
Add major group description to EDGAR
Add industry group code to EDGAR
Add major group endpoint(s)
Consider adding industry group endpoint(s)
Add division endpoint(s)

Add additional URLs to service

Google patents
Google news
Google finance or Bing finance
Google maps or Bing maps

Change the dictionary structure in edgar.get_all_details()

Introduction

In an effort to make companies, with the same CIK, report back correctly a format change was made in the name to put all of the names as uppercase without any punctuation. This is a weak implementation as companies could change their names slightly and cause a need to reformat the code again. A better approach is needed.

Proposed approach

For EDGAR the durable identifier is the Central Index Key (CIK). This identifier should be used instead of the name as the name can change even for public companies. The present code for temporarily tracking a company is:

            # If we've seen this company before then add the form, otherwise include both firmographics and the initial form definition
            if tmp_companies.get(company_name) == None:
                tmp_companies[company_name] = company_info
                tmp_companies[company_name]['forms'] = {accession_key: form}
            else:
                tmp_companies[company_name]['forms'][accession_key] = form

The proposed change could look something like this:

            # If we've seen this company before then add the form, otherwise include both firmographics and the initial form definition
            if tmp_companies.get(cik_no) == None:
                tmp_companies[cik_no] = company_info
                tmp_companies[cik_no]['forms'] = {accession_key: form}
            else:
                tmp_companies[cik_no]['forms'][accession_key] = form

Since company_info is a dict() that also keeps the companyName attribute the bookkeeping of the name is ok there. Because modules that make use of these data require a dict() keyed on companyName a function to rekey based upon companyName is needed. This function would loop over all cik_no keys, replace them with companyName and return a new dict(). The exact details of this change are left to the time of implementation.

For select companies company type and location data aren't properly reported

McKinsey and Company:

Company type looks like Incorporated [[partnership which is a fix to the parsing operation for this field.
Additionally, the city and other location data is None meaning we've not correctly set them to Unknown.

Collection of incorrect reporting outputs/data fields for companies

Introduction

For select companies improvements are required to ensure that as much data as possible is reported back. This issue is being used as a master issue that will be updated with companies having odd reporting behaviors. These should then turn into relevant fixes to account for fixing these behaviors.

This far the following companies do not report back correctly:

SAP - no lat long data, no google links, malformed isin
- malformed isin example: "isin": "{{ISIN|sl|=|n|pl|=|y|DE0007164600}}"
- Lat long data should be reported with only country and city data available
- Ticker data is available for SAP but no google finance reporting
- No google patents URL
- No google news URL
Apple Inc - SEC forms are thin and malformed isin
- malformed isin example: "isin": "{{ISIN|sl|=|n|pl|=|y|US0378331005}}"
- thin forms: date back to only 2018 could be the result of choosing the first instance of a duplicated CIK, possible to merge CIK data when multiples are reported?
IBM - SEC forms are thin
Schlumberger - SEC forms are thin and malformed isin
McKinsey - malformed company type, incorrect site and location data
- Company type looks like Incorporated [[partnership which is a fix to the parsing operation for this field.
- Additionally, the city and other location data is None meaning we've not correctly set them to Unknown.

Actions

Open issue for malformed isin
Open issue for SEC forms being thin
Open issue for SAP to report back at least google and lat, long data using country, city info
Open issue for better company type parsing

miha42-github / company_dns Goto Github PK

company_dns's People

Contributors

Stargazers

Watchers

Forkers

company_dns's Issues

New Feature Proposal: Data Lineage

What is Data Lineage?

General Case: /lineage endpoint

Ideas

Specific case lineage for each query endpoint

Ideas

Introduction

Action

Introduction

Potential cause

Action

Introduction

Actions

Example implementation

Introduction

Tasks

Introduction

Proposed approach

Introduction

Actions

Recommend Projects

Recommend Topics

Recommend Org

General Case: `/lineage` endpoint