cofacts / rumors-db Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 11.0 21.06 MB

Scripts for managing rumors db

License: MIT License

JavaScript 33.23% TypeScript 66.77%

database elasticsearch

rumors-db's Introduction

Scripts for managing rumors db

Installation

Please first install Node.JS 18.

$ npm i

Configuration

For development, copy .env.sample to .env and make necessary changes.

Elasticsearch

Anatomy of a schema file

A schema file under schema/ directory consists of:

VERSION -- see the next section for details.
The default export -- an object that represents the mapping of the index.
Exports a zod schema named <indexName>Schema, which can be used to generate Typescript definitions as well as use as validator.
Exports a Typescript definition of the index name in UpperCamelCase, created from zod.
The examples which is an array of the example data that can be inserted into the index and correctly type check. We use the examples to:
- Provide readable examples of what is actually stored in ES index
- Check if the index schema is as expected
- Check if Typescript definition is as expected

Index mapping versions

All mappings exist in schema/ directory, with schema/index.js being the entry point.

When loading schema into DB using npm run schema, it appends _${VERSION} in the created indexes,

then create an alias to the index name, according to VERSION const in the respective schema file.

For example, the mappings in schema/articles.js would go to the index articles_v1_0_0 and an alias from articles to articles_v1_0_0 would be created after running npm run schema, given that the VERSION in schema/article.js is 1.0.0.

Running migrations

All index mappings are already the latest, so if you are starting a database with fresh data, there is no need for migrations.

However, if you are reading data from a legacy version of mapping, you may need migrations.

Migration scripts are put under db/migrations, which can be run as:

$ ./node_modules/.bin/babel-node db/migrations/<migration script name>

Prepare database for unit test

See rumors-api

Backup production database and run on local machine

According to rumors-deploy, the production DB raw data should be available in rumors-deploy/volumes/db-production. (Staging is in db-staging instead).

Just tar the rumors-deploy/volumes/db-production, download to local machine, extract the tar file and put it in esdata directory of this project's root.

Then run:

$ docker-compose up

This spins up elasticsearch on localhost:62223, with Kibana available in localhost:62224, using the data in esdata.

Updating schema for one index

After adding fields / removing fields from an index file, you will need to reload schema because ElasticSearch mappings are not editable for opened indices.

This can be done by:

Manually bumping the schema version in the schema file
Run npm run reload -- <index file name> (For instance, npm run reload -- replyrequests)

The script would create indices with latest schema & package.json version postfix, perform reindex, modifies alias and removes the old index.

BigQuery

Please manually create dataset, handle permission on Google Cloud, and setup related environment variables in .env.

Run the following script to create big query tables under the dataset specified in the environment variable.

./node_modules/.bin/babel-node db/setBqTables.ts --extensions .ts,.js

Other commands

These commands are invoked by commands mentioned above. See package.json for details.

`npm run clear`

Deletes all indices.

`npm run schema [-- indexName]`

Creates indices with specified mappings.

By default it will create all indexes that exists in schema/ directory, and will error if the index already exists.

We can create one index by specifying indexName in the command.

`npm run scan [-- indexName]`

Scans through all existing document in indexName to see if the documents match the current zod schema.

If indexName is not given, all indexes in schema will be scanned.

`npm run seed`

Inserts examples in each schema into the database

rumors-db's People

Contributors

Stargazers

Watchers

Forkers

sayuan linekin kevinjcliao carolhsu corrupt0 opendream weifanhaha itwalter stvreumi garconsdecrystal yhsiang

rumors-db's Issues

Integrate migration script

Abstract migration framework for node
https://github.com/tj/node-migrate

Migration script that updates index mapping without loss of data

After allowing users to directly insert answers and rumors into database, we will need a non-destructive way to update index mappings.

Add author ID of article reply and reply to articlereplyfeedback

Don't index url's topImageUrl

Error: https://rollbar.com/mrorz/rumors-api/items/68/occurrences/54124225770/

Seems that there is a topImageUrl that is super long (data: image/png......).

We should use the same settings in topImageUrl and html.

Open Dataset scripts & docs

Goal

From 1122 Johnson's goals
開放資料庫資料，讓統計人工智慧得以投入：
(a) 分析訊息組成，讓有興趣之人尋找「台灣人會對什麼產生疑惑」「有多少訊息來自境外、來自哪些地方」等問題的答案，或者是透過 social network graphical model 來理解帳號與關係的關聯性。
(b) 自動輔助分類與分領域。

Previous decisions & spec

http://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FSysG-Jxo- (準備 data 頁面)

Related discussion

https://g0v-tw.slackarchive.io/cofacts/page-9/ts-1497944951784497 (Web QA)
https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1505700819000037 (Discussion)

Actionable steps

Finalize the index & fields to put in dataset
Design Schema for this dataset
Build a script that outputs a bundled dataset
Prepare a page (either hackmd or github page or README.md) to describe the dataset and the field

Reply reference field should use cjk_url_email analyzer

Current setting will mess up with reply search functionality.

Modifying analyzer will need reindexing.

Update seed script to use a static set of seed data

Currently inside /data we use CSV files directly downloaded from Airtable.

In very soon we are going to ditch Airtable and allow users to edit the rumor database (ElasticSearch). Therefore, we will no longer update our seed data using CSVs from Airtable.

We can provide a CSV file that already has duplicated answers and rumors removed.

In this issue, we need to:

Populate rumor ID and answer ID in the CSV file. Add answerIds column to rumors in the CSV file so that we don't need to calculate the dependency in the seed script.
Clean up CSV files so that in contains no duplicated rumors and answers. Use answerIds to keep track of the many-to-many relationship among the rumors and answers.
Since IDs and relationship data will be ready in the CSV, we no longer need to calculate them in script/csvToElasticSearch.js. They should be removed.

update db schemas to reflect the new user model

update schema files and write a migration script to update existing mappings in db

Sample article date display error

The message in video: https://cofacts.g0v.tw/article/3abphmilm6try
The erroneous article shown in video: https://cofacts.g0v.tw/article/sample2-rumor

The article has createdAt & updatedAt being null:

Articles with id sampleX-rumor are seed items in the beginning of 真的假的 project, they do not have valid createdAt because they are not collected from chatbot.

We can

find out how many articles are like this
discuss how to handle / display such articles

Add field for Article LIFF's trend data

Field design: see cofacts/rumors-api#281

Refactor DB mapping relations

From the conversation:

我會找時間整理一下現在 DB mappings 之間的 relation。
之前為了要讓使用者可以比較好 filter 一些東西，所以 foreign key 擺放得有點奇怪（非傳統 RDBMS 的結構），但當初設計的 foreign key 擺放方式，對『按照「最後有人回報」的間來排序』沒有助益 orz

整理完 mapping 之後，一併整理 query 的需求（例如說要按照啥排序、要做出一個列表來列出「有 non-article」的文章等等），希望能與 @darkbtf 或 @sayuan  一起討論一下要怎麼調整 mapping 比較好 @@

Update UML diagram
Conduct discussions & reach concensus
Implement as index
Change API server, make unit tests pass

data visualization

被詢問度排序
詢問時間分佈
謠言生命週期 (詢問次數-天數、回答數-天數)
謠言分類器 (是謠言、非謠言、不用處理)
最久閒置的謠言

Some articles are having null createdAt

Null values will cause errors in ListArticle query. For example, the last cursor returned by this query is broken by null:

{
 ListArticles(orderBy:{createdAt:DESC}, filter:{replyCount:{GT:0}}, first:50, after:"WzE0ODE4ODk2MDAwMDAsIjUzNTg1NDY2MDEwMTUtcnVtb3IiXQ==") {
      pageInfo {
        firstCursor
        lastCursor
      }
      edges {
        cursor
        node {
          id
        }
      }
    }
}

When the broken cursor is used, it will cause error.

Query:

{
 ListArticles(orderBy:{createdAt:DESC}, filter:{replyCount:{GT:0}}, first:50, after:"Wy05MjIzMzcyMDM2ODU0Nzc2MDAwLCIwNmRiMTAxMTE3ZTFlYjgyYzE4MjI0MTA0YmQwYTgxYS1ydW1vciJd") {
      pageInfo {
        firstCursor
        lastCursor
      }
      edges {
        cursor
        node {
          id
        }
      }
    }
}

Result:

{
  "data": {
    "ListArticles": {
      "pageInfo": {
        "firstCursor": null,
        "lastCursor": null
      },
      "edges": null
    }
  },
  "errors": [
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 4,
          "column": 9
        }
      ],
      "path": [
        "ListArticles",
        "pageInfo",
        "firstCursor"
      ],
      "authError": false
    },
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 5,
          "column": 9
        }
      ],
      "path": [
        "ListArticles",
        "pageInfo",
        "lastCursor"
      ],
      "authError": false
    },
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 7,
          "column": 7
        }
      ],
      "path": [
        "ListArticles",
        "edges"
      ],
      "authError": false
    }
  ]
}

We should either handle null or properly fill in missing values so that elasticsearch don't generate broken cursors.

Speed up duplication detection process

Currently it's O(n^2) scan for all rumor entries.

https://github.com/MrOrz/rumors-db/blob/master/scripts/csvToElasticSearch.js#L108

Locality sensitive hashing should help, but requires effective design for the hashing bins.

Category labeling mechanism DB fields

完成 Cofacts crowd-source label mechanism API（含 unit test、DB migration）

API implementation - cofacts/rumors-api#143
Migration script #36

// articles
// (Remove "tags" field)

articles = {
  articleCategories: {
    type: 'nested',
    properties: {
      // Who connected the replyId with the article.
      // Empty if the category is added by AI
      userId: { type: 'keyword' },
      appId: { type: 'keyword' },
      
      // exists only for AI tags
      aiModel: {type: 'keyword'},
      aiConfidence: { type: 'double' },
      
      // Counter cache for feedbacks
      positiveFeedbackCount: { type: 'long' },
      negativeFeedbackCount: { type: 'long' },

      // Foreign key
      categoryId: { type: 'keyword' },

      status: { type: 'keyword' }, // NORMAL, DELETED
      createdAt: { type: 'date' },
      updatedAt: { type: 'date' },
    },
  },
}

articlecategoryfeedbacks = {
  // The article ID and reply ID is used in calculating replyrequests' ID.
  articleId: { type: 'keyword' },
  categoryId: { type: 'keyword' },

  // Auth
  userId: { type: 'keyword' },
  // The user submits the feedback with which client.
  // Should be one of backend APP ID, 'BOT_LEGACY', 'RUMORS_LINE_BOT' or 'WEBSITE'
  appId: { type: 'keyword' },

  score: { type: 'byte' }, // 1, -1
  comment: { type: 'text', analyzer: 'cjk_url_email' },   // user comment for the article category

  createdAt: { type: 'date' },
  updatedAt: { type: 'date' },
}

// categories
categories = {
  properties: {
    title: { type: 'text', analyzer: 'cjk' },
    description: { type: 'text', analyzer: 'cjk' },
    createdAt: { type: 'date' },
    updatedAt: { type: 'date' },
  }
};

Discussion: https://hackmd.io/SR5H5bYVRUGCYXF8ch7doQ

DB remote backup script

Should backup elasticsearch DB to S3 periodically using a backup script activated from a cron job.

Guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/backing-up-your-cluster.html
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html

S3 plugin: https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/repository-s3.html (Its installation should be included in the seed script!)

create users from existing entities

fetch all entities with userId and appId (articles, replyrequests, articlereplyfeedbacks, etc...) and create users for each unique (appId, userId) pair, where the new user id is appId_sha(userId)
update userId for entities mentioned above
analytics only has docUserId and no appId, need to refetch all data from replies
~~unit tests~~ integration tests

design collision handling -> lowered priority and tracked in cofacts/rumors-api#222

cofacts / rumors-db Goto Github PK

rumors-db's Introduction

Scripts for managing rumors db

Installation

Configuration

Elasticsearch

Anatomy of a schema file

Index mapping versions

Running migrations

Prepare database for unit test

Backup production database and run on local machine

Updating schema for one index

BigQuery

Other commands

npm run clear

npm run schema [-- indexName]

npm run scan [-- indexName]

npm run seed

rumors-db's People

Contributors

Stargazers

Watchers

Forkers

rumors-db's Issues

Goal

Previous decisions & spec

Related discussion

Actionable steps

Recommend Projects

Recommend Topics

Recommend Org

`npm run clear`

`npm run schema [-- indexName]`

`npm run scan [-- indexName]`

`npm run seed`