Code Monkey home page Code Monkey logo

rumors-db's Introduction

Scripts for managing rumors db

Build Status

Installation

Please first install Node.JS 18.

$ npm i

Configuration

For development, copy .env.sample to .env and make necessary changes.

Elasticsearch

Anatomy of a schema file

A schema file under schema/ directory consists of:

  • VERSION -- see the next section for details.
  • The default export -- an object that represents the mapping of the index.
  • Exports a zod schema named <indexName>Schema, which can be used to generate Typescript definitions as well as use as validator.
  • Exports a Typescript definition of the index name in UpperCamelCase, created from zod.
  • The examples which is an array of the example data that can be inserted into the index and correctly type check. We use the examples to:
    • Provide readable examples of what is actually stored in ES index
    • Check if the index schema is as expected
    • Check if Typescript definition is as expected

Index mapping versions

All mappings exist in schema/ directory, with schema/index.js being the entry point.

When loading schema into DB using npm run schema, it appends _${VERSION} in the created indexes,

then create an alias to the index name, according to VERSION const in the respective schema file.

For example, the mappings in schema/articles.js would go to the index articles_v1_0_0 and an alias from articles to articles_v1_0_0 would be created after running npm run schema, given that the VERSION in schema/article.js is 1.0.0.

Running migrations

All index mappings are already the latest, so if you are starting a database with fresh data, there is no need for migrations.

However, if you are reading data from a legacy version of mapping, you may need migrations.

Migration scripts are put under db/migrations, which can be run as:

$ ./node_modules/.bin/babel-node db/migrations/<migration script name>

Prepare database for unit test

See rumors-api

Backup production database and run on local machine

According to rumors-deploy, the production DB raw data should be available in rumors-deploy/volumes/db-production. (Staging is in db-staging instead).

Just tar the rumors-deploy/volumes/db-production, download to local machine, extract the tar file and put it in esdata directory of this project's root.

Then run:

$ docker-compose up

This spins up elasticsearch on localhost:62223, with Kibana available in localhost:62224, using the data in esdata.

Updating schema for one index

After adding fields / removing fields from an index file, you will need to reload schema because ElasticSearch mappings are not editable for opened indices.

This can be done by:

  1. Manually bumping the schema version in the schema file
  2. Run npm run reload -- <index file name> (For instance, npm run reload -- replyrequests)

The script would create indices with latest schema & package.json version postfix, perform reindex, modifies alias and removes the old index.

BigQuery

Please manually create dataset, handle permission on Google Cloud, and setup related environment variables in .env.

Run the following script to create big query tables under the dataset specified in the environment variable.

./node_modules/.bin/babel-node db/setBqTables.ts --extensions .ts,.js

Other commands

These commands are invoked by commands mentioned above. See package.json for details.

npm run clear

Deletes all indices.

npm run schema [-- indexName]

Creates indices with specified mappings.

By default it will create all indexes that exists in schema/ directory, and will error if the index already exists.

We can create one index by specifying indexName in the command.

npm run scan [-- indexName]

Scans through all existing document in indexName to see if the documents match the current zod schema.

If indexName is not given, all indexes in schema will be scanned.

npm run seed

Inserts examples in each schema into the database

rumors-db's People

Contributors

darkbtf avatar godgunman avatar johnson-liang avatar linekin avatar mrorz avatar nonumpa avatar renovate-bot avatar sayuan avatar ztsai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rumors-db's Issues

Open Dataset scripts & docs

Goal

From 1122 Johnson's goals
開放資料庫資料,讓統計人工智慧得以投入:
(a) 分析訊息組成,讓有興趣之人尋找「台灣人會對什麼產生疑惑」「有多少訊息來自境外、來自哪些地方」等問題的答案,或者是透過 social network graphical model 來理解帳號與關係的關聯性。
(b) 自動輔助分類與分領域。

Previous decisions & spec

http://beta.hackfoldr.org/cofacts/https%253A%252F%252Fhackmd.io%252Fs%252FSysG-Jxo- (準備 data 頁面)

Related discussion

https://g0v-tw.slackarchive.io/cofacts/page-9/ts-1497944951784497 (Web QA)
https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1505700819000037 (Discussion)

Actionable steps

  1. Finalize the index & fields to put in dataset
  2. Design Schema for this dataset
  3. Build a script that outputs a bundled dataset
  4. Prepare a page (either hackmd or github page or README.md) to describe the dataset and the field

Update seed script to use a static set of seed data

Currently inside /data we use CSV files directly downloaded from Airtable.

In very soon we are going to ditch Airtable and allow users to edit the rumor database (ElasticSearch). Therefore, we will no longer update our seed data using CSVs from Airtable.

We can provide a CSV file that already has duplicated answers and rumors removed.

In this issue, we need to:

  1. Populate rumor ID and answer ID in the CSV file. Add answerIds column to rumors in the CSV file so that we don't need to calculate the dependency in the seed script.
  2. Clean up CSV files so that in contains no duplicated rumors and answers. Use answerIds to keep track of the many-to-many relationship among the rumors and answers.
  3. Since IDs and relationship data will be ready in the CSV, we no longer need to calculate them in script/csvToElasticSearch.js. They should be removed.

Refactor DB mapping relations

From the conversation:

我會找時間整理一下現在 DB mappings 之間的 relation。
之前為了要讓使用者可以比較好 filter 一些東西,所以 foreign key 擺放得有點奇怪(非傳統 RDBMS 的結構),但當初設計的 foreign key 擺放方式,對『按照「最後有人回報」的間來排序』沒有助益 orz

整理完 mapping 之後,一併整理 query 的需求(例如說要按照啥排序、要做出一個列表來列出「有 non-article」的文章等等),希望能與 @darkbtf 或 @sayuan  一起討論一下要怎麼調整 mapping 比較好 @@
  • Update UML diagram
  • Conduct discussions & reach concensus
  • Implement as index
  • Change API server, make unit tests pass

data visualization

  • 被詢問度排序
  • 詢問時間分佈
  • 謠言生命週期 (詢問次數-天數、回答數-天數)
  • 謠言分類器 (是謠言、非謠言、不用處理)
  • 最久閒置的謠言

Some articles are having null createdAt

Null values will cause errors in ListArticle query. For example, the last cursor returned by this query is broken by null:

{
 ListArticles(orderBy:{createdAt:DESC}, filter:{replyCount:{GT:0}}, first:50, after:"WzE0ODE4ODk2MDAwMDAsIjUzNTg1NDY2MDEwMTUtcnVtb3IiXQ==") {
      pageInfo {
        firstCursor
        lastCursor
      }
      edges {
        cursor
        node {
          id
        }
      }
    }
}

When the broken cursor is used, it will cause error.

Query:

{
 ListArticles(orderBy:{createdAt:DESC}, filter:{replyCount:{GT:0}}, first:50, after:"Wy05MjIzMzcyMDM2ODU0Nzc2MDAwLCIwNmRiMTAxMTE3ZTFlYjgyYzE4MjI0MTA0YmQwYTgxYS1ydW1vciJd") {
      pageInfo {
        firstCursor
        lastCursor
      }
      edges {
        cursor
        node {
          id
        }
      }
    }
}

Result:

{
  "data": {
    "ListArticles": {
      "pageInfo": {
        "firstCursor": null,
        "lastCursor": null
      },
      "edges": null
    }
  },
  "errors": [
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 4,
          "column": 9
        }
      ],
      "path": [
        "ListArticles",
        "pageInfo",
        "firstCursor"
      ],
      "authError": false
    },
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 5,
          "column": 9
        }
      ],
      "path": [
        "ListArticles",
        "pageInfo",
        "lastCursor"
      ],
      "authError": false
    },
    {
      "message": "[illegal_state_exception] No matching token for number_type [BIG_INTEGER]",
      "locations": [
        {
          "line": 7,
          "column": 7
        }
      ],
      "path": [
        "ListArticles",
        "edges"
      ],
      "authError": false
    }
  ]
}

We should either handle null or properly fill in missing values so that elasticsearch don't generate broken cursors.

Category labeling mechanism DB fields

完成 Cofacts crowd-source label mechanism API(含 unit test、DB migration)

Related:

// articles
// (Remove "tags" field)

articles = {
  articleCategories: {
    type: 'nested',
    properties: {
      // Who connected the replyId with the article.
      // Empty if the category is added by AI
      userId: { type: 'keyword' },
      appId: { type: 'keyword' },
      
      // exists only for AI tags
      aiModel: {type: 'keyword'},
      aiConfidence: { type: 'double' },
      
      // Counter cache for feedbacks
      positiveFeedbackCount: { type: 'long' },
      negativeFeedbackCount: { type: 'long' },

      // Foreign key
      categoryId: { type: 'keyword' },

      status: { type: 'keyword' }, // NORMAL, DELETED
      createdAt: { type: 'date' },
      updatedAt: { type: 'date' },
    },
  },
}

articlecategoryfeedbacks = {
  // The article ID and reply ID is used in calculating replyrequests' ID.
  articleId: { type: 'keyword' },
  categoryId: { type: 'keyword' },

  // Auth
  userId: { type: 'keyword' },
  // The user submits the feedback with which client.
  // Should be one of backend APP ID, 'BOT_LEGACY', 'RUMORS_LINE_BOT' or 'WEBSITE'
  appId: { type: 'keyword' },

  score: { type: 'byte' }, // 1, -1
  comment: { type: 'text', analyzer: 'cjk_url_email' },   // user comment for the article category

  createdAt: { type: 'date' },
  updatedAt: { type: 'date' },
}

// categories
categories = {
  properties: {
    title: { type: 'text', analyzer: 'cjk' },
    description: { type: 'text', analyzer: 'cjk' },
    createdAt: { type: 'date' },
    updatedAt: { type: 'date' },
  }
};

Discussion: https://hackmd.io/SR5H5bYVRUGCYXF8ch7doQ

create users from existing entities

  • fetch all entities with userId and appId (articles, replyrequests, articlereplyfeedbacks, etc...) and create users for each unique (appId, userId) pair, where the new user id is appId_sha(userId)
  • update userId for entities mentioned above
  • analytics only has docUserId and no appId, need to refetch all data from replies
  • unit tests integration tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.