data management and access¶
core4 uses Mongodb for system management and for project data management. To support ad-hoc data engineering and analytics, core4 also provides custom user databases.
core4 system collections¶
core4 uses collections in MongoDB core4 database to manage the runtime
environment. The database name defaults to test and can be modified by
core4 configuration key DEFAULT.mongo_database. The suggested name for the
production database is core4.
The following collections enable system operations:
| collection | purpose |
|---|---|
| sys.log | logging messages |
| sys.role | users and roles |
| sys.worker | registered workers |
| sys.job | registered jobs |
| sys.handler | registered API request handlers |
| sys.cookie | cookies of jobs and API request handlers |
| sys.queue | job queue of active jobs |
| sys.lock | job processing lock |
| sys.journal | job journal of processed jobs |
| sys.stdout | job stdout |
| sys.event | events |
Note
Collection sys.stdout has a time-to-live (TTL) which can be
defined by core4 configuration key worker.stdout_ttl
project collections¶
database name setting¶
Best practice is to operate one MongoDB database for each project. The suggested database name is the same as the project name.
The place to define this project database name is in the project YAML file,
e.g. demo.yaml:
DEFAULT:
mongo_database: voting
collection:
client: !connect mongodb://client
session: !connect mongodb://session
event: !connect mongodb://event
csv: !connect mongodb://csv
The above example defines the default database name voting. This name
cascades into all !connect settings if no explicit database name is given.
See also !connect tag.
job load trace¶
A job extracts, downloads or retrieves data and feeds it into the job’s project MongoDB database. To trace data loads, each job must add source information before inserting data into a MongoDB collection:
for source in self.list_proc(".+\.csv$"):
self.set_source(source)
self.config.mypro.csv_collection.delete_many(
{"_src": self.get_source()})
df = pd.read_csv(source)
self.config.mypro.csv_collection.insert_many(df.to_csv())
This snippet of a job queries all csv file in the processing folder, loads the
data using pandas, sets the source and finally inserts the data into the
MongoDB collection csv defined by core4 configuration setting
mypro.csv_collection:
# content snippet of mypro.yaml
DEFAULT:
mongo_database: mypro
csv_collection: !connect mongodb://csv
This mechanic of setting the source makes the data loaded from the csv file
traceable. Each csv line record carries an attribute _job_id and a source
identifier _src. This approach also enables restartability of a job which
can be achieved by resetting all records in collection csv before
reoloading the data:
self.config.mypro.csv_collection.delete_many(
{"_src": self.get_source()})
Note
The _src attributes only stores the basename of the source
filename.
user databases¶
core4 authorization manages read-only access to MongoDB databases. See
authentication, authorisation and access management. Additionally each user has read/write access to his or her
user database. These databases adhere to the naming convention
user![username] which can be modified with core4 configuration setting
sys.userdb.
To access the database an access token has to be created. Use this token similar to a password to connect to your personal user databbase:
mongo \
--host [hostname] \
--port [port] \
--username [username] \
--password [token] \
--authenticationDatabase admin