Vulnerability History
Date | High Risk | Low Risk |
---|---|---|
2024-12-07 | 1 | 1 |
Audit Report Details
13032
Lines of Code
8
Open
0
Resolved
🚨 High Risk Vulnerabilities
⚠️ Low Risk Vulnerabilities
Vulnerable Code:
1---
2File: /.github/pull_request_template.md
3---
4
5## Describe your changes
6
7## Issue ticket number and link
8
9[Task Title](https://www.notion.so/Compute-SN-c27d35dd084e4c4d92374f55cdd293f2?p=f9b26856f1a6406892b5db46446260da&pm=s)
10
11## Checklist before requesting a review
12- [ ] I have performed a self-review of my code
13- [ ] I wrote tests.
14- [ ] Need to take care of performance?
15
16
17
18---
19File: /contrib/CODE_REVIEW_DOCS.md
20---
21
22# Code Review
23### Conceptual Review
24
25A review can be a conceptual review, where the reviewer leaves a comment
26 * `Concept (N)ACK`, meaning "I do (not) agree with the general goal of this pull
27 request",
28 * `Approach (N)ACK`, meaning `Concept ACK`, but "I do (not) agree with the
29 approach of this change".
30
31A `NACK` needs to include a rationale why the change is not worthwhile.
32NACKs without accompanying reasoning may be disregarded.
33After conceptual agreement on the change, code review can be provided. A review
34begins with `ACK BRANCH_COMMIT`, where `BRANCH_COMMIT` is the top of the PR
35branch, followed by a description of how the reviewer did the review. The
36following language is used within pull request comments:
37
38 - "I have tested the code", involving change-specific manual testing in
39 addition to running the unit, functional, or fuzz tests, and in case it is
40 not obvious how the manual testing was done, it should be described;
41 - "I have not tested the code, but I have reviewed it and it looks
42 OK, I agree it can be merged";
43 - A "nit" refers to a trivial, often non-blocking issue.
44
45### Code Review
46Project maintainers reserve the right to weigh the opinions of peer reviewers
47using common sense judgement and may also weigh based on merit. Reviewers that
48have demonstrated a deeper commitment and understanding of the project over time
49or who have clear domain expertise may naturally have more weight, as one would
50expect in all walks of life.
51
52Where a patch set affects consensus-critical code, the bar will be much
53higher in terms of discussion and peer review requirements, keeping in mind that
54mistakes could be very costly to the wider community. This includes refactoring
55of consensus-critical code.
56
57Where a patch set proposes to change the Bittensor consensus, it must have been
58discussed extensively on the discord server and other channels, be accompanied by a widely
59discussed BIP and have a generally widely perceived technical consensus of being
60a worthwhile change based on the judgement of the maintainers.
61
62### Finding Reviewers
63
64As most reviewers are themselves developers with their own projects, the review
65process can be quite lengthy, and some amount of patience is required. If you find
66that you've been waiting for a pull request to be given attention for several
67months, there may be a number of reasons for this, some of which you can do something
68about:
69
70 - It may be because of a feature freeze due to an upcoming release. During this time,
71 only bug fixes are taken into consideration. If your pull request is a new feature,
72 it will not be prioritized until after the release. Wait for the release.
73 - It may be because the changes you are suggesting do not appeal to people. Rather than
74 nits and critique, which require effort and means they care enough to spend time on your
75 contribution, thundering silence is a good sign of widespread (mild) dislike of a given change
76 (because people don't assume *others* won't actually like the proposal). Don't take
77 that personally, though! Instead, take another critical look at what you are suggesting
78 and see if it: changes too much, is too broad, doesn't adhere to the
79 [developer notes](DEVELOPMENT_WORKFLOW.md), is dangerous or insecure, is messily written, etc.
80 Identify and address any of the issues you find. Then ask e.g. on IRC if someone could give
81 their opinion on the concept itself.
82 - It may be because your code is too complex for all but a few people, and those people
83 may not have realized your pull request even exists. A great way to find people who
84 are qualified and care about the code you are touching is the
85 [Git Blame feature](https://docs.github.com/en/github/managing-files-in-a-repository/managing-files-on-github/tracking-changes-in-a-file). Simply
86 look up who last modified the code you are changing and see if you can find
87 them and give them a nudge. Don't be incessant about the nudging, though.
88 - Finally, if all else fails, ask on IRC or elsewhere for someone to give your pull request
89 a look. If you think you've been waiting for an unreasonably long time (say,
90 more than a month) for no particular reason (a few lines changed, etc.),
91 this is totally fine. Try to return the favor when someone else is asking
92 for feedback on their code, and the universe balances out.
93 - Remember that the best thing you can do while waiting is give review to others!
94
95
96---
97File: /contrib/CONTRIBUTING.md
98---
99
100# Contributing to Bittensor Subnet Development
101
102The following is a set of guidelines for contributing to the Bittensor ecosystem. These are **HIGHLY RECOMMENDED** guidelines, but not hard-and-fast rules. Use your best judgment, and feel free to propose changes to this document in a pull request.
103
104## Table Of Contents
1051. [How Can I Contribute?](#how-can-i-contribute)
106 1. [Communication Channels](#communication-channels)
107 1. [Code Contribution General Guidelines](#code-contribution-general-guidelines)
108 1. [Pull Request Philosophy](#pull-request-philosophy)
109 1. [Pull Request Process](#pull-request-process)
110 1. [Addressing Feedback](#addressing-feedback)
111 1. [Squashing Commits](#squashing-commits)
112 1. [Refactoring](#refactoring)
113 1. [Peer Review](#peer-review)
114 1. [Suggesting Enhancements and Features](#suggesting-enhancements-and-features)
115
116
117## How Can I Contribute?
118TODO(developer): Define your desired contribution procedure.
119
120## Communication Channels
121TODO(developer): Place your communication channels here
122
123> Please follow the Bittensor Subnet [style guide](./STYLE.md) regardless of your contribution type.
124
125Here is a high-level summary:
126- Code consistency is crucial; adhere to established programming language conventions.
127- Use `black` to format your Python code; it ensures readability and consistency.
128- Write concise Git commit messages; summarize changes in ~50 characters.
129- Follow these six commit rules:
130 - Atomic Commits: Focus on one task or fix per commit.
131 - Subject and Body Separation: Use a blank line to separate the subject from the body.
132 - Subject Line Length: Keep it under 50 characters for readability.
133 - Imperative Mood: Write subject line as if giving a command or instruction.
134 - Body Text Width: Wrap text manually at 72 characters.
135 - Body Content: Explain what changed and why, not how.
136- Make use of your commit messages to simplify project understanding and maintenance.
137
138> For clear examples of each of the commit rules, see the style guide's [rules](./STYLE.md#the-six-rules-of-a-great-commit) section.
139
140### Code Contribution General Guidelines
141
142> Review the Bittensor Subnet [style guide](./STYLE.md) and [development workflow](./DEVELOPMENT_WORKFLOW.md) before contributing.
143
144
145#### Pull Request Philosophy
146
147Patchsets and enhancements should always be focused. A pull request could add a feature, fix a bug, or refactor code, but it should not contain a mixture of these. Please also avoid 'super' pull requests which attempt to do too much, are overly large, or overly complex as this makes review difficult.
148
149Specifically, pull requests must adhere to the following criteria:
150- Contain fewer than 50 files. PRs with more than 50 files will be closed.
151- If a PR introduces a new feature, it *must* include corresponding tests.
152- Other PRs (bug fixes, refactoring, etc.) should ideally also have tests, as they provide proof of concept and prevent regression.
153- Categorize your PR properly by using GitHub labels. This aids in the review process by informing reviewers about the type of change at a glance.
154- Make sure your code includes adequate comments. These should explain why certain decisions were made and how your changes work.
155- If your changes are extensive, consider breaking your PR into smaller, related PRs. This makes your contributions easier to understand and review.
156- Be active in the discussion about your PR. Respond promptly to comments and questions to help reviewers understand your changes and speed up the acceptance process.
157
158Generally, all pull requests must:
159
160 - Have a clear use case, fix a demonstrable bug or serve the greater good of the project (e.g. refactoring for modularisation).
161 - Be well peer-reviewed.
162 - Follow code style guidelines.
163 - Not break the existing test suite.
164 - Where bugs are fixed, where possible, there should be unit tests demonstrating the bug and also proving the fix.
165 - Change relevant comments and documentation when behaviour of code changes.
166
167#### Pull Request Process
168
169Please follow these steps to have your contribution considered by the maintainers:
170
171*Before* creating the PR:
1721. Read the [development workflow](./DEVELOPMENT_WORKFLOW.md) defined for this repository to understand our workflow.
1732. Ensure your PR meets the criteria stated in the 'Pull Request Philosophy' section.
1743. Include relevant tests for any fixed bugs or new features as stated in the [testing guide](./TESTING.md).
1754. Ensure your commit messages are clear and concise. Include the issue number if applicable.
1765. If you have multiple commits, rebase them into a single commit using `git rebase -i`.
1776. Explain what your changes do and why you think they should be merged in the PR description consistent with the [style guide](./STYLE.md).
178
179*After* creating the PR:
1801. Verify that all [status checks](https://help.github.com/articles/about-status-checks/) are passing after you submit your pull request.
1812. Label your PR using GitHub's labeling feature. The labels help categorize the PR and streamline the review process.
1823. Document your code with comments that provide a clear understanding of your changes. Explain any non-obvious parts of your code or design decisions you've made.
1834. If your PR has extensive changes, consider splitting it into smaller, related PRs. This reduces the cognitive load on the reviewers and speeds up the review process.
184
185Please be responsive and participate in the discussion on your PR! This aids in clarifying any confusion or concerns and leads to quicker resolution and merging of your PR.
186
187> Note: If your changes are not ready for merge but you want feedback, create a draft pull request.
188
189Following these criteria will aid in quicker review and potential merging of your PR.
190While the prerequisites above must be satisfied prior to having your pull request reviewed, the reviewer(s) may ask you to complete additional design work, tests, or other changes before your pull request can be ultimately accepted.
191
192When you are ready to submit your changes, create a pull request:
193
194> **Always** follow the [style guide](./STYLE.md) and [development workflow](./DEVELOPMENT_WORKFLOW.md) before submitting pull requests.
195
196After you submit a pull request, it will be reviewed by the maintainers. They may ask you to make changes. Please respond to any comments and push your changes as a new commit.
197
198> Note: Be sure to merge the latest from "upstream" before making a pull request:
199
200```bash
201git remote add upstream https://github.com/opentensor/bittensor.git # TODO(developer): replace with your repo URL
202git fetch upstream
203git merge upstream/<your-branch-name>
204git push origin <your-branch-name>
205```
206
207#### Addressing Feedback
208
209After submitting your pull request, expect comments and reviews from other contributors. You can add more commits to your pull request by committing them locally and pushing to your fork.
210
211You are expected to reply to any review comments before your pull request is merged. You may update the code or reject the feedback if you do not agree with it, but you should express so in a reply. If there is outstanding feedback and you are not actively working on it, your pull request may be closed.
212
213#### Squashing Commits
214
215If your pull request contains fixup commits (commits that change the same line of code repeatedly) or too fine-grained commits, you may be asked to [squash](https://git-scm.com/docs/git-rebase#_interactive_mode) your commits before it will be reviewed. The basic squashing workflow is shown below.
216
217 git checkout your_branch_name
218 git rebase -i HEAD~n
219 # n is normally the number of commits in the pull request.
220 # Set commits (except the one in the first line) from 'pick' to 'squash', save and quit.
221 # On the next screen, edit/refine commit messages.
222 # Save and quit.
223 git push -f # (force push to GitHub)
224
225Please update the resulting commit message, if needed. It should read as a coherent message. In most cases, this means not just listing the interim commits.
226
227If your change contains a merge commit, the above workflow may not work and you will need to remove the merge commit first. See the next section for details on how to rebase.
228
229Please refrain from creating several pull requests for the same change. Use the pull request that is already open (or was created earlier) to amend changes. This preserves the discussion and review that happened earlier for the respective change set.
230
231The length of time required for peer review is unpredictable and will vary from pull request to pull request.
232
233#### Refactoring
234
235Refactoring is a necessary part of any software project's evolution. The following guidelines cover refactoring pull requests for the project.
236
237There are three categories of refactoring: code-only moves, code style fixes, and code refactoring. In general, refactoring pull requests should not mix these three kinds of activities in order to make refactoring pull requests easy to review and uncontroversial. In all cases, refactoring PRs must not change the behaviour of code within the pull request (bugs must be preserved as is).
238
239Project maintainers aim for a quick turnaround on refactoring pull requests, so where possible keep them short, uncomplex and easy to verify.
240
241Pull requests that refactor the code should not be made by new contributors. It requires a certain level of experience to know where the code belongs to and to understand the full ramification (including rebase effort of open pull requests). Trivial pull requests or pull requests that refactor the code with no clear benefits may be immediately closed by the maintainers to reduce unnecessary workload on reviewing.
242
243#### Peer Review
244
245Anyone may participate in peer review which is expressed by comments in the pull request. Typically reviewers will review the code for obvious errors, as well as test out the patch set and opine on the technical merits of the patch. Project maintainers take into account the peer review when determining if there is consensus to merge a pull request (remember that discussions may have taken place elsewhere, not just on GitHub). The following language is used within pull-request comments:
246
247- ACK means "I have tested the code and I agree it should be merged";
248- NACK means "I disagree this should be merged", and must be accompanied by sound technical justification. NACKs without accompanying reasoning may be disregarded;
249- utACK means "I have not tested the code, but I have reviewed it and it looks OK, I agree it can be merged";
250- Concept ACK means "I agree in the general principle of this pull request";
251- Nit refers to trivial, often non-blocking issues.
252
253Reviewers should include the commit(s) they have reviewed in their comments. This can be done by copying the commit SHA1 hash.
254
255A pull request that changes consensus-critical code is considerably more involved than a pull request that adds a feature to the wallet, for example. Such patches must be reviewed and thoroughly tested by several reviewers who are knowledgeable about the changed subsystems. Where new features are proposed, it is helpful for reviewers to try out the patch set on a test network and indicate that they have done so in their review. Project maintainers will take this into consideration when merging changes.
256
257For a more detailed description of the review process, see the [Code Review Guidelines](CODE_REVIEW_DOCS.md).
258
259> **Note:** If you find a **Closed** issue that seems like it is the same thing that you're experiencing, open a new issue and include a link to the original issue in the body of your new one.
260
261#### How Do I Submit A (Good) Bug Report?
262
263Please track bugs as GitHub issues.
264
265Explain the problem and include additional details to help maintainers reproduce the problem:
266
267* **Use a clear and descriptive title** for the issue to identify the problem.
268* **Describe the exact steps which reproduce the problem** in as many details as possible. For example, start by explaining how you started the application, e.g. which command exactly you used in the terminal, or how you started Bittensor otherwise. When listing steps, **don't just say what you did, but explain how you did it**. For example, if you ran with a set of custom configs, explain if you used a config file or command line arguments.
269* **Provide specific examples to demonstrate the steps**. Include links to files or GitHub projects, or copy/pasteable snippets, which you use in those examples. If you're providing snippets in the issue, use [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).
270* **Describe the behavior you observed after following the steps** and point out what exactly is the problem with that behavior.
271* **Explain which behavior you expected to see instead and why.**
272* **Include screenshots and animated GIFs** which show you following the described steps and clearly demonstrate the problem. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.
273* **If you're reporting that Bittensor crashed**, include a crash report with a stack trace from the operating system. On macOS, the crash report will be available in `Console.app` under "Diagnostic and usage information" > "User diagnostic reports". Include the crash report in the issue in a [code block](https://help.github.com/articles/markdown-basics/#multiple-lines), a [file attachment](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/), or put it in a [gist](https://gist.github.com/) and provide link to that gist.
274* **If the problem is related to performance or memory**, include a CPU profile capture with your report, if you're using a GPU then include a GPU profile capture as well. Look into the [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) to look at memory usage of your model.
275* **If the problem wasn't triggered by a specific action**, describe what you were doing before the problem happened and share more information using the guidelines below.
276
277Provide more context by answering these questions:
278
279* **Did the problem start happening recently** (e.g. after updating to a new version) or was this always a problem?
280* If the problem started happening recently, **can you reproduce the problem in an older version of Bittensor?**
281* **Can you reliably reproduce the issue?** If not, provide details about how often the problem happens and under which conditions it normally happens.
282
283Include details about your configuration and environment:
284
285* **Which version of Bittensor Subnet are you using?**
286* **What commit hash are you on?** You can get the exact commit hash by checking `git log` and pasting the full commit hash.
287* **What's the name and version of the OS you're using**?
288* **Are you running Bittensor Subnet in a virtual machine?** If so, which VM software are you using and which operating systems and versions are used for the host and the guest?
289* **Are you running Bittensor Subnet in a dockerized container?** If so, have you made sure that your docker container contains your latest changes and is up to date with Master branch?
290
291### Suggesting Enhancements and Features
292
293This section guides you through submitting an enhancement suggestion, including completely new features and minor improvements to existing functionality. Following these guidelines helps maintainers and the community understand your suggestion :pencil: and find related suggestions :mag_right:.
294
295When you are creating an enhancement suggestion, please [include as many details as possible](#how-do-i-submit-a-good-enhancement-suggestion). Fill in [the template](https://bit.ly/atom-behavior-pr), including the steps that you imagine you would take if the feature you're requesting existed.
296
297#### Before Submitting An Enhancement Suggestion
298
299* **Check the [debugging guide](./DEBUGGING.md).** for tips — you might discover that the enhancement is already available. Most importantly, check if you're using the latest version of the project first.
300
301#### How to Submit A (Good) Feature Suggestion
302
303* **Use a clear and descriptive title** for the issue to identify the problem.
304* **Provide a step-by-step description of the suggested enhancement** in as many details as possible.
305* **Provide specific examples to demonstrate the steps**. Include copy/pasteable snippets which you use in those examples, as [Markdown code blocks](https://help.github.com/articles/markdown-basics/#multiple-lines).
306* **Describe the current behavior** and **explain which behavior you expected to see instead** and why.
307* **Include screenshots and animated GIFs** which help you demonstrate the steps or point out the part of the project which the suggestion is related to. You can use [this tool](https://www.cockos.com/licecap/) to record GIFs on macOS and Windows, and [this tool](https://github.com/colinkeenan/silentcast) or [this tool](https://github.com/GNOME/byzanz) on Linux.
308* **Explain why this enhancement would be useful** to most users.
309* **List some other text editors or applications where this enhancement exists.**
310* **Specify the name and version of the OS you're using.**
311
312Thank you for considering contributing to Bittensor! Any help is greatly appreciated along this journey to incentivize open and permissionless intelligence.
313
314
315
316---
317File: /contrib/DEVELOPMENT_WORKFLOW.md
318---
319
320# Bittensor Subnet Development Workflow
321
322This is a highly advisable workflow to follow to keep your subtensor project organized and foster ease of contribution.
323
324## Table of contents
325
326- [Bittensor Subnet Development Workflow](#bittensor-subnet-development-workflow)
327 - [Main Branches](#main-branches)
328 - [Development Model](#development-model)
329 - [Feature Branches](#feature-branches)
330 - [Release Branches](#release-branches)
331 - [Hotfix Branches](#hotfix-branches)
332 - [Git Operations](#git-operations)
333 - [Creating a Feature Branch](#creating-a-feature-branch)
334 - [Merging Feature Branch into Staging](#merging-feature-branch-into-staging)
335 - [Creating a Release Branch](#creating-a-release-branch)
336 - [Finishing a Release Branch](#finishing-a-release-branch)
337 - [Creating a Hotfix Branch](#creating-a-hotfix-branch)
338 - [Finishing a Hotfix Branch](#finishing-a-hotfix-branch)
339 - [Continuous Integration (CI) and Continuous Deployment (CD)](#continuous-integration-ci-and-continuous-deployment-cd)
340 - [Versioning and Release Notes](#versioning-and-release-notes)
341 - [Pending Tasks](#pending-tasks)
342
343## Main Branches
344
345Bittensor's codebase consists of two main branches: **main** and **staging**.
346
347**main**
348- This is Bittensor's live production branch, which should only be updated by the core development team. This branch is protected, so refrain from pushing or merging into it unless authorized.
349
350**staging**
351- This branch is continuously updated and is where you propose and merge changes. It's essentially Bittensor's active development branch.
352
353## Development Model
354
355### Feature Branches
356
357- Branch off from: `staging`
358- Merge back into: `staging`
359- Naming convention: `feature/<ticket>/<descriptive-sentence>`
360
361Feature branches are used to develop new features for upcoming or future releases. They exist as long as the feature is in development, but will eventually be merged into `staging` or discarded. Always delete your feature branch after merging to avoid unnecessary clutter.
362
363### Release Branches
364
365- Branch off from: `staging`
366- Merge back into: `staging` and then `main`
367- Naming convention: `release/<version>/<descriptive-message>/<creator's-name>`
368
369Release branches support the preparation of a new production release, allowing for minor bug fixes and preparation of metadata (version number, configuration, etc). All new features should be merged into `staging` and wait for the next big release.
370
371### Hotfix Branches
372
373General workflow:
374
375- Branch off from: `main` or `staging`
376- Merge back into: `staging` then `main`
377- Naming convention: `hotfix/<version>/<descriptive-message>/<creator's-name>`
378
379Hotfix branches are meant for quick fixes in the production environment. When a critical bug in a production version must be resolved immediately, a hotfix branch is created.
380
381## Git Operations
382
383#### Create a feature branch
384
3851. Branch from the **staging** branch.
386 1. Command: `git checkout -b feature/my-feature staging`
387
388> Rebase frequently with the updated staging branch so you do not face big conflicts before submitting your pull request. Remember, syncing your changes with other developers could also help you avoid big conflicts.
389
390#### Merge feature branch into staging
391
392In other words, integrate your changes into a branch that will be tested and prepared for release.
393
3941. Switch branch to staging: `git checkout staging`
3952. Merging feature branch into staging: `git merge --no-ff feature/my-feature`
3963. Pushing changes to staging: `git push origin staging`
3974. Delete feature branch: `git branch -d feature/my-feature` (alternatively, this can be navigated on the GitHub web UI)
398
399This operation is done by Github when merging a PR.
400
401So, what you have to keep in mind is:
402- Open the PR against the `staging` branch.
403- After merging a PR you should delete your feature branch. This will be strictly enforced.
404
405#### Creating a release branch
406
4071. Create branch from staging: `git checkout -b release/3.4.0/descriptive-message/creator's_name staging`
4082. Updating version with major or minor: `./scripts/update_version.sh major|minor`
4093. Commit file changes with new version: `git commit -a -m "Updated version to 3.4.0"`
410
411
412#### Finishing a Release Branch
413
414This involves releasing stable code and generating a new version for bittensor.
415
4161. Switch branch to main: `git checkout main`
4172. Merge release branch into main: `git merge --no-ff release/3.4.0/optional-descriptive-message`
4183. Tag changeset: `git tag -a v3.4.0 -m "Releasing v3.4.0: some comment about it"`
4194. Push changes to main: `git push origin main`
4205. Push tags to origin: `git push origin --tags`
421
422To keep the changes made in the __release__ branch, we need to merge those back into `staging`:
423
424- Switch branch to staging: `git checkout staging`.
425- Merging release branch into staging: `git merge --no-ff release/3.4.0/optional-descriptive-message`
426
427This step may well lead to a merge conflict (probably even, since we have changed the version number). If so, fix it and commit.
428
429
430#### Creating a hotfix branch
4311. Create branch from main: `git checkout -b hotfix/3.3.4/descriptive-message/creator's-name main`
4322. Update patch version: `./scripts/update_version.sh patch`
4333. Commit file changes with new version: `git commit -a -m "Updated version to 3.3.4"`
4344. Fix the bug and commit the fix: `git commit -m "Fixed critical production issue X"`
435
436#### Finishing a Hotfix Branch
437
438Finishing a hotfix branch involves merging the bugfix into both `main` and `staging`.
439
4401. Switch branch to main: `git checkout main`
4412. Merge hotfix into main: `git merge --no-ff hotfix/3.3.4/optional-descriptive-message`
4423. Tag new version: `git tag -a v3.3.4 -m "Releasing v3.3.4: descriptive comment about the hotfix"`
4434. Push changes to main: `git push origin main`
4445. Push tags to origin: `git push origin --tags`
4456. Switch branch to staging: `git checkout staging`
4467. Merge hotfix into staging: `git merge --no-ff hotfix/3.3.4/descriptive-message/creator's-name`
4478. Push changes to origin/staging: `git push origin staging`
4489. Delete hotfix branch: `git branch -d hotfix/3.3.4/optional-descriptive-message`
449
450The one exception to the rule here is that, **when a release branch currently exists, the hotfix changes need to be merged into that release branch, instead of** `staging`. Back-merging the bugfix into the __release__ branch will eventually result in the bugfix being merged into `develop` too, when the release branch is finished. (If work in develop immediately requires this bugfix and cannot wait for the release branch to be finished, you may safely merge the bugfix into develop now already as well.)
451
452Finally, we remove the temporary branch:
453
454- `git branch -d hotfix/3.3.4/optional-descriptive-message`
455## Continuous Integration (CI) and Continuous Deployment (CD)
456
457Continuous Integration (CI) is a software development practice where members of a team integrate their work frequently. Each integration is verified by an automated build and test process to detect integration errors as quickly as possible.
458
459Continuous Deployment (CD) is a software engineering approach in which software functionalities are delivered frequently through automated deployments.
460
461- **CircleCI job**: Create jobs in CircleCI to automate the merging of staging into main and release version (needed to release code) and building and testing Bittensor (needed to merge PRs).
462
463> It is highly recommended to set up your own circleci pipeline with your subnet
464
465## Versioning and Release Notes
466
467Semantic versioning helps keep track of the different versions of the software. When code is merged into main, generate a new version.
468
469Release notes provide documentation for each version released to the users, highlighting the new features, improvements, and bug fixes. When merged into main, generate GitHub release and release notes.
470
471## Pending Tasks
472
473Follow these steps when you are contributing to the bittensor subnet:
474
475- Determine if main and staging are different
476- Determine what is in staging that is not merged yet
477 - Document not released developments
478 - When merged into staging, generate information about what's merged into staging but not released.
479 - When merged into main, generate GitHub release and release notes.
480- CircleCI jobs
481 - Merge staging into main and release version (needed to release code)
482 - Build and Test Bittensor (needed to merge PRs)
483
484This document can be improved as the Bittensor project continues to develop and change.
485
486
487
488---
489File: /contrib/STYLE.md
490---
491
492# Style Guide
493
494A project’s long-term success rests (among other things) on its maintainability, and a maintainer has few tools more powerful than his or her project’s log. It’s worth taking the time to learn how to care for one properly. What may be a hassle at first soon becomes habit, and eventually a source of pride and productivity for all involved.
495
496Most programming languages have well-established conventions as to what constitutes idiomatic style, i.e. naming, formatting and so on. There are variations on these conventions, of course, but most developers agree that picking one and sticking to it is far better than the chaos that ensues when everybody does their own thing.
497
498# Table of Contents
4991. [Code Style](#code-style)
5002. [Naming Conventions](#naming-conventions)
5013. [Git Commit Style](#git-commit-style)
5024. [The Six Rules of a Great Commit](#the-six-rules-of-a-great-commit)
503 - [1. Atomic Commits](#1-atomic-commits)
504 - [2. Separate Subject from Body with a Blank Line](#2-separate-subject-from-body-with-a-blank-line)
505 - [3. Limit the Subject Line to 50 Characters](#3-limit-the-subject-line-to-50-characters)
506 - [4. Use the Imperative Mood in the Subject Line](#4-use-the-imperative-mood-in-the-subject-line)
507 - [5. Wrap the Body at 72 Characters](#5-wrap-the-body-at-72-characters)
508 - [6. Use the Body to Explain What and Why vs. How](#6-use-the-body-to-explain-what-and-why-vs-how)
5095. [Tools Worth Mentioning](#tools-worth-mentioning)
510 - [Using `--fixup`](#using---fixup)
511 - [Interactive Rebase](#interactive-rebase)
5126. [Pull Request and Squashing Commits Caveats](#pull-request-and-squashing-commits-caveats)
513
514
515### Code style
516
517#### General Style
518Python's official style guide is PEP 8, which provides conventions for writing code for the main Python distribution. Here are some key points:
519
520- `Indentation:` Use 4 spaces per indentation level.
521
522- `Line Length:` Limit all lines to a maximum of 79 characters.
523
524- `Blank Lines:` Surround top-level function and class definitions with two blank lines. Method definitions inside a class are surrounded by a single blank line.
525
526- `Imports:` Imports should usually be on separate lines and should be grouped in the following order:
527
528 - Standard library imports.
529 - Related third party imports.
530 - Local application/library specific imports.
531- `Whitespace:` Avoid extraneous whitespace in the following situations:
532
533 - Immediately inside parentheses, brackets or braces.
534 - Immediately before a comma, semicolon, or colon.
535 - Immediately before the open parenthesis that starts the argument list of a function call.
536- `Comments:` Comments should be complete sentences and should be used to clarify code and are not a substitute for poorly written code.
537
538#### For Python
539
540- `List Comprehensions:` Use list comprehensions for concise and readable creation of lists.
541
542- `Generators:` Use generators when dealing with large amounts of data to save memory.
543
544- `Context Managers:` Use context managers (with statement) for resource management.
545
546- `String Formatting:` Use f-strings for formatting strings in Python 3.6 and above.
547
548- `Error Handling:` Use exceptions for error handling whenever possible.
549
550#### More details
551
552Use `black` to format your python code before committing for consistency across such a large pool of contributors. Black's code [style](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#code-style) ensures consistent and opinionated code formatting. It automatically formats your Python code according to the Black style guide, enhancing code readability and maintainability.
553
554Key Features of Black:
555
556 Consistency: Black enforces a single, consistent coding style across your project, eliminating style debates and allowing developers to focus on code logic.
557
558 Readability: By applying a standard formatting style, Black improves code readability, making it easier to understand and collaborate on projects.
559
560 Automation: Black automates the code formatting process, saving time and effort. It eliminates the need for manual formatting and reduces the likelihood of inconsistencies.
561
562### Naming Conventions
563
564- `Classes:` Class names should normally use the CapWords Convention.
565- `Functions and Variables:` Function names should be lowercase, with words separated by underscores as necessary to improve readability. Variable names follow the same convention as function names.
566
567- `Constants:` Constants are usually defined on a module level and written in all capital letters with underscores separating words.
568
569- `Non-public Methods and Instance Variables:` Use a single leading underscore (_). This is a weak "internal use" indicator.
570
571- `Strongly "private" methods and variables:` Use a double leading underscore (__). This triggers name mangling in Python.
572
573
574### Git commit style
575
576Here’s a model Git commit message when contributing:
577```
578Summarize changes in around 50 characters or less
579
580More detailed explanatory text, if necessary. Wrap it to about 72
581characters or so. In some contexts, the first line is treated as the
582subject of the commit and the rest of the text as the body. The
583blank line separating the summary from the body is critical (unless
584you omit the body entirely); various tools like `log`, `shortlog`
585and `rebase` can get confused if you run the two together.
586
587Explain the problem that this commit is solving. Focus on why you
588are making this change as opposed to how (the code explains that).
589Are there side effects or other unintuitive consequences of this
590change? Here's the place to explain them.
591
592Further paragraphs come after blank lines.
593
594 - Bullet points are okay, too
595
596 - Typically a hyphen or asterisk is used for the bullet, preceded
597 by a single space, with blank lines in between, but conventions
598 vary here
599
600If you use an issue tracker, put references to them at the bottom,
601like this:
602
603Resolves: #123
604See also: #456, #789
605```
606
607
608## The six rules of a great commit.
609
610#### 1. Atomic Commits
611An “atomic” change revolves around one task or one fix.
612
613Atomic Approach
614 - Commit each fix or task as a separate change
615 - Only commit when a block of work is complete
616 - Commit each layout change separately
617 - Joint commit for layout file, code behind file, and additional resources
618
619Benefits
620
621- Easy to roll back without affecting other changes
622- Easy to make other changes on the fly
623- Easy to merge features to other branches
624
625#### Avoid trivial commit messages
626
627Commit messages like "fix", "fix2", or "fix3" don't provide any context or clear understanding of what changes the commit introduces. Here are some examples of good vs. bad commit messages:
628
629**Bad Commit Message:**
630
631 $ git commit -m "fix"
632
633**Good Commit Message:**
634
635 $ git commit -m "Fix typo in README file"
636
637> **Caveat**: When working with new features, an atomic commit will often consist of multiple files, since a layout file, code behind file, and additional resources may have been added/modified. You don’t want to commit all of these separately, because if you had to roll back the application to a state before the feature was added, it would involve multiple commit entries, and that can get confusing
638
639#### 2. Separate subject from body with a blank line
640
641Not every commit requires both a subject and a body. Sometimes a single line is fine, especially when the change is so simple that no further context is necessary.
642
643For example:
644
645 Fix typo in introduction to user guide
646
647Nothing more need be said; if the reader wonders what the typo was, she can simply take a look at the change itself, i.e. use git show or git diff or git log -p.
648
649If you’re committing something like this at the command line, it’s easy to use the -m option to git commit:
650
651 $ git commit -m"Fix typo in introduction to user guide"
652
653However, when a commit merits a bit of explanation and context, you need to write a body. For example:
654
655 Derezz the master control program
656
657 MCP turned out to be evil and had become intent on world domination.
658 This commit throws Tron's disc into MCP (causing its deresolution)
659 and turns it back into a chess game.
660
661Commit messages with bodies are not so easy to write with the -m option. You’re better off writing the message in a proper text editor. [See Pro Git](https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration).
662
663In any case, the separation of subject from body pays off when browsing the log. Here’s the full log entry:
664
665 $ git log
666 commit 42e769bdf4894310333942ffc5a15151222a87be
667 Author: Kevin Flynn <[email protected]>
668 Date: Fri Jan 01 00:00:00 1982 -0200
669
670 Derezz the master control program
671
672 MCP turned out to be evil and had become intent on world domination.
673 This commit throws Tron's disc into MCP (causing its deresolution)
674 and turns it back into a chess game.
675
676
677#### 3. Limit the subject line to 50 characters
67850 characters is not a hard limit, just a rule of thumb. Keeping subject lines at this length ensures that they are readable, and forces the author to think for a moment about the most concise way to explain what’s going on.
679
680GitHub’s UI is fully aware of these conventions. It will warn you if you go past the 50 character limit. Git will truncate any subject line longer than 72 characters with an ellipsis, thus keeping it to 50 is best practice.
681
682#### 4. Use the imperative mood in the subject line
683Imperative mood just means “spoken or written as if giving a command or instruction”. A few examples:
684
685 Clean your room
686 Close the door
687 Take out the trash
688
689Each of the seven rules you’re reading about right now are written in the imperative (“Wrap the body at 72 characters”, etc.).
690
691The imperative can sound a little rude; that’s why we don’t often use it. But it’s perfect for Git commit subject lines. One reason for this is that Git itself uses the imperative whenever it creates a commit on your behalf.
692
693For example, the default message created when using git merge reads:
694
695 Merge branch 'myfeature'
696
697And when using git revert:
698
699 Revert "Add the thing with the stuff"
700
701 This reverts commit cc87791524aedd593cff5a74532befe7ab69ce9d.
702
703Or when clicking the “Merge” button on a GitHub pull request:
704
705 Merge pull request #123 from someuser/somebranch
706
707So when you write your commit messages in the imperative, you’re following Git’s own built-in conventions. For example:
708
709 Refactor subsystem X for readability
710 Update getting started documentation
711 Remove deprecated methods
712 Release version 1.0.0
713
714Writing this way can be a little awkward at first. We’re more used to speaking in the indicative mood, which is all about reporting facts. That’s why commit messages often end up reading like this:
715
716 Fixed bug with Y
717 Changing behavior of X
718
719And sometimes commit messages get written as a description of their contents:
720
721 More fixes for broken stuff
722 Sweet new API methods
723
724To remove any confusion, here’s a simple rule to get it right every time.
725
726**A properly formed Git commit subject line should always be able to complete the following sentence:**
727
728 If applied, this commit will <your subject line here>
729
730For example:
731
732 If applied, this commit will refactor subsystem X for readability
733 If applied, this commit will update getting started documentation
734 If applied, this commit will remove deprecated methods
735 If applied, this commit will release version 1.0.0
736 If applied, this commit will merge pull request #123 from user/branch
737
738#### 5. Wrap the body at 72 characters
739Git never wraps text automatically. When you write the body of a commit message, you must mind its right margin, and wrap text manually.
740
741The recommendation is to do this at 72 characters, so that Git has plenty of room to indent text while still keeping everything under 80 characters overall.
742
743A good text editor can help here. It’s easy to configure Vim, for example, to wrap text at 72 characters when you’re writing a Git commit.
744
745#### 6. Use the body to explain what and why vs. how
746This [commit](https://github.com/bitcoin/bitcoin/commit/eb0b56b19017ab5c16c745e6da39c53126924ed6) from Bitcoin Core is a great example of explaining what changed and why:
747
748```
749commit eb0b56b19017ab5c16c745e6da39c53126924ed6
750Author: Pieter Wuille <[email protected]>
751Date: Fri Aug 1 22:57:55 2014 +0200
752
753 Simplify serialize.h's exception handling
754
755 Remove the 'state' and 'exceptmask' from serialize.h's stream
756 implementations, as well as related methods.
757
758 As exceptmask always included 'failbit', and setstate was always
759 called with bits = failbit, all it did was immediately raise an
760 exception. Get rid of those variables, and replace the setstate
761 with direct exception throwing (which also removes some dead
762 code).
763
764 As a result, good() is never reached after a failure (there are
765 only 2 calls, one of which is in tests), and can just be replaced
766 by !eof().
767
768 fail(), clear(n) and exceptions() are just never called. Delete
769 them.
770```
771
772Take a look at the [full diff](https://github.com/bitcoin/bitcoin/commit/eb0b56b19017ab5c16c745e6da39c53126924ed6) and just think how much time the author is saving fellow and future committers by taking the time to provide this context here and now. If he didn’t, it would probably be lost forever.
773
774In most cases, you can leave out details about how a change has been made. Code is generally self-explanatory in this regard (and if the code is so complex that it needs to be explained in prose, that’s what source comments are for). Just focus on making clear the reasons why you made the change in the first place—the way things worked before the change (and what was wrong with that), the way they work now, and why you decided to solve it the way you did.
775
776The future maintainer that thanks you may be yourself!
777
778
779
780#### Tools worth mentioning
781
782##### Using `--fixup`
783
784If you've made a commit and then realize you've missed something or made a minor mistake, you can use the `--fixup` option.
785
786For example, suppose you've made a commit with a hash `9fceb02`. Later, you realize you've left a debug statement in your code. Instead of making a new commit titled "remove debug statement" or "fix", you can do the following:
787
788 $ git commit --fixup 9fceb02
789
790This will create a new commit to fix the issue, with a message like "fixup! The original commit message".
791
792##### Interactive Rebase
793
794Interactive rebase, or `rebase -i`, can be used to squash these fixup commits into the original commits they're fixing, which cleans up your commit history. You can use the `autosquash` option to automatically squash any commits marked as "fixup" into their target commits.
795
796For example:
797
798 $ git rebase -i --autosquash HEAD~5
799
800This command starts an interactive rebase for the last 5 commits (`HEAD~5`). Any commits marked as "fixup" will be automatically moved to squash with their target commits.
801
802The benefit of using `--fixup` and interactive rebase is that it keeps your commit history clean and readable. It groups fixes with the commits they are related to, rather than having a separate "fix" commit that might not make sense to other developers (or even to you) in the future.
803
804
805---
806
807#### Pull Request and Squashing Commits Caveats
808
809While atomic commits are great for development and for understanding the changes within the branch, the commit history can get messy when merging to the main branch. To keep a cleaner and more understandable commit history in our main branch, we encourage squashing all the commits of a PR into one when merging.
810
811This single commit should provide an overview of the changes that the PR introduced. It should follow the guidelines for atomic commits (an atomic commit is complete, self-contained, and understandable) but on the scale of the entire feature, task, or fix that the PR addresses. This approach combines the benefits of atomic commits during development with a clean commit history in our main branch.
812
813Here is how you can squash commits:
814
815```bash
816git rebase -i HEAD~n
817```
818
819where `n` is the number of commits to squash. After running the command, replace `pick` with `squash` for the commits you want to squash into the previous commit. This will combine the commits and allow you to write a new commit message.
820
821In this context, an atomic commit message could look like:
822
823```
824Add feature X
825
826This commit introduces feature X which does A, B, and C. It adds
827new files for layout, updates the code behind the file, and introduces
828new resources. This change is important because it allows users to
829perform task Y more efficiently.
830
831It includes:
832- Creation of new layout file
833- Updates in the code-behind file
834- Addition of new resources
835
836Resolves: #123
837```
838
839In your PRs, remember to detail what the PR is introducing or fixing. This will be helpful for reviewers to understand the context and the reason behind the changes.
840
841
842
843---
844File: /datura/datura/consumers/base.py
845---
846
847import abc
848import logging
849
850from fastapi import WebSocket, WebSocketDisconnect
851
852from ..requests.base import BaseRequest
853
854logger = logging.getLogger(__name__)
855
856
857class BaseConsumer(abc.ABC):
858 def __init__(self, websocket: WebSocket):
859 self.websocket = websocket
860
861 @abc.abstractmethod
862 def accepted_request_type(self) -> type[BaseRequest]:
863 pass
864
865 async def connect(self):
866 await self.websocket.accept()
867
868 async def receive_message(self) -> BaseRequest:
869 data = await self.websocket.receive_text()
870 return self.accepted_request_type().parse(data)
871
872 async def send_message(self, msg: BaseRequest):
873 await self.websocket.send_text(msg.json())
874
875 async def disconnect(self):
876 try:
877 await self.websocket.close()
878 except Exception:
879 pass
880
881 @abc.abstractmethod
882 async def handle_message(self, data: BaseRequest):
883 raise NotImplementedError
884
885 async def handle(self):
886 # await self.connect()
887 try:
888 while True:
889 data: BaseRequest = await self.receive_message()
890 await self.handle_message(data)
891 except WebSocketDisconnect as ex:
892 logger.info("Websocket connection closed, e: %s", str(ex))
893 await self.disconnect()
894 except Exception as ex:
895 logger.info("Handling message error: %s", str(ex))
896 await self.disconnect()
897
898
899
900---
901File: /datura/datura/errors/__init__.py
902---
903
904
905
906
907---
908File: /datura/datura/errors/protocol.py
909---
910
911from datura.requests.base import BaseRequest
912
913
914class UnsupportedMessageReceived(Exception):
915 def __init__(self, msg: BaseRequest):
916 self.msg = msg
917
918 def __str__(self):
919 return f"{type(self).__name__}: {self.msg.json()}"
920
921 __repr__ = __str__
922
923
924
925---
926File: /datura/datura/requests/base.py
927---
928
929import abc
930import enum
931import json
932
933import pydantic
934
935
936class ValidationError(Exception):
937 def __init__(self, msg):
938 self.msg = msg
939
940 @classmethod
941 def from_json_decode_error(cls, exc: json.JSONDecodeError):
942 return cls(exc.args[0])
943
944 @classmethod
945 def from_pydantic_validation_error(cls, exc: pydantic.ValidationError):
946 return cls(json.dumps(exc.json()))
947
948 def __repr__(self):
949 return f"{type(self).__name__}({self.msg})"
950
951
952def all_subclasses(cls: type):
953 for subcls in cls.__subclasses__():
954 yield subcls
955 yield from all_subclasses(subcls)
956
957
958base_class_to_request_type_mapping = {}
959
960
961class BaseRequest(pydantic.BaseModel, abc.ABC):
962 message_type: enum.Enum
963
964 @classmethod
965 def type_to_model(cls, type_: enum.Enum) -> type["BaseRequest"]:
966 mapping = base_class_to_request_type_mapping.get(cls)
967 if not mapping:
968 mapping = {}
969 for klass in all_subclasses(cls):
970 if not (message_type := klass.__fields__.get("message_type")):
971 continue
972 if not message_type.default:
973 continue
974 mapping[message_type.default] = klass
975 base_class_to_request_type_mapping[cls] = mapping
976
977 return mapping[type_]
978
979 @classmethod
980 def parse(cls, str_: str):
981 try:
982 json_ = json.loads(str_)
983 except json.JSONDecodeError as exc:
984 raise ValidationError.from_json_decode_error(exc)
985
986 try:
987 base_model_object = cls.parse_obj(json_)
988 except pydantic.ValidationError as exc:
989 raise ValidationError.from_pydantic_validation_error(exc)
990
991 target_model = cls.type_to_model(base_model_object.message_type)
992
993 try:
994 return target_model.parse_obj(json_)
995 except pydantic.ValidationError as exc:
996 raise ValidationError.from_pydantic_validation_error(exc)
997
998
999
1000---
1001File: /datura/datura/requests/miner_requests.py
1002---
1003
1004import enum
1005
1006import pydantic
1007from datura.requests.base import BaseRequest
1008
1009
1010class RequestType(enum.Enum):
1011 GenericError = "GenericError"
1012 AcceptJobRequest = "AcceptJobRequest"
1013 DeclineJobRequest = "DeclineJobRequest"
1014 AcceptSSHKeyRequest = "AcceptSSHKeyRequest"
1015 FailedRequest = "FailedRequest"
1016 UnAuthorizedRequest = "UnAuthorizedRequest"
1017 SSHKeyRemoved = "SSHKeyRemoved"
1018
1019
1020class Executor(pydantic.BaseModel):
1021 uuid: str
1022 address: str
1023 port: int
1024
1025
1026class BaseMinerRequest(BaseRequest):
1027 message_type: RequestType
1028
1029
1030class GenericError(BaseMinerRequest):
1031 message_type: RequestType = RequestType.GenericError
1032 details: str | None = None
1033
1034
1035class AcceptJobRequest(BaseMinerRequest):
1036 message_type: RequestType = RequestType.AcceptJobRequest
1037 executors: list[Executor]
1038
1039
1040class DeclineJobRequest(BaseMinerRequest):
1041 message_type: RequestType = RequestType.DeclineJobRequest
1042
1043
1044class ExecutorSSHInfo(pydantic.BaseModel):
1045 uuid: str
1046 address: str
1047 port: int
1048 ssh_username: str
1049 ssh_port: int
1050 python_path: str
1051 root_dir: str
1052 port_range: str | None = None
1053 port_mappings: str | None = None
1054
1055class AcceptSSHKeyRequest(BaseMinerRequest):
1056 message_type: RequestType = RequestType.AcceptSSHKeyRequest
1057 executors: list[ExecutorSSHInfo]
1058
1059
1060class SSHKeyRemoved(BaseMinerRequest):
1061 message_type: RequestType = RequestType.SSHKeyRemoved
1062
1063
1064class FailedRequest(BaseMinerRequest):
1065 message_type: RequestType = RequestType.FailedRequest
1066 details: str | None = None
1067
1068
1069class UnAuthorizedRequest(FailedRequest):
1070 message_type: RequestType = RequestType.UnAuthorizedRequest
1071
1072
1073
1074---
1075File: /datura/datura/requests/validator_requests.py
1076---
1077
1078import enum
1079import json
1080from typing import Optional
1081
1082import pydantic
1083from datura.requests.base import BaseRequest
1084
1085
1086class RequestType(enum.Enum):
1087 AuthenticateRequest = "AuthenticateRequest"
1088 SSHPubKeySubmitRequest = "SSHPubKeySubmitRequest"
1089 SSHPubKeyRemoveRequest = "SSHPubKeyRemoveRequest"
1090
1091
1092class BaseValidatorRequest(BaseRequest):
1093 message_type: RequestType
1094
1095
1096class AuthenticationPayload(pydantic.BaseModel):
1097 validator_hotkey: str
1098 miner_hotkey: str
1099 timestamp: int
1100
1101 def blob_for_signing(self):
1102 instance_dict = self.model_dump()
1103 return json.dumps(instance_dict, sort_keys=True)
1104
1105
1106class AuthenticateRequest(BaseValidatorRequest):
1107 message_type: RequestType = RequestType.AuthenticateRequest
1108 payload: AuthenticationPayload
1109 signature: str
1110
1111 def blob_for_signing(self):
1112 return self.payload.blob_for_signing()
1113
1114
1115class SSHPubKeySubmitRequest(BaseValidatorRequest):
1116 message_type: RequestType = RequestType.SSHPubKeySubmitRequest
1117 public_key: bytes
1118 executor_id: Optional[str] = None
1119
1120
1121class SSHPubKeyRemoveRequest(BaseValidatorRequest):
1122 message_type: RequestType = RequestType.SSHPubKeyRemoveRequest
1123 public_key: bytes
1124 executor_id: Optional[str] = None
1125
1126
1127
1128---
1129File: /datura/datura/__init__.py
1130---
1131
1132
1133
1134
1135---
1136File: /datura/tests/__init__.py
1137---
1138
1139
1140
1141
1142---
1143File: /datura/README.md
1144---
1145
1146# datura
1147
1148
1149
1150---
1151File: /neurons/executor/src/core/__init__.py
1152---
1153
1154
1155
1156
1157---
1158File: /neurons/executor/src/core/config.py
1159---
1160
1161from typing import Optional
1162from pydantic import Field
1163from pydantic_settings import BaseSettings, SettingsConfigDict
1164
1165
1166class Settings(BaseSettings):
1167 model_config = SettingsConfigDict(env_file=".env", extra="ignore")
1168 PROJECT_NAME: str = "compute-subnet-executor"
1169
1170 INTERNAL_PORT: int = Field(env="INTERNAL_PORT", default=8001)
1171 SSH_PORT: int = Field(env="SSH_PORT", default=2200)
1172 SSH_PUBLIC_PORT: Optional[int] = Field(env="SSH_PUBLIC_PORT", default=None)
1173
1174 MINER_HOTKEY_SS58_ADDRESS: str = Field(env="MINER_HOTKEY_SS58_ADDRESS")
1175
1176 RENTING_PORT_RANGE: Optional[str] = Field(env="RENTING_PORT_RANGE", default=None)
1177 RENTING_PORT_MAPPINGS: Optional[str] = Field(env="RENTING_PORT_MAPPINGS", default=None)
1178
1179 ENV: str = Field(env="ENV", default="dev")
1180
1181
1182settings = Settings()
1183
1184
1185
1186---
1187File: /neurons/executor/src/core/logger.py
1188---
1189
1190import logging
1191import json
1192
1193
1194def get_logger(name: str):
1195 logger = logging.getLogger(name)
1196 handler = logging.StreamHandler()
1197 formatter = logging.Formatter(
1198 "Name: %(name)s | Time: %(asctime)s | Level: %(levelname)s | File: %(filename)s | Function: %(funcName)s | Line: %(lineno)s | Process: %(process)d | Message: %(message)s"
1199 )
1200 handler.setFormatter(formatter)
1201 logger.addHandler(handler)
1202 logger.setLevel(logging.INFO)
1203 return logger
1204
1205
1206class StructuredMessage:
1207 def __init__(self, message, extra: dict):
1208 self.message = message
1209 self.extra = extra
1210
1211 def __str__(self):
1212 return "%s >>> %s" % (self.message, json.dumps(self.extra)) # noqa
1213
1214
1215_m = StructuredMessage
1216
1217
1218
1219---
1220File: /neurons/executor/src/middlewares/__init__.py
1221---
1222
1223
1224
1225
1226---
1227File: /neurons/executor/src/middlewares/miner.py
1228---
1229
1230import bittensor
1231from fastapi.responses import JSONResponse
1232from payloads.miner import MinerAuthPayload
1233from pydantic import ValidationError
1234from starlette.middleware.base import BaseHTTPMiddleware
1235
1236from core.config import settings
1237from core.logger import _m, get_logger
1238
1239logger = get_logger(__name__)
1240
1241
1242class MinerMiddleware(BaseHTTPMiddleware):
1243 def __init__(self, app) -> None:
1244 super().__init__(app)
1245
1246 async def dispatch(self, request, call_next):
1247 try:
1248 body_bytes = await request.body()
1249 miner_ip = request.client.host
1250 default_extra = {"miner_ip": miner_ip}
1251
1252 # Parse it into the Pydantic model
1253 payload = MinerAuthPayload.model_validate_json(body_bytes)
1254
1255 logger.info(_m("miner ip", extra=default_extra))
1256
1257 keypair = bittensor.Keypair(ss58_address=settings.MINER_HOTKEY_SS58_ADDRESS)
1258 if not keypair.verify(payload.public_key, payload.signature):
1259 logger.error(
1260 _m(
1261 "Auth failed. incorrect signature",
1262 extra={
1263 **default_extra,
1264 "signature": payload.signature,
1265 "public_key": payload.public_key,
1266 "miner_hotkey": settings.MINER_HOTKEY_SS58_ADDRESS,
1267 },
1268 )
1269 )
1270 return JSONResponse(status_code=401, content="Unauthorized")
1271
1272 response = await call_next(request)
1273 return response
1274 except ValidationError as e:
1275 # Handle validation error if needed
1276 error_message = str(_m("Validation Error", extra={"errors": str(e.errors())}))
1277 logger.error(error_message)
1278 return JSONResponse(status_code=422, content=error_message)
1279
1280
1281
1282---
1283File: /neurons/executor/src/payloads/__init__.py
1284---
1285
1286
1287
1288
1289---
1290File: /neurons/executor/src/payloads/miner.py
1291---
1292
1293from pydantic import BaseModel
1294
1295
1296class MinerAuthPayload(BaseModel):
1297 public_key: str
1298 signature: str
1299
1300
1301
1302---
1303File: /neurons/executor/src/routes/__init__.py
1304---
1305
1306
1307
1308
1309---
1310File: /neurons/executor/src/routes/apis.py
1311---
1312
1313from typing import Annotated
1314
1315from fastapi import APIRouter, Depends
1316from services.miner_service import MinerService
1317
1318from payloads.miner import MinerAuthPayload
1319
1320apis_router = APIRouter()
1321
1322
1323@apis_router.post("/upload_ssh_key")
1324async def upload_ssh_key(
1325 payload: MinerAuthPayload, miner_service: Annotated[MinerService, Depends(MinerService)]
1326):
1327 return await miner_service.upload_ssh_key(payload)
1328
1329
1330@apis_router.post("/remove_ssh_key")
1331async def remove_ssh_key(
1332 payload: MinerAuthPayload, miner_service: Annotated[MinerService, Depends(MinerService)]
1333):
1334 return await miner_service.remove_ssh_key(payload)
1335
1336
1337
1338---
1339File: /neurons/executor/src/services/miner_service.py
1340---
1341
1342import asyncio
1343import sys
1344import logging
1345from pathlib import Path
1346
1347from typing import Annotated
1348from fastapi import Depends
1349
1350from core.config import settings
1351from services.ssh_service import SSHService
1352
1353from payloads.miner import MinerAuthPayload
1354
1355logger = logging.getLogger(__name__)
1356
1357
1358class MinerService:
1359 def __init__(
1360 self,
1361 ssh_service: Annotated[SSHService, Depends(SSHService)],
1362 ):
1363 self.ssh_service = ssh_service
1364
1365 async def upload_ssh_key(self, paylod: MinerAuthPayload):
1366 self.ssh_service.add_pubkey_to_host(paylod.public_key)
1367
1368 return {
1369 "ssh_username": self.ssh_service.get_current_os_user(),
1370 "ssh_port": settings.SSH_PUBLIC_PORT or settings.SSH_PORT,
1371 "python_path": sys.executable,
1372 "root_dir": str(Path(__file__).resolve().parents[2]),
1373 "port_range": settings.RENTING_PORT_RANGE,
1374 "port_mappings": settings.RENTING_PORT_MAPPINGS
1375 }
1376
1377 async def remove_ssh_key(self, paylod: MinerAuthPayload):
1378 return self.ssh_service.remove_pubkey_from_host(paylod.public_key)
1379
1380
1381
1382---
1383File: /neurons/executor/src/services/ssh_service.py
1384---
1385
1386import getpass
1387import os
1388
1389
1390class SSHService:
1391 def add_pubkey_to_host(self, pub_key: str):
1392 with open(os.path.expanduser("~/.ssh/authorized_keys"), "a") as file:
1393 file.write(pub_key + "\n")
1394
1395 def remove_pubkey_from_host(self, pub_key: str):
1396 authorized_keys_path = os.path.expanduser("~/.ssh/authorized_keys")
1397
1398 with open(authorized_keys_path, "r") as file:
1399 lines = file.readlines()
1400
1401 with open(authorized_keys_path, "w") as file:
1402 for line in lines:
1403 if line.strip() != pub_key:
1404 file.write(line)
1405
1406 def get_current_os_user(self) -> str:
1407 return getpass.getuser()
1408
1409
1410
1411---
1412File: /neurons/executor/src/executor.py
1413---
1414
1415import logging
1416
1417from fastapi import FastAPI
1418import uvicorn
1419
1420from core.config import settings
1421from middlewares.miner import MinerMiddleware
1422from routes.apis import apis_router
1423
1424# Set up logging
1425logging.basicConfig(level=logging.INFO)
1426
1427app = FastAPI(
1428 title=settings.PROJECT_NAME,
1429)
1430
1431app.add_middleware(MinerMiddleware)
1432app.include_router(apis_router)
1433
1434reload = True if settings.ENV == "dev" else False
1435
1436if __name__ == "__main__":
1437 uvicorn.run("executor:app", host="0.0.0.0", port=settings.INTERNAL_PORT, reload=reload)
1438
1439
1440
1441---
1442File: /neurons/executor/src/gpus_utility.py
1443---
1444
1445import asyncio
1446import logging
1447import time
1448
1449import aiohttp
1450import click
1451import pynvml
1452import psutil
1453
1454logger = logging.getLogger(__name__)
1455logger.setLevel(logging.INFO)
1456
1457
1458class GPUMetricsTracker:
1459 def __init__(self, threshold_percent: float = 10.0):
1460 self.previous_metrics: dict[int, dict] = {}
1461 self.threshold = threshold_percent
1462
1463 def has_significant_change(self, gpu_id: int, util: float, mem_used: float) -> bool:
1464 if gpu_id not in self.previous_metrics:
1465 self.previous_metrics[gpu_id] = {"util": util, "mem_used": mem_used}
1466 return True
1467
1468 prev = self.previous_metrics[gpu_id]
1469 util_diff = abs(util - prev["util"])
1470 mem_diff_percent = abs(mem_used - prev["mem_used"]) / prev["mem_used"] * 100
1471
1472 if util_diff >= self.threshold or mem_diff_percent >= self.threshold:
1473 self.previous_metrics[gpu_id] = {"util": util, "mem_used": mem_used}
1474 return True
1475 return False
1476
1477
1478async def scrape_gpu_metrics(
1479 interval: int,
1480 program_id: str,
1481 signature: str,
1482 executor_id: str,
1483 validator_hotkey: str,
1484 compute_rest_app_url: str,
1485):
1486 try:
1487 pynvml.nvmlInit()
1488 device_count = pynvml.nvmlDeviceGetCount()
1489 if device_count == 0:
1490 logger.warning("No NVIDIA GPUs found in the system")
1491 return
1492 except pynvml.NVMLError as e:
1493 logger.error(f"Failed to initialize NVIDIA Management Library: {e}")
1494 logger.error(
1495 "This might be because no NVIDIA GPU is present or drivers are not properly installed"
1496 )
1497 return
1498
1499 http_url = f"{compute_rest_app_url}/validator/{validator_hotkey}/update-gpu-metrics"
1500 logger.info(f"Will send metrics to: {http_url}")
1501
1502 # Initialize the tracker
1503 tracker = GPUMetricsTracker(threshold_percent=10.0)
1504
1505 async with aiohttp.ClientSession() as session:
1506 logger.info(f"Scraping metrics for {device_count} GPUs...")
1507 try:
1508 while True:
1509 try:
1510 gpu_utilization = []
1511 should_send = False
1512
1513 for i in range(device_count):
1514 handle = pynvml.nvmlDeviceGetHandleByIndex(i)
1515
1516 name = pynvml.nvmlDeviceGetName(handle)
1517 if isinstance(name, bytes):
1518 name = name.decode("utf-8")
1519
1520 utilization = pynvml.nvmlDeviceGetUtilizationRates(handle)
1521 memory = pynvml.nvmlDeviceGetMemoryInfo(handle)
1522
1523 gpu_util = utilization.gpu
1524 mem_used = memory.used
1525 mem_total = memory.total
1526 timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
1527
1528 # Check if there's a significant change for this GPU
1529 if tracker.has_significant_change(i, gpu_util, mem_used):
1530 should_send = True
1531 logger.info(f"Significant change detected for GPU {i}")
1532
1533 gpu_utilization.append(
1534 {
1535 "utilization_in_percent": gpu_util,
1536 "memory_utilization_in_bytes": mem_used,
1537 "memory_utilization_in_percent": round(mem_used / mem_total * 100, 1)
1538 }
1539 )
1540
1541 # Get CPU, RAM, and Disk metrics using psutil
1542 cpu_percent = psutil.cpu_percent(interval=1)
1543 ram = psutil.virtual_memory()
1544 disk = psutil.disk_usage('/')
1545
1546 cpu_ram_utilization = {
1547 "cpu_utilization_in_percent": cpu_percent,
1548 "ram_utilization_in_bytes": ram.used,
1549 "ram_utilization_in_percent": ram.percent
1550 }
1551
1552 disk_utilization = {
1553 "disk_utilization_in_bytes": disk.used,
1554 "disk_utilization_in_percent": disk.percent
1555 }
1556
1557 # Only send if there's a significant change in any GPU
1558 if should_send:
1559 payload = {
1560 "gpu_utilization": gpu_utilization,
1561 "cpu_ram_utilization": cpu_ram_utilization,
1562 "disk_utilization": disk_utilization,
1563 "timestamp": timestamp,
1564 "program_id": program_id,
1565 "signature": signature,
1566 "executor_id": executor_id,
1567 }
1568 # Send HTTP POST request
1569 async with session.post(http_url, json=payload) as response:
1570 if response.status == 200:
1571 logger.info("Successfully sent metrics to backend")
1572 else:
1573 logger.error(f"Failed to send metrics. Status: {response.status}")
1574 text = await response.text()
1575 logger.error(f"Response: {text}")
1576
1577 await asyncio.sleep(interval)
1578
1579 except Exception as e:
1580 logger.error(f"Error in main loop: {e}")
1581 await asyncio.sleep(5) # Wait before retrying
1582
1583 except KeyboardInterrupt:
1584 logger.info("Stopping GPU scraping...")
1585 finally:
1586 pynvml.nvmlShutdown()
1587
1588
1589@click.command()
1590@click.option("--program_id", prompt="Program ID", help="Program ID for monitoring")
1591@click.option("--signature", prompt="Signature", help="Signature for verification")
1592@click.option("--executor_id", prompt="Executor ID", help="Executor ID")
1593@click.option("--validator_hotkey", prompt="Validator Hotkey", help="Validator hotkey")
1594@click.option("--compute_rest_app_url", prompt="Compute-app Url", help="Compute-app Url")
1595@click.option("--interval", default=5, type=int, help="Scraping interval in seconds")
1596def main(
1597 interval: int,
1598 program_id: str,
1599 signature: str,
1600 executor_id: str,
1601 validator_hotkey: str,
1602 compute_rest_app_url: str,
1603):
1604 asyncio.run(
1605 scrape_gpu_metrics(
1606 interval, program_id, signature, executor_id, validator_hotkey, compute_rest_app_url
1607 )
1608 )
1609
1610
1611if __name__ == "__main__":
1612 main()
1613
1614
1615
1616---
1617File: /neurons/executor/docker_build.sh
1618---
1619
1620#!/bin/bash
1621set -eux -o pipefail
1622
1623IMAGE_NAME="daturaai/compute-subnet-executor:$TAG"
1624
1625docker build --build-context datura=../../datura -t $IMAGE_NAME .
1626
1627
1628---
1629File: /neurons/executor/docker_publish.sh
1630---
1631
1632#!/bin/bash
1633set -eux -o pipefail
1634
1635source ./docker_build.sh
1636
1637echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
1638docker push "$IMAGE_NAME"
1639
1640
1641---
1642File: /neurons/executor/docker_runner_build.sh
1643---
1644
1645#!/bin/bash
1646set -eux -o pipefail
1647
1648IMAGE_NAME="daturaai/compute-subnet-executor-runner:$TAG"
1649
1650docker build --file Dockerfile.runner -t $IMAGE_NAME .
1651
1652
1653---
1654File: /neurons/executor/docker_runner_publish.sh
1655---
1656
1657#!/bin/bash
1658set -eux -o pipefail
1659
1660source ./docker_runner_build.sh
1661
1662echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
1663docker push "$IMAGE_NAME"
1664
1665
1666---
1667File: /neurons/executor/entrypoint.sh
1668---
1669
1670#!/bin/sh
1671set -eu
1672
1673docker compose up --pull always --detach --wait --force-recreate
1674
1675# Clean docker images
1676docker image prune -f
1677
1678while true
1679do
1680 docker compose logs -f
1681 echo 'All containers died'
1682 sleep 10
1683done
1684
1685
1686
1687---
1688File: /neurons/executor/README.md
1689---
1690
1691# Executor
1692
1693## Setup project
1694### Requirements
1695* Ubuntu machine
1696* install [docker](https://docs.docker.com/engine/install/ubuntu/)
1697
1698
1699### Step 1: Clone project
1700
1701```
1702git clone https://github.com/Datura-ai/compute-subnet.git
1703```
1704
1705### Step 2: Install Required Tools
1706
1707Run following command to install required tools:
1708```shell
1709cd compute-subnet && chmod +x scripts/install_executor_on_ubuntu.sh && scripts/install_executor_on_ubuntu.sh
1710```
1711
1712if you don't have sudo on your machine, run
1713```shell
1714sed -i 's/sudo //g' scripts/install_executor_on_ubuntu.sh
1715```
1716to remove sudo from the setup script commands
1717
1718### Step 3: Configure Docker for Nvidia
1719
1720Please follow [this](https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours) to setup docker for nvidia properly
1721
1722
1723### Step 4: Install and Run
1724
1725* Go to executor root
1726```shell
1727cd neurons/executor
1728```
1729
1730* Add .env in the project
1731```shell
1732cp .env.template .env
1733```
1734
1735Add the correct miner wallet hotkey for `MINER_HOTKEY_SS58_ADDRESS`.
1736You can change the ports for `INTERNAL_PORT`, `EXTERNAL_PORT`, `SSH_PORT` based on your need.
1737
1738- **INTERNAL_PORT**: internal port of your executor docker container
1739- **EXTERNAL_PORT**: external expose port of your executor docker container
1740- **SSH_PORT**: ssh port map into 22 of your executor docker container
1741- **SSH_PUBLIC_PORT**: [Optional] ssh public access port of your executor docker container. If `SSH_PUBLIC_PORT` is equal to `SSH_PORT` then you don't have to specify this port.
1742- **MINER_HOTKEY_SS58_ADDRESS**: the miner hotkey address
1743- **RENTING_PORT_RANGE**: The port range that are publicly accessible. This can be empty if all ports are open. Available formats are:
1744 - Range Specification(`from-to`): Miners can specify a range of ports, such as 2000-2005. This means ports from 2000 to 2005 will be open for the validator to select.
1745 - Specific Ports(`port1,port2,port3`): Miners can specify individual ports, such as 2000,2001,2002. This means only ports 2000, 2001, and 2002 will be available for the validator.
1746 - Default Behavior: If no ports are specified, the validator will assume that all ports on the executor are available.
1747- **RENTING_PORT_MAPPINGS**: Internal, external port mappings. Use this env when you are using proxy in front of your executors and the internal port and external port can't be the same. You can ignore this env, if all ports are open or the internal and external ports are the same. example:
1748 - if internal port 46681 is mapped to 56681 external port and internal port 46682 is mapped to 56682 external port, then RENTING_PORT_MAPPINGS="[[46681, 56681], [46682, 56682]]"
1749
1750Note: Please use either **RENTING_PORT_RANGE** or **RENTING_PORT_MAPPINGS** and DO NOT use both of them if you have specific ports are available.
1751
1752
1753* Run project
1754```shell
1755docker compose up -d
1756```
1757
1758
1759
1760---
1761File: /neurons/executor/run.sh
1762---
1763
1764#!/bin/sh
1765set -eux -o pipefail
1766
1767# start ssh service
1768ssh-keygen -A
1769service ssh start
1770
1771# run fastapi app
1772python src/executor.py
1773
1774
1775---
1776File: /neurons/miners/migrations/versions/8e52603bd563_create_validator_model.py
1777---
1778
1779"""create validator model
1780
1781Revision ID: 8e52603bd563
1782Revises:
1783Create Date: 2024-07-15 10:47:41.596221
1784
1785"""
1786
1787from collections.abc import Sequence
1788
1789import sqlalchemy as sa
1790import sqlmodel
1791import sqlmodel.sql.sqltypes
1792from alembic import op
1793
1794# revision identifiers, used by Alembic.
1795revision: str = "8e52603bd563"
1796down_revision: str | None = None
1797branch_labels: str | Sequence[str] | None = None
1798depends_on: str | Sequence[str] | None = None
1799
1800
1801def upgrade() -> None:
1802 # ### commands auto generated by Alembic - please adjust! ###
1803 op.create_table(
1804 "validator",
1805 sa.Column('uuid', sa.Uuid(), nullable=False),
1806 sa.Column('validator_hotkey', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
1807 sa.Column('active', sa.Boolean(), nullable=False),
1808 sa.PrimaryKeyConstraint('uuid'),
1809 sa.UniqueConstraint('validator_hotkey')
1810 )
1811 # ### end Alembic commands ###
1812
1813
1814def downgrade() -> None:
1815 # ### commands auto generated by Alembic - please adjust! ###
1816 op.drop_table("validator")
1817 # ### end Alembic commands ###
1818
1819
1820
1821---
1822File: /neurons/miners/migrations/versions/eb0b92cbc38e_add_executors_table.py
1823---
1824
1825"""Add executors table
1826
1827Revision ID: eb0b92cbc38e
1828Revises: 8e52603bd563
1829Create Date: 2024-09-06 06:56:04.990324
1830
1831"""
1832
1833from collections.abc import Sequence
1834
1835import sqlalchemy as sa
1836import sqlmodel
1837import sqlmodel.sql.sqltypes
1838from alembic import op
1839
1840# revision identifiers, used by Alembic.
1841revision: str = "eb0b92cbc38e"
1842down_revision: str | None = "8e52603bd563"
1843branch_labels: str | Sequence[str] | None = None
1844depends_on: str | Sequence[str] | None = None
1845
1846
1847def upgrade() -> None:
1848 # ### commands auto generated by Alembic - please adjust! ###
1849 op.create_table(
1850 "executor",
1851 sa.Column("uuid", sa.Uuid(), nullable=False),
1852 sa.Column("address", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
1853 sa.Column("port", sa.Integer(), nullable=False),
1854 sa.Column("validator", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
1855 sa.PrimaryKeyConstraint("uuid"),
1856 sa.UniqueConstraint("address", "port", name="unique_contraint_address_port"),
1857 )
1858 # ### end Alembic commands ###
1859
1860
1861def downgrade() -> None:
1862 # ### commands auto generated by Alembic - please adjust! ###
1863 op.drop_table("executor")
1864 # ### end Alembic commands ###
1865
1866
1867
1868---
1869File: /neurons/miners/migrations/env.py
1870---
1871
1872import os
1873from logging.config import fileConfig
1874from pathlib import Path
1875
1876from alembic import context
1877from dotenv import load_dotenv
1878from sqlalchemy import engine_from_config, pool
1879from sqlmodel import SQLModel
1880
1881from models.executor import * # noqa
1882from models.validator import * # noqa
1883
1884# this is the Alembic Config object, which provides
1885# access to the values within the .ini file in use.
1886config = context.config
1887
1888# Interpret the config file for Python logging.
1889# This line sets up loggers basically.
1890if config.config_file_name is not None:
1891 fileConfig(config.config_file_name)
1892
1893# add your model's MetaData object here
1894# for 'autogenerate' support
1895# from myapp import mymodel
1896# target_metadata = mymodel.Base.metadata
1897
1898target_metadata = SQLModel.metadata
1899
1900# other values from the config, defined by the needs of env.py,
1901# can be acquired:
1902# my_important_option = config.get_main_option("my_important_option")
1903# ... etc.
1904
1905current_dir = Path(__file__).parent
1906
1907load_dotenv(str(current_dir / ".." / ".env"))
1908
1909
1910def get_url():
1911 url = os.getenv("SQLALCHEMY_DATABASE_URI")
1912 return url
1913
1914
1915def run_migrations_offline() -> None:
1916 """Run migrations in 'offline' mode.
1917
1918 This configures the context with just a URL
1919 and not an Engine, though an Engine is acceptable
1920 here as well. By skipping the Engine creation
1921 we don't even need a DBAPI to be available.
1922
1923 Calls to context.execute() here emit the given string to the
1924 script output.
1925
1926 """
1927 url = get_url()
1928 context.configure(
1929 url=url,
1930 target_metadata=target_metadata,
1931 literal_binds=True,
1932 dialect_opts={"paramstyle": "named"},
1933 )
1934
1935 with context.begin_transaction():
1936 context.run_migrations()
1937
1938
1939def run_migrations_online() -> None:
1940 """Run migrations in 'online' mode.
1941
1942 In this scenario we need to create an Engine
1943 and associate a connection with the context.
1944
1945 """
1946 configuration = config.get_section(config.config_ini_section)
1947 configuration["sqlalchemy.url"] = get_url()
1948 connectable = engine_from_config(
1949 configuration,
1950 prefix="sqlalchemy.",
1951 poolclass=pool.NullPool,
1952 )
1953
1954 with connectable.connect() as connection:
1955 context.configure(connection=connection, target_metadata=target_metadata)
1956
1957 with context.begin_transaction():
1958 context.run_migrations()
1959
1960
1961if context.is_offline_mode():
1962 run_migrations_offline()
1963else:
1964 run_migrations_online()
1965
1966
1967
1968---
1969File: /neurons/miners/src/consumers/validator_consumer.py
1970---
1971
1972import asyncio
1973import logging
1974import time
1975from typing import Annotated
1976
1977import bittensor
1978from datura.consumers.base import BaseConsumer
1979from datura.requests.miner_requests import (
1980 AcceptJobRequest,
1981 AcceptSSHKeyRequest,
1982 DeclineJobRequest,
1983 Executor,
1984 ExecutorSSHInfo,
1985 FailedRequest,
1986 UnAuthorizedRequest,
1987)
1988from datura.requests.validator_requests import (
1989 AuthenticateRequest,
1990 BaseValidatorRequest,
1991 SSHPubKeyRemoveRequest,
1992 SSHPubKeySubmitRequest,
1993)
1994from fastapi import Depends, WebSocket
1995
1996from core.config import settings
1997from services.executor_service import ExecutorService
1998from services.ssh_service import MinerSSHService
1999from services.validator_service import ValidatorService
2000
2001AUTH_MESSAGE_MAX_AGE = 10
2002MAX_MESSAGE_COUNT = 10
2003
2004logger = logging.getLogger(__name__)
2005
2006
2007class ValidatorConsumer(BaseConsumer):
2008 def __init__(
2009 self,
2010 websocket: WebSocket,
2011 validator_key: str,
2012 ssh_service: Annotated[MinerSSHService, Depends(MinerSSHService)],
2013 validator_service: Annotated[ValidatorService, Depends(ValidatorService)],
2014 executor_service: Annotated[ExecutorService, Depends(ExecutorService)],
2015 ):
2016 super().__init__(websocket)
2017 self.ssh_service = ssh_service
2018 self.validator_service = validator_service
2019 self.executor_service = executor_service
2020 self.validator_key = validator_key
2021 self.my_hotkey = settings.get_bittensor_wallet().get_hotkey().ss58_address
2022 self.validator_authenticated = False
2023 self.msg_queue = []
2024
2025 def accepted_request_type(self):
2026 return BaseValidatorRequest
2027
2028 def verify_auth_msg(self, msg: AuthenticateRequest) -> tuple[bool, str]:
2029 if msg.payload.timestamp < time.time() - AUTH_MESSAGE_MAX_AGE:
2030 return False, "msg too old"
2031 if msg.payload.miner_hotkey != self.my_hotkey:
2032 return False, f"wrong miner hotkey ({self.my_hotkey}!={msg.payload.miner_hotkey})"
2033 if msg.payload.validator_hotkey != self.validator_key:
2034 return (
2035 False,
2036 f"wrong validator hotkey ({self.validator_key}!={msg.payload.validator_hotkey})",
2037 )
2038
2039 keypair = bittensor.Keypair(ss58_address=self.validator_key)
2040 if keypair.verify(msg.blob_for_signing(), msg.signature):
2041 return True, ""
2042
2043 async def handle_authentication(self, msg: AuthenticateRequest):
2044 # check if validator is registered
2045 if not self.validator_service.is_valid_validator(self.validator_key):
2046 await self.send_message(UnAuthorizedRequest(details="Validator is not registered"))
2047 await self.disconnect()
2048 return
2049
2050 authenticated, error_msg = self.verify_auth_msg(msg)
2051 if not authenticated:
2052 response_msg = f"Validator {self.validator_key} not authenticated due to: {error_msg}"
2053 logger.info(response_msg)
2054 await self.send_message(UnAuthorizedRequest(details=response_msg))
2055 await self.disconnect()
2056 return
2057
2058 self.validator_authenticated = True
2059 for msg in self.msg_queue:
2060 await self.handle_message(msg)
2061
2062 async def check_validator_allowance(self):
2063 """Check if there's any executors opened for current validator.
2064
2065 If there are any executors, send accept job request to validator w/ executors list
2066 available for that validator.
2067
2068 If no executors, decline job request
2069 """
2070 executors = self.executor_service.get_executors_for_validator(self.validator_key)
2071 if len(executors):
2072 logger.info("Found %d executors for validator(%s)", len(executors), self.validator_key)
2073 await self.send_message(
2074 AcceptJobRequest(
2075 executors=[
2076 Executor(uuid=str(executor.uuid), address=executor.address, port=executor.port)
2077 for executor in executors
2078 ]
2079 )
2080 )
2081 else:
2082 logger.info("Not found any executors for validator(%s)", self.validator_key)
2083 await self.send_message(DeclineJobRequest())
2084 await self.disconnect()
2085
2086 async def handle_message(self, msg: BaseValidatorRequest):
2087 if isinstance(msg, AuthenticateRequest):
2088 await self.handle_authentication(msg)
2089 if self.validator_authenticated:
2090 await self.check_validator_allowance()
2091 return
2092
2093 # TODO: update logic here, fow now, it sends AcceptJobRequest regardless
2094 # if self.validator_authenticated:
2095 # await self.send_message(AcceptJobRequest())
2096
2097 if not self.validator_authenticated:
2098 if len(self.msg_queue) <= MAX_MESSAGE_COUNT:
2099 self.msg_queue.append(msg)
2100 return
2101
2102 if isinstance(msg, SSHPubKeySubmitRequest):
2103 logger.info("Validator %s sent SSH Pubkey.", self.validator_key)
2104
2105 try:
2106 msg: SSHPubKeySubmitRequest
2107 executors: list[ExecutorSSHInfo] = await self.executor_service.register_pubkey(
2108 self.validator_key, msg.public_key, msg.executor_id
2109 )
2110 await self.send_message(AcceptSSHKeyRequest(executors=executors))
2111 logger.info("Sent AcceptSSHKeyRequest to validator %s", self.validator_key)
2112 except Exception as e:
2113 logger.error("Storing SSH key or Sending AcceptSSHKeyRequest failed: %s", str(e))
2114 self.ssh_service.remove_pubkey_from_host(msg.public_key)
2115 await self.send_message(FailedRequest(details=str(e)))
2116 return
2117
2118 if isinstance(msg, SSHPubKeyRemoveRequest):
2119 logger.info("Validator %s sent remove SSH Pubkey.", self.validator_key)
2120 try:
2121 await self.executor_service.deregister_pubkey(self.validator_key, msg.public_key, msg.executor_id)
2122 logger.info("Sent SSHKeyRemoved to validator %s", self.validator_key)
2123 except Exception as e:
2124 logger.error("Failed SSHKeyRemoved request: %s", str(e))
2125 await self.send_message(FailedRequest(details=str(e)))
2126 return
2127
2128
2129class ValidatorConsumerManger:
2130 def __init__(
2131 self,
2132 ):
2133 self.active_consumer: ValidatorConsumer | None = None
2134 self.lock = asyncio.Lock()
2135
2136 async def addConsumer(
2137 self,
2138 websocket: WebSocket,
2139 validator_key: str,
2140 ssh_service: Annotated[MinerSSHService, Depends(MinerSSHService)],
2141 validator_service: Annotated[ValidatorService, Depends(ValidatorService)],
2142 executor_service: Annotated[ExecutorService, Depends(ExecutorService)],
2143 ):
2144 consumer = ValidatorConsumer(
2145 websocket=websocket,
2146 validator_key=validator_key,
2147 ssh_service=ssh_service,
2148 validator_service=validator_service,
2149 executor_service=executor_service,
2150 )
2151 await consumer.connect()
2152
2153 if self.active_consumer is not None:
2154 await consumer.send_message(DeclineJobRequest())
2155 await consumer.disconnect()
2156 return
2157
2158 async with self.lock:
2159 self.active_consumer = consumer
2160
2161 await self.active_consumer.handle()
2162
2163 self.active_consumer = None
2164
2165
2166validatorConsumerManager = ValidatorConsumerManger()
2167
2168
2169
2170---
2171File: /neurons/miners/src/core/__init__.py
2172---
2173
2174
2175
2176
2177---
2178File: /neurons/miners/src/core/config.py
2179---
2180
2181from typing import TYPE_CHECKING
2182import argparse
2183import pathlib
2184
2185import bittensor
2186from pydantic import Field
2187from pydantic_settings import BaseSettings, SettingsConfigDict
2188
2189if TYPE_CHECKING:
2190 from bittensor_wallet import Wallet
2191
2192
2193class Settings(BaseSettings):
2194 model_config = SettingsConfigDict(env_file=".env", extra="ignore")
2195 PROJECT_NAME: str = "compute-subnet-miner"
2196
2197 BITTENSOR_WALLET_DIRECTORY: pathlib.Path = Field(
2198 env="BITTENSOR_WALLET_DIRECTORY",
2199 default=pathlib.Path("~").expanduser() / ".bittensor" / "wallets",
2200 )
2201 BITTENSOR_WALLET_NAME: str = Field(env="BITTENSOR_WALLET_NAME")
2202 BITTENSOR_WALLET_HOTKEY_NAME: str = Field(env="BITTENSOR_WALLET_HOTKEY_NAME")
2203 BITTENSOR_NETUID: int = Field(env="BITTENSOR_NETUID")
2204 BITTENSOR_CHAIN_ENDPOINT: str | None = Field(env="BITTENSOR_CHAIN_ENDPOINT", default=None)
2205 BITTENSOR_NETWORK: str = Field(env="BITTENSOR_NETWORK")
2206
2207 SQLALCHEMY_DATABASE_URI: str = Field(env="SQLALCHEMY_DATABASE_URI")
2208
2209 EXTERNAL_IP_ADDRESS: str = Field(env="EXTERNAL_IP_ADDRESS")
2210 INTERNAL_PORT: int = Field(env="INTERNAL_PORT", default=8000)
2211 EXTERNAL_PORT: int = Field(env="EXTERNAL_PORT", default=8000)
2212 ENV: str = Field(env="ENV", default="dev")
2213 DEBUG: bool = Field(env="DEBUG", default=False)
2214
2215 def get_bittensor_wallet(self) -> "Wallet":
2216 if not self.BITTENSOR_WALLET_NAME or not self.BITTENSOR_WALLET_HOTKEY_NAME:
2217 raise RuntimeError("Wallet not configured")
2218 wallet = bittensor.wallet(
2219 name=self.BITTENSOR_WALLET_NAME,
2220 hotkey=self.BITTENSOR_WALLET_HOTKEY_NAME,
2221 path=str(self.BITTENSOR_WALLET_DIRECTORY),
2222 )
2223 wallet.hotkey_file.get_keypair() # this raises errors if the keys are inaccessible
2224 return wallet
2225
2226 def get_bittensor_config(self) -> bittensor.config:
2227 parser = argparse.ArgumentParser()
2228 # bittensor.wallet.add_args(parser)
2229 # bittensor.subtensor.add_args(parser)
2230 # bittensor.axon.add_args(parser)
2231
2232 if self.BITTENSOR_NETWORK:
2233 if "--subtensor.network" in parser._option_string_actions:
2234 parser._handle_conflict_resolve(
2235 None,
2236 [("--subtensor.network", parser._option_string_actions["--subtensor.network"])],
2237 )
2238
2239 parser.add_argument(
2240 "--subtensor.network",
2241 type=str,
2242 help="network",
2243 default=self.BITTENSOR_NETWORK,
2244 )
2245
2246 if self.BITTENSOR_CHAIN_ENDPOINT:
2247 if "--subtensor.chain_endpoint" in parser._option_string_actions:
2248 parser._handle_conflict_resolve(
2249 None,
2250 [
2251 (
2252 "--subtensor.chain_endpoint",
2253 parser._option_string_actions["--subtensor.chain_endpoint"],
2254 )
2255 ],
2256 )
2257
2258 parser.add_argument(
2259 "--subtensor.chain_endpoint",
2260 type=str,
2261 help="chain endpoint",
2262 default=self.BITTENSOR_CHAIN_ENDPOINT,
2263 )
2264
2265 return bittensor.config(parser)
2266
2267
2268settings = Settings()
2269
2270
2271
2272---
2273File: /neurons/miners/src/core/db.py
2274---
2275
2276from collections.abc import Generator
2277from typing import Annotated
2278
2279from fastapi import Depends
2280from sqlmodel import Session, create_engine
2281
2282from core.config import settings
2283
2284engine = create_engine(str(settings.SQLALCHEMY_DATABASE_URI))
2285
2286
2287def get_db() -> Generator[Session, None, None]:
2288 with Session(engine) as session:
2289 yield session
2290
2291
2292SessionDep = Annotated[Session, Depends(get_db)]
2293
2294
2295
2296---
2297File: /neurons/miners/src/core/miner.py
2298---
2299
2300from typing import TYPE_CHECKING
2301import logging
2302import traceback
2303import asyncio
2304import bittensor
2305from websockets.protocol import State as WebSocketClientState
2306
2307from core.config import settings
2308from core.db import get_db
2309from core.utils import _m, get_extra_info
2310from daos.validator import ValidatorDao, Validator
2311
2312if TYPE_CHECKING:
2313 from bittensor_wallet import Wallet
2314
2315logger = logging.getLogger(__name__)
2316
2317MIN_STAKE = 10
2318VALIDATORS_LIMIT = 24
2319SYNC_CYCLE = 2 * 60
2320
2321
2322class Miner:
2323 wallet: "Wallet"
2324 subtensor: bittensor.subtensor
2325 netuid: int
2326
2327 def __init__(self):
2328 self.config = settings.get_bittensor_config()
2329 self.wallet = settings.get_bittensor_wallet()
2330 self.netuid = settings.BITTENSOR_NETUID
2331
2332 self.default_extra = {
2333 "external_port": settings.EXTERNAL_PORT,
2334 "external_ip": settings.EXTERNAL_IP_ADDRESS,
2335 }
2336
2337 self.axon = bittensor.axon(
2338 wallet=self.wallet,
2339 external_port=settings.EXTERNAL_PORT,
2340 external_ip=settings.EXTERNAL_IP_ADDRESS,
2341 port=settings.INTERNAL_PORT,
2342 ip=settings.EXTERNAL_IP_ADDRESS,
2343 )
2344 self.subtensor = None
2345 self.set_subtensor()
2346
2347 self.should_exit = False
2348 self.session = next(get_db())
2349 self.validator_dao = ValidatorDao(session=self.session)
2350 self.last_announced_block = 0
2351
2352 def set_subtensor(self):
2353 try:
2354 if (
2355 self.subtensor
2356 and self.subtensor.substrate
2357 and self.subtensor.substrate.websocket
2358 and self.subtensor.substrate.websocket.state is WebSocketClientState.OPEN
2359 ):
2360 return
2361
2362 logger.info(
2363 _m(
2364 "Getting subtensor",
2365 extra=get_extra_info(self.default_extra),
2366 ),
2367 )
2368
2369 self.subtensor = bittensor.subtensor(config=self.config)
2370
2371 # check registered
2372 self.check_registered()
2373 except Exception as e:
2374 logger.info(
2375 _m(
2376 "[Error] Getting subtensor",
2377 extra=get_extra_info({
2378 ** self.default_extra,
2379 "error": str(e),
2380 }),
2381 ),
2382 )
2383
2384 def check_registered(self):
2385 try:
2386 logger.info(
2387 _m(
2388 '[check_registered] checking miner is registered',
2389 extra=get_extra_info(self.default_extra),
2390 ),
2391 )
2392
2393 if not self.subtensor.is_hotkey_registered(
2394 netuid=self.netuid,
2395 hotkey_ss58=self.wallet.get_hotkey().ss58_address,
2396 ):
2397 logger.error(
2398 _m(
2399 f"[check_registered] Wallet: {self.wallet} is not registered on netuid {self.netuid}.",
2400 extra=get_extra_info(self.default_extra),
2401 ),
2402 )
2403 exit()
2404 except Exception as e:
2405 logger.error(
2406 _m(
2407 '[check_registered] Checking miner registered failed',
2408 extra=get_extra_info({
2409 **self.default_extra,
2410 "error": str(e)
2411 }),
2412 ),
2413 )
2414
2415 def get_node(self):
2416 # return SubstrateInterface(url=self.config.subtensor.chain_endpoint)
2417 return self.subtensor.substrate
2418
2419 def get_current_block(self):
2420 node = self.get_node()
2421 return node.query("System", "Number", []).value
2422
2423 def get_tempo(self):
2424 return self.subtensor.tempo(self.netuid)
2425
2426 def get_serving_rate_limit(self):
2427 node = self.get_node()
2428 return node.query("SubtensorModule", "ServingRateLimit", [self.netuid]).value
2429
2430 def announce(self):
2431 try:
2432 current_block = self.get_current_block()
2433 tempo = self.get_tempo()
2434
2435 if current_block - self.last_announced_block >= tempo:
2436 self.last_announced_block = current_block
2437
2438 logger.info(
2439 _m(
2440 '[announce] Announce miner',
2441 extra=get_extra_info(self.default_extra),
2442 ),
2443 )
2444 self.axon.serve(netuid=self.netuid, subtensor=self.subtensor)
2445 except Exception as e:
2446 logger.error(
2447 _m(
2448 '[announce] Announcing miner error',
2449 extra=get_extra_info({
2450 **self.default_extra,
2451 "error": str(e)
2452 }),
2453 ),
2454 )
2455
2456 async def fetch_validators(self):
2457 metagraph = self.subtensor.metagraph(netuid=self.netuid)
2458 neurons = [n for n in metagraph.neurons if (n.stake.tao >= MIN_STAKE)]
2459 return neurons[:VALIDATORS_LIMIT]
2460
2461 async def save_validators(self, validators):
2462 logger.info(
2463 _m(
2464 '[save_validators] Sync validators',
2465 extra=get_extra_info(self.default_extra),
2466 ),
2467 )
2468 for v in validators:
2469 existing = self.validator_dao.get_validator_by_hotkey(v.hotkey)
2470 if not existing:
2471 self.validator_dao.save(
2472 Validator(
2473 validator_hotkey=v.hotkey,
2474 active=True
2475 )
2476 )
2477
2478 async def sync(self):
2479 try:
2480 self.set_subtensor()
2481
2482 self.announce()
2483
2484 validators = await self.fetch_validators()
2485 await self.save_validators(validators)
2486 except Exception as e:
2487 logger.error(
2488 _m(
2489 '[sync] Miner sync failed',
2490 extra=get_extra_info({
2491 **self.default_extra,
2492 "error": str(e)
2493 }),
2494 ),
2495 )
2496
2497 async def start(self):
2498 logger.info(
2499 _m(
2500 'Start Miner in background',
2501 extra=get_extra_info(self.default_extra),
2502 ),
2503 )
2504 try:
2505 while not self.should_exit:
2506 await self.sync()
2507
2508 # sync every 2 mins
2509 await asyncio.sleep(SYNC_CYCLE)
2510 except KeyboardInterrupt:
2511 logger.debug('Miner killed by keyboard interrupt.')
2512 exit()
2513 except Exception as e:
2514 logger.error(traceback.format_exc())
2515
2516 async def stop(self):
2517 logger.info(
2518 _m(
2519 'Stop Miner process',
2520 extra=get_extra_info(self.default_extra),
2521 ),
2522 )
2523 self.should_exit = True
2524
2525
2526
2527---
2528File: /neurons/miners/src/core/utils.py
2529---
2530
2531import asyncio
2532import contextvars
2533import json
2534import logging
2535
2536from core.config import settings
2537
2538logger = logging.getLogger(__name__)
2539
2540# Create a ContextVar to hold the context information
2541context = contextvars.ContextVar("context", default="ValidatorService")
2542context.set("ValidatorService")
2543
2544
2545def wait_for_services_sync(timeout=30):
2546 """Wait until PostgreSQL connections are working."""
2547 from sqlalchemy import create_engine, text
2548
2549 from core.config import settings
2550
2551 logger.info("Waiting for services to be available...")
2552
2553 while True:
2554 try:
2555 # Check PostgreSQL connection using SQLAlchemy
2556 engine = create_engine(settings.SQLALCHEMY_DATABASE_URI)
2557 with engine.connect() as connection:
2558 connection.execute(text("SELECT 1"))
2559 logger.info("Connected to PostgreSQL.")
2560
2561 break
2562 except Exception as e:
2563 logger.error("Failed to connect to PostgreSQL.")
2564 raise e
2565
2566
2567def get_extra_info(extra: dict) -> dict:
2568 task = asyncio.current_task()
2569 coro_name = task.get_coro().__name__ if task else "NoTask"
2570 task_id = id(task) if task else "NoTaskID"
2571 extra_info = {
2572 "coro_name": coro_name,
2573 "task_id": task_id,
2574 **extra,
2575 }
2576 return extra_info
2577
2578
2579def configure_logs_of_other_modules():
2580 miner_hotkey = settings.get_bittensor_wallet().get_hotkey().ss58_address
2581
2582 logging.basicConfig(
2583 level=logging.INFO,
2584 format=f"Miner: {miner_hotkey} | Name: %(name)s | Time: %(asctime)s | Level: %(levelname)s | File: %(filename)s | Function: %(funcName)s | Line: %(lineno)s | Process: %(process)d | Message: %(message)s",
2585 )
2586
2587 sqlalchemy_logger = logging.getLogger("sqlalchemy")
2588 sqlalchemy_logger.setLevel(logging.WARNING)
2589
2590 # Create a custom formatter that adds the context to the log messages
2591 class CustomFormatter(logging.Formatter):
2592 def format(self, record):
2593 try:
2594 task = asyncio.current_task()
2595 coro_name = task.get_coro().__name__ if task else "NoTask"
2596 task_id = id(task) if task else "NoTaskID"
2597 return f"{getattr(record, 'context', 'Default')} | {coro_name} | {task_id} | {super().format(record)}"
2598 except Exception:
2599 return ""
2600
2601 # Create a handler for the logger
2602 handler = logging.StreamHandler()
2603
2604 # Set the formatter for the handler
2605 handler.setFormatter(
2606 CustomFormatter("%(name)s %(asctime)s %(levelname)s %(filename)s %(process)d %(message)s")
2607 )
2608
2609
2610class StructuredMessage:
2611 def __init__(self, message, extra: dict):
2612 self.message = message
2613 self.extra = extra
2614
2615 def __str__(self):
2616 return "%s >>> %s" % (self.message, json.dumps(self.extra)) # noqa
2617
2618
2619_m = StructuredMessage
2620
2621
2622
2623---
2624File: /neurons/miners/src/daos/__init__.py
2625---
2626
2627
2628
2629
2630---
2631File: /neurons/miners/src/daos/base.py
2632---
2633
2634from core.db import SessionDep
2635
2636
2637class BaseDao:
2638 def __init__(self, session: SessionDep):
2639 self.session = session
2640
2641
2642
2643---
2644File: /neurons/miners/src/daos/executor.py
2645---
2646
2647from typing import Optional
2648from daos.base import BaseDao
2649from models.executor import Executor
2650
2651
2652class ExecutorDao(BaseDao):
2653 def save(self, executor: Executor) -> Executor:
2654 self.session.add(executor)
2655 self.session.commit()
2656 self.session.refresh(executor)
2657 return executor
2658
2659 def delete_by_address_port(self, address: str, port: int) -> None:
2660 executor = self.session.query(Executor).filter_by(
2661 address=address, port=port).first()
2662 if executor:
2663 self.session.delete(executor)
2664 self.session.commit()
2665
2666 def get_executors_for_validator(self, validator_key: str, executor_id: Optional[str] = None) -> list[Executor]:
2667 """Get executors that opened to valdiator
2668
2669 Args:
2670 validator_key (str): validator hotkey string
2671
2672 Return:
2673 List[Executor]: list of Executors
2674 """
2675 if executor_id:
2676 return list(self.session.query(Executor).filter_by(validator=validator_key, uuid=executor_id))
2677
2678 return list(self.session.query(Executor).filter_by(validator=validator_key))
2679
2680 def get_all_executors(self) -> list[Executor]:
2681 return list(self.session.query(Executor).all())
2682
2683
2684
2685---
2686File: /neurons/miners/src/daos/validator.py
2687---
2688
2689from daos.base import BaseDao
2690
2691from models.validator import Validator
2692
2693
2694class ValidatorDao(BaseDao):
2695 def save(self, validator: Validator) -> Validator:
2696 self.session.add(validator)
2697 self.session.commit()
2698 self.session.refresh(validator)
2699 return validator
2700
2701 def get_validator_by_hotkey(self, hotkey: str):
2702 return self.session.query(Validator).filter_by(validator_hotkey=hotkey).first()
2703
2704
2705
2706---
2707File: /neurons/miners/src/models/__init__.py
2708---
2709
2710
2711
2712
2713---
2714File: /neurons/miners/src/models/executor.py
2715---
2716
2717import uuid
2718from uuid import UUID
2719
2720from sqlmodel import Field, SQLModel, UniqueConstraint
2721
2722
2723class Executor(SQLModel, table=True):
2724 """Task model."""
2725
2726 __table_args__ = (UniqueConstraint("address", "port", name="unique_contraint_address_port"),)
2727
2728 uuid: UUID | None = Field(default_factory=uuid.uuid4, primary_key=True)
2729 address: str
2730 port: int
2731 validator: str
2732
2733 def __str__(self):
2734 return f"{self.address}:{self.port}"
2735
2736
2737
2738---
2739File: /neurons/miners/src/models/validator.py
2740---
2741
2742import uuid
2743from uuid import UUID
2744
2745from sqlmodel import Field, SQLModel
2746
2747
2748class Validator(SQLModel, table=True):
2749 """Task model."""
2750
2751 uuid: UUID | None = Field(default_factory=uuid.uuid4, primary_key=True)
2752 validator_hotkey: str = Field(unique=True)
2753 active: bool
2754
2755
2756
2757---
2758File: /neurons/miners/src/routes/__init__.py
2759---
2760
2761
2762
2763
2764---
2765File: /neurons/miners/src/routes/debug_routes.py
2766---
2767
2768from typing import Annotated
2769
2770from fastapi import APIRouter, Depends
2771
2772from core.config import settings
2773from services.executor_service import ExecutorService
2774
2775debug_apis_router = APIRouter()
2776
2777
2778@debug_apis_router.get("/debug/get-executors-for-validator/{validator_hotkey}")
2779async def get_executors_for_validator(
2780 validator_hotkey: str, executor_service: Annotated[ExecutorService, Depends(ExecutorService)]
2781):
2782 if not settings.DEBUG:
2783 return None
2784 return executor_service.get_executors_for_validator(validator_hotkey)
2785
2786
2787@debug_apis_router.post("/debug/register_pubkey/{validator_hotkey}")
2788async def register_pubkey(
2789 validator_hotkey: str, executor_service: Annotated[ExecutorService, Depends(ExecutorService)]
2790):
2791 if not settings.DEBUG:
2792 return None
2793 pub_key = "Test Pubkey"
2794 return await executor_service.register_pubkey(validator_hotkey, pub_key.encode("utf-8"))
2795
2796
2797@debug_apis_router.post("/debug/remove_pubkey/{validator_hotkey}")
2798async def remove_pubkey_from_executor(
2799 validator_hotkey: str, executor_service: Annotated[ExecutorService, Depends(ExecutorService)]
2800):
2801 if not settings.DEBUG:
2802 return None
2803 pub_key = "Test Pubkey"
2804 await executor_service.deregister_pubkey(validator_hotkey, pub_key.encode("utf-8"))
2805
2806
2807
2808---
2809File: /neurons/miners/src/routes/validator_interface.py
2810---
2811
2812from typing import Annotated
2813
2814from fastapi import APIRouter, Depends, WebSocket
2815
2816from consumers.validator_consumer import ValidatorConsumer
2817validator_router = APIRouter()
2818
2819
2820@validator_router.websocket("/jobs/{validator_key}")
2821async def validator_interface(consumer: Annotated[ValidatorConsumer, Depends(ValidatorConsumer)]):
2822 await consumer.connect()
2823 await consumer.handle()
2824
2825
2826@validator_router.websocket("/resources/{validator_key}")
2827async def validator_interface(consumer: Annotated[ValidatorConsumer, Depends(ValidatorConsumer)]):
2828 await consumer.connect()
2829 await consumer.handle()
2830
2831
2832
2833---
2834File: /neurons/miners/src/services/executor_service.py
2835---
2836
2837import asyncio
2838import json
2839import logging
2840from typing import Annotated, Optional
2841
2842import aiohttp
2843import bittensor
2844from datura.requests.miner_requests import ExecutorSSHInfo
2845from fastapi import Depends
2846
2847from core.config import settings
2848from daos.executor import ExecutorDao
2849from models.executor import Executor
2850
2851logging.basicConfig(level=logging.INFO)
2852logger = logging.getLogger(__name__)
2853
2854
2855class ExecutorService:
2856 def __init__(self, executor_dao: Annotated[ExecutorDao, Depends(ExecutorDao)]):
2857 self.executor_dao = executor_dao
2858
2859 def get_executors_for_validator(self, validator_hotkey: str, executor_id: Optional[str] = None):
2860 return self.executor_dao.get_executors_for_validator(validator_hotkey, executor_id)
2861
2862 async def send_pubkey_to_executor(
2863 self, executor: Executor, pubkey: str
2864 ) -> ExecutorSSHInfo | None:
2865 """TODO: Send API request to executor with pubkey
2866
2867 Args:
2868 executor (Executor): Executor instance that register validator hotkey
2869 pubkey (str): SSH public key from validator
2870
2871 Return:
2872 response (ExecutorSSHInfo | None): Executor SSH connection info.
2873 """
2874 timeout = aiohttp.ClientTimeout(total=10) # 5 seconds timeout
2875 url = f"http://{executor.address}:{executor.port}/upload_ssh_key"
2876 keypair: bittensor.Keypair = settings.get_bittensor_wallet().get_hotkey()
2877 payload = {"public_key": pubkey, "signature": f"0x{keypair.sign(pubkey).hex()}"}
2878 async with aiohttp.ClientSession(timeout=timeout) as session:
2879 try:
2880 async with session.post(url, json=payload) as response:
2881 if response.status != 200:
2882 logger.error("API request failed to register SSH key. url=%s", url)
2883 return None
2884 response_obj: dict = await response.json()
2885 logger.info(
2886 "Get response from Executor(%s:%s): %s",
2887 executor.address,
2888 executor.port,
2889 json.dumps(response_obj),
2890 )
2891 response_obj["uuid"] = str(executor.uuid)
2892 response_obj["address"] = executor.address
2893 response_obj["port"] = executor.port
2894 return ExecutorSSHInfo.parse_obj(response_obj)
2895 except Exception as e:
2896 logger.error(
2897 "API request failed to register SSH key. url=%s, error=%s", url, str(e)
2898 )
2899
2900 async def remove_pubkey_from_executor(self, executor: Executor, pubkey: str):
2901 """TODO: Send API request to executor to cleanup pubkey
2902
2903 Args:
2904 executor (Executor): Executor instance that needs to remove pubkey
2905 """
2906 timeout = aiohttp.ClientTimeout(total=10) # 5 seconds timeout
2907 url = f"http://{executor.address}:{executor.port}/remove_ssh_key"
2908 keypair: bittensor.Keypair = settings.get_bittensor_wallet().get_hotkey()
2909 payload = {"public_key": pubkey, "signature": f"0x{keypair.sign(pubkey).hex()}"}
2910 async with aiohttp.ClientSession(timeout=timeout) as session:
2911 try:
2912 async with session.post(url, json=payload) as response:
2913 if response.status != 200:
2914 logger.error("API request failed to register SSH key. url=%s", url)
2915 return None
2916 except Exception as e:
2917 logger.error(
2918 "API request failed to register SSH key. url=%s, error=%s", url, str(e)
2919 )
2920
2921 async def register_pubkey(self, validator_hotkey: str, pubkey: bytes, executor_id: Optional[str] = None):
2922 """Register pubkeys to executors for given validator.
2923
2924 Args:
2925 validator_hotkey (str): Validator hotkey
2926 pubkey (bytes): SSH pubkey from validator.
2927
2928 Return:
2929 List[dict/object]: Executors SSH connection infos that accepted validator pubkey.
2930 """
2931 tasks = [
2932 asyncio.create_task(
2933 self.send_pubkey_to_executor(executor, pubkey.decode("utf-8")),
2934 name=f"{executor}.send_pubkey_to_executor",
2935 )
2936 for executor in self.get_executors_for_validator(validator_hotkey, executor_id)
2937 ]
2938
2939 total_executors = len(tasks)
2940 results = [
2941 result for result in await asyncio.gather(*tasks, return_exceptions=True) if result
2942 ]
2943 logger.info(
2944 "Send pubkey register API requests to %d executors and received results from %d executors",
2945 total_executors,
2946 len(results),
2947 )
2948 return results
2949
2950 async def deregister_pubkey(self, validator_hotkey: str, pubkey: bytes, executor_id: Optional[str] = None):
2951 """Deregister pubkey from executors.
2952
2953 Args:
2954 validator_hotkey (str): Validator hotkey
2955 pubkey (bytes): validator pubkey
2956 """
2957 tasks = [
2958 asyncio.create_task(
2959 self.remove_pubkey_from_executor(executor, pubkey.decode("utf-8")),
2960 name=f"{executor}.remove_pubkey_from_executor",
2961 )
2962 for executor in self.get_executors_for_validator(validator_hotkey, executor_id)
2963 ]
2964 await asyncio.gather(*tasks, return_exceptions=True)
2965
2966
2967
2968---
2969File: /neurons/miners/src/services/ssh_service.py
2970---
2971
2972import getpass
2973import os
2974
2975
2976class MinerSSHService:
2977 def add_pubkey_to_host(self, pub_key: bytes):
2978 with open(os.path.expanduser("~/.ssh/authorized_keys"), "a") as file:
2979 file.write(pub_key.decode() + "\n")
2980
2981 def remove_pubkey_from_host(self, pub_key: bytes):
2982 pub_key_str = pub_key.decode().strip()
2983 authorized_keys_path = os.path.expanduser("~/.ssh/authorized_keys")
2984
2985 with open(authorized_keys_path, "r") as file:
2986 lines = file.readlines()
2987
2988 with open(authorized_keys_path, "w") as file:
2989 for line in lines:
2990 if line.strip() != pub_key_str:
2991 file.write(line)
2992
2993 def get_current_os_user(self) -> str:
2994 return getpass.getuser()
2995
2996
2997
2998---
2999File: /neurons/miners/src/services/validator_service.py
3000---
3001
3002from typing import Annotated
3003
3004from fastapi import Depends
3005
3006from daos.validator import ValidatorDao
3007
3008
3009class ValidatorService:
3010 def __init__(self, validator_dao: Annotated[ValidatorDao, Depends(ValidatorDao)]):
3011 self.validator_dao = validator_dao
3012
3013 def is_valid_validator(self, validator_hotkey: str) -> bool:
3014 return not (not self.validator_dao.get_validator_by_hotkey(validator_hotkey))
3015
3016
3017
3018---
3019File: /neurons/miners/src/_miner.py
3020---
3021
3022# The MIT License (MIT)
3023# Copyright © 2023 Yuma Rao
3024# TODO(developer): Set your name
3025# Copyright © 2023 <your name>
3026
3027# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
3028# documentation files (the “Software”), to deal in the Software without restriction, including without limitation
3029# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
3030# and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
3031
3032# The above copyright notice and this permission notice shall be included in all copies or substantial portions of
3033# the Software.
3034
3035# THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
3036# THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
3037# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
3038# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
3039# DEALINGS IN THE SOFTWARE.
3040
3041import time
3042
3043import bittensor as bt
3044
3045# Bittensor Miner Template:
3046import template
3047
3048# import base miner class which takes care of most of the boilerplate
3049from template.base.miner import BaseMinerNeuron
3050
3051
3052class Miner(BaseMinerNeuron):
3053 """
3054 Your miner neuron class. You should use this class to define your miner's behavior. In particular, you should replace the forward function with your own logic. You may also want to override the blacklist and priority functions according to your needs.
3055
3056 This class inherits from the BaseMinerNeuron class, which in turn inherits from BaseNeuron. The BaseNeuron class takes care of routine tasks such as setting up wallet, subtensor, metagraph, logging directory, parsing config, etc. You can override any of the methods in BaseNeuron if you need to customize the behavior.
3057
3058 This class provides reasonable default behavior for a miner such as blacklisting unrecognized hotkeys, prioritizing requests based on stake, and forwarding requests to the forward function. If you need to define custom
3059 """
3060
3061 def __init__(self, config=None):
3062 super().__init__(config=config)
3063
3064 # TODO(developer): Anything specific to your use case you can do here
3065
3066 async def forward(self, synapse: template.protocol.Dummy) -> template.protocol.Dummy:
3067 """
3068 Processes the incoming 'Dummy' synapse by performing a predefined operation on the input data.
3069 This method should be replaced with actual logic relevant to the miner's purpose.
3070
3071 Args:
3072 synapse (template.protocol.Dummy): The synapse object containing the 'dummy_input' data.
3073
3074 Returns:
3075 template.protocol.Dummy: The synapse object with the 'dummy_output' field set to twice the 'dummy_input' value.
3076
3077 The 'forward' function is a placeholder and should be overridden with logic that is appropriate for
3078 the miner's intended operation. This method demonstrates a basic transformation of input data.
3079 """
3080 # TODO(developer): Replace with actual implementation logic.
3081 synapse.dummy_output = synapse.dummy_input * 2
3082 return synapse
3083
3084 async def blacklist(self, synapse: template.protocol.Dummy) -> tuple[bool, str]:
3085 """
3086 Determines whether an incoming request should be blacklisted and thus ignored. Your implementation should
3087 define the logic for blacklisting requests based on your needs and desired security parameters.
3088
3089 Blacklist runs before the synapse data has been deserialized (i.e. before synapse.data is available).
3090 The synapse is instead contracted via the headers of the request. It is important to blacklist
3091 requests before they are deserialized to avoid wasting resources on requests that will be ignored.
3092
3093 Args:
3094 synapse (template.protocol.Dummy): A synapse object constructed from the headers of the incoming request.
3095
3096 Returns:
3097 Tuple[bool, str]: A tuple containing a boolean indicating whether the synapse's hotkey is blacklisted,
3098 and a string providing the reason for the decision.
3099
3100 This function is a security measure to prevent resource wastage on undesired requests. It should be enhanced
3101 to include checks against the metagraph for entity registration, validator status, and sufficient stake
3102 before deserialization of synapse data to minimize processing overhead.
3103
3104 Example blacklist logic:
3105 - Reject if the hotkey is not a registered entity within the metagraph.
3106 - Consider blacklisting entities that are not validators or have insufficient stake.
3107
3108 In practice it would be wise to blacklist requests from entities that are not validators, or do not have
3109 enough stake. This can be checked via metagraph.S and metagraph.validator_permit. You can always attain
3110 the uid of the sender via a metagraph.hotkeys.index( synapse.dendrite.hotkey ) call.
3111
3112 Otherwise, allow the request to be processed further.
3113 """
3114
3115 if synapse.dendrite is None or synapse.dendrite.hotkey is None:
3116 bt.logging.warning("Received a request without a dendrite or hotkey.")
3117 return True, "Missing dendrite or hotkey"
3118
3119 # TODO(developer): Define how miners should blacklist requests.
3120 uid = self.metagraph.hotkeys.index(synapse.dendrite.hotkey)
3121 if (
3122 not self.config.blacklist.allow_non_registered
3123 and synapse.dendrite.hotkey not in self.metagraph.hotkeys
3124 ):
3125 # Ignore requests from un-registered entities.
3126 bt.logging.trace(f"Blacklisting un-registered hotkey {synapse.dendrite.hotkey}")
3127 return True, "Unrecognized hotkey"
3128
3129 if self.config.blacklist.force_validator_permit:
3130 # If the config is set to force validator permit, then we should only allow requests from validators.
3131 if not self.metagraph.validator_permit[uid]:
3132 bt.logging.warning(
3133 f"Blacklisting a request from non-validator hotkey {synapse.dendrite.hotkey}"
3134 )
3135 return True, "Non-validator hotkey"
3136
3137 bt.logging.trace(f"Not Blacklisting recognized hotkey {synapse.dendrite.hotkey}")
3138 return False, "Hotkey recognized!"
3139
3140 async def priority(self, synapse: template.protocol.Dummy) -> float:
3141 """
3142 The priority function determines the order in which requests are handled. More valuable or higher-priority
3143 requests are processed before others. You should design your own priority mechanism with care.
3144
3145 This implementation assigns priority to incoming requests based on the calling entity's stake in the metagraph.
3146
3147 Args:
3148 synapse (template.protocol.Dummy): The synapse object that contains metadata about the incoming request.
3149
3150 Returns:
3151 float: A priority score derived from the stake of the calling entity.
3152
3153 Miners may receive messages from multiple entities at once. This function determines which request should be
3154 processed first. Higher values indicate that the request should be processed first. Lower values indicate
3155 that the request should be processed later.
3156
3157 Example priority logic:
3158 - A higher stake results in a higher priority value.
3159 """
3160 if synapse.dendrite is None or synapse.dendrite.hotkey is None:
3161 bt.logging.warning("Received a request without a dendrite or hotkey.")
3162 return 0.0
3163
3164 # TODO(developer): Define how miners should prioritize requests.
3165 caller_uid = self.metagraph.hotkeys.index(synapse.dendrite.hotkey) # Get the caller index.
3166 priority = float(self.metagraph.S[caller_uid]) # Return the stake as the priority.
3167 bt.logging.trace(f"Prioritizing {synapse.dendrite.hotkey} with value: {priority}")
3168 return priority
3169
3170
3171# This is the main function, which runs the miner.
3172if __name__ == "__main__":
3173 with Miner() as miner:
3174 while True:
3175 bt.logging.info(f"Miner running... {time.time()}")
3176 time.sleep(5)
3177
3178
3179
3180---
3181File: /neurons/miners/src/cli.py
3182---
3183
3184import asyncio
3185import logging
3186import uuid
3187
3188import click
3189import sqlalchemy
3190
3191from core.db import get_db
3192from daos.executor import ExecutorDao
3193from models.executor import Executor
3194
3195logging.basicConfig(level=logging.INFO)
3196logger = logging.getLogger(__name__)
3197
3198
3199async def async_add_executor(address: str, port: int, validator: str):
3200 """Add executor machine to the database"""
3201 logger.info("Add an new executor (%s:%d) that opens to validator(%s)", address, port, validator)
3202 executor_dao = ExecutorDao(session=next(get_db()))
3203 try:
3204 executor = executor_dao.save(
3205 Executor(uuid=uuid.uuid4(), address=address, port=port, validator=validator)
3206 )
3207 except sqlalchemy.exc.IntegrityError as e:
3208 logger.error("Failed in adding an executor: %s", str(e))
3209 else:
3210 logger.info("Added an executor(id=%s)", str(executor.uuid))
3211
3212
3213@click.group()
3214def cli():
3215 pass
3216
3217
3218@cli.command()
3219@click.option("--address", prompt="IP Address", help="IP address of executor")
3220@click.option("--port", type=int, prompt="Port", help="Port of executor")
3221@click.option(
3222 "--validator", prompt="Validator Hotkey", help="Validator hotkey that executor opens to."
3223)
3224def add_executor(address: str, port: int, validator: str):
3225 """Add executor machine to the database"""
3226 asyncio.run(async_add_executor(address, port, validator))
3227
3228
3229@cli.command()
3230@click.option("--address", prompt="IP Address", help="IP address of executor")
3231@click.option("--port", type=int, prompt="Port", help="Port of executor")
3232def remove_executor(address: str, port: int):
3233 """Add executor machine to the database"""
3234 logger.info("Removing executor (%s:%d)", address, port)
3235 executor_dao = ExecutorDao(session=next(get_db()))
3236 try:
3237 executor_dao.delete_by_address_port(address, port)
3238 except sqlalchemy.exc.IntegrityError as e:
3239 logger.error("Failed in removing an executor: %s", str(e))
3240 else:
3241 logger.info("Removed an executor(%s:%d)", address, port)
3242
3243
3244@cli.command()
3245def show_executors():
3246 """Add executor machine to the database"""
3247 executor_dao = ExecutorDao(session=next(get_db()))
3248 try:
3249 for executor in executor_dao.get_all_executors():
3250 logger.info("%s:%d -> %s", executor.address, executor.port, executor.validator)
3251 except sqlalchemy.exc.IntegrityError as e:
3252 logger.error("Failed in removing an executor: %s", str(e))
3253
3254
3255if __name__ == "__main__":
3256 cli()
3257
3258
3259
3260---
3261File: /neurons/miners/src/gpt2-training-model.py
3262---
3263
3264import time
3265import torch
3266from datasets import load_dataset
3267from torch.utils.data import DataLoader
3268from transformers import AdamW, GPT2LMHeadModel, GPT2Tokenizer
3269
3270# Load a small dataset
3271dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1000]")
3272
3273# Initialize tokenizer and model
3274tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
3275model = GPT2LMHeadModel.from_pretrained("gpt2")
3276
3277tokenizer.pad_token = tokenizer.eos_token
3278
3279
3280# Tokenize the dataset
3281def tokenize_function(examples):
3282 return tokenizer(examples["text"], truncation=True, max_length=128, padding="max_length")
3283
3284
3285start_time = time.time()
3286tokenized_dataset = dataset.map(tokenize_function, batched=True)
3287tokenized_dataset = tokenized_dataset.remove_columns(["text"])
3288tokenized_dataset.set_format("torch")
3289
3290# Create DataLoader
3291dataloader = DataLoader(tokenized_dataset, batch_size=4, shuffle=True)
3292
3293# Training loop
3294device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
3295print("device", device)
3296model.to(device)
3297
3298
3299# Evaluation function
3300def evaluate(model, dataloader):
3301 model.eval()
3302 total_loss = 0
3303 with torch.no_grad():
3304 for batch in dataloader:
3305 inputs = batch["input_ids"].to(device)
3306 outputs = model(input_ids=inputs, labels=inputs)
3307 total_loss += outputs.loss.item()
3308 return total_loss / len(dataloader)
3309
3310
3311# Initial evaluation
3312initial_loss = evaluate(model, dataloader)
3313print(f"Initial Loss: {initial_loss:.4f}")
3314print(f"Initial Perplexity: {torch.exp(torch.tensor(initial_loss)):.4f}")
3315optimizer = AdamW(model.parameters(), lr=5e-5, no_deprecation_warning=True)
3316
3317num_epochs = 1
3318for epoch in range(num_epochs):
3319 model.train()
3320 for batch in dataloader:
3321 batch = {k: v.to(device) for k, v in batch.items()}
3322 outputs = model(input_ids=batch["input_ids"], labels=batch["input_ids"])
3323 loss = outputs.loss
3324 loss.backward()
3325 optimizer.step()
3326 optimizer.zero_grad()
3327 print(f"Epoch {epoch+1}/{num_epochs} completed")
3328
3329# Final evaluation
3330final_loss = evaluate(model, dataloader)
3331print(f"Final Loss: {final_loss:.4f}")
3332print(f"Final Perplexity: {torch.exp(torch.tensor(final_loss)):.4f}")
3333
3334print(f"Loss decreased by: {initial_loss - final_loss:.4f}")
3335print(
3336 f"Perplexity decreased by: {torch.exp(torch.tensor(initial_loss)) - torch.exp(torch.tensor(final_loss)):.4f}"
3337)
3338
3339print("Job finished")
3340print(time.time() - start_time)
3341
3342
3343---
3344File: /neurons/miners/src/miner.py
3345---
3346
3347import asyncio
3348import logging
3349from contextlib import asynccontextmanager
3350
3351import uvicorn
3352from fastapi import FastAPI
3353
3354from core.config import settings
3355from core.miner import Miner
3356from routes.debug_routes import debug_apis_router
3357from routes.validator_interface import validator_router
3358from core.utils import configure_logs_of_other_modules, wait_for_services_sync
3359
3360configure_logs_of_other_modules()
3361wait_for_services_sync()
3362
3363
3364@asynccontextmanager
3365async def app_lifespan(app: FastAPI):
3366 miner = Miner()
3367 # Run the miner in the background
3368 task = asyncio.create_task(miner.start())
3369
3370 try:
3371 yield
3372 finally:
3373 await miner.stop() # Ensure proper cleanup
3374 await task # Wait for the background task to complete
3375 logging.info("Miner exited successfully.")
3376
3377
3378app = FastAPI(
3379 title=settings.PROJECT_NAME,
3380 lifespan=app_lifespan,
3381)
3382
3383app.include_router(validator_router)
3384app.include_router(debug_apis_router)
3385
3386reload = True if settings.ENV == "dev" else False
3387
3388if __name__ == "__main__":
3389 uvicorn.run("miner:app", host="0.0.0.0", port=settings.INTERNAL_PORT, reload=reload)
3390
3391
3392
3393---
3394File: /neurons/miners/tests/__init__.py
3395---
3396
3397
3398
3399
3400---
3401File: /neurons/miners/assigning_validator_hotkeys.md
3402---
3403
3404# Best Practices for Assigning Validator Hotkeys
3405
3406In the Compute Subnet, validators play a critical role in ensuring the performance and security of the network. However, miners must assign executors carefully to the validators to maximize incentives. This guide explains the best strategy for assigning validator hotkeys based on stake distribution within the network.
3407
3408## Why Validator Hotkey Assignment Matters
3409
3410You will **not receive any rewards** if your executors are not assigned to validators that control a **majority of the stake** in the network. Therefore, it’s crucial to understand how stake distribution works and how to assign your executors effectively.
3411
3412## Step-by-Step Strategy for Assigning Validator Hotkeys
3413
3414### 1. Check the Validator Stakes
3415
3416The first step is to determine how much stake each validator controls in the network. You can find the current stake distribution of all validators by visiting:
3417
3418[**TaoMarketCap Subnet 51 Validators**](https://taomarketcap.com/subnets/51/validators)
3419
3420This page lists each validator and their respective stake, which is essential for making decisions about hotkey assignments.
3421
3422### 2. Assign Executors to Cover at Least 50% of the Stake
3423
3424To begin, you need to ensure that your executors are covering **at least 50%** of the total network stake. This guarantees that your executors will be actively validated and you’ll receive rewards.
3425
3426#### Example:
3427
3428Suppose you have **100 executors** (GPUs) and the stake distribution of the validators is as follows:
3429
3430| Validator | Stake (%) |
3431|-----------|-----------|
3432| Validator 1 | 50% |
3433| Validator 2 | 25% |
3434| Validator 3 | 15% |
3435| Validator 4 | 5% |
3436| Validator 5 | 1% |
3437
3438- To cover 50% of the total stake, assign **enough executors** to cover **Validator 1** (50% stake).
3439- In this case, assign at least **one executor** to **Validator 1** because they control 50% of the network stake.
3440
3441### 3. Stake-Weighted Assignment for Remaining Executors
3442
3443Once you’ve ensured that you’re covering at least 50% of the network stake, the remaining executors should be assigned in a **stake-weighted** fashion to maximize rewards.
3444
3445#### Continuing the Example:
3446
3447You have **99 remaining executors** to assign to validators. Here's the distribution of executors you should follow based on the stake:
3448
3449- **Validator 1 (50% stake)**: Assign **50% of executors** to Validator 1.
3450 - Assign 50 executors.
3451- **Validator 2 (25% stake)**: Assign **25% of executors** to Validator 2.
3452 - Assign 25 executors.
3453- **Validator 3 (15% stake)**: Assign **15% of executors** to Validator 3.
3454 - Assign 15 executors.
3455- **Validator 4 (5% stake)**: Assign **5% of executors** to Validator 4.
3456 - Assign 5 executors.
3457- **Validator 5 (1% stake)**: Assign **1% of executors** to Validator 5.
3458 - Assign 1 executor.
3459
3460### 4. Adjust Based on Network Dynamics
3461
3462The stake of validators can change over time. Make sure to periodically check the **validator stakes** on [TaoMarketCap](https://taomarketcap.com/subnets/51/validators) and **reassign your executors** as needed to maintain optimal rewards. If a validator’s stake increases significantly, you may want to adjust your assignments accordingly.
3463
3464## Summary of the Best Strategy
3465
3466- **Step 1**: Check the validator stakes on [TaoMarketCap](https://taomarketcap.com/subnets/51/validators).
3467- **Step 2**: Ensure your executors are covering at least **50% of the total network stake**.
3468- **Step 3**: Use a **stake-weighted** strategy to assign your remaining executors, matching the proportion of the stake each validator controls.
3469- **Step 4**: Periodically recheck the stake distribution and adjust assignments as needed.
3470
3471By following this strategy, you’ll ensure that your executors are assigned to validators in the most efficient way possible, maximizing your chances of receiving rewards.
3472
3473## Additional Resources
3474
3475- [TaoMarketCap Subnet 51 Validators](https://taomarketcap.com/subnets/51/validators)
3476- [Compute Subnet Miner README](README.md)
3477
3478
3479
3480
3481---
3482File: /neurons/miners/docker_build.sh
3483---
3484
3485#!/bin/bash
3486set -eux -o pipefail
3487
3488IMAGE_NAME="daturaai/compute-subnet-miner:$TAG"
3489
3490docker build --build-context datura=../../datura -t $IMAGE_NAME .
3491
3492
3493---
3494File: /neurons/miners/docker_publish.sh
3495---
3496
3497#!/bin/bash
3498set -eux -o pipefail
3499
3500source ./docker_build.sh
3501
3502echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
3503docker push "$IMAGE_NAME"
3504
3505
3506---
3507File: /neurons/miners/docker_runner_build.sh
3508---
3509
3510#!/bin/bash
3511set -eux -o pipefail
3512
3513IMAGE_NAME="daturaai/compute-subnet-miner-runner:$TAG"
3514
3515docker build --file Dockerfile.runner -t $IMAGE_NAME .
3516
3517
3518---
3519File: /neurons/miners/docker_runner_publish.sh
3520---
3521
3522#!/bin/bash
3523set -eux -o pipefail
3524
3525source ./docker_runner_build.sh
3526
3527echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
3528docker push "$IMAGE_NAME"
3529
3530
3531---
3532File: /neurons/miners/entrypoint.sh
3533---
3534
3535#!/bin/sh
3536set -eu
3537
3538docker compose up --pull always --detach --wait --force-recreate
3539
3540# Clean docker images
3541docker image prune -f
3542
3543while true
3544do
3545 docker compose logs -f
3546 echo 'All containers died'
3547 sleep 10
3548done
3549
3550
3551
3552---
3553File: /neurons/miners/README.md
3554---
3555
3556# Miner
3557
3558## Overview
3559
3560This miner allows you to contribute your GPU resources to the Compute Subnet and earn compensation for providing computational power. You will run a central miner on a CPU server, which manages multiple executors running on GPU-equipped machines.
3561
3562### Central Miner Server Requirements
3563
3564To run the central miner, you only need a CPU server with the following specifications:
3565
3566- **CPU**: 4 cores
3567- **RAM**: 8GB
3568- **Storage**: 50GB available disk space
3569- **OS**: Ubuntu (recommended)
3570
3571### Executors
3572
3573Executors are GPU-equipped machines that perform the computational tasks. The central miner manages these executors, which can be easily added or removed from the network.
3574
3575To see the compatible GPUs to mine with and their relative rewards, see this dict [here](https://github.com/Datura-ai/compute-subnet/blob/main/neurons/validators/src/services/const.py#L3).
3576
3577## Installation
3578
3579### Using Docker
3580
3581#### Step 1: Clone the Git Repository
3582
3583```
3584git clone https://github.com/Datura-ai/compute-subnet.git
3585```
3586
3587#### Step 2: Install Required Tools
3588
3589```
3590cd compute-subnet && chmod +x scripts/install_miner_on_ubuntu.sh && ./scripts/install_miner_on_ubuntu.sh
3591```
3592
3593Verify if bittensor and docker installed:
3594```
3595btcli --version
3596```
3597
3598```
3599docker --version
3600```
3601
3602If one of them isn't installed properly, install using following link:
3603For bittensor, use [This Link](https://github.com/opentensor/bittensor/blob/master/README.md#install-bittensor-sdk)
3604For docker, use [This Link](https://docs.docker.com/engine/install/)
3605
3606#### Step 3: Setup ENV
3607```
3608cp neurons/miners/.env.template neurons/miners/.env
3609```
3610
3611Fill in your information for:
3612
3613`BITTENSOR_WALLET_NAME`: Your wallet name for Bittensor. You can check this with `btcli wallet list`
3614
3615`BITTENSOR_WALLET_HOTKEY_NAME`: The hotkey name of your wallet's registered hotkey. If it is not registered, run `btcli subnet register --netuid 51`.
3616
3617`EXTERNAL_IP_ADDRESS`: The external IP address of your central miner server. Make sure it is open to external connections on the `EXTERNAL PORT`
3618
3619`HOST_WALLET_DIR`: The directory path of your wallet on the machine.
3620
3621`INTERNAL_PORT` and `EXTERNAL_PORT`: Optionally customize these ports. Make sure the `EXTERNAL PORT` is open for external connections to connect to the validators.
3622
3623
3624#### Step 4: Start the Miner
3625
3626```
3627cd neurons/miners && docker compose up -d
3628```
3629
3630## Managing Executors
3631
3632### Adding an Executor
3633
3634Executors are machines running on GPUs that you can add to your central miner. The more executors (GPUs) you have, the greater your compensation will be. Here's how to add them:
3635
36361. Ensure the executor machine is set up and running Docker. For more information, follow the [executor README.md here](../executor/README.md)
36372. Use the following command to add an executor to the central miner:
3638
3639 ```bash
3640 docker exec <container-id or name> python /root/app/src/cli.py add-executor --address <executor-ip-address> --port <executor-port> --validator <validator-hotkey>
3641 ```
3642
3643 - `<executor-ip-address>`: The IP address of the executor machine.
3644 - `<executor-port>`: The port number used for the executor (default: `8001`).
3645 - `<validator-hotkey>`: The validator hotkey that you want to give access to this executor. Which validator hotkey should you pick? Follow [this guide](assigning_validator_hotkeys.md)
3646
3647### What is a Validator Hotkey?
3648
3649The **validator hotkey** is a unique identifier tied to a validator that authenticates and verifies the performance of your executor machines. When you specify a validator hotkey during executor registration, it ensures that your executor is validated by this specific validator.
3650
3651To switch to a different validator first follow the instructions for removing an executor. After removing the executor, you need to re-register executors by running the add-executor command again (Step 2 of Adding an Executor).
3652
3653### Removing an Executor
3654
3655To remove an executor from the central miner, follow these steps:
3656
36571. Run the following command to remove the executor:
3658
3659 ```bash
3660 docker exec <docker instance> python /root/app/src/cli.py remove-executor --address <executor public ip> --port <executor external port>
3661 ```
3662
3663
3664### Monitoring earnings
3665
3666To monitor your earnings, use [Taomarketcap.com](https://taomarketcap.com/subnets/51/miners)'s subnet 51 miner page to track your daily rewards, and relative performance with other miners.
3667
3668
3669
3670---
3671File: /neurons/miners/run.sh
3672---
3673
3674#!/bin/sh
3675
3676# db migrate
3677alembic upgrade head
3678
3679# run fastapi app
3680python src/miner.py
3681
3682
3683---
3684File: /neurons/validators/migrations/versions/0653dc97382a_add_executors_table.py
3685---
3686
3687"""Add executors table
3688
3689Revision ID: 0653dc97382a
3690Revises: d5037a3f7b99
3691Create Date: 2024-09-10 09:42:38.878136
3692
3693"""
3694from typing import Sequence, Union
3695
3696from alembic import op
3697import sqlalchemy as sa
3698import sqlmodel
3699import sqlmodel.sql.sqltypes
3700
3701
3702# revision identifiers, used by Alembic.
3703revision: str = '0653dc97382a'
3704down_revision: Union[str, None] = 'd5037a3f7b99'
3705branch_labels: Union[str, Sequence[str], None] = None
3706depends_on: Union[str, Sequence[str], None] = None
3707
3708
3709def upgrade() -> None:
3710 # ### commands auto generated by Alembic - please adjust! ###
3711 op.create_table('executor',
3712 sa.Column('uuid', sa.Uuid(), nullable=False),
3713 sa.Column('miner_address', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3714 sa.Column('miner_port', sa.Integer(), nullable=False),
3715 sa.Column('miner_hotkey', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3716 sa.Column('executor_id', sa.Uuid(), nullable=False),
3717 sa.Column('executor_ip_address', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3718 sa.Column('executor_ssh_username', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3719 sa.Column('executor_ssh_port', sa.Integer(), nullable=False),
3720 sa.Column('rented', sa.Boolean(), nullable=True),
3721 sa.PrimaryKeyConstraint('uuid')
3722 )
3723 op.add_column('task', sa.Column('executor_id', sa.Uuid(), nullable=False))
3724 op.drop_column('task', 'ssh_private_key')
3725 # ### end Alembic commands ###
3726
3727
3728def downgrade() -> None:
3729 # ### commands auto generated by Alembic - please adjust! ###
3730 op.add_column('task', sa.Column('ssh_private_key', sa.VARCHAR(), autoincrement=False, nullable=False))
3731 op.drop_column('task', 'executor_id')
3732 op.drop_table('executor')
3733 # ### end Alembic commands ###
3734
3735
3736
3737---
3738File: /neurons/validators/migrations/versions/d5037a3f7b99_create_task_model.py
3739---
3740
3741"""create task model
3742
3743Revision ID: d5037a3f7b99
3744Revises:
3745Create Date: 2024-08-19 17:57:42.735518
3746
3747"""
3748from typing import Sequence, Union
3749
3750from alembic import op
3751import sqlalchemy as sa
3752import sqlmodel
3753import sqlmodel.sql.sqltypes
3754
3755
3756# revision identifiers, used by Alembic.
3757revision: str = 'd5037a3f7b99'
3758down_revision: Union[str, None] = None
3759branch_labels: Union[str, Sequence[str], None] = None
3760depends_on: Union[str, Sequence[str], None] = None
3761
3762
3763def upgrade() -> None:
3764 # ### commands auto generated by Alembic - please adjust! ###
3765 op.create_table('task',
3766 sa.Column('uuid', sa.Uuid(), nullable=False),
3767 sa.Column('task_status', sa.Enum('Initiated', 'SSHConnected', 'Failed', 'Finished', name='taskstatus'), nullable=True),
3768 sa.Column('miner_hotkey', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3769 sa.Column('ssh_private_key', sqlmodel.sql.sqltypes.AutoString(), nullable=False),
3770 sa.Column('created_at', sa.DateTime(), nullable=False),
3771 sa.Column('proceed_time', sa.Integer(), nullable=True),
3772 sa.Column('score', sa.Float(), nullable=True),
3773 sa.PrimaryKeyConstraint('uuid')
3774 )
3775 # ### end Alembic commands ###
3776
3777
3778def downgrade() -> None:
3779 # ### commands auto generated by Alembic - please adjust! ###
3780 op.drop_table('task')
3781 # ### end Alembic commands ###
3782
3783
3784
3785---
3786File: /neurons/validators/migrations/env.py
3787---
3788
3789import os
3790from logging.config import fileConfig
3791from pathlib import Path
3792
3793from alembic import context
3794from dotenv import load_dotenv
3795from sqlalchemy import engine_from_config, pool
3796from sqlmodel import SQLModel
3797
3798from models.executor import * # noqa
3799from models.task import * # noqa
3800
3801# this is the Alembic Config object, which provides
3802# access to the values within the .ini file in use.
3803config = context.config
3804
3805# Interpret the config file for Python logging.
3806# This line sets up loggers basically.
3807if config.config_file_name is not None:
3808 fileConfig(config.config_file_name)
3809
3810# add your model's MetaData object here
3811# for 'autogenerate' support
3812# from myapp import mymodel
3813# target_metadata = mymodel.Base.metadata
3814
3815target_metadata = SQLModel.metadata
3816
3817# other values from the config, defined by the needs of env.py,
3818# can be acquired:
3819# my_important_option = config.get_main_option("my_important_option")
3820# ... etc.
3821
3822current_dir = Path(__file__).parent
3823
3824load_dotenv(str(current_dir / ".." / ".env"))
3825
3826
3827def get_url():
3828 url = os.getenv("SQLALCHEMY_DATABASE_URI")
3829 return url
3830
3831
3832def run_migrations_offline() -> None:
3833 """Run migrations in 'offline' mode.
3834
3835 This configures the context with just a URL
3836 and not an Engine, though an Engine is acceptable
3837 here as well. By skipping the Engine creation
3838 we don't even need a DBAPI to be available.
3839
3840 Calls to context.execute() here emit the given string to the
3841 script output.
3842
3843 """
3844 url = get_url()
3845 context.configure(
3846 url=url,
3847 target_metadata=target_metadata,
3848 literal_binds=True,
3849 dialect_opts={"paramstyle": "named"},
3850 )
3851
3852 with context.begin_transaction():
3853 context.run_migrations()
3854
3855
3856def run_migrations_online() -> None:
3857 """Run migrations in 'online' mode.
3858
3859 In this scenario we need to create an Engine
3860 and associate a connection with the context.
3861
3862 """
3863 configuration = config.get_section(config.config_ini_section)
3864 configuration["sqlalchemy.url"] = get_url()
3865 connectable = engine_from_config(
3866 configuration,
3867 prefix="sqlalchemy.",
3868 poolclass=pool.NullPool,
3869 )
3870
3871 with connectable.connect() as connection:
3872 context.configure(connection=connection, target_metadata=target_metadata)
3873
3874 with context.begin_transaction():
3875 context.run_migrations()
3876
3877
3878if context.is_offline_mode():
3879 run_migrations_offline()
3880else:
3881 run_migrations_online()
3882
3883
3884
3885---
3886File: /neurons/validators/src/clients/__init__.py
3887---
3888
3889
3890
3891
3892---
3893File: /neurons/validators/src/clients/compute_client.py
3894---
3895
3896import asyncio
3897import json
3898import logging
3899from typing import NoReturn
3900
3901import bittensor
3902import pydantic
3903import redis.asyncio as aioredis
3904import tenacity
3905import websockets
3906from datura.requests.base import BaseRequest
3907from payload_models.payloads import (
3908 ContainerBaseRequest,
3909 ContainerCreated,
3910 ContainerCreateRequest,
3911 ContainerDeleteRequest,
3912 ContainerStartRequest,
3913 ContainerStopRequest,
3914 DuplicateExecutorsResponse,
3915 FailedContainerRequest,
3916)
3917from protocol.vc_protocol.compute_requests import Error, RentedMachineResponse, Response
3918from protocol.vc_protocol.validator_requests import (
3919 AuthenticateRequest,
3920 DuplicateExecutorsRequest,
3921 ExecutorSpecRequest,
3922 LogStreamRequest,
3923 RentedMachineRequest,
3924)
3925from pydantic import BaseModel
3926from websockets.asyncio.client import ClientConnection
3927
3928from clients.metagraph_client import create_metagraph_refresh_task, get_miner_axon_info
3929from core.utils import _m, get_extra_info
3930from services.miner_service import MinerService
3931from services.redis_service import (
3932 DUPLICATED_MACHINE_SET,
3933 MACHINE_SPEC_CHANNEL_NAME,
3934 RENTED_MACHINE_SET,
3935 STREAMING_LOG_CHANNEL,
3936)
3937
3938logger = logging.getLogger(__name__)
3939
3940
3941class AuthenticationError(Exception):
3942 def __init__(self, reason: str, errors: list[Error]):
3943 self.reason = reason
3944 self.errors = errors
3945
3946
3947class ComputeClient:
3948 HEARTBEAT_PERIOD = 60
3949
3950 def __init__(
3951 self, keypair: bittensor.Keypair, compute_app_uri: str, miner_service: MinerService
3952 ):
3953 self.keypair = keypair
3954 self.ws: ClientConnection | None = None
3955 self.compute_app_uri = compute_app_uri
3956 self.miner_drivers = asyncio.Queue()
3957 self.miner_driver_awaiter_task = asyncio.create_task(self.miner_driver_awaiter())
3958 # self.heartbeat_task = asyncio.create_task(self.heartbeat())
3959 self.refresh_metagraph_task = self.create_metagraph_refresh_task()
3960 self.miner_service = miner_service
3961
3962 self.logging_extra = get_extra_info(
3963 {
3964 "validator_hotkey": self.my_hotkey(),
3965 "compute_app_uri": compute_app_uri,
3966 }
3967 )
3968
3969 def accepted_request_type(self) -> type[BaseRequest]:
3970 return ContainerBaseRequest
3971
3972 def connect(self):
3973 """Create an awaitable/async-iterable websockets.connect() object"""
3974 logger.info(
3975 _m(
3976 "Connecting to backend app",
3977 extra=self.logging_extra,
3978 )
3979 )
3980 return websockets.connect(self.compute_app_uri)
3981
3982 async def miner_driver_awaiter(self):
3983 """avoid memory leak by awaiting miner driver tasks"""
3984 while True:
3985 task = await self.miner_drivers.get()
3986 if task is None:
3987 return
3988
3989 try:
3990 await task
3991 except Exception as exc:
3992 logger.error(
3993 _m(
3994 "Error occurred during driving a miner client",
3995 extra={**self.logging_extra, "error": str(exc)},
3996 )
3997 )
3998
3999 async def __aenter__(self):
4000 pass
4001
4002 async def __aexit__(self, exc_type, exc_val, exc_tb):
4003 await self.miner_drivers.put(None)
4004 await self.miner_driver_awaiter_task
4005
4006 def my_hotkey(self) -> str:
4007 return self.keypair.ss58_address
4008
4009 async def run_forever(self) -> NoReturn:
4010 """connect (and re-connect) to facilitator and keep reading messages ... forever"""
4011 try:
4012 # subscribe to channel to get machine specs
4013 pubsub = await self.miner_service.redis_service.subscribe(MACHINE_SPEC_CHANNEL_NAME)
4014 log_channel = await self.miner_service.redis_service.subscribe(STREAMING_LOG_CHANNEL)
4015
4016 # send machine specs to facilitator
4017 self.specs_task = asyncio.create_task(self.wait_for_specs(pubsub))
4018 asyncio.create_task(self.wait_for_log_streams(log_channel))
4019 except Exception as exc:
4020 logger.error(
4021 _m("redis connection error", extra={**self.logging_extra, "error": str(exc)})
4022 )
4023
4024 asyncio.create_task(self.poll_rented_machines())
4025
4026 try:
4027 while True:
4028 async for ws in self.connect():
4029 try:
4030 logger.info(
4031 _m(
4032 "Connected to backend app",
4033 extra=self.logging_extra,
4034 )
4035 )
4036 await self.handle_connection(ws)
4037 except websockets.ConnectionClosed as exc:
4038 self.ws = None
4039 logger.warning(
4040 _m(
4041 f"validator connection to backend app closed with code {exc.code} and reason {exc.reason}, reconnecting...",
4042 extra=self.logging_extra,
4043 )
4044 )
4045 except asyncio.exceptions.CancelledError:
4046 self.ws = None
4047 logger.warning(
4048 _m(
4049 "Facilitator client received cancel, stopping",
4050 extra=self.logging_extra,
4051 )
4052 )
4053 except Exception:
4054 self.ws = None
4055 logger.error(
4056 _m(
4057 "Error in connecting to compute app",
4058 extra=self.logging_extra,
4059 )
4060 )
4061
4062 except Exception as exc:
4063 self.ws = None
4064 logger.error(
4065 _m(
4066 "Connecting to compute app failed",
4067 extra={**self.logging_extra, "error": str(exc)},
4068 ),
4069 exc_info=True,
4070 )
4071
4072 async def handle_connection(self, ws: ClientConnection):
4073 """handle a single websocket connection"""
4074 await ws.send(AuthenticateRequest.from_keypair(self.keypair).model_dump_json())
4075
4076 raw_msg = await ws.recv()
4077 try:
4078 response = Response.model_validate_json(raw_msg)
4079 except pydantic.ValidationError as exc:
4080 raise AuthenticationError(
4081 "did not receive Response for AuthenticationRequest", []
4082 ) from exc
4083 if response.status != "success":
4084 raise AuthenticationError("auth request received failed response", response.errors)
4085
4086 self.ws = ws
4087
4088 async for raw_msg in ws:
4089 await self.handle_message(raw_msg)
4090
4091 async def wait_for_specs(self, channel: aioredis.client.PubSub):
4092 specs_queue = []
4093 while True:
4094 validator_hotkey = self.my_hotkey()
4095
4096 logger.info(
4097 _m(
4098 f"Waiting for machine specs from validator app: {validator_hotkey}",
4099 extra=self.logging_extra,
4100 )
4101 )
4102 try:
4103 msg = await channel.get_message(ignore_subscribe_messages=True, timeout=100 * 60)
4104 logger.info(
4105 _m(
4106 "Received machine specs from validator app.",
4107 extra={**self.logging_extra},
4108 )
4109 )
4110
4111 if msg is None:
4112 logger.warning(
4113 _m(
4114 "No message received from validator app.",
4115 extra=self.logging_extra,
4116 )
4117 )
4118 continue
4119
4120 msg = json.loads(msg["data"])
4121 specs = None
4122 executor_logging_extra = {}
4123 try:
4124 specs = ExecutorSpecRequest(
4125 specs=msg["specs"],
4126 score=msg["score"],
4127 synthetic_job_score=msg["synthetic_job_score"],
4128 log_status=msg["log_status"],
4129 job_batch_id=msg["job_batch_id"],
4130 log_text=msg["log_text"],
4131 miner_hotkey=msg["miner_hotkey"],
4132 validator_hotkey=validator_hotkey,
4133 executor_uuid=msg["executor_uuid"],
4134 executor_ip=msg["executor_ip"],
4135 executor_port=msg["executor_port"],
4136 )
4137 executor_logging_extra = {
4138 "executor_uuid": msg["executor_uuid"],
4139 "executor_ip": msg["executor_ip"],
4140 "executor_port": msg["executor_port"],
4141 "job_batch_id": msg["job_batch_id"],
4142 }
4143 except Exception as exc:
4144 msg = "Error occurred while parsing msg"
4145 logger.error(
4146 _m(
4147 msg,
4148 extra={
4149 **self.logging_extra,
4150 **executor_logging_extra,
4151 "error": str(exc),
4152 },
4153 )
4154 )
4155 continue
4156
4157 logger.info(
4158 "Sending machine specs update of executor to compute app",
4159 extra={**self.logging_extra, **executor_logging_extra, "specs": str(specs)},
4160 )
4161
4162 specs_queue.append(specs)
4163 if self.ws is not None:
4164 while len(specs_queue) > 0:
4165 spec_to_send = specs_queue.pop(0)
4166 try:
4167 await self.send_model(spec_to_send)
4168 except Exception as exc:
4169 specs_queue.insert(0, spec_to_send)
4170 msg = "Error occurred while sending specs of executor"
4171 logger.error(
4172 _m(
4173 msg,
4174 extra={
4175 **self.logging_extra,
4176 **executor_logging_extra,
4177 "error": str(exc),
4178 },
4179 )
4180 )
4181 break
4182 except TimeoutError:
4183 logger.error(
4184 _m(
4185 "wait_for_specs still running",
4186 extra=self.logging_extra,
4187 )
4188 )
4189
4190 async def wait_for_log_streams(self, channel: aioredis.client.PubSub):
4191 logs_queue: list[LogStreamRequest] = []
4192 while True:
4193 validator_hotkey = self.my_hotkey()
4194 logger.info(
4195 _m(
4196 f"Waiting for log streams: {validator_hotkey}",
4197 extra=self.logging_extra,
4198 )
4199 )
4200 try:
4201 msg = await channel.get_message(ignore_subscribe_messages=True, timeout=100 * 60)
4202 if msg is None:
4203 logger.warning(
4204 _m(
4205 "No log streams yet",
4206 extra=self.logging_extra,
4207 )
4208 )
4209 continue
4210
4211 msg = json.loads(msg["data"])
4212 log_stream = None
4213
4214 try:
4215 log_stream = LogStreamRequest(
4216 logs=msg["logs"],
4217 miner_hotkey=msg["miner_hotkey"],
4218 validator_hotkey=validator_hotkey,
4219 executor_uuid=msg["executor_uuid"],
4220 )
4221
4222 logger.info(
4223 _m(
4224 f'Successfully created LogStreamRequest instance with {len(msg["logs"])} logs',
4225 extra=self.logging_extra,
4226 )
4227 )
4228 except Exception as exc:
4229 logger.error(
4230 _m(
4231 "Failed to get LogStreamRequest instance",
4232 extra={
4233 **self.logging_extra,
4234 "error": str(exc),
4235 "msg": str(msg),
4236 },
4237 )
4238 )
4239 continue
4240
4241 logs_queue.append(log_stream)
4242 if self.ws is not None:
4243 while len(logs_queue) > 0:
4244 log_to_send = logs_queue.pop(0)
4245 try:
4246 await self.send_model(log_to_send)
4247 except Exception as exc:
4248 logs_queue.insert(0, log_to_send)
4249 logger.error(
4250 _m(
4251 msg,
4252 extra={
4253 **self.logging_extra,
4254 "error": str(exc),
4255 },
4256 )
4257 )
4258 break
4259 except TimeoutError:
4260 pass
4261
4262 def create_metagraph_refresh_task(self, period=None):
4263 return create_metagraph_refresh_task(period=period)
4264
4265 async def heartbeat(self):
4266 pass
4267 # while True:
4268 # if self.ws is not None:
4269 # try:
4270 # await self.send_model(Heartbeat())
4271 # except Exception as exc:
4272 # msg = f"Error occurred while sending heartbeat: {exc}"
4273 # logger.warning(msg)
4274 # await asyncio.sleep(self.HEARTBEAT_PERIOD)
4275
4276 @tenacity.retry(
4277 stop=tenacity.stop_after_attempt(7),
4278 wait=tenacity.wait_exponential(multiplier=1, exp_base=2, min=1, max=10),
4279 retry=tenacity.retry_if_exception_type(websockets.ConnectionClosed),
4280 )
4281 async def send_model(self, msg: BaseModel):
4282 if self.ws is None:
4283 raise websockets.ConnectionClosed(rcvd=None, sent=None)
4284 await self.ws.send(msg.model_dump_json())
4285 # Summary: https://github.com/python-websockets/websockets/issues/867
4286 # Longer discussion: https://github.com/python-websockets/websockets/issues/865
4287 await asyncio.sleep(0)
4288
4289 async def poll_rented_machines(self):
4290 while True:
4291 if self.ws is not None:
4292 logger.info(
4293 _m(
4294 "Request rented machines",
4295 extra=self.logging_extra,
4296 )
4297 )
4298 await self.send_model(RentedMachineRequest())
4299
4300 logger.info(
4301 _m(
4302 "Request duplicated machines",
4303 extra=self.logging_extra,
4304 )
4305 )
4306 await self.send_model(DuplicateExecutorsRequest())
4307
4308 await asyncio.sleep(10 * 60)
4309 else:
4310 await asyncio.sleep(10)
4311
4312 async def handle_message(self, raw_msg: str | bytes):
4313 """handle message received from facilitator"""
4314 try:
4315 response = Response.model_validate_json(raw_msg)
4316 except pydantic.ValidationError:
4317 pass
4318 else:
4319 if response.status != "success":
4320 logger.error(
4321 _m(
4322 "received error response from facilitator",
4323 extra={**self.logging_extra, "response": str(response)},
4324 )
4325 )
4326 return
4327
4328 try:
4329 response = pydantic.TypeAdapter(RentedMachineResponse).validate_json(raw_msg)
4330 except pydantic.ValidationError:
4331 pass
4332 else:
4333 logger.info(
4334 _m(
4335 "Rented machines",
4336 extra={**self.logging_extra, "machines": len(response.machines)},
4337 )
4338 )
4339
4340 redis_service = self.miner_service.redis_service
4341 await redis_service.delete(RENTED_MACHINE_SET)
4342
4343 for machine in response.machines:
4344 await redis_service.add_rented_machine(machine)
4345
4346 return
4347
4348 try:
4349 response = pydantic.TypeAdapter(DuplicateExecutorsResponse).validate_json(raw_msg)
4350 except pydantic.ValidationError:
4351 pass
4352 else:
4353 logger.info(
4354 _m(
4355 "Duplicated executors",
4356 extra={**self.logging_extra, "executors": len(response.executors)},
4357 )
4358 )
4359
4360 redis_service = self.miner_service.redis_service
4361 await redis_service.delete(DUPLICATED_MACHINE_SET)
4362
4363 for _, details_list in response.executors.items():
4364 for detail in details_list:
4365 executor_id = detail.get("executor_id")
4366 miner_hotkey = detail.get("miner_hotkey")
4367 await redis_service.sadd(
4368 DUPLICATED_MACHINE_SET, f"{miner_hotkey}:{executor_id}"
4369 )
4370
4371 return
4372
4373 try:
4374 job_request = self.accepted_request_type().parse(raw_msg)
4375 except Exception as ex:
4376 error_msg = f"Invalid message received from celium backend: {str(ex)}"
4377 logger.error(
4378 _m(
4379 error_msg,
4380 extra={**self.logging_extra, "error": str(ex), "raw_msg": raw_msg},
4381 )
4382 )
4383 else:
4384 task = asyncio.create_task(self.miner_driver(job_request))
4385 await self.miner_drivers.put(task)
4386 return
4387 # logger.error("unsupported message received from facilitator: %s", raw_msg)
4388
4389 async def get_miner_axon_info(self, hotkey: str) -> bittensor.AxonInfo:
4390 return await get_miner_axon_info(hotkey)
4391
4392 async def miner_driver(
4393 self,
4394 job_request: ContainerCreateRequest
4395 | ContainerDeleteRequest
4396 | ContainerStopRequest
4397 | ContainerStartRequest,
4398 ):
4399 """drive a miner client from job start to completion, then close miner connection"""
4400 miner_axon_info = await self.get_miner_axon_info(job_request.miner_hotkey)
4401 logging_extra = {
4402 **self.logging_extra,
4403 "miner_hotkey": job_request.miner_hotkey,
4404 "miner_ip": miner_axon_info.ip,
4405 "miner_port": miner_axon_info.port,
4406 "job_request": str(job_request),
4407 "executor_id": str(job_request.executor_id),
4408 }
4409 logger.info(
4410 _m(
4411 "Miner driver to miner",
4412 extra=logging_extra,
4413 )
4414 )
4415
4416 if isinstance(job_request, ContainerCreateRequest):
4417 logger.info(
4418 _m(
4419 "Creating container for executor.",
4420 extra={**logging_extra, "job_request": str(job_request)},
4421 )
4422 )
4423 job_request.miner_address = miner_axon_info.ip
4424 job_request.miner_port = miner_axon_info.port
4425 container_created: (
4426 ContainerCreated | FailedContainerRequest
4427 ) = await self.miner_service.handle_container(job_request)
4428
4429 logger.info(
4430 _m(
4431 "Sending back created container info to compute app",
4432 extra={**logging_extra, "container_created": str(container_created)},
4433 )
4434 )
4435 await self.send_model(container_created)
4436 elif isinstance(job_request, ContainerDeleteRequest):
4437 job_request.miner_address = miner_axon_info.ip
4438 job_request.miner_port = miner_axon_info.port
4439 response: (
4440 ContainerDeleteRequest | FailedContainerRequest
4441 ) = await self.miner_service.handle_container(job_request)
4442
4443 logger.info(
4444 _m(
4445 "Sending back deleted container info to compute app",
4446 extra={**logging_extra, "response": str(response)},
4447 )
4448 )
4449 await self.send_model(response)
4450 elif isinstance(job_request, ContainerStopRequest):
4451 job_request.miner_address = miner_axon_info.ip
4452 job_request.miner_port = miner_axon_info.port
4453 response: (
4454 ContainerStopRequest | FailedContainerRequest
4455 ) = await self.miner_service.handle_container(job_request)
4456
4457 logger.info(
4458 _m(
4459 "Sending back stopped container info to compute app",
4460 extra={**logging_extra, "response": str(response)},
4461 )
4462 )
4463 await self.send_model(response)
4464 elif isinstance(job_request, ContainerStartRequest):
4465 job_request.miner_address = miner_axon_info.ip
4466 job_request.miner_port = miner_axon_info.port
4467 response: (
4468 ContainerStartRequest | FailedContainerRequest
4469 ) = await self.miner_service.handle_container(job_request)
4470
4471 logger.info(
4472 _m(
4473 "Sending back started container info to compute app",
4474 extra={**logging_extra, "response": str(response)},
4475 )
4476 )
4477 await self.send_model(response)
4478
4479
4480
4481---
4482File: /neurons/validators/src/clients/metagraph_client.py
4483---
4484
4485import asyncio
4486import datetime as dt
4487import logging
4488
4489import bittensor
4490from asgiref.sync import sync_to_async
4491
4492from core.config import settings
4493
4494logger = logging.getLogger(__name__)
4495
4496
4497class AsyncMetagraphClient:
4498 def __init__(self, cache_time=dt.timedelta(minutes=5)):
4499 self.cache_time = cache_time
4500 self._metagraph_future = None
4501 self._future_lock = asyncio.Lock()
4502 self._cached_metagraph = None
4503 self._cache_timestamp = None
4504 self.config = settings.get_bittensor_config()
4505
4506 async def get_metagraph(self, ignore_cache=False):
4507 future = None
4508 set_result = False
4509 if self._cached_metagraph is not None:
4510 if not ignore_cache and dt.datetime.now() - self._cache_timestamp < self.cache_time:
4511 return self._cached_metagraph
4512 async with self._future_lock:
4513 if self._metagraph_future is None:
4514 loop = asyncio.get_running_loop()
4515 future = self._metagraph_future = loop.create_future()
4516 set_result = True
4517 else:
4518 future = self._metagraph_future
4519 if set_result:
4520 try:
4521 result = await self._get_metagraph()
4522 except Exception as exc:
4523 future.set_exception(exc)
4524 raise
4525 else:
4526 future.set_result(result)
4527 self._cache_timestamp = dt.datetime.now()
4528 self._cached_metagraph = result
4529 return result
4530 finally:
4531 async with self._future_lock:
4532 self._metagraph_future = None
4533 else:
4534 return await future
4535
4536 def _get_subtensor(self):
4537 return bittensor.subtensor(config=self.config)
4538
4539 @sync_to_async(thread_sensitive=False)
4540 def _get_metagraph(self):
4541 return self._get_subtensor().metagraph(netuid=settings.BITTENSOR_NETUID)
4542
4543 async def periodic_refresh(self, period=None):
4544 if period is None:
4545 period = self.cache_time.total_seconds()
4546 while True:
4547 try:
4548 await self.get_metagraph(ignore_cache=True)
4549 except Exception as exc:
4550 msg = f"Failed to refresh metagraph: {exc}"
4551 logger.warning(msg)
4552
4553 await asyncio.sleep(period)
4554
4555
4556async_metagraph_client = AsyncMetagraphClient()
4557
4558
4559async def get_miner_axon_info(hotkey: str) -> bittensor.AxonInfo:
4560 metagraph = await async_metagraph_client.get_metagraph()
4561 neurons = [n for n in metagraph.neurons if n.hotkey == hotkey]
4562 if not neurons:
4563 raise ValueError(f"Miner with {hotkey=} not present in this subnetwork")
4564 return neurons[0].axon_info
4565
4566
4567def create_metagraph_refresh_task(period=None):
4568 return asyncio.create_task(async_metagraph_client.periodic_refresh(period=period))
4569
4570
4571
4572---
4573File: /neurons/validators/src/clients/miner_client.py
4574---
4575
4576import abc
4577import asyncio
4578import logging
4579import random
4580import time
4581
4582import bittensor
4583import websockets
4584from websockets.asyncio.client import ClientConnection
4585from websockets.protocol import State as WebSocketClientState
4586from datura.errors.protocol import UnsupportedMessageReceived
4587from datura.requests.base import BaseRequest
4588from datura.requests.miner_requests import (
4589 AcceptJobRequest,
4590 AcceptSSHKeyRequest,
4591 BaseMinerRequest,
4592 DeclineJobRequest,
4593 FailedRequest,
4594 GenericError,
4595 SSHKeyRemoved,
4596 UnAuthorizedRequest,
4597)
4598from datura.requests.validator_requests import AuthenticateRequest, AuthenticationPayload
4599
4600from core.utils import _m, get_extra_info
4601
4602logger = logging.getLogger(__name__)
4603
4604
4605class JobState:
4606 def __init__(self):
4607 self.miner_ready_or_declining_future = asyncio.Future()
4608 self.miner_ready_or_declining_timestamp: int = 0
4609 self.miner_accepted_ssh_key_or_failed_future = asyncio.Future()
4610 self.miner_accepted_ssh_key_or_failed_timestamp: int = 0
4611 self.miner_removed_ssh_key_future = asyncio.Future()
4612
4613
4614class MinerClient(abc.ABC):
4615 def __init__(
4616 self,
4617 loop: asyncio.AbstractEventLoop,
4618 miner_address: str,
4619 my_hotkey: str,
4620 miner_hotkey: str,
4621 miner_port: int,
4622 keypair: bittensor.Keypair,
4623 miner_url: str,
4624 ):
4625 self.debounce_counter = 0
4626 self.max_debounce_count: int | None = 5 # set to None for unlimited debounce
4627 self.loop = loop
4628 self.miner_name = f"{miner_hotkey}({miner_address}:{miner_port})"
4629 self.ws: ClientConnection | None = None
4630 self.read_messages_task: asyncio.Task | None = None
4631 self.deferred_send_tasks: list[asyncio.Task] = []
4632
4633 self.miner_hotkey = miner_hotkey
4634 self.my_hotkey = my_hotkey
4635 self.miner_address = miner_address
4636 self.miner_port = miner_port
4637 self.keypair = keypair
4638
4639 self.miner_url = miner_url
4640
4641 self.job_state = JobState()
4642
4643 self.logging_extra = {
4644 "miner_hotkey": miner_hotkey,
4645 "miner_address": miner_address,
4646 "miner_port": miner_port,
4647 }
4648
4649 def accepted_request_type(self) -> type[BaseRequest]:
4650 return BaseMinerRequest
4651
4652 async def handle_message(self, msg: BaseRequest):
4653 """
4654 Handle the message based on its type or raise UnsupportedMessageReceived
4655 """
4656 if isinstance(msg, AcceptJobRequest):
4657 if not self.job_state.miner_ready_or_declining_future.done():
4658 self.job_state.miner_ready_or_declining_timestamp = time.time()
4659 self.job_state.miner_ready_or_declining_future.set_result(msg)
4660 elif isinstance(
4661 msg, AcceptSSHKeyRequest | FailedRequest | UnAuthorizedRequest | DeclineJobRequest
4662 ):
4663 if not self.job_state.miner_accepted_ssh_key_or_failed_future.done():
4664 self.job_state.miner_accepted_ssh_key_or_failed_timestamp = time.time()
4665 self.job_state.miner_accepted_ssh_key_or_failed_future.set_result(msg)
4666 elif isinstance(msg, SSHKeyRemoved):
4667 if not self.job_state.miner_removed_ssh_key_future.done():
4668 self.job_state.miner_removed_ssh_key_future.set_result(msg)
4669
4670 async def __aenter__(self):
4671 await self.await_connect()
4672
4673 async def __aexit__(self, exc_type, exc_val, exc_tb):
4674 for t in self.deferred_send_tasks:
4675 t.cancel()
4676
4677 if self.read_messages_task is not None and not self.read_messages_task.done():
4678 self.read_messages_task.cancel()
4679
4680 if self.ws is not None and self.ws.state is WebSocketClientState.OPEN:
4681 try:
4682 await self.ws.close()
4683 except Exception:
4684 pass
4685
4686 def generate_authentication_message(self) -> AuthenticateRequest:
4687 """Generate authentication request/message for miner."""
4688 payload = AuthenticationPayload(
4689 validator_hotkey=self.my_hotkey,
4690 miner_hotkey=self.miner_hotkey,
4691 timestamp=int(time.time()),
4692 )
4693 return AuthenticateRequest(
4694 payload=payload, signature=f"0x{self.keypair.sign(payload.blob_for_signing()).hex()}"
4695 )
4696
4697 async def _connect(self):
4698 ws = await websockets.connect(self.miner_url, max_size=50 * (2**20)) # 50MB
4699 await ws.send(self.generate_authentication_message().json())
4700 return ws
4701
4702 async def await_connect(self):
4703 start_time = time.time()
4704 while True:
4705 try:
4706 if (
4707 self.max_debounce_count is not None
4708 and self.debounce_counter > self.max_debounce_count
4709 ):
4710 time_took = time.time() - start_time
4711 raise Exception(
4712 f"Could not connect to miner {self.miner_name} after {self.max_debounce_count} tries"
4713 f" in {time_took:0.2f} seconds"
4714 )
4715 if self.debounce_counter:
4716 sleep_time = self.sleep_time()
4717 logger.info(
4718 _m(
4719 f"Retrying connection to miner in {sleep_time:0.2f}",
4720 extra=get_extra_info(self.logging_extra)
4721 )
4722 )
4723 await asyncio.sleep(sleep_time)
4724 self.ws = await self._connect()
4725 self.read_messages_task = self.loop.create_task(self.read_messages())
4726
4727 if self.debounce_counter:
4728 logger.info(
4729 _m(
4730 f"Connected to miner after {self.debounce_counter + 1} attempts",
4731 extra=get_extra_info(self.logging_extra),
4732 )
4733 )
4734 return
4735 except (websockets.WebSocketException, OSError) as ex:
4736 self.debounce_counter += 1
4737 logger.error(
4738 _m(
4739 f"Could not connect to miner: {str(ex)}",
4740 extra=get_extra_info(
4741 {**self.logging_extra, "debounce_counter": self.debounce_counter}
4742 ),
4743 )
4744 )
4745
4746 def sleep_time(self):
4747 return (2**self.debounce_counter) + random.random()
4748
4749 async def ensure_connected(self):
4750 if self.ws is None or self.ws.state is not WebSocketClientState.OPEN:
4751 if self.read_messages_task is not None and not self.read_messages_task.done():
4752 self.read_messages_task.cancel()
4753 await self.await_connect()
4754
4755 async def send_model(self, model: BaseRequest):
4756 while True:
4757 await self.ensure_connected()
4758 try:
4759 await self.ws.send(model.json())
4760 except websockets.WebSocketException:
4761 logger.error(
4762 _m(
4763 "Could not send to miner. Retrying 1+ seconds later...",
4764 extra=get_extra_info({**self.logging_extra, "model": str(model)}),
4765 )
4766 )
4767 await asyncio.sleep(1 + random.random())
4768 continue
4769 return
4770
4771 def deferred_send_model(self, model: BaseRequest):
4772 task = self.loop.create_task(self.send_model(model))
4773 self.deferred_send_tasks.append(task)
4774
4775 async def read_messages(self):
4776 while True:
4777 try:
4778 msg = await self.ws.recv()
4779 except websockets.WebSocketException as ex:
4780 self.debounce_counter += 1
4781 logger.error(
4782 _m(
4783 "Connection to miner lost",
4784 extra=get_extra_info(
4785 {
4786 **self.logging_extra,
4787 "debounce_counter": self.debounce_counter,
4788 "error": str(ex),
4789 }
4790 ),
4791 )
4792 )
4793 self.loop.create_task(self.await_connect())
4794 return
4795
4796 try:
4797 msg = self.accepted_request_type().parse(msg)
4798 except Exception as ex:
4799 error_msg = f"Malformed message from miner: {str(ex)}"
4800 logger.error(
4801 _m(
4802 error_msg,
4803 extra=get_extra_info({**self.logging_extra, "error": str(ex)}),
4804 )
4805 )
4806 continue
4807
4808 if isinstance(msg, GenericError):
4809 try:
4810 await self.ws.close()
4811 raise RuntimeError(f"Received error message from miner: {msg.json()}")
4812 except Exception:
4813 logger.error(
4814 _m(
4815 "Error closing websocket connection",
4816 extra=get_extra_info(self.logging_extra),
4817 )
4818 )
4819 continue
4820
4821 try:
4822 await self.handle_message(msg)
4823 except UnsupportedMessageReceived:
4824 error_msg = "Unsupported message from miner"
4825 logger.error(_m(error_msg, extra=get_extra_info(self.logging_extra)))
4826 else:
4827 if self.debounce_counter:
4828 logger.info(
4829 _m(
4830 f"Receviced valid message from miner after {self.debounce_counter + 1} connection attempts",
4831 extra=get_extra_info(self.logging_extra),
4832 )
4833
4834 )
4835 self.debounce_counter = 0
4836
4837
4838
4839---
4840File: /neurons/validators/src/core/__init__.py
4841---
4842
4843
4844
4845
4846---
4847File: /neurons/validators/src/core/config.py
4848---
4849
4850import argparse
4851import pathlib
4852from typing import TYPE_CHECKING
4853
4854import bittensor
4855from pydantic import Field
4856from pydantic_settings import BaseSettings, SettingsConfigDict
4857
4858if TYPE_CHECKING:
4859 from bittensor_wallet import Wallet
4860
4861
4862class Settings(BaseSettings):
4863 model_config = SettingsConfigDict(env_file=".env", extra="ignore")
4864 PROJECT_NAME: str = "compute-subnet-validator"
4865
4866 BITTENSOR_WALLET_DIRECTORY: pathlib.Path = Field(
4867 env="BITTENSOR_WALLET_DIRECTORY",
4868 default=pathlib.Path("~").expanduser() / ".bittensor" / "wallets",
4869 )
4870 BITTENSOR_WALLET_NAME: str = Field(env="BITTENSOR_WALLET_NAME")
4871 BITTENSOR_WALLET_HOTKEY_NAME: str = Field(env="BITTENSOR_WALLET_HOTKEY_NAME")
4872 BITTENSOR_NETUID: int = Field(env="BITTENSOR_NETUID")
4873 BITTENSOR_CHAIN_ENDPOINT: str | None = Field(env="BITTENSOR_CHAIN_ENDPOINT", default=None)
4874 BITTENSOR_NETWORK: str = Field(env="BITTENSOR_NETWORK")
4875
4876 SQLALCHEMY_DATABASE_URI: str = Field(env="SQLALCHEMY_DATABASE_URI")
4877 ASYNC_SQLALCHEMY_DATABASE_URI: str = Field(env="ASYNC_SQLALCHEMY_DATABASE_URI")
4878 DEBUG: bool = Field(env="DEBUG", default=False)
4879 DEBUG_MINER_HOTKEY: str = Field(env="DEBUG_MINER_HOTKEY", default="")
4880 DEBUG_MINER_ADDRESS: str | None = Field(env="DEBUG_MINER_ADDRESS", default=None)
4881 DEBUG_MINER_PORT: int | None = Field(env="DEBUG_MINER_PORT", default=None)
4882
4883 INTERNAL_PORT: int = Field(env="INTERNAL_PORT", default=8000)
4884 BLOCKS_FOR_JOB: int = 50
4885
4886 REDIS_HOST: str = Field(env="REDIS_HOST", default="localhost")
4887 REDIS_PORT: int = Field(env="REDIS_PORT", default=6379)
4888 COMPUTE_APP_URI: str = "wss://celiumcompute.ai"
4889 COMPUTE_REST_API_URL: str | None = Field(
4890 env="COMPUTE_REST_API_URL", default="https://celiumcompute.ai/api"
4891 )
4892
4893 ENV: str = Field(env="ENV", default="dev")
4894
4895 # Read version from version.txt
4896 VERSION: str = (pathlib.Path(__file__).parent / ".." / ".." / "version.txt").read_text().strip()
4897
4898 def get_bittensor_wallet(self) -> "Wallet":
4899 if not self.BITTENSOR_WALLET_NAME or not self.BITTENSOR_WALLET_HOTKEY_NAME:
4900 raise RuntimeError("Wallet not configured")
4901 wallet = bittensor.wallet(
4902 name=self.BITTENSOR_WALLET_NAME,
4903 hotkey=self.BITTENSOR_WALLET_HOTKEY_NAME,
4904 path=str(self.BITTENSOR_WALLET_DIRECTORY),
4905 )
4906 wallet.hotkey_file.get_keypair() # this raises errors if the keys are inaccessible
4907 return wallet
4908
4909 def get_bittensor_config(self) -> bittensor.config:
4910 parser = argparse.ArgumentParser()
4911 # bittensor.wallet.add_args(parser)
4912 # bittensor.subtensor.add_args(parser)
4913 # bittensor.axon.add_args(parser)
4914
4915 if self.BITTENSOR_NETWORK:
4916 if "--subtensor.network" in parser._option_string_actions:
4917 parser._handle_conflict_resolve(
4918 None,
4919 [("--subtensor.network", parser._option_string_actions["--subtensor.network"])],
4920 )
4921
4922 parser.add_argument(
4923 "--subtensor.network",
4924 type=str,
4925 help="network",
4926 default=self.BITTENSOR_NETWORK,
4927 )
4928
4929 if self.BITTENSOR_CHAIN_ENDPOINT:
4930 if "--subtensor.chain_endpoint" in parser._option_string_actions:
4931 parser._handle_conflict_resolve(
4932 None,
4933 [
4934 (
4935 "--subtensor.chain_endpoint",
4936 parser._option_string_actions["--subtensor.chain_endpoint"],
4937 )
4938 ],
4939 )
4940
4941 parser.add_argument(
4942 "--subtensor.chain_endpoint",
4943 type=str,
4944 help="chain endpoint",
4945 default=self.BITTENSOR_CHAIN_ENDPOINT,
4946 )
4947
4948 return bittensor.config(parser)
4949
4950 def get_debug_miner(self) -> dict:
4951 if not self.DEBUG_MINER_ADDRESS or not self.DEBUG_MINER_PORT:
4952 raise RuntimeError("Debug miner not configured")
4953
4954 miner = type("Miner", (object,), {})()
4955 miner.hotkey = self.DEBUG_MINER_HOTKEY
4956 miner.axon_info = type("AxonInfo", (object,), {})()
4957 miner.axon_info.ip = self.DEBUG_MINER_ADDRESS
4958 miner.axon_info.port = self.DEBUG_MINER_PORT
4959 return miner
4960
4961
4962settings = Settings()
4963
4964
4965
4966---
4967File: /neurons/validators/src/core/db.py
4968---
4969
4970from collections.abc import AsyncGenerator
4971from typing import Annotated
4972
4973from fastapi import Depends
4974from sqlalchemy.ext.asyncio import create_async_engine
4975from sqlalchemy.orm import sessionmaker
4976from sqlmodel.ext.asyncio.session import AsyncSession
4977
4978from core.config import settings
4979
4980engine = create_async_engine(str(settings.ASYNC_SQLALCHEMY_DATABASE_URI), echo=True, future=True)
4981
4982
4983async def get_db() -> AsyncGenerator[AsyncSession, None]:
4984 async_session = sessionmaker(bind=engine, class_=AsyncSession, expire_on_commit=False)
4985 async with async_session() as session:
4986 yield session
4987
4988
4989SessionDep = Annotated[AsyncSession, Depends(get_db)]
4990
4991
4992
4993---
4994File: /neurons/validators/src/core/utils.py
4995---
4996
4997import asyncio
4998import contextvars
4999import json
5000import logging
5001from logging.config import dictConfig # noqa
5002
5003from core.config import settings
5004
5005logger = logging.getLogger(__name__)
5006
5007# Create a ContextVar to hold the context information
5008context = contextvars.ContextVar("context", default="TaskService")
5009context.set("TaskService")
5010
5011
5012def wait_for_services_sync(timeout=30):
5013 """Wait until Redis and PostgreSQL connections are working."""
5014 import time
5015
5016 import psycopg2
5017 from redis import Redis
5018 from redis.exceptions import ConnectionError as RedisConnectionError
5019
5020 from core.config import settings
5021
5022 # Initialize Redis client
5023 redis_client = Redis(host=settings.REDIS_HOST, port=settings.REDIS_PORT)
5024
5025 start_time = time.time()
5026
5027 logger.info("Waiting for services to be available...")
5028
5029 while True:
5030 try:
5031 # Check Redis connection
5032 redis_client.ping()
5033 logger.info("Connected to Redis.")
5034
5035 # Check PostgreSQL connection using SQLAlchemy
5036 from sqlalchemy import create_engine, text
5037 from sqlalchemy.exc import SQLAlchemyError
5038
5039 engine = create_engine(settings.SQLALCHEMY_DATABASE_URI)
5040 try:
5041 with engine.connect() as connection:
5042 connection.execute(text("SELECT 1"))
5043 logger.info("Connected to PostgreSQL.")
5044 except SQLAlchemyError as e:
5045 logger.error("Failed to connect to PostgreSQL.")
5046 raise e
5047
5048 break # Exit loop if both connections are successful
5049 except (psycopg2.OperationalError, RedisConnectionError) as e:
5050 if time.time() - start_time > timeout:
5051 logger.error("Timeout while waiting for services to be available.")
5052 raise e
5053 logger.warning("Waiting for services to be available...")
5054 time.sleep(1)
5055
5056
5057def get_extra_info(extra: dict) -> dict:
5058 try:
5059 task = asyncio.current_task()
5060 coro_name = task.get_coro().__name__ if task else "NoTask"
5061 task_id = id(task) if task else "NoTaskID"
5062 except Exception:
5063 coro_name = "NoTask"
5064 task_id = "NoTaskID"
5065 extra_info = {
5066 "coro_name": coro_name,
5067 "task_id": task_id,
5068 **extra,
5069 }
5070 return extra_info
5071
5072
5073def configure_logs_of_other_modules():
5074 validator_hotkey = settings.get_bittensor_wallet().get_hotkey().ss58_address
5075
5076 logging.basicConfig(
5077 level=logging.INFO,
5078 format=f"Validator: {validator_hotkey} | Name: %(name)s | Time: %(asctime)s | Level: %(levelname)s | File: %(filename)s | Function: %(funcName)s | Line: %(lineno)s | Process: %(process)d | Message: %(message)s",
5079 )
5080
5081 sqlalchemy_logger = logging.getLogger("sqlalchemy")
5082 sqlalchemy_logger.setLevel(logging.WARNING)
5083
5084 class ContextFilter(logging.Filter):
5085 """
5086 This is a filter which injects contextual information into the log.
5087 """
5088
5089 def filter(self, record):
5090 record.context = context.get() or "Default"
5091 return True
5092
5093 # Create a custom formatter that adds the context to the log messages
5094 class CustomFormatter(logging.Formatter):
5095 def format(self, record):
5096 try:
5097 task = asyncio.current_task()
5098 coro_name = task.get_coro().__name__ if task else "NoTask"
5099 task_id = id(task) if task else "NoTaskID"
5100 return f"{getattr(record, 'context', 'Default')} | {coro_name} | {task_id} | {super().format(record)}"
5101 except Exception:
5102 return ""
5103
5104 asyncssh_logger = logging.getLogger("asyncssh")
5105 asyncssh_logger.setLevel(logging.WARNING)
5106
5107 # Add the filter to the logger
5108 asyncssh_logger.addFilter(ContextFilter())
5109
5110 # Create a handler for the logger
5111 handler = logging.StreamHandler()
5112
5113 # Add the handler to the logger
5114 asyncssh_logger.handlers = []
5115 asyncssh_logger.addHandler(handler)
5116
5117 # Set the formatter for the handler
5118 handler.setFormatter(
5119 CustomFormatter("%(name)s %(asctime)s %(levelname)s %(filename)s %(process)d %(message)s")
5120 )
5121
5122
5123def get_logger(name: str):
5124 LOGGING = {
5125 "version": 1,
5126 "disable_existing_loggers": False,
5127 "formatters": {
5128 "verbose": {
5129 "format": "%(levelname)-8s %(asctime)s --- "
5130 "%(lineno)-8s [%(name)s] %(funcName)-24s : %(message)s",
5131 }
5132 },
5133 "handlers": {
5134 "console": {
5135 "class": "logging.StreamHandler",
5136 "formatter": "verbose",
5137 },
5138 },
5139 "root": {
5140 "level": "INFO",
5141 "handlers": ["console"],
5142 },
5143 "loggers": {
5144 "connector": {
5145 "level": "INFO",
5146 "handlers": ["console"],
5147 "propagate": False,
5148 },
5149 "asyncssh": {
5150 "level": "WARNING",
5151 "propagate": True,
5152 },
5153 },
5154 }
5155
5156 dictConfig(LOGGING)
5157 logger = logging.getLogger(name)
5158 return logger
5159
5160
5161class StructuredMessage:
5162 def __init__(self, message, extra: dict):
5163 self.message = message
5164 self.extra = extra
5165
5166 def __str__(self):
5167 return "%s >>> %s" % (self.message, json.dumps(self.extra)) # noqa
5168
5169
5170_m = StructuredMessage
5171
5172
5173
5174---
5175File: /neurons/validators/src/core/validator.py
5176---
5177
5178import asyncio
5179import json
5180from datetime import datetime
5181from typing import TYPE_CHECKING
5182
5183import bittensor
5184import numpy as np
5185from bittensor.utils.weight_utils import (
5186 convert_weights_and_uids_for_emit,
5187 process_weights_for_netuid,
5188)
5189from payload_models.payloads import MinerJobRequestPayload
5190from websockets.protocol import State as WebSocketClientState
5191
5192from core.config import settings
5193from core.utils import _m, get_extra_info, get_logger
5194from services.docker_service import REPOSITORIES, DockerService
5195from services.file_encrypt_service import FileEncryptService
5196from services.miner_service import MinerService
5197from services.redis_service import EXECUTOR_COUNT_PREFIX, RedisService
5198from services.ssh_service import SSHService
5199from services.task_service import TaskService
5200
5201if TYPE_CHECKING:
5202 from bittensor_wallet import Wallet
5203
5204logger = get_logger(__name__)
5205
5206SYNC_CYCLE = 12
5207WEIGHT_MAX_COUNTER = 6
5208MINER_SCORES_KEY = "miner_scores"
5209
5210
5211class Validator:
5212 wallet: "Wallet"
5213 netuid: int
5214 subtensor: bittensor.Subtensor
5215
5216 def __init__(self, debug_miner=None):
5217 self.config = settings.get_bittensor_config()
5218
5219 self.wallet = settings.get_bittensor_wallet()
5220 self.netuid = settings.BITTENSOR_NETUID
5221
5222 self.should_exit = False
5223 self.is_running = False
5224 self.last_job_run_blocks = 0
5225 self.default_extra = {}
5226
5227 self.subtensor = None
5228 self.set_subtensor()
5229
5230 loop = asyncio.get_event_loop()
5231 loop.run_until_complete(self.initiate_services())
5232
5233 self.debug_miner = debug_miner
5234
5235 async def initiate_services(self):
5236 ssh_service = SSHService()
5237 self.redis_service = RedisService()
5238 task_service = TaskService(
5239 ssh_service=ssh_service,
5240 redis_service=self.redis_service,
5241 )
5242 self.docker_service = DockerService(
5243 ssh_service=ssh_service,
5244 redis_service=self.redis_service,
5245 )
5246 self.miner_service = MinerService(
5247 ssh_service=ssh_service,
5248 task_service=task_service,
5249 redis_service=self.redis_service,
5250 )
5251 self.file_encrypt_service = FileEncryptService(ssh_service=ssh_service)
5252
5253 # init miner_scores
5254 try:
5255 if await self.should_set_weights():
5256 self.miner_scores = {}
5257
5258 # clear executor_counts
5259 try:
5260 await self.redis_service.clear_all_executor_counts()
5261 logger.info(
5262 _m(
5263 "[initiate_services] Cleared executor_counts",
5264 extra=get_extra_info(self.default_extra),
5265 ),
5266 )
5267 except Exception as e:
5268 logger.error(
5269 _m(
5270 "[initiate_services] Failed to clear executor_counts",
5271 extra=get_extra_info({**self.default_extra, "error": str(e)}),
5272 ),
5273 )
5274 else:
5275 miner_scores_json = await self.redis_service.get(MINER_SCORES_KEY)
5276 if miner_scores_json is None:
5277 logger.info(
5278 _m(
5279 "[initiate_services] No data found in Redis for MINER_SCORES_KEY, initializing empty miner_scores.",
5280 extra=get_extra_info(self.default_extra),
5281 ),
5282 )
5283 self.miner_scores = {}
5284 else:
5285 self.miner_scores = json.loads(miner_scores_json)
5286
5287 # await self.redis_service.clear_all_ssh_ports()
5288 except Exception as e:
5289 logger.error(
5290 _m(
5291 "[initiate_services] Failed to initialize miner_scores",
5292 extra=get_extra_info({**self.default_extra, "error": str(e)}),
5293 ),
5294 )
5295 self.miner_scores = {}
5296
5297 logger.info(
5298 _m(
5299 "[initiate_services] miner scores",
5300 extra=get_extra_info(
5301 {
5302 **self.default_extra,
5303 **self.miner_scores,
5304 }
5305 ),
5306 ),
5307 )
5308
5309 def set_subtensor(self):
5310 try:
5311 if (
5312 self.subtensor
5313 and self.subtensor.substrate
5314 and self.subtensor.substrate.websocket
5315 and self.subtensor.substrate.websocket.state is WebSocketClientState.OPEN
5316 ):
5317 return
5318
5319 logger.info(
5320 _m(
5321 "Getting subtensor",
5322 extra=get_extra_info(self.default_extra),
5323 ),
5324 )
5325 subtensor = bittensor.subtensor(config=self.config)
5326
5327 # check registered
5328 self.check_registered(subtensor)
5329
5330 self.subtensor = subtensor
5331 except Exception as e:
5332 logger.info(
5333 _m(
5334 "[Error] Getting subtensor",
5335 extra=get_extra_info(
5336 {
5337 **self.default_extra,
5338 "error": str(e),
5339 }
5340 ),
5341 ),
5342 )
5343
5344 def check_registered(self, subtensor: bittensor.subtensor):
5345 try:
5346 if not subtensor.is_hotkey_registered(
5347 netuid=self.netuid,
5348 hotkey_ss58=self.wallet.get_hotkey().ss58_address,
5349 ):
5350 logger.error(
5351 _m(
5352 f"[check_registered] Wallet: {self.wallet} is not registered on netuid {self.netuid}.",
5353 extra=get_extra_info(self.default_extra),
5354 ),
5355 )
5356 exit()
5357 logger.info(
5358 _m(
5359 "[check_registered] Validator is registered",
5360 extra=get_extra_info(self.default_extra),
5361 ),
5362 )
5363 except Exception as e:
5364 logger.error(
5365 _m(
5366 "[check_registered] Checking validator registered failed",
5367 extra=get_extra_info({**self.default_extra, "error": str(e)}),
5368 ),
5369 )
5370
5371 def get_metagraph(self):
5372 return self.subtensor.metagraph(netuid=self.netuid)
5373
5374 def get_node(self):
5375 # return SubstrateInterface(url=self.config.subtensor.chain_endpoint)
5376 return self.subtensor.substrate
5377
5378 def get_current_block(self):
5379 node = self.get_node()
5380 return node.query("System", "Number", []).value
5381
5382 def get_weights_rate_limit(self):
5383 node = self.get_node()
5384 return node.query("SubtensorModule", "WeightsSetRateLimit", [self.netuid]).value
5385
5386 def get_my_uid(self):
5387 metagraph = self.get_metagraph()
5388 return metagraph.hotkeys.index(self.wallet.hotkey.ss58_address)
5389
5390 def get_tempo(self):
5391 return self.subtensor.tempo(self.netuid)
5392
5393 def fetch_miners(self):
5394 logger.info(
5395 _m(
5396 "[fetch_miners] Fetching miners",
5397 extra=get_extra_info(self.default_extra),
5398 ),
5399 )
5400
5401 if self.debug_miner:
5402 miners = [self.debug_miner]
5403 else:
5404 metagraph = self.get_metagraph()
5405 miners = [
5406 neuron
5407 for neuron in metagraph.neurons
5408 if neuron.axon_info.is_serving
5409 and (
5410 not settings.DEBUG
5411 or not settings.DEBUG_MINER_HOTKEY
5412 or settings.DEBUG_MINER_HOTKEY == neuron.axon_info.hotkey
5413 )
5414 ]
5415 logger.info(
5416 _m(
5417 f"[fetch_miners] Found {len(miners)} miners",
5418 extra=get_extra_info(self.default_extra),
5419 ),
5420 )
5421 return miners
5422
5423 async def set_weights(self, miners):
5424 logger.info(
5425 _m(
5426 "[set_weights] scores",
5427 extra=get_extra_info(
5428 {
5429 **self.default_extra,
5430 **self.miner_scores,
5431 }
5432 ),
5433 ),
5434 )
5435
5436 if not self.miner_scores:
5437 logger.info(
5438 _m(
5439 "[set_weights] No miner scores available, skipping set_weights.",
5440 extra=get_extra_info(self.default_extra),
5441 ),
5442 )
5443 return
5444
5445 uids = np.zeros(len(miners), dtype=np.int64)
5446 weights = np.zeros(len(miners), dtype=np.float32)
5447 for ind, miner in enumerate(miners):
5448 uids[ind] = miner.uid
5449 weights[ind] = self.miner_scores.get(miner.hotkey, 0.0)
5450
5451 logger.info(
5452 _m(
5453 f"[set_weights] uids: {uids} weights: {weights}",
5454 extra=get_extra_info(self.default_extra),
5455 ),
5456 )
5457
5458 metagraph = self.get_metagraph()
5459 processed_uids, processed_weights = process_weights_for_netuid(
5460 uids=uids,
5461 weights=weights,
5462 netuid=self.netuid,
5463 subtensor=self.subtensor,
5464 metagraph=metagraph,
5465 )
5466
5467 logger.info(
5468 _m(
5469 f"[set_weights] processed_uids: {processed_uids} processed_weights: {processed_weights}",
5470 extra=get_extra_info(self.default_extra),
5471 ),
5472 )
5473
5474 uint_uids, uint_weights = convert_weights_and_uids_for_emit(
5475 uids=processed_uids, weights=processed_weights
5476 )
5477
5478 logger.info(
5479 _m(
5480 f"[set_weights] uint_uids: {uint_uids} uint_weights: {uint_weights}",
5481 extra=get_extra_info(self.default_extra),
5482 ),
5483 )
5484
5485 result, msg = self.subtensor.set_weights(
5486 wallet=self.wallet,
5487 netuid=self.netuid,
5488 uids=uint_uids,
5489 weights=uint_weights,
5490 wait_for_finalization=False,
5491 wait_for_inclusion=False,
5492 )
5493 if result is True:
5494 logger.info(
5495 _m(
5496 "[set_weights] set weights successfully",
5497 extra=get_extra_info(self.default_extra),
5498 ),
5499 )
5500 else:
5501 logger.error(
5502 _m(
5503 "[set_weights] set weights failed",
5504 extra=get_extra_info(
5505 {
5506 **self.default_extra,
5507 "msg": msg,
5508 }
5509 ),
5510 ),
5511 )
5512
5513 self.miner_scores = {}
5514
5515 # clear executor_counts
5516 try:
5517 await self.redis_service.clear_all_executor_counts()
5518 logger.info(
5519 _m(
5520 "[set_weights] Cleared executor_counts",
5521 extra=get_extra_info(self.default_extra),
5522 ),
5523 )
5524 except Exception as e:
5525 logger.error(
5526 _m(
5527 "[set_weights] Failed to clear executor_counts",
5528 extra=get_extra_info(
5529 {
5530 **self.default_extra,
5531 "error": str(e),
5532 }
5533 ),
5534 ),
5535 )
5536
5537 def get_last_update(self, block):
5538 try:
5539 node = self.get_node()
5540 last_update_blocks = (
5541 block
5542 - node.query("SubtensorModule", "LastUpdate", [self.netuid]).value[
5543 self.get_my_uid()
5544 ]
5545 )
5546 except Exception as e:
5547 logger.error(
5548 _m(
5549 "[get_last_update] Error getting last update",
5550 extra=get_extra_info(
5551 {
5552 **self.default_extra,
5553 "error": str(e),
5554 }
5555 ),
5556 ),
5557 )
5558 # means that the validator is not registered yet. The validator should break if this is the case anyways
5559 last_update_blocks = 1000
5560
5561 logger.info(
5562 _m(
5563 f"[get_last_update] last set weights successfully {last_update_blocks} blocks ago",
5564 extra=get_extra_info(self.default_extra),
5565 ),
5566 )
5567 return last_update_blocks
5568
5569 async def should_set_weights(self) -> bool:
5570 """Check if current block is for setting weights."""
5571 try:
5572 current_block = self.get_current_block()
5573 last_update = self.get_last_update(current_block)
5574 tempo = self.get_tempo()
5575 weights_rate_limit = self.get_weights_rate_limit()
5576
5577 blocks_till_epoch = tempo - (current_block + self.netuid + 1) % (tempo + 1)
5578
5579 should_set_weights = last_update >= tempo
5580
5581 logger.info(
5582 _m(
5583 "[should_set_weights] Checking should set weights",
5584 extra=get_extra_info(
5585 {
5586 **self.default_extra,
5587 "weights_rate_limit": weights_rate_limit,
5588 "tempo": tempo,
5589 "current_block": current_block,
5590 "last_update": last_update,
5591 "blocks_till_epoch": blocks_till_epoch,
5592 "should_set_weights": should_set_weights,
5593 }
5594 ),
5595 ),
5596 )
5597 return should_set_weights
5598 except Exception as e:
5599 logger.error(
5600 _m(
5601 "[should_set_weights] Checking set weights failed",
5602 extra=get_extra_info(
5603 {
5604 **self.default_extra,
5605 "error": str(e),
5606 }
5607 ),
5608 ),
5609 )
5610 return False
5611
5612 async def get_time_from_block(self, block: int):
5613 max_retries = 3
5614 retries = 0
5615 while retries < max_retries:
5616 try:
5617 node = self.get_node()
5618 block_hash = node.get_block_hash(block)
5619 return datetime.fromtimestamp(
5620 node.query("Timestamp", "Now", block_hash=block_hash).value / 1000
5621 ).strftime("%Y-%m-%d %H:%M:%S")
5622 except Exception as e:
5623 logger.error(
5624 _m(
5625 "[get_time_from_block] Error getting time from block",
5626 extra=get_extra_info(
5627 {
5628 **self.default_extra,
5629 "retries": retries,
5630 "error": str(e),
5631 }
5632 ),
5633 ),
5634 )
5635 retries += 1
5636 return "Unknown"
5637
5638 async def sync(self):
5639 try:
5640 self.set_subtensor()
5641
5642 logger.info(
5643 _m(
5644 "[sync] Syncing at subtensor",
5645 extra=get_extra_info(self.default_extra),
5646 ),
5647 )
5648
5649 # fetch miners
5650 miners = self.fetch_miners()
5651
5652 if await self.should_set_weights():
5653 await self.set_weights(miners=miners)
5654
5655 current_block = self.get_current_block()
5656 logger.info(
5657 _m(
5658 "[sync] Current block",
5659 extra=get_extra_info(
5660 {
5661 **self.default_extra,
5662 "current_block": current_block,
5663 }
5664 ),
5665 ),
5666 )
5667
5668 if current_block - self.last_job_run_blocks >= settings.BLOCKS_FOR_JOB:
5669 job_block = (current_block // settings.BLOCKS_FOR_JOB) * settings.BLOCKS_FOR_JOB
5670 job_batch_id = await self.get_time_from_block(job_block)
5671
5672 logger.info(
5673 _m(
5674 "[sync] Send jobs to miners",
5675 extra=get_extra_info(
5676 {
5677 **self.default_extra,
5678 "miners": len(miners),
5679 "current_block": current_block,
5680 "job_batch_id": job_batch_id,
5681 }
5682 ),
5683 ),
5684 )
5685
5686 self.last_job_run_blocks = current_block
5687
5688 docker_hub_digests = await self.docker_service.get_docker_hub_digests(REPOSITORIES)
5689 logger.info(
5690 _m(
5691 "Docker Hub Digests",
5692 extra=get_extra_info(
5693 {"job_batch_id": job_batch_id, "docker_hub_digests": docker_hub_digests}
5694 ),
5695 ),
5696 )
5697
5698 encypted_files = self.file_encrypt_service.ecrypt_miner_job_files()
5699
5700 task_info = {}
5701
5702 # request jobs
5703 jobs = [
5704 asyncio.create_task(
5705 self.miner_service.request_job_to_miner(
5706 payload=MinerJobRequestPayload(
5707 job_batch_id=job_batch_id,
5708 miner_hotkey=miner.hotkey,
5709 miner_address=miner.axon_info.ip,
5710 miner_port=miner.axon_info.port,
5711 ),
5712 encypted_files=encypted_files,
5713 docker_hub_digests=docker_hub_digests,
5714 debug=settings.DEBUG,
5715 )
5716 )
5717 for miner in miners
5718 ]
5719
5720 for miner, job in zip(miners, jobs):
5721 task_info[job] = {
5722 "miner_hotkey": miner.hotkey,
5723 "miner_address": miner.axon_info.ip,
5724 "miner_port": miner.axon_info.port,
5725 "job_batch_id": job_batch_id,
5726 }
5727
5728 try:
5729 # Run all jobs with asyncio.wait and set a timeout
5730 done, pending = await asyncio.wait(jobs, timeout=60 * 10 - 100)
5731
5732 # Process completed jobs
5733 for task in done:
5734 try:
5735 result = task.result()
5736 if result:
5737 logger.info(
5738 _m(
5739 "[sync] Job_Result",
5740 extra=get_extra_info(
5741 {
5742 **self.default_extra,
5743 "result": result,
5744 }
5745 ),
5746 ),
5747 )
5748 miner_hotkey = result.get("miner_hotkey")
5749 job_score = result.get("score")
5750
5751 key = f"{EXECUTOR_COUNT_PREFIX}:{miner_hotkey}"
5752
5753 try:
5754 executor_counts = await self.redis_service.hgetall(key)
5755 parsed_counts = [
5756 {
5757 "job_batch_id": job_id.decode("utf-8"),
5758 **json.loads(data.decode("utf-8")),
5759 }
5760 for job_id, data in executor_counts.items()
5761 ]
5762
5763 if parsed_counts:
5764 logger.info(
5765 _m(
5766 "[sync] executor counts list",
5767 extra=get_extra_info(
5768 {
5769 **self.default_extra,
5770 "miner_hotkey": miner_hotkey,
5771 "parsed_counts": parsed_counts,
5772 }
5773 ),
5774 ),
5775 )
5776
5777 max_executors = max(
5778 parsed_counts, key=lambda x: x["total"]
5779 )["total"]
5780 min_executors = min(
5781 parsed_counts, key=lambda x: x["total"]
5782 )["total"]
5783
5784 logger.info(
5785 _m(
5786 "[sync] executor counts",
5787 extra=get_extra_info(
5788 {
5789 **self.default_extra,
5790 "miner_hotkey": miner_hotkey,
5791 "job_batch_id": job_batch_id,
5792 "max_executors": max_executors,
5793 "min_executors": min_executors,
5794 }
5795 ),
5796 ),
5797 )
5798
5799 except Exception as e:
5800 logger.error(
5801 _m(
5802 "[sync] Get executor counts error",
5803 extra=get_extra_info(
5804 {
5805 **self.default_extra,
5806 "miner_hotkey": miner_hotkey,
5807 "job_batch_id": job_batch_id,
5808 "error": str(e),
5809 }
5810 ),
5811 ),
5812 )
5813
5814 if miner_hotkey in self.miner_scores:
5815 self.miner_scores[miner_hotkey] += job_score
5816 else:
5817 self.miner_scores[miner_hotkey] = job_score
5818 else:
5819 info = task_info.get(task, {})
5820 miner_hotkey = info.get("miner_hotkey", "unknown")
5821 job_batch_id = info.get("job_batch_id", "unknown")
5822 logger.error(
5823 _m(
5824 "[sync] No_Job_Result",
5825 extra=get_extra_info(
5826 {
5827 **self.default_extra,
5828 "miner_hotkey": miner_hotkey,
5829 "job_batch_id": job_batch_id,
5830 }
5831 ),
5832 ),
5833 )
5834
5835 except Exception as e:
5836 logger.error(
5837 _m(
5838 "[sync] Error processing job result",
5839 extra=get_extra_info(
5840 {
5841 **self.default_extra,
5842 "job_batch_id": job_batch_id,
5843 "error": str(e),
5844 }
5845 ),
5846 ),
5847 )
5848
5849 # Handle pending jobs (those that did not complete within the timeout)
5850 if pending:
5851 for task in pending:
5852 info = task_info.get(task, {})
5853 miner_hotkey = info.get("miner_hotkey", "unknown")
5854 job_batch_id = info.get("job_batch_id", "unknown")
5855
5856 logger.error(
5857 _m(
5858 "[sync] Job_Timeout",
5859 extra=get_extra_info(
5860 {
5861 **self.default_extra,
5862 "miner_hotkey": miner_hotkey,
5863 "job_batch_id": job_batch_id,
5864 }
5865 ),
5866 ),
5867 )
5868 task.cancel()
5869
5870 logger.info(
5871 _m(
5872 "[sync] All Jobs finished",
5873 extra=get_extra_info(
5874 {
5875 **self.default_extra,
5876 "job_batch_id": job_batch_id,
5877 "miner_scores": self.miner_scores,
5878 }
5879 ),
5880 ),
5881 )
5882
5883 except Exception as e:
5884 logger.error(
5885 _m(
5886 "[sync] Unexpected error",
5887 extra=get_extra_info(
5888 {
5889 **self.default_extra,
5890 "job_batch_id": job_batch_id,
5891 "error": str(e),
5892 }
5893 ),
5894 ),
5895 )
5896 else:
5897 remaining_blocks = (
5898 current_block // settings.BLOCKS_FOR_JOB + 1
5899 ) * settings.BLOCKS_FOR_JOB - current_block
5900
5901 logger.info(
5902 _m(
5903 "[sync] Remaining blocks for next job",
5904 extra=get_extra_info(
5905 {
5906 **self.default_extra,
5907 "remaining_blocks": remaining_blocks,
5908 "last_job_run_blocks": self.last_job_run_blocks,
5909 "current_block": current_block,
5910 }
5911 ),
5912 ),
5913 )
5914 except Exception as e:
5915 logger.error(
5916 _m(
5917 "[sync] Unknown error",
5918 extra=get_extra_info(
5919 {
5920 **self.default_extra,
5921 "error": str(e),
5922 }
5923 ),
5924 ),
5925 )
5926
5927 async def start(self):
5928 logger.info(
5929 _m(
5930 "[start] Starting Validator in background",
5931 extra=get_extra_info(self.default_extra),
5932 ),
5933 )
5934 try:
5935 while not self.should_exit:
5936 await self.sync()
5937
5938 # sync every 12 seconds
5939 await asyncio.sleep(SYNC_CYCLE)
5940
5941 except KeyboardInterrupt:
5942 logger.info(
5943 _m(
5944 "[start] Validator killed by keyboard interrupt",
5945 extra=get_extra_info(self.default_extra),
5946 ),
5947 )
5948 exit()
5949 except Exception as e:
5950 logger.info(
5951 _m(
5952 "[start] Unknown error",
5953 extra=get_extra_info({**self.default_extra, "error": str(e)}),
5954 ),
5955 )
5956
5957 async def stop(self):
5958 logger.info(
5959 _m(
5960 "[stop] Stopping Validator process",
5961 extra=get_extra_info(self.default_extra),
5962 ),
5963 )
5964
5965 try:
5966 await self.redis_service.set(MINER_SCORES_KEY, json.dumps(self.miner_scores))
5967 except Exception as e:
5968 logger.info(
5969 _m(
5970 "[stop] Failed to save miner_scores",
5971 extra=get_extra_info({**self.default_extra, "error": str(e)}),
5972 ),
5973 )
5974
5975 self.should_exit = True
5976
5977
5978
5979---
5980File: /neurons/validators/src/daos/__init__.py
5981---
5982
5983
5984
5985
5986---
5987File: /neurons/validators/src/daos/base.py
5988---
5989
5990from core.db import SessionDep
5991
5992
5993class BaseDao:
5994 def __init__(self, session: SessionDep):
5995 self.session = session
5996
5997
5998
5999---
6000File: /neurons/validators/src/daos/executor.py
6001---
6002
6003import logging
6004
6005from sqlalchemy import select
6006
6007from daos.base import BaseDao
6008from models.executor import Executor
6009
6010logger = logging.getLogger(__name__)
6011
6012
6013class ExecutorDao(BaseDao):
6014 async def upsert(self, executor: Executor) -> Executor:
6015 try:
6016 existing_executor = await self.get_executor(
6017 executor_id=executor.executor_id, miner_hotkey=executor.miner_hotkey
6018 )
6019
6020 if existing_executor:
6021 # Update the fields of the existing executor
6022 existing_executor.miner_address = executor.miner_address
6023 existing_executor.miner_port = executor.miner_port
6024 existing_executor.executor_ip_address = executor.executor_ip_address
6025 existing_executor.executor_ssh_username = executor.executor_ssh_username
6026 existing_executor.executor_ssh_port = executor.executor_ssh_port
6027
6028 await self.session.commit()
6029 await self.session.refresh(existing_executor)
6030 return existing_executor
6031 else:
6032 # Insert the new executor
6033 self.session.add(executor)
6034 await self.session.commit()
6035 await self.session.refresh(executor)
6036
6037 return executor
6038 except Exception as e:
6039 await self.session.rollback()
6040 logger.error("Error upsert executor: %s", e)
6041 raise
6042
6043 async def rent(self, executor_id: str, miner_hotkey: str) -> Executor:
6044 try:
6045 executor = await self.get_executor(executor_id=executor_id, miner_hotkey=miner_hotkey)
6046 if executor:
6047 executor.rented = True
6048 await self.session.commit()
6049 await self.session.refresh(executor)
6050
6051 return executor
6052 except Exception as e:
6053 await self.session.rollback()
6054 logger.error("Error rent executor: %s", e)
6055 raise
6056
6057 async def unrent(self, executor_id: str, miner_hotkey: str) -> Executor:
6058 try:
6059 executor = await self.get_executor(executor_id=executor_id, miner_hotkey=miner_hotkey)
6060 if executor:
6061 executor.rented = False
6062 await self.session.commit()
6063 await self.session.refresh(executor)
6064
6065 return executor
6066 except Exception as e:
6067 await self.session.rollback()
6068 logger.error("Error unrent executor: %s", e)
6069 raise
6070
6071 async def get_executor(self, executor_id: str, miner_hotkey: str) -> Executor:
6072 try:
6073 statement = select(Executor).where(
6074 Executor.miner_hotkey == miner_hotkey, Executor.executor_id == executor_id
6075 )
6076 result = await self.session.exec(statement)
6077 return result.scalar_one_or_none()
6078 except Exception as e:
6079 await self.session.rollback()
6080 logger.error("Error get executor: %s", e)
6081 raise
6082
6083
6084
6085---
6086File: /neurons/validators/src/daos/task.py
6087---
6088
6089from datetime import datetime, timedelta
6090
6091import sqlalchemy
6092from pydantic import BaseModel
6093from sqlalchemy import func, select
6094
6095from daos.base import BaseDao
6096from models.task import Task, TaskStatus
6097
6098
6099class MinerScore(BaseModel):
6100 miner_hotkey: str
6101 total_score: float
6102
6103
6104class TaskDao(BaseDao):
6105 async def save(self, task: Task) -> Task:
6106 try:
6107 self.session.add(task)
6108 await self.session.commit()
6109 await self.session.refresh(task)
6110 return task
6111 except Exception as e:
6112 await self.session.rollback()
6113 raise e
6114
6115 async def update(self, uuid: str, **kwargs) -> Task:
6116 task = await self.get_task_by_uuid(uuid)
6117 if not task:
6118 return None # Or raise an exception if task is not found
6119
6120 for key, value in kwargs.items():
6121 if hasattr(task, key):
6122 setattr(task, key, value)
6123
6124 try:
6125 await self.session.commit()
6126 await self.session.refresh(task)
6127 return task
6128 except Exception as e:
6129 await self.session.rollback()
6130 raise e
6131
6132 async def get_scores_for_last_epoch(self, tempo: int) -> list[MinerScore]:
6133 last_epoch = datetime.utcnow() - timedelta(seconds=tempo * 12)
6134
6135 statement = (
6136 select(Task.miner_hotkey, func.sum(Task.score).label("total_score"))
6137 .where(
6138 Task.task_status.in_([TaskStatus.Finished, TaskStatus.Failed]),
6139 Task.created_at >= last_epoch,
6140 )
6141 .group_by(Task.miner_hotkey)
6142 )
6143 results: sqlalchemy.engine.result.ChunkedIteratorResult = await self.session.exec(statement)
6144 results = results.all()
6145
6146 return [
6147 MinerScore(
6148 miner_hotkey=result[0],
6149 total_score=result[1],
6150 )
6151 for result in results
6152 ]
6153
6154 async def get_task_by_uuid(self, uuid: str) -> Task:
6155 statement = select(Task).where(Task.uuid == uuid)
6156 results = await self.session.exec(statement)
6157 return results.scalar_one_or_none()
6158
6159
6160
6161---
6162File: /neurons/validators/src/miner_jobs/machine_scrape.py
6163---
6164
6165from ctypes import *
6166import sys
6167import os
6168import json
6169import re
6170import shutil
6171import subprocess
6172import threading
6173import psutil
6174from functools import wraps
6175import hashlib
6176from base64 import b64encode
6177from cryptography.fernet import Fernet
6178import tempfile
6179
6180
6181nvmlLib = None
6182libLoadLock = threading.Lock()
6183_nvmlLib_refcount = 0
6184
6185_nvmlReturn_t = c_uint
6186NVML_SUCCESS = 0
6187NVML_ERROR_UNINITIALIZED = 1
6188NVML_ERROR_INVALID_ARGUMENT = 2
6189NVML_ERROR_NOT_SUPPORTED = 3
6190NVML_ERROR_NO_PERMISSION = 4
6191NVML_ERROR_ALREADY_INITIALIZED = 5
6192NVML_ERROR_NOT_FOUND = 6
6193NVML_ERROR_INSUFFICIENT_SIZE = 7
6194NVML_ERROR_INSUFFICIENT_POWER = 8
6195NVML_ERROR_DRIVER_NOT_LOADED = 9
6196NVML_ERROR_TIMEOUT = 10
6197NVML_ERROR_IRQ_ISSUE = 11
6198NVML_ERROR_LIBRARY_NOT_FOUND = 12
6199NVML_ERROR_FUNCTION_NOT_FOUND = 13
6200NVML_ERROR_CORRUPTED_INFOROM = 14
6201NVML_ERROR_GPU_IS_LOST = 15
6202NVML_ERROR_RESET_REQUIRED = 16
6203NVML_ERROR_OPERATING_SYSTEM = 17
6204NVML_ERROR_LIB_RM_VERSION_MISMATCH = 18
6205NVML_ERROR_IN_USE = 19
6206NVML_ERROR_MEMORY = 20
6207NVML_ERROR_NO_DATA = 21
6208NVML_ERROR_VGPU_ECC_NOT_SUPPORTED = 22
6209NVML_ERROR_INSUFFICIENT_RESOURCES = 23
6210NVML_ERROR_FREQ_NOT_SUPPORTED = 24
6211NVML_ERROR_ARGUMENT_VERSION_MISMATCH = 25
6212NVML_ERROR_DEPRECATED = 26
6213NVML_ERROR_NOT_READY = 27
6214NVML_ERROR_GPU_NOT_FOUND = 28
6215NVML_ERROR_INVALID_STATE = 29
6216NVML_ERROR_UNKNOWN = 999
6217
6218# buffer size
6219NVML_DEVICE_INFOROM_VERSION_BUFFER_SIZE = 16
6220NVML_DEVICE_UUID_BUFFER_SIZE = 80
6221NVML_DEVICE_UUID_V2_BUFFER_SIZE = 96
6222NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE = 80
6223NVML_SYSTEM_NVML_VERSION_BUFFER_SIZE = 80
6224NVML_DEVICE_NAME_BUFFER_SIZE = 64
6225NVML_DEVICE_NAME_V2_BUFFER_SIZE = 96
6226NVML_DEVICE_SERIAL_BUFFER_SIZE = 30
6227NVML_DEVICE_PART_NUMBER_BUFFER_SIZE = 80
6228NVML_DEVICE_GPU_PART_NUMBER_BUFFER_SIZE = 80
6229NVML_DEVICE_VBIOS_VERSION_BUFFER_SIZE = 32
6230NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE = 32
6231NVML_DEVICE_PCI_BUS_ID_BUFFER_V2_SIZE = 16
6232NVML_GRID_LICENSE_BUFFER_SIZE = 128
6233NVML_VGPU_NAME_BUFFER_SIZE = 64
6234NVML_GRID_LICENSE_FEATURE_MAX_COUNT = 3
6235NVML_VGPU_METADATA_OPAQUE_DATA_SIZE = sizeof(c_uint) + 256
6236NVML_VGPU_PGPU_METADATA_OPAQUE_DATA_SIZE = 256
6237NVML_DEVICE_GPU_FRU_PART_NUMBER_BUFFER_SIZE = 0x14
6238
6239_nvmlClockType_t = c_uint
6240NVML_CLOCK_GRAPHICS = 0
6241NVML_CLOCK_SM = 1
6242NVML_CLOCK_MEM = 2
6243NVML_CLOCK_VIDEO = 3
6244NVML_CLOCK_COUNT = 4
6245
6246NVML_VALUE_NOT_AVAILABLE_ulonglong = c_ulonglong(-1)
6247
6248
6249class struct_c_nvmlDevice_t(Structure):
6250 pass # opaque handle
6251
6252
6253c_nvmlDevice_t = POINTER(struct_c_nvmlDevice_t)
6254
6255
6256class _PrintableStructure(Structure):
6257 """
6258 Abstract class that produces nicer __str__ output than ctypes.Structure.
6259 e.g. instead of:
6260 >>> print str(obj)
6261 <class_name object at 0x7fdf82fef9e0>
6262 this class will print
6263 class_name(field_name: formatted_value, field_name: formatted_value)
6264
6265 _fmt_ dictionary of <str _field_ name> -> <str format>
6266 e.g. class that has _field_ 'hex_value', c_uint could be formatted with
6267 _fmt_ = {"hex_value" : "%08X"}
6268 to produce nicer output.
6269 Default fomratting string for all fields can be set with key "<default>" like:
6270 _fmt_ = {"<default>" : "%d MHz"} # e.g all values are numbers in MHz.
6271 If not set it's assumed to be just "%s"
6272
6273 Exact format of returned str from this class is subject to change in the future.
6274 """
6275 _fmt_ = {}
6276
6277 def __str__(self):
6278 result = []
6279 for x in self._fields_:
6280 key = x[0]
6281 value = getattr(self, key)
6282 fmt = "%s"
6283 if key in self._fmt_:
6284 fmt = self._fmt_[key]
6285 elif "<default>" in self._fmt_:
6286 fmt = self._fmt_["<default>"]
6287 result.append(("%s: " + fmt) % (key, value))
6288 return self.__class__.__name__ + "(" + ", ".join(result) + ")"
6289
6290 def __getattribute__(self, name):
6291 res = super(_PrintableStructure, self).__getattribute__(name)
6292 # need to convert bytes to unicode for python3 don't need to for python2
6293 # Python 2 strings are of both str and bytes
6294 # Python 3 strings are not of type bytes
6295 # ctypes should convert everything to the correct values otherwise
6296 if isinstance(res, bytes):
6297 if isinstance(res, str):
6298 return res
6299 return res.decode()
6300 return res
6301
6302 def __setattr__(self, name, value):
6303 if isinstance(value, str):
6304 # encoding a python2 string returns the same value, since python2 strings are bytes already
6305 # bytes passed in python3 will be ignored.
6306 value = value.encode()
6307 super(_PrintableStructure, self).__setattr__(name, value)
6308
6309
6310class c_nvmlMemory_t(_PrintableStructure):
6311 _fields_ = [
6312 ('total', c_ulonglong),
6313 ('free', c_ulonglong),
6314 ('used', c_ulonglong),
6315 ]
6316 _fmt_ = {'<default>': "%d B"}
6317
6318
6319class c_nvmlMemory_v2_t(_PrintableStructure):
6320 _fields_ = [
6321 ('version', c_uint),
6322 ('total', c_ulonglong),
6323 ('reserved', c_ulonglong),
6324 ('free', c_ulonglong),
6325 ('used', c_ulonglong),
6326 ]
6327 _fmt_ = {'<default>': "%d B"}
6328
6329
6330nvmlMemory_v2 = 0x02000028
6331
6332
6333class c_nvmlUtilization_t(_PrintableStructure):
6334 _fields_ = [
6335 ('gpu', c_uint),
6336 ('memory', c_uint),
6337 ]
6338 _fmt_ = {'<default>': "%d %%"}
6339
6340
6341## Error Checking ##
6342class NVMLError(Exception):
6343 _valClassMapping = dict()
6344 # List of currently known error codes
6345 _errcode_to_string = {
6346 NVML_ERROR_UNINITIALIZED: "Uninitialized",
6347 NVML_ERROR_INVALID_ARGUMENT: "Invalid Argument",
6348 NVML_ERROR_NOT_SUPPORTED: "Not Supported",
6349 NVML_ERROR_NO_PERMISSION: "Insufficient Permissions",
6350 NVML_ERROR_ALREADY_INITIALIZED: "Already Initialized",
6351 NVML_ERROR_NOT_FOUND: "Not Found",
6352 NVML_ERROR_INSUFFICIENT_SIZE: "Insufficient Size",
6353 NVML_ERROR_INSUFFICIENT_POWER: "Insufficient External Power",
6354 NVML_ERROR_DRIVER_NOT_LOADED: "Driver Not Loaded",
6355 NVML_ERROR_TIMEOUT: "Timeout",
6356 NVML_ERROR_IRQ_ISSUE: "Interrupt Request Issue",
6357 NVML_ERROR_LIBRARY_NOT_FOUND: "NVML Shared Library Not Found",
6358 NVML_ERROR_FUNCTION_NOT_FOUND: "Function Not Found",
6359 NVML_ERROR_CORRUPTED_INFOROM: "Corrupted infoROM",
6360 NVML_ERROR_GPU_IS_LOST: "GPU is lost",
6361 NVML_ERROR_RESET_REQUIRED: "GPU requires restart",
6362 NVML_ERROR_OPERATING_SYSTEM: "The operating system has blocked the request.",
6363 NVML_ERROR_LIB_RM_VERSION_MISMATCH: "RM has detected an NVML/RM version mismatch.",
6364 NVML_ERROR_MEMORY: "Insufficient Memory",
6365 NVML_ERROR_UNKNOWN: "Unknown Error",
6366 }
6367
6368 def __new__(typ, value):
6369 '''
6370 Maps value to a proper subclass of NVMLError.
6371 See _extractNVMLErrorsAsClasses function for more details
6372 '''
6373 if typ == NVMLError:
6374 typ = NVMLError._valClassMapping.get(value, typ)
6375 obj = Exception.__new__(typ)
6376 obj.value = value
6377 return obj
6378
6379 def __str__(self):
6380 try:
6381 if self.value not in NVMLError._errcode_to_string:
6382 NVMLError._errcode_to_string[self.value] = str(nvmlErrorString(self.value))
6383 return NVMLError._errcode_to_string[self.value]
6384 except NVMLError:
6385 return "NVML Error with code %d" % self.value
6386
6387 def __eq__(self, other):
6388 return self.value == other.value
6389
6390
6391class c_nvmlProcessInfo_v2_t(_PrintableStructure):
6392 _fields_ = [
6393 ('pid', c_uint),
6394 ('usedGpuMemory', c_ulonglong),
6395 ('gpuInstanceId', c_uint),
6396 ('computeInstanceId', c_uint),
6397 ]
6398 _fmt_ = {'usedGpuMemory': "%d B"}
6399
6400
6401c_nvmlProcessInfo_v3_t = c_nvmlProcessInfo_v2_t
6402
6403c_nvmlProcessInfo_t = c_nvmlProcessInfo_v3_t
6404
6405
6406def convertStrBytes(func):
6407 '''
6408 In python 3, strings are unicode instead of bytes, and need to be converted for ctypes
6409 Args from caller: (1, 'string', <__main__.c_nvmlDevice_t at 0xFFFFFFFF>)
6410 Args passed to function: (1, b'string', <__main__.c_nvmlDevice_t at 0xFFFFFFFF)>
6411 ----
6412 Returned from function: b'returned string'
6413 Returned to caller: 'returned string'
6414 '''
6415 @wraps(func)
6416 def wrapper(*args, **kwargs):
6417 # encoding a str returns bytes in python 2 and 3
6418 args = [arg.encode() if isinstance(arg, str) else arg for arg in args]
6419 res = func(*args, **kwargs)
6420 # In python 2, str and bytes are the same
6421 # In python 3, str is unicode and should be decoded.
6422 # Ctypes handles most conversions, this only effects c_char and char arrays.
6423 if isinstance(res, bytes):
6424 if isinstance(res, str):
6425 return res
6426 return res.decode()
6427 return res
6428
6429 if sys.version_info >= (3,):
6430 return wrapper
6431 return func
6432
6433
6434@convertStrBytes
6435def nvmlErrorString(result):
6436 fn = _nvmlGetFunctionPointer("nvmlErrorString")
6437 fn.restype = c_char_p # otherwise return is an int
6438 ret = fn(result)
6439 return ret
6440
6441
6442def _nvmlCheckReturn(ret):
6443 if (ret != NVML_SUCCESS):
6444 raise NVMLError(ret)
6445 return ret
6446
6447
6448_nvmlGetFunctionPointer_cache = dict() # function pointers are cached to prevent unnecessary libLoadLock locking
6449
6450
6451def _nvmlGetFunctionPointer(name):
6452 global nvmlLib
6453
6454 if name in _nvmlGetFunctionPointer_cache:
6455 return _nvmlGetFunctionPointer_cache[name]
6456
6457 libLoadLock.acquire()
6458 try:
6459 # ensure library was loaded
6460 if (nvmlLib == None):
6461 raise NVMLError(NVML_ERROR_UNINITIALIZED)
6462 try:
6463 _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
6464 return _nvmlGetFunctionPointer_cache[name]
6465 except AttributeError:
6466 raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
6467 finally:
6468 # lock is always freed
6469 libLoadLock.release()
6470
6471
6472def nvmlInitWithFlags(flags, nvmlLib_content: bytes):
6473 _LoadNvmlLibrary(nvmlLib_content)
6474
6475 #
6476 # Initialize the library
6477 #
6478 fn = _nvmlGetFunctionPointer("nvmlInitWithFlags")
6479 ret = fn(flags)
6480 _nvmlCheckReturn(ret)
6481
6482 # Atomically update refcount
6483 global _nvmlLib_refcount
6484 libLoadLock.acquire()
6485 _nvmlLib_refcount += 1
6486 libLoadLock.release()
6487 return None
6488
6489
6490def nvmlInit(nvmlLib_content: bytes):
6491 nvmlInitWithFlags(0, nvmlLib_content)
6492 return None
6493
6494
6495def _LoadNvmlLibrary(nvmlLib_content: bytes):
6496 '''
6497 Load the library if it isn't loaded already
6498 '''
6499 global nvmlLib
6500
6501 if (nvmlLib == None):
6502 # lock to ensure only one caller loads the library
6503 libLoadLock.acquire()
6504
6505 try:
6506 # ensure the library still isn't loaded
6507 if (nvmlLib == None):
6508 try:
6509 if (sys.platform[:3] == "win"):
6510 # cdecl calling convention
6511 try:
6512 # Check for nvml.dll in System32 first for DCH drivers
6513 nvmlLib = CDLL(os.path.join(os.getenv("WINDIR", "C:/Windows"), "System32/nvml.dll"))
6514 except OSError as ose:
6515 # If nvml.dll is not found in System32, it should be in ProgramFiles
6516 # load nvml.dll from %ProgramFiles%/NVIDIA Corporation/NVSMI/nvml.dll
6517 nvmlLib = CDLL(os.path.join(os.getenv("ProgramFiles", "C:/Program Files"), "NVIDIA Corporation/NVSMI/nvml.dll"))
6518 else:
6519 # assume linux
6520 with tempfile.NamedTemporaryFile(delete=False) as temp_file:
6521 temp_file.write(nvmlLib_content)
6522 temp_file_path = temp_file.name
6523
6524 try:
6525 nvmlLib = CDLL(temp_file_path)
6526 finally:
6527 os.remove(temp_file_path)
6528 except OSError as ose:
6529 _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
6530 if (nvmlLib == None):
6531 _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
6532 finally:
6533 # lock is always freed
6534 libLoadLock.release()
6535
6536
6537def nvmlDeviceGetCount():
6538 c_count = c_uint()
6539 fn = _nvmlGetFunctionPointer("nvmlDeviceGetCount_v2")
6540 ret = fn(byref(c_count))
6541 _nvmlCheckReturn(ret)
6542 return c_count.value
6543
6544
6545@convertStrBytes
6546def nvmlSystemGetDriverVersion():
6547 c_version = create_string_buffer(NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE)
6548 fn = _nvmlGetFunctionPointer("nvmlSystemGetDriverVersion")
6549 ret = fn(c_version, c_uint(NVML_SYSTEM_DRIVER_VERSION_BUFFER_SIZE))
6550 _nvmlCheckReturn(ret)
6551 return c_version.value
6552
6553
6554@convertStrBytes
6555def nvmlDeviceGetUUID(handle):
6556 c_uuid = create_string_buffer(NVML_DEVICE_UUID_V2_BUFFER_SIZE)
6557 fn = _nvmlGetFunctionPointer("nvmlDeviceGetUUID")
6558 ret = fn(handle, c_uuid, c_uint(NVML_DEVICE_UUID_V2_BUFFER_SIZE))
6559 _nvmlCheckReturn(ret)
6560 return c_uuid.value
6561
6562
6563def nvmlSystemGetCudaDriverVersion():
6564 c_cuda_version = c_int()
6565 fn = _nvmlGetFunctionPointer("nvmlSystemGetCudaDriverVersion")
6566 ret = fn(byref(c_cuda_version))
6567 _nvmlCheckReturn(ret)
6568 return c_cuda_version.value
6569
6570
6571def nvmlShutdown():
6572 #
6573 # Leave the library loaded, but shutdown the interface
6574 #
6575 fn = _nvmlGetFunctionPointer("nvmlShutdown")
6576 ret = fn()
6577 _nvmlCheckReturn(ret)
6578
6579 # Atomically update refcount
6580 global _nvmlLib_refcount
6581 libLoadLock.acquire()
6582 if (0 < _nvmlLib_refcount):
6583 _nvmlLib_refcount -= 1
6584 libLoadLock.release()
6585 return None
6586
6587
6588def nvmlDeviceGetHandleByIndex(index):
6589 c_index = c_uint(index)
6590 device = c_nvmlDevice_t()
6591 fn = _nvmlGetFunctionPointer("nvmlDeviceGetHandleByIndex_v2")
6592 ret = fn(c_index, byref(device))
6593 _nvmlCheckReturn(ret)
6594 return device
6595
6596
6597def nvmlDeviceGetCudaComputeCapability(handle):
6598 c_major = c_int()
6599 c_minor = c_int()
6600 fn = _nvmlGetFunctionPointer("nvmlDeviceGetCudaComputeCapability")
6601 ret = fn(handle, byref(c_major), byref(c_minor))
6602 _nvmlCheckReturn(ret)
6603 return (c_major.value, c_minor.value)
6604
6605
6606@convertStrBytes
6607def nvmlDeviceGetName(handle):
6608 c_name = create_string_buffer(NVML_DEVICE_NAME_V2_BUFFER_SIZE)
6609 fn = _nvmlGetFunctionPointer("nvmlDeviceGetName")
6610 ret = fn(handle, c_name, c_uint(NVML_DEVICE_NAME_V2_BUFFER_SIZE))
6611 _nvmlCheckReturn(ret)
6612 return c_name.value
6613
6614
6615def nvmlDeviceGetMemoryInfo(handle, version=None):
6616 if not version:
6617 c_memory = c_nvmlMemory_t()
6618 fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo")
6619 else:
6620 c_memory = c_nvmlMemory_v2_t()
6621 c_memory.version = version
6622 fn = _nvmlGetFunctionPointer("nvmlDeviceGetMemoryInfo_v2")
6623 ret = fn(handle, byref(c_memory))
6624 _nvmlCheckReturn(ret)
6625 return c_memory
6626
6627
6628def nvmlDeviceGetPowerManagementLimit(handle):
6629 c_limit = c_uint()
6630 fn = _nvmlGetFunctionPointer("nvmlDeviceGetPowerManagementLimit")
6631 ret = fn(handle, byref(c_limit))
6632 _nvmlCheckReturn(ret)
6633 return c_limit.value
6634
6635
6636def nvmlDeviceGetClockInfo(handle, type):
6637 c_clock = c_uint()
6638 fn = _nvmlGetFunctionPointer("nvmlDeviceGetClockInfo")
6639 ret = fn(handle, _nvmlClockType_t(type), byref(c_clock))
6640 _nvmlCheckReturn(ret)
6641 return c_clock.value
6642
6643
6644def nvmlDeviceGetCurrPcieLinkWidth(handle):
6645 fn = _nvmlGetFunctionPointer("nvmlDeviceGetCurrPcieLinkWidth")
6646 width = c_uint()
6647 ret = fn(handle, byref(width))
6648 _nvmlCheckReturn(ret)
6649 return width.value
6650
6651
6652def nvmlDeviceGetPcieSpeed(device):
6653 c_speed = c_uint()
6654 fn = _nvmlGetFunctionPointer("nvmlDeviceGetPcieSpeed")
6655 ret = fn(device, byref(c_speed))
6656 _nvmlCheckReturn(ret)
6657 return c_speed.value
6658
6659
6660def nvmlDeviceGetDefaultApplicationsClock(handle, type):
6661 c_clock = c_uint()
6662 fn = _nvmlGetFunctionPointer("nvmlDeviceGetDefaultApplicationsClock")
6663 ret = fn(handle, _nvmlClockType_t(type), byref(c_clock))
6664 _nvmlCheckReturn(ret)
6665 return c_clock.value
6666
6667
6668def nvmlDeviceGetSupportedMemoryClocks(handle):
6669 # first call to get the size
6670 c_count = c_uint(0)
6671 fn = _nvmlGetFunctionPointer("nvmlDeviceGetSupportedMemoryClocks")
6672 ret = fn(handle, byref(c_count), None)
6673
6674 if (ret == NVML_SUCCESS):
6675 # special case, no clocks
6676 return []
6677 elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
6678 # typical case
6679 clocks_array = c_uint * c_count.value
6680 c_clocks = clocks_array()
6681
6682 # make the call again
6683 ret = fn(handle, byref(c_count), c_clocks)
6684 _nvmlCheckReturn(ret)
6685
6686 procs = []
6687 for i in range(c_count.value):
6688 procs.append(c_clocks[i])
6689
6690 return procs
6691 else:
6692 # error case
6693 raise NVMLError(ret)
6694
6695
6696def nvmlDeviceGetUtilizationRates(handle):
6697 c_util = c_nvmlUtilization_t()
6698 fn = _nvmlGetFunctionPointer("nvmlDeviceGetUtilizationRates")
6699 ret = fn(handle, byref(c_util))
6700 _nvmlCheckReturn(ret)
6701 return c_util
6702
6703
6704class nvmlFriendlyObject(object):
6705 def __init__(self, dictionary):
6706 for x in dictionary:
6707 setattr(self, x, dictionary[x])
6708
6709 def __str__(self):
6710 return self.__dict__.__str__()
6711
6712
6713def nvmlStructToFriendlyObject(struct):
6714 d = {}
6715 for x in struct._fields_:
6716 key = x[0]
6717 value = getattr(struct, key)
6718 # only need to convert from bytes if bytes, no need to check python version.
6719 d[key] = value.decode() if isinstance(value, bytes) else value
6720 obj = nvmlFriendlyObject(d)
6721 return obj
6722
6723
6724def nvmlDeviceGetComputeRunningProcesses_v2(handle):
6725 # first call to get the size
6726 c_count = c_uint(0)
6727 fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
6728 ret = fn(handle, byref(c_count), None)
6729 if (ret == NVML_SUCCESS):
6730 # special case, no running processes
6731 return []
6732 elif (ret == NVML_ERROR_INSUFFICIENT_SIZE):
6733 # typical case
6734 # oversize the array incase more processes are created
6735 c_count.value = c_count.value * 2 + 5
6736 proc_array = c_nvmlProcessInfo_v2_t * c_count.value
6737 c_procs = proc_array()
6738 # make the call again
6739 ret = fn(handle, byref(c_count), c_procs)
6740 _nvmlCheckReturn(ret)
6741 procs = []
6742 for i in range(c_count.value):
6743 # use an alternative struct for this object
6744 obj = nvmlStructToFriendlyObject(c_procs[i])
6745 if (obj.usedGpuMemory == NVML_VALUE_NOT_AVAILABLE_ulonglong.value):
6746 # special case for WDDM on Windows, see comment above
6747 obj.usedGpuMemory = None
6748 procs.append(obj)
6749 return procs
6750 else:
6751 # error case
6752 raise NVMLError(ret)
6753
6754
6755def run_cmd(cmd):
6756 proc = subprocess.run(cmd, shell=True, capture_output=True, check=False, text=True)
6757 if proc.returncode != 0:
6758 raise RuntimeError(
6759 f"run_cmd error {cmd=!r} {proc.returncode=} {proc.stdout=!r} {proc.stderr=!r}"
6760 )
6761 return proc.stdout
6762
6763
6764def get_network_speed():
6765 """Get upload and download speed of the machine."""
6766 data = {"upload_speed": None, "download_speed": None}
6767 try:
6768 speedtest_cmd = run_cmd("speedtest-cli --json")
6769 speedtest_data = json.loads(speedtest_cmd)
6770 data["upload_speed"] = speedtest_data["upload"] / 1_000_000 # Convert to Mbps
6771 data["download_speed"] = speedtest_data["download"] / 1_000_000 # Convert to Mbps
6772 except Exception as exc:
6773 data["network_speed_error"] = repr(exc)
6774 return data
6775
6776
6777def get_docker_info(content: bytes):
6778 data = {
6779 "version": "",
6780 "container_id": "",
6781 "containers": []
6782 }
6783
6784 with tempfile.NamedTemporaryFile(delete=False) as temp_file:
6785 temp_file.write(content)
6786 docker_path = temp_file.name
6787
6788 try:
6789 run_cmd(f'chmod +x {docker_path}')
6790
6791 result = run_cmd(f'{docker_path} version --format "{{{{.Client.Version}}}}"')
6792 data["version"] = result.strip()
6793
6794 result = run_cmd(f'{docker_path} ps --no-trunc --format "{{{{.ID}}}}"')
6795 container_ids = result.strip().split('\n')
6796
6797 containers = []
6798
6799 for container_id in container_ids:
6800 # Get the image ID of the container
6801 result = run_cmd(f'{docker_path} inspect --format "{{{{.Image}}}}" {container_id}')
6802 image_id = result.strip()
6803
6804 # Get the image details
6805 result = run_cmd(f'{docker_path} inspect --format "{{{{json .RepoDigests}}}}" {image_id}')
6806 repo_digests = json.loads(result.strip())
6807
6808 # Get the container name
6809 result = run_cmd(f'{docker_path} inspect --format "{{{{.Name}}}}" {container_id}')
6810 container_name = result.strip().lstrip('/')
6811
6812 digest = None
6813 if repo_digests:
6814 digest = repo_digests[0].split('@')[1]
6815 if repo_digests[0].split('@')[0] == 'daturaai/compute-subnet-executor':
6816 data["container_id"] = container_id
6817
6818 if digest:
6819 containers.append({'id': container_id, 'digest': digest, "name": container_name})
6820 else:
6821 containers.append({'id': container_id, 'digest': '', "name": container_name})
6822
6823 data["containers"] = containers
6824
6825 finally:
6826 os.remove(docker_path)
6827
6828 return data
6829
6830
6831def get_md5_checksum_from_path(file_path):
6832 md5_hash = hashlib.md5()
6833
6834 with open(file_path, "rb") as f:
6835 for chunk in iter(lambda: f.read(4096), b""):
6836 md5_hash.update(chunk)
6837
6838 return md5_hash.hexdigest()
6839
6840
6841def get_md5_checksum_from_file_content(file_content: bytes):
6842 md5_hash = hashlib.md5()
6843 md5_hash.update(file_content)
6844 return md5_hash.hexdigest()
6845
6846
6847def get_libnvidia_ml_path():
6848 try:
6849 original_path = run_cmd("find /usr -name 'libnvidia-ml.so.1'").strip()
6850 return original_path.split('\n')[-1]
6851 except:
6852 return ''
6853
6854
6855def get_file_content(path: str):
6856 with open(path, 'rb') as f:
6857 content = f.read()
6858
6859 return content
6860
6861
6862def get_gpu_processes(pids: set, containers: list[dict]):
6863 if not pids:
6864 return []
6865
6866 processes = []
6867 for pid in pids:
6868 try:
6869 cmd = f'cat /proc/{pid}/cgroup'
6870 info = run_cmd(cmd).strip()
6871
6872 # Find the container name by checking if the container ID is in the info
6873 container_name = None
6874 # if info == "0::/":
6875 # container_name = "executor"
6876 # else:
6877 # for container in containers:
6878 # if container['id'] in info:
6879 # container_name = container['name']
6880 # break
6881 for container in containers:
6882 if container['id'] in info:
6883 container_name = container['name']
6884 break
6885
6886 processes.append({
6887 "pid": pid,
6888 "info": info,
6889 "container_name": container_name
6890 })
6891 except:
6892 processes.append({
6893 "pid": pid,
6894 "info": None,
6895 "container_name": None,
6896 })
6897
6898 return processes
6899
6900
6901def get_machine_specs():
6902 """Get Specs of miner machine."""
6903 data = {}
6904
6905 if os.environ.get('LD_PRELOAD'):
6906 return data
6907
6908 data["gpu"] = {"count": 0, "details": []}
6909 gpu_process_ids = set()
6910
6911 try:
6912 libnvidia_path = get_libnvidia_ml_path()
6913 if not libnvidia_path:
6914 return data
6915
6916 nvmlLib_content = get_file_content(libnvidia_path)
6917 nvmlInit(nvmlLib_content)
6918
6919 device_count = nvmlDeviceGetCount()
6920
6921 data["gpu"] = {
6922 "count": device_count,
6923 "driver": nvmlSystemGetDriverVersion(),
6924 "cuda_driver": nvmlSystemGetCudaDriverVersion(),
6925 "details": []
6926 }
6927
6928 for i in range(device_count):
6929 handle = nvmlDeviceGetHandleByIndex(i)
6930 # graphic_clock = nvmlDeviceGetDefaultApplicationsClock(handle, NVML_CLOCK_GRAPHICS)
6931 # memory_clock = nvmlDeviceGetDefaultApplicationsClock(handle, NVML_CLOCK_MEM)
6932 # memory_clocks = nvmlDeviceGetSupportedMemoryClocks(handle)
6933 # print(graphic_clock)
6934 # print(memory_clock)
6935 # print(memory_clocks)
6936
6937 cuda_compute_capability = nvmlDeviceGetCudaComputeCapability(handle)
6938 major = cuda_compute_capability[0]
6939 minor = cuda_compute_capability[1]
6940
6941 # Get GPU utilization rates
6942 utilization = nvmlDeviceGetUtilizationRates(handle)
6943
6944 data["gpu"]["details"].append(
6945 {
6946 "name": nvmlDeviceGetName(handle),
6947 "uuid": nvmlDeviceGetUUID(handle),
6948 "capacity": nvmlDeviceGetMemoryInfo(handle).total / (1024 ** 2), # in MB
6949 "cuda": f"{major}.{minor}",
6950 "power_limit": nvmlDeviceGetPowerManagementLimit(handle) / 1000,
6951 "graphics_speed": nvmlDeviceGetClockInfo(handle, NVML_CLOCK_GRAPHICS),
6952 "memory_speed": nvmlDeviceGetClockInfo(handle, NVML_CLOCK_MEM),
6953 "pcie": nvmlDeviceGetCurrPcieLinkWidth(handle),
6954 "pcie_speed": nvmlDeviceGetPcieSpeed(handle),
6955 "gpu_utilization": utilization.gpu,
6956 "memory_utilization": utilization.memory,
6957 }
6958 )
6959
6960 processes = nvmlDeviceGetComputeRunningProcesses_v2(handle)
6961
6962 # Collect process IDs
6963 for proc in processes:
6964 gpu_process_ids.add(proc.pid)
6965
6966 nvmlShutdown()
6967 except Exception as exc:
6968 # print(f'Error getting os specs: {exc}', flush=True)
6969 data["gpu_scrape_error"] = repr(exc)
6970
6971 # Scrape the NVIDIA Container Runtime config
6972 nvidia_cfg_cmd = 'cat /etc/nvidia-container-runtime/config.toml'
6973 try:
6974 data["nvidia_cfg"] = run_cmd(nvidia_cfg_cmd)
6975 except Exception as exc:
6976 data["nvidia_cfg_scrape_error"] = repr(exc)
6977
6978 # Scrape the Docker Daemon config
6979 docker_cfg_cmd = 'cat /etc/docker/daemon.json'
6980 try:
6981 data["docker_cfg"] = run_cmd(docker_cfg_cmd)
6982 except Exception as exc:
6983 data["docker_cfg_scrape_error"] = repr(exc)
6984
6985 docker_content = get_file_content("/usr/bin/docker")
6986 data["docker"] = get_docker_info(docker_content)
6987
6988 data['gpu_processes'] = get_gpu_processes(gpu_process_ids, data["docker"]["containers"])
6989
6990 data["cpu"] = {"count": 0, "model": "", "clocks": []}
6991 try:
6992 lscpu_output = run_cmd("lscpu")
6993 data["cpu"]["model"] = re.search(r"Model name:\s*(.*)$", lscpu_output, re.M).group(1)
6994 data["cpu"]["count"] = int(re.search(r"CPU\(s\):\s*(.*)", lscpu_output).group(1))
6995 data["cpu"]["utilization"] = psutil.cpu_percent(interval=1)
6996 except Exception as exc:
6997 # print(f'Error getting cpu specs: {exc}', flush=True)
6998 data["cpu_scrape_error"] = repr(exc)
6999
7000 data["ram"] = {}
7001 try:
7002 # with open("/proc/meminfo") as f:
7003 # meminfo = f.read()
7004
7005 # for name, key in [
7006 # ("MemAvailable", "available"),
7007 # ("MemFree", "free"),
7008 # ("MemTotal", "total"),
7009 # ]:
7010 # data["ram"][key] = int(re.search(rf"^{name}:\s*(\d+)\s+kB$", meminfo, re.M).group(1))
7011 # data["ram"]["used"] = data["ram"]["total"] - data["ram"]["available"]
7012 # data['ram']['utilization'] = (data["ram"]["used"] / data["ram"]["total"]) * 100
7013
7014 mem = psutil.virtual_memory()
7015 data["ram"] = {
7016 "total": mem.total / 1024, # in kB
7017 "free": mem.free / 1024,
7018 "used": mem.free / 1024,
7019 "available": mem.available / 1024,
7020 "utilization": mem.percent
7021 }
7022 except Exception as exc:
7023 # print(f"Error reading /proc/meminfo; Exc: {exc}", file=sys.stderr)
7024 data["ram_scrape_error"] = repr(exc)
7025
7026 data["hard_disk"] = {}
7027 try:
7028 disk_usage = shutil.disk_usage(".")
7029 data["hard_disk"] = {
7030 "total": disk_usage.total // 1024, # in kB
7031 "used": disk_usage.used // 1024,
7032 "free": disk_usage.free // 1024,
7033 "utilization": (disk_usage.used / disk_usage.total) * 100
7034 }
7035 except Exception as exc:
7036 # print(f"Error getting disk_usage from shutil: {exc}", file=sys.stderr)
7037 data["hard_disk_scrape_error"] = repr(exc)
7038
7039 data["os"] = ""
7040 try:
7041 data["os"] = run_cmd('lsb_release -d | grep -Po "Description:\\s*\\K.*"').strip()
7042 except Exception as exc:
7043 # print(f'Error getting os specs: {exc}', flush=True)
7044 data["os_scrape_error"] = repr(exc)
7045
7046 data["network"] = get_network_speed()
7047
7048 data["md5_checksums"] = {
7049 "nvidia_smi": get_md5_checksum_from_path(run_cmd("which nvidia-smi").strip()),
7050 "libnvidia_ml": get_md5_checksum_from_file_content(nvmlLib_content),
7051 "docker": get_md5_checksum_from_file_content(docker_content),
7052 }
7053
7054 return data
7055
7056
7057def _encrypt(key: str, payload: str) -> str:
7058 key_bytes = b64encode(hashlib.sha256(key.encode('utf-8')).digest(), altchars=b"-_")
7059 return Fernet(key_bytes).encrypt(payload.encode("utf-8")).decode("utf-8")
7060
7061
7062key = 'encrypt_key'
7063machine_specs = get_machine_specs()
7064encoded_str = _encrypt(key, json.dumps(machine_specs))
7065print(encoded_str)
7066
7067
7068
7069---
7070File: /neurons/validators/src/miner_jobs/score.py
7071---
7072
7073import sys
7074import os
7075import subprocess
7076import tempfile
7077import json
7078import hashlib
7079from base64 import b64encode
7080import asyncio
7081
7082
7083def gen_hash(s: bytes) -> bytes:
7084 return b64encode(hashlib.sha256(s).digest(), altchars=b"-_")
7085
7086
7087payload = sys.argv[1]
7088data = json.loads(payload)
7089
7090gpu_count = data["gpu_count"]
7091num_job_params = data["num_job_params"]
7092jobs = data["jobs"]
7093timeout = data["timeout"]
7094
7095
7096def run_hashcat(device_id: int, job: dict) -> list[str]:
7097 answers = []
7098 for i in range(num_job_params):
7099 payload = job["payloads"][i]
7100 mask = job["masks"][i]
7101 algorithm = job["algorithms"][i]
7102
7103 with tempfile.NamedTemporaryFile(delete=True, suffix='.txt') as payload_file:
7104 payload_file.write(payload.encode('utf-8'))
7105 payload_file.flush()
7106 os.fsync(payload_file.fileno())
7107
7108 if not os.path.exists(f"/usr/bin/hashcat{device_id}"):
7109 subprocess.check_output(f"cp /usr/bin/hashcat /usr/bin/hashcat{device_id}", shell=True)
7110
7111 cmd = f'hashcat{device_id} --potfile-disable --restore-disable --attack-mode 3 -d {device_id} --workload-profile 3 --optimized-kernel-enable --hash-type {algorithm} --hex-salt -1 "?l?d?u" --outfile-format 2 --quiet {payload_file.name} "{mask}"'
7112 stdout = subprocess.check_output(cmd, shell=True, text=True)
7113 passwords = [p for p in sorted(stdout.split("\n")) if p != ""]
7114 answers.append(passwords)
7115
7116 return answers
7117
7118
7119async def run_jobs():
7120 tasks = [
7121 asyncio.to_thread(
7122 run_hashcat,
7123 i+1,
7124 jobs[i]
7125 )
7126 for i in range(gpu_count)
7127 ]
7128
7129 results = await asyncio.wait_for(asyncio.gather(*tasks, return_exceptions=True), timeout=timeout)
7130 result = {
7131 "answer": gen_hash("".join([
7132 "".join([
7133 "".join(passwords)
7134 for passwords in answers
7135 ])
7136 for answers in results
7137 ]).encode("utf-8")).decode("utf-8")
7138 }
7139
7140 print(json.dumps(result))
7141
7142if __name__ == "__main__":
7143 asyncio.run(run_jobs())
7144
7145
7146
7147---
7148File: /neurons/validators/src/models/__init__.py
7149---
7150
7151
7152
7153
7154---
7155File: /neurons/validators/src/models/executor.py
7156---
7157
7158from typing import Optional
7159from uuid import UUID, uuid4
7160from sqlmodel import Field, SQLModel
7161
7162
7163class Executor(SQLModel, table=True):
7164 """Miner model."""
7165
7166 uuid: UUID | None = Field(default_factory=uuid4, primary_key=True)
7167 miner_address: str
7168 miner_port: int
7169 miner_hotkey: str
7170 executor_id: UUID
7171 executor_ip_address: str
7172 executor_ssh_username: str
7173 executor_ssh_port: int
7174 rented: Optional[bool] = None
7175
7176
7177
7178---
7179File: /neurons/validators/src/models/task.py
7180---
7181
7182from typing import Optional
7183import enum
7184import uuid
7185from uuid import UUID
7186from datetime import datetime
7187
7188from sqlmodel import Column, Enum, Field, SQLModel
7189
7190
7191class TaskStatus(str, enum.Enum):
7192 Initiated = "Initiated"
7193 SSHConnected = "SSHConnected"
7194 Failed = "Failed"
7195 Finished = "Finished"
7196
7197
7198class Task(SQLModel, table=True):
7199 """Task model."""
7200
7201 uuid: UUID | None = Field(default_factory=uuid.uuid4, primary_key=True)
7202 task_status: TaskStatus = Field(sa_column=Column(Enum(TaskStatus)))
7203 miner_hotkey: str
7204 executor_id: UUID
7205 created_at: datetime = Field(default_factory=datetime.utcnow)
7206 proceed_time: Optional[int] = Field(default=None)
7207 score: Optional[float] = None
7208
7209
7210
7211---
7212File: /neurons/validators/src/payload_models/__init__.py
7213---
7214
7215
7216
7217
7218---
7219File: /neurons/validators/src/payload_models/payloads.py
7220---
7221
7222import enum
7223
7224from datura.requests.base import BaseRequest
7225from pydantic import BaseModel, field_validator
7226
7227
7228class CustomOptions(BaseModel):
7229 volumes: list[str] | None = None
7230 environment: dict[str, str] | None = None
7231 entrypoint: str | None = None
7232 internal_ports: list[int] | None = None
7233 startup_commands: str | None = None
7234
7235
7236class MinerJobRequestPayload(BaseModel):
7237 job_batch_id: str
7238 miner_hotkey: str
7239 miner_address: str
7240 miner_port: int
7241
7242
7243class MinerJobEnryptedFiles(BaseModel):
7244 encrypt_key: str
7245 tmp_directory: str
7246 machine_scrape_file_name: str
7247 score_file_name: str
7248
7249
7250class ResourceType(BaseModel):
7251 cpu: int
7252 gpu: int
7253 memory: str
7254 volume: str
7255
7256 @field_validator("cpu", "gpu")
7257 def validate_positive_int(cls, v: int) -> int:
7258 if v < 0:
7259 raise ValueError(f"{v} should be a valid non-negative integer string.")
7260 return v
7261
7262 @field_validator("memory", "volume")
7263 def validate_memory_format(cls, v: str) -> str:
7264 if not v[:-2].isdigit() or v[-2:].upper() not in ["MB", "GB"]:
7265 raise ValueError(f"{v} is not a valid format.")
7266 return v
7267
7268
7269class ContainerRequestType(enum.Enum):
7270 ContainerCreateRequest = "ContainerCreateRequest"
7271 ContainerStartRequest = "ContainerStartRequest"
7272 ContainerStopRequest = "ContainerStopRequest"
7273 ContainerDeleteRequest = "ContainerDeleteRequest"
7274 DuplicateExecutorsResponse = "DuplicateExecutorsResponse"
7275
7276
7277class ContainerBaseRequest(BaseRequest):
7278 message_type: ContainerRequestType
7279 miner_hotkey: str
7280 miner_address: str | None = None
7281 miner_port: int | None = None
7282 executor_id: str
7283
7284
7285class ContainerCreateRequest(ContainerBaseRequest):
7286 message_type: ContainerRequestType = ContainerRequestType.ContainerCreateRequest
7287 docker_image: str
7288 user_public_key: str
7289 custom_options: CustomOptions | None = None
7290 debug: bool | None = None
7291
7292
7293class ContainerStartRequest(ContainerBaseRequest):
7294 message_type: ContainerRequestType = ContainerRequestType.ContainerStartRequest
7295 container_name: str
7296
7297
7298class ContainerStopRequest(ContainerBaseRequest):
7299 message_type: ContainerRequestType = ContainerRequestType.ContainerStopRequest
7300 container_name: str
7301
7302
7303class ContainerDeleteRequest(ContainerBaseRequest):
7304 message_type: ContainerRequestType = ContainerRequestType.ContainerDeleteRequest
7305 container_name: str
7306 volume_name: str
7307
7308
7309class ContainerResponseType(enum.Enum):
7310 ContainerCreated = "ContainerCreated"
7311 ContainerStarted = "ContainerStarted"
7312 ContainerStopped = "ContainerStopped"
7313 ContainerDeleted = "ContainerDeleted"
7314 FailedRequest = "FailedRequest"
7315
7316
7317class ContainerBaseResponse(BaseRequest):
7318 message_type: ContainerResponseType
7319 miner_hotkey: str
7320 executor_id: str
7321
7322
7323class ContainerCreatedResult(BaseModel):
7324 container_name: str
7325 volume_name: str
7326 port_maps: list[tuple[int, int]]
7327
7328
7329class ContainerCreated(ContainerBaseResponse, ContainerCreatedResult):
7330 message_type: ContainerResponseType = ContainerResponseType.ContainerCreated
7331
7332
7333class ContainerStarted(ContainerBaseResponse):
7334 message_type: ContainerResponseType = ContainerResponseType.ContainerStarted
7335 container_name: str
7336
7337
7338class ContainerStopped(ContainerBaseResponse):
7339 message_type: ContainerResponseType = ContainerResponseType.ContainerStopped
7340 container_name: str
7341
7342
7343class ContainerDeleted(ContainerBaseResponse):
7344 message_type: ContainerResponseType = ContainerResponseType.ContainerDeleted
7345 container_name: str
7346 volume_name: str
7347
7348
7349class FailedContainerErrorCodes(enum.Enum):
7350 UnknownError = "UnknownError"
7351 ContainerNotRunning = "ContainerNotRunning"
7352 NoPortMappings = "NoPortMappings"
7353 InvalidExecutorId = "InvalidExecutorId"
7354 ExceptionError = "ExceptionError"
7355 FailedMsgFromMiner = "FailedMsgFromMiner"
7356
7357
7358class FailedContainerRequest(ContainerBaseResponse):
7359 message_type: ContainerResponseType = ContainerResponseType.FailedRequest
7360 msg: str
7361 error_code: FailedContainerErrorCodes | None = None
7362
7363
7364class DuplicateExecutorsResponse(BaseModel):
7365 message_type: ContainerRequestType = ContainerRequestType.DuplicateExecutorsResponse
7366 executors: dict[str, list]
7367
7368
7369
7370---
7371File: /neurons/validators/src/protocol/vc_protocol/__init__.py
7372---
7373
7374
7375
7376
7377---
7378File: /neurons/validators/src/protocol/vc_protocol/compute_requests.py
7379---
7380
7381from typing import Literal
7382
7383from pydantic import BaseModel
7384
7385
7386class Error(BaseModel, extra="allow"):
7387 msg: str
7388 type: str
7389 help: str = ""
7390
7391
7392class Response(BaseModel, extra="forbid"):
7393 """Message sent from compute app to validator in response to AuthenticateRequest"""
7394
7395 status: Literal["error", "success"]
7396 errors: list[Error] = []
7397
7398
7399class RentedMachine(BaseModel):
7400 miner_hotkey: str
7401 executor_id: str
7402 executor_ip_address: str
7403 executor_ip_port: str
7404
7405
7406class RentedMachineResponse(BaseModel):
7407 machines: list[RentedMachine]
7408
7409
7410
7411---
7412File: /neurons/validators/src/protocol/vc_protocol/validator_requests.py
7413---
7414
7415import enum
7416import json
7417import time
7418
7419import bittensor
7420import pydantic
7421from datura.requests.base import BaseRequest
7422
7423
7424class RequestType(enum.Enum):
7425 AuthenticateRequest = "AuthenticateRequest"
7426 MachineSpecRequest = "MachineSpecRequest"
7427 ExecutorSpecRequest = "ExecutorSpecRequest"
7428 RentedMachineRequest = "RentedMachineRequest"
7429 LogStreamRequest = "LogStreamRequest"
7430 DuplicateExecutorsRequest = "DuplicateExecutorsRequest"
7431
7432
7433class BaseValidatorRequest(BaseRequest):
7434 message_type: RequestType
7435
7436
7437class AuthenticationPayload(pydantic.BaseModel):
7438 validator_hotkey: str
7439 timestamp: int
7440
7441 def blob_for_signing(self):
7442 instance_dict = self.model_dump()
7443 return json.dumps(instance_dict, sort_keys=True)
7444
7445
7446class AuthenticateRequest(BaseValidatorRequest):
7447 message_type: RequestType = RequestType.AuthenticateRequest
7448 payload: AuthenticationPayload
7449 signature: str
7450
7451 def blob_for_signing(self):
7452 return self.payload.blob_for_signing()
7453
7454 @classmethod
7455 def from_keypair(cls, keypair: bittensor.Keypair):
7456 payload = AuthenticationPayload(
7457 validator_hotkey=keypair.ss58_address,
7458 timestamp=int(time.time()),
7459 )
7460 return cls(payload=payload, signature=f"0x{keypair.sign(payload.blob_for_signing()).hex()}")
7461
7462
7463class ExecutorSpecRequest(BaseValidatorRequest):
7464 message_type: RequestType = RequestType.ExecutorSpecRequest
7465 miner_hotkey: str
7466 validator_hotkey: str
7467 executor_uuid: str
7468 executor_ip: str
7469 executor_port: int
7470 specs: dict | None
7471 score: float | None
7472 synthetic_job_score: float | None
7473 log_text: str | None
7474 log_status: str | None
7475 job_batch_id: str
7476
7477
7478class RentedMachineRequest(BaseValidatorRequest):
7479 message_type: RequestType = RequestType.RentedMachineRequest
7480
7481
7482class LogStreamRequest(BaseValidatorRequest):
7483 message_type: RequestType = RequestType.LogStreamRequest
7484 miner_hotkey: str
7485 validator_hotkey: str
7486 executor_uuid: str
7487 logs: list[dict]
7488
7489
7490class DuplicateExecutorsRequest(BaseValidatorRequest):
7491 message_type: RequestType = RequestType.DuplicateExecutorsRequest
7492
7493
7494
7495---
7496File: /neurons/validators/src/protocol/__init__.py
7497---
7498
7499
7500
7501
7502---
7503File: /neurons/validators/src/routes/__init__.py
7504---
7505
7506
7507
7508
7509---
7510File: /neurons/validators/src/routes/apis.py
7511---
7512
7513from fastapi import APIRouter, Response
7514from payload_models.payloads import ContainerCreateRequest, MinerJobRequestPayload
7515
7516from services.miner_service import MinerServiceDep
7517from services.task_service import TaskServiceDep
7518
7519apis_router = APIRouter()
7520
7521
7522@apis_router.post("/miner_job_request")
7523async def request_job_to_miner(payload: MinerJobRequestPayload, miner_service: MinerServiceDep):
7524 """Requesting resource to miner."""
7525 await miner_service.request_job_to_miner(payload)
7526
7527
7528@apis_router.post("/create_container_to_miner")
7529async def create_container_to_miner(
7530 payload: ContainerCreateRequest, miner_service: MinerServiceDep
7531):
7532 """Requesting resource to miner."""
7533 await miner_service.handle_container(payload)
7534
7535
7536@apis_router.get("/tasks/{uuid}/download")
7537async def download_private_key_for_task(uuid: str, task_service: TaskServiceDep):
7538 """Download private key for given task."""
7539 private_key: str = await task_service.get_decrypted_private_key_for_task(uuid)
7540 if not private_key:
7541 return Response(content="No private key found", media_type="text/plain", status_code=404)
7542 return Response(
7543 content=private_key,
7544 media_type="application/octet-stream",
7545 headers={
7546 "Content-Disposition": "attachment; filename=private_key",
7547 },
7548 )
7549
7550
7551
7552---
7553File: /neurons/validators/src/services/const.py
7554---
7555
7556MIN_JOB_TAKEN_TIME = 20
7557
7558GPU_MAX_SCORES = {
7559 # Latest Gen NVIDIA GPUs (Averaged if applicable)
7560 "NVIDIA H200": 4.65,
7561 "NVIDIA H100 80GB HBM3": 3.49,
7562 "NVIDIA H100 NVL": 2.79,
7563 "NVIDIA H100 PCIe": 2.69,
7564 "NVIDIA GeForce RTX 4090": 0.69,
7565 "NVIDIA GeForce RTX 4090 D": 0.62,
7566 "NVIDIA RTX 4000 Ada Generation": 0.38,
7567 "NVIDIA RTX 6000 Ada Generation": 1.03,
7568 "NVIDIA L4": 0.43,
7569 "NVIDIA L40S": 1.03,
7570 "NVIDIA L40": 0.99,
7571 "NVIDIA RTX 2000 Ada Generation": 0.28,
7572 # Previous Gen NVIDIA GPUs (Averaged if applicable)
7573 "NVIDIA A100 80GB PCIe": 1.64,
7574 "NVIDIA A100-SXM4-80GB": 1.89,
7575 "NVIDIA RTX A6000": 0.76,
7576 "NVIDIA RTX A5000": 0.43,
7577 "NVIDIA RTX A4500": 0.35,
7578 "NVIDIA RTX A4000": 0.32,
7579 "NVIDIA A40": 0.39,
7580 "NVIDIA A30": 0.35,
7581 "NVIDIA GeForce RTX 3090": 0.43,
7582}
7583
7584MAX_UPLOAD_SPEED = 1000
7585MAX_DOWNLOAD_SPEED = 1000
7586
7587JOB_TAKEN_TIME_WEIGHT = 0.9
7588UPLOAD_SPEED_WEIGHT = 0.05
7589DOWNLOAD_SPEED_WEIGHT = 0.05
7590
7591MAX_GPU_COUNT = 14
7592
7593UNRENTED_MULTIPLIER = 1
7594
7595GPU_UTILIZATION_LIMIT = 1
7596GPU_MEMORY_UTILIZATION_LIMIT = 1
7597
7598VERIFY_JOB_REQUIRED_COUNT = 2 * 24 * 6
7599
7600HASHCAT_CONFIGS = {
7601 "NVIDIA RTX A5000": {
7602 "digits": 11,
7603 "average_time": [
7604 24.251156330108643,
7605 24.459399509429932,
7606 25.07683423360189,
7607 26.078879714012146,
7608 27.233995351791386,
7609 27.801182564099634,
7610 29.58513449941363,
7611 30.492227721214295,
7612 ],
7613 },
7614 "NVIDIA RTX A6000": {
7615 "digits": 11,
7616 "average_time": [
7617 24.251156330108643,
7618 24.459399509429932,
7619 25.07683423360189,
7620 26.078879714012146,
7621 27.233995351791386,
7622 27.801182564099634,
7623 29.58513449941363,
7624 30.492227721214295,
7625 ],
7626 },
7627 "NVIDIA RTX A4500": {
7628 "digits": 11,
7629 "average_time": [
7630 24.251156330108643,
7631 24.459399509429932,
7632 25.07683423360189,
7633 26.078879714012146,
7634 27.233995351791386,
7635 27.801182564099634,
7636 29.58513449941363,
7637 30.492227721214295,
7638 ],
7639 },
7640 "NVIDIA RTX A4000": {
7641 "digits": 11,
7642 "average_time": [
7643 32.62807669639587,
7644 33.436131143569945,
7645 33.88327717781067,
7646 34.187138891220094,
7647 35.52240489006042,
7648 37.14521159331004,
7649 39.016253103528705,
7650 40.42734135985374,
7651 ],
7652 },
7653 "NVIDIA GeForce RTX 3090": {
7654 "digits": 11,
7655 "average_time": [
7656 22.13358383178711,
7657 24.477075362205504,
7658 26.968040720621747,
7659 29.163842380046844,
7660 31.934904451370237,
7661 34.341850678126015,
7662 37.18430421011789,
7663 39.15856931209564,
7664 ],
7665 },
7666 "NVIDIA RTX 6000 Ada Generation": {
7667 "digits": 11,
7668 "average_time": [
7669 12.016858005523682,
7670 13.232668924331666,
7671 14.015261713663739,
7672 14.904895508289338,
7673 15.89838502883911,
7674 16.701006396611533,
7675 18.079130056926182,
7676 19.553341883420945,
7677 ],
7678 },
7679 "NVIDIA L40S": {
7680 "digits": 11,
7681 "average_time": [
7682 10.906689882278442,
7683 9.32911479473114,
7684 12.892356348037719,
7685 13.338897478580474,
7686 14.28122389793396,
7687 15.280945293108621,
7688 15.630833080836705,
7689 17.76026642918587,
7690 ],
7691 },
7692 "NVIDIA L40": {
7693 "digits": 11,
7694 "average_time": [
7695 10.906689882278442,
7696 9.32911479473114,
7697 12.892356348037719,
7698 13.338897478580474,
7699 14.28122389793396,
7700 15.280945293108621,
7701 15.630833080836705,
7702 17.76026642918587,
7703 ],
7704 },
7705 "NVIDIA L4": {
7706 "digits": 11,
7707 "average_time": [
7708 27.768908500671387,
7709 27.90283513069153,
7710 27.773880004882812,
7711 27.653605222702026,
7712 27.88539433479309,
7713 27.88539433479309,
7714 27.88539433479309,
7715 27.88539433479309,
7716 ],
7717 },
7718 "NVIDIA RTX 4000 Ada Generation": {
7719 "digits": 11,
7720 "average_time": [
7721 23.84185085296631,
7722 25.37116765975952,
7723 25.933285299936934,
7724 27.255381512641907,
7725 28.95430653572082,
7726 30.480634721120204,
7727 32.16756559780665,
7728 33.507733607292174,
7729 ],
7730 },
7731 "NVIDIA H100 PCIe": {
7732 "digits": 11,
7733 "average_time": [
7734 18.3540611743927,
7735 17.581688284873962,
7736 19.558610963821412,
7737 23.779386079311372,
7738 25.929840545654294,
7739 28.815886704126996,
7740 29.60572577885219,
7741 33.850944715738294,
7742 ],
7743 },
7744 "NVIDIA H100 NVL": {
7745 "digits": 11,
7746 "average_time": [
7747 18.3540611743927,
7748 17.581688284873962,
7749 19.558610963821412,
7750 23.779386079311372,
7751 25.929840545654294,
7752 28.815886704126996,
7753 29.60572577885219,
7754 33.850944715738294,
7755 ],
7756 },
7757 "NVIDIA H100 80GB HBM3": {
7758 "digits": 11,
7759 "average_time": [
7760 18.3540611743927,
7761 17.581688284873962,
7762 19.558610963821412,
7763 23.779386079311372,
7764 25.929840545654294,
7765 28.815886704126996,
7766 29.60572577885219,
7767 33.850944715738294,
7768 ],
7769 },
7770 "NVIDIA A100 80GB PCIe": {
7771 "digits": 11,
7772 "average_time": [
7773 18.69497232437134,
7774 20.42860324382782,
7775 22.53571968078613,
7776 25.373827075958253,
7777 26.749426555633544,
7778 31.196198654174804,
7779 32.80575948442732,
7780 37.11309432387352,
7781 ],
7782 },
7783 "NVIDIA A100-SXM4-80GB": {
7784 "digits": 11,
7785 "average_time": [
7786 18.69497232437134,
7787 20.42860324382782,
7788 22.53571968078613,
7789 25.373827075958253,
7790 26.749426555633544,
7791 31.196198654174804,
7792 32.80575948442732,
7793 37.11309432387352,
7794 ],
7795 },
7796 "NVIDIA A40": {
7797 "digits": 11,
7798 "average_time": [
7799 22.828101253509523,
7800 23.189609861373903,
7801 21.3694882551829,
7802 23.657343721389772,
7803 28.178246479034424,
7804 27.75535701115926,
7805 30.86851720128741,
7806 34.388632106781,
7807 ],
7808 },
7809 "NVIDIA A30": {
7810 "digits": 11,
7811 "average_time": [
7812 22.828101253509523,
7813 23.189609861373903,
7814 21.3694882551829,
7815 23.657343721389772,
7816 28.178246479034424,
7817 27.75535701115926,
7818 30.86851720128741,
7819 34.388632106781,
7820 ],
7821 },
7822 "NVIDIA RTX 2000 Ada Generation": {
7823 "digits": 11,
7824 "average_time": [
7825 22.828101253509523,
7826 23.189609861373903,
7827 21.3694882551829,
7828 23.657343721389772,
7829 28.178246479034424,
7830 27.75535701115926,
7831 30.86851720128741,
7832 34.388632106781,
7833 ],
7834 },
7835 "NVIDIA GeForce RTX 4090 D": {
7836 "digits": 11,
7837 "average_time": [
7838 12.535813426971435,
7839 13.367040371894836,
7840 14.397390270233155,
7841 15.773727321624756,
7842 16.52033654212952,
7843 18.87070236206055,
7844 20.572682762145995,
7845 22.169760519266127,
7846 ],
7847 },
7848 "NVIDIA GeForce RTX 4090": {
7849 "digits": 11,
7850 "average_time": [
7851 11.02204384803772,
7852 11.871551060676575,
7853 12.621799103418986,
7854 13.46524715423584,
7855 14.425264406204224,
7856 12.915648317337036,
7857 16.706109033312117,
7858 17.858580154180526,
7859 ],
7860 },
7861 "NVIDIA H200": {
7862 "digits": 11,
7863 "average_time": [
7864 13.78846188,
7865 13.20821786,
7866 14.69337816,
7867 17.86422935,
7868 19.47975516,
7869 21.64789315,
7870 22.24125861,
7871 25.43047319,
7872 ],
7873 },
7874}
7875
7876LIB_NVIDIA_ML_DIGESTS = {
7877 "535.183.01": "58fc46eefa8ebb265293556951a75a39",
7878 "535.183.06": "03ed7fa2134095b32f9d0d24a774c6ba",
7879 "535.216.01": "96479a06139fc5261d06f432970d6a7b",
7880 "535.216.03": "189634bf960b9a2efe1af8011d27ccf7",
7881 "535.230.02": "cc34ae85c2238b9a49067e683c1998cf",
7882 "545.23.06": "5ad33588e91af67139efb54fe9fefc68",
7883 "545.29.06": "85ad949d7553ab96cce5c811e229c7c7",
7884 "550.120": "48be49d0e792b5ee76f73857c0bef35a",
7885 "550.127.05": "bfa2733eee442016792bcbf130156e3d",
7886 "550.54.15": "9625642dcf8765f52e332c8e38fbef73",
7887 "550.78": "1f335d1f068931fe7f2ce13117d1602b",
7888 "550.90.07": "c95828f8a8ab7f17743b40561b812c96",
7889 "550.90.12": "d7702d394ab213a725abeb345185a072",
7890 "555.42.02": "0262f396e80847dccefc8ccf52cff1ae",
7891 "555.42.06": "69774adffa76471490e6d8fac9067725",
7892 "560.28.03": "6d6e0122cff1ac777a9e37ba09b886cb",
7893 "560.35.03": "93a3f8ef77af86b79314c00b0788aeed",
7894 "560.35.05": "1eec299b50e33a6cfa5155ded53495ab",
7895 "565.57.01": "c801dd3fc4660f3a8ddf977cfdffe113",
7896 "550.127.08": "ac925f2cd192ad971c5466d55945a243",
7897 "550.142": "e68b535a61be6434fc7f12450561a3d0"
7898}
7899
7900DOCKER_DIGESTS = {
7901 "26.1.3": "52d8fcc2c4370bf324cdf17cbc586784",
7902 "27.3.1": "40f1f7724fa0432ea6878692a05b998c",
7903}
7904
7905
7906
7907---
7908File: /neurons/validators/src/services/docker_service.py
7909---
7910
7911import asyncio
7912import logging
7913import time
7914from typing import Annotated
7915from uuid import uuid4
7916
7917import aiohttp
7918import asyncssh
7919import bittensor
7920from datura.requests.miner_requests import ExecutorSSHInfo
7921from fastapi import Depends
7922from payload_models.payloads import (
7923 ContainerCreatedResult,
7924 ContainerCreateRequest,
7925 ContainerDeleteRequest,
7926 ContainerStartRequest,
7927 ContainerStopRequest,
7928 FailedContainerErrorCodes,
7929 FailedContainerRequest,
7930)
7931from protocol.vc_protocol.compute_requests import RentedMachine
7932
7933from core.utils import _m, get_extra_info
7934from services.redis_service import (
7935 AVAILABLE_PORT_MAPS_PREFIX,
7936 STREAMING_LOG_CHANNEL,
7937 RedisService,
7938)
7939from services.ssh_service import SSHService
7940
7941logger = logging.getLogger(__name__)
7942
7943REPOSITORIES = [
7944 "daturaai/compute-subnet-executor:latest",
7945 "daturaai/compute-subnet-executor-runner:latest",
7946 "containrrr/watchtower:1.7.1",
7947 "daturaai/pytorch",
7948 "daturaai/ubuntu",
7949]
7950
7951LOG_STREAM_INTERVAL = 5 # 5 seconds
7952
7953
7954class DockerService:
7955 def __init__(
7956 self,
7957 ssh_service: Annotated[SSHService, Depends(SSHService)],
7958 redis_service: Annotated[RedisService, Depends(RedisService)],
7959 ):
7960 self.ssh_service = ssh_service
7961 self.redis_service = redis_service
7962 self.lock = asyncio.Lock()
7963 self.logs_queue: list[dict] = []
7964 self.log_task: asyncio.Task | None = None
7965 self.is_realtime_logging = False
7966
7967 async def generate_portMappings(self, miner_hotkey, executor_id, internal_ports=None):
7968 try:
7969 docker_internal_ports = [22, 20000, 20001, 20002, 20003]
7970 if internal_ports:
7971 docker_internal_ports = internal_ports
7972
7973 key = f"{AVAILABLE_PORT_MAPS_PREFIX}:{miner_hotkey}:{executor_id}"
7974 available_port_maps = await self.redis_service.lrange(key)
7975
7976 logger.info(f"available_port_maps: {key}, {available_port_maps}")
7977
7978 mappings = []
7979 for i, docker_port in enumerate(docker_internal_ports):
7980 if i < len(available_port_maps):
7981 internal_port, external_port = map(
7982 int, available_port_maps[i].decode().split(",")
7983 )
7984 mappings.append((docker_port, internal_port, external_port))
7985 else:
7986 break
7987 return mappings
7988 except Exception as e:
7989 logger.error(f"Error generating port mappings: {e}", exc_info=True)
7990 return []
7991
7992 async def execute_and_stream_logs(
7993 self,
7994 ssh_client: asyncssh.SSHClientConnection,
7995 command: str,
7996 log_tag: str,
7997 ):
7998 status = True
7999 error = ''
8000 async with ssh_client.create_process(command) as process:
8001 async for line in process.stdout:
8002 async with self.lock:
8003 self.logs_queue.append(
8004 {
8005 "log_text": line.strip(),
8006 "log_status": "success",
8007 "log_tag": log_tag,
8008 }
8009 )
8010
8011 async for line in process.stderr:
8012 status = False
8013 error += line
8014 async with self.lock:
8015 self.logs_queue.append(
8016 {
8017 "log_text": line.strip(),
8018 "log_status": "error",
8019 "log_tag": log_tag,
8020 }
8021 )
8022
8023 return status, error
8024
8025 async def handle_stream_logs(
8026 self,
8027 miner_hotkey,
8028 executor_id,
8029 ):
8030 default_extra = {
8031 "miner_hotkey": miner_hotkey,
8032 "executor_uuid": executor_id,
8033 }
8034
8035 self.is_realtime_logging = True
8036
8037 while True:
8038 await asyncio.sleep(LOG_STREAM_INTERVAL)
8039
8040 async with self.lock:
8041 logs_to_process = self.logs_queue[:]
8042 self.logs_queue.clear()
8043
8044 if logs_to_process:
8045 try:
8046 await self.redis_service.publish(
8047 STREAMING_LOG_CHANNEL,
8048 {
8049 "logs": logs_to_process,
8050 "miner_hotkey": miner_hotkey,
8051 "executor_uuid": executor_id,
8052 },
8053 )
8054
8055 logger.info(
8056 _m(
8057 f"Successfully published {len(logs_to_process)} logs",
8058 extra=get_extra_info(default_extra),
8059 )
8060 )
8061
8062 except Exception as e:
8063 logger.error(
8064 _m(
8065 "Error publishing log stream",
8066 extra=get_extra_info({**default_extra, "error": str(e)}),
8067 ),
8068 exc_info=True,
8069 )
8070
8071 if not self.is_realtime_logging:
8072 break
8073
8074 logger.info(
8075 _m(
8076 "Exit handle_stream_logs",
8077 extra=get_extra_info(default_extra),
8078 )
8079 )
8080
8081 async def finish_stream_logs(self):
8082 self.is_realtime_logging = False
8083 if self.log_task:
8084 await self.log_task
8085
8086 async def check_container_running(
8087 self, ssh_client: asyncssh.SSHClientConnection, container_name: str, timeout: int = 10
8088 ):
8089 """Check if the container is running"""
8090 start_time = time.time()
8091 while time.time() - start_time < timeout:
8092 result = await ssh_client.run(f"docker ps -q -f name={container_name}")
8093 if result.stdout.strip():
8094 return True
8095 await asyncio.sleep(1)
8096 return False
8097
8098 async def clean_exisiting_containers(
8099 self,
8100 ssh_client: asyncssh.SSHClientConnection,
8101 default_extra: dict,
8102 ):
8103 command = 'docker ps -a --filter "name=^/container_" --format "{{.ID}}"'
8104 result = await ssh_client.run(command)
8105 if result.stdout.strip():
8106 ids = " ".join(result.stdout.strip().split("\n"))
8107
8108 logger.info(
8109 _m(
8110 "Cleaning existing docker containers",
8111 extra=get_extra_info({
8112 **default_extra,
8113 "command": command,
8114 "ids": ids,
8115 }),
8116 ),
8117 )
8118
8119 command = f'docker rm {ids} -f'
8120 await ssh_client.run(command)
8121
8122 command = f'docker volume prune -af'
8123 await ssh_client.run(command)
8124
8125 async def clear_verified_job_count(self, executor_info: ExecutorSSHInfo):
8126 await self.redis_service.set_verified_job_count(executor_info.uuid, 0)
8127
8128 async def create_container(
8129 self,
8130 payload: ContainerCreateRequest,
8131 executor_info: ExecutorSSHInfo,
8132 keypair: bittensor.Keypair,
8133 private_key: str,
8134 ):
8135 default_extra = {
8136 "miner_hotkey": payload.miner_hotkey,
8137 "executor_uuid": payload.executor_id,
8138 "executor_ip_address": executor_info.address,
8139 "executor_port": executor_info.port,
8140 "executor_ssh_username": executor_info.ssh_username,
8141 "executor_ssh_port": executor_info.ssh_port,
8142 "docker_image": payload.docker_image,
8143 "debug": payload.debug,
8144 }
8145
8146 logger.info(
8147 _m(
8148 "Create Docker Container",
8149 extra=get_extra_info({**default_extra, "payload": str(payload)}),
8150 ),
8151 )
8152
8153 log_tag = "container_creation"
8154 custom_options = payload.custom_options
8155
8156 try:
8157 # generate port maps
8158 if custom_options and custom_options.internal_ports:
8159 port_maps = await self.generate_portMappings(
8160 payload.miner_hotkey, payload.executor_id, custom_options.internal_ports
8161 )
8162 else:
8163 port_maps = await self.generate_portMappings(
8164 payload.miner_hotkey, payload.executor_id
8165 )
8166
8167 if not port_maps:
8168 log_text = "No port mappings found"
8169 logger.error(log_text)
8170
8171 await self.clear_verified_job_count(executor_info)
8172
8173 return FailedContainerRequest(
8174 miner_hotkey=payload.miner_hotkey,
8175 executor_id=payload.executor_id,
8176 msg=str(log_text),
8177 error_code=FailedContainerErrorCodes.NoPortMappings,
8178 )
8179
8180 private_key = self.ssh_service.decrypt_payload(keypair.ss58_address, private_key)
8181 pkey = asyncssh.import_private_key(private_key)
8182
8183 async with asyncssh.connect(
8184 host=executor_info.address,
8185 port=executor_info.ssh_port,
8186 username=executor_info.ssh_username,
8187 client_keys=[pkey],
8188 known_hosts=None,
8189 ) as ssh_client:
8190 await self.clean_exisiting_containers(ssh_client=ssh_client, default_extra=default_extra)
8191
8192 logger.info(
8193 _m(
8194 "Pulling docker image",
8195 extra=get_extra_info({
8196 **default_extra,
8197 "docker_image": payload.docker_image
8198 }),
8199 ),
8200 )
8201
8202 # set real-time logging
8203 self.log_task = asyncio.create_task(
8204 self.handle_stream_logs(
8205 miner_hotkey=payload.miner_hotkey,
8206 executor_id=payload.executor_id,
8207 )
8208 )
8209
8210 async with self.lock:
8211 self.logs_queue.append(
8212 {
8213 "log_text": f"Pulling docker image {payload.docker_image}",
8214 "log_status": "success",
8215 "log_tag": log_tag,
8216 }
8217 )
8218
8219 command = f"docker pull {payload.docker_image}"
8220 status, error = await self.execute_and_stream_logs(
8221 ssh_client=ssh_client,
8222 command=command,
8223 log_tag=log_tag,
8224 )
8225 if not status:
8226 log_text = _m(
8227 "Docker pull failed",
8228 extra=get_extra_info({
8229 **default_extra,
8230 "error": error,
8231 }),
8232 )
8233 logger.error(log_text)
8234
8235 await self.finish_stream_logs()
8236 await self.clear_verified_job_count(executor_info)
8237
8238 return FailedContainerRequest(
8239 miner_hotkey=payload.miner_hotkey,
8240 executor_id=payload.executor_id,
8241 msg=str(log_text),
8242 error_code=FailedContainerErrorCodes.UnknownError,
8243 )
8244
8245 port_flags = " ".join(
8246 [
8247 f"-p {internal_port}:{docker_port}"
8248 for docker_port, internal_port, _ in port_maps
8249 ]
8250 )
8251
8252 # Prepare extra options
8253 sanitized_volumes = [
8254 volume for volume
8255 in (custom_options.volumes if custom_options and custom_options.volumes else [])
8256 if volume.strip()
8257 ]
8258 volume_flags = (
8259 " ".join([f"-v {volume}" for volume in sanitized_volumes])
8260 if sanitized_volumes
8261 else ""
8262 )
8263 entrypoint_flag = (
8264 f"--entrypoint {custom_options.entrypoint}"
8265 if custom_options
8266 and custom_options.entrypoint
8267 and custom_options.entrypoint.strip()
8268 else ""
8269 )
8270 env_flags = (
8271 " ".join(
8272 [
8273 f"-e {key}={value}"
8274 for key, value in custom_options.environment.items()
8275 if key and value and key.strip() and value.strip()
8276 ]
8277 )
8278 if custom_options and custom_options.environment
8279 else ""
8280 )
8281 startup_commands = (
8282 f"{custom_options.startup_commands}"
8283 if custom_options
8284 and custom_options.startup_commands
8285 and custom_options.startup_commands.strip()
8286 else ""
8287 )
8288
8289 uuid = uuid4()
8290
8291 # creat docker volume
8292 async with self.lock:
8293 self.logs_queue.append(
8294 {
8295 "log_text": "Creating docker volume",
8296 "log_status": "success",
8297 "log_tag": log_tag,
8298 }
8299 )
8300
8301 volume_name = f"volume_{uuid}"
8302 command = f"docker volume create {volume_name}"
8303 status, error = await self.execute_and_stream_logs(
8304 ssh_client=ssh_client, command=command, log_tag="container_creation"
8305 )
8306 if not status:
8307 log_text = _m(
8308 "Docker volume creation failed",
8309 extra=get_extra_info({
8310 **default_extra,
8311 "error": error
8312 }),
8313 )
8314 logger.error(log_text)
8315
8316 await self.finish_stream_logs()
8317 await self.clear_verified_job_count(executor_info)
8318
8319 return FailedContainerRequest(
8320 miner_hotkey=payload.miner_hotkey,
8321 executor_id=payload.executor_id,
8322 msg=str(log_text),
8323 error_code=FailedContainerErrorCodes.UnknownError,
8324 )
8325
8326 logger.info(
8327 _m(
8328 "Created Docker Volume",
8329 extra=get_extra_info({**default_extra, "volume_name": volume_name}),
8330 ),
8331 )
8332
8333 # create docker container with the port map & resource
8334 async with self.lock:
8335 self.logs_queue.append(
8336 {
8337 "log_text": "Creating docker container",
8338 "log_status": "success",
8339 "log_tag": log_tag,
8340 }
8341 )
8342
8343 container_name = f"container_{uuid}"
8344
8345 if payload.debug:
8346 command = f'docker run -d {port_flags} -v "/var/run/docker.sock:/var/run/docker.sock" {volume_flags} {entrypoint_flag} -e PUBLIC_KEY="{payload.user_public_key}" {env_flags} --mount source={volume_name},target=/root --name {container_name} {payload.docker_image} {startup_commands}'
8347 else:
8348 command = f'docker run -d {port_flags} {volume_flags} {entrypoint_flag} -e PUBLIC_KEY="{payload.user_public_key}" {env_flags} --mount source={volume_name},target=/root --gpus all --name {container_name} {payload.docker_image} {startup_commands}'
8349
8350 logger.info(
8351 _m(
8352 "Creating docker container",
8353 extra=get_extra_info({
8354 **default_extra,
8355 "command": command,
8356 }),
8357 ),
8358 )
8359
8360 status, error = await self.execute_and_stream_logs(
8361 ssh_client=ssh_client, command=command, log_tag="container_creation"
8362 )
8363 if not status:
8364 log_text = _m(
8365 "Docker container creation failed",
8366 extra=get_extra_info({
8367 **default_extra,
8368 "command": command,
8369 "error": error,
8370 }),
8371 )
8372 logger.error(log_text)
8373
8374 await self.finish_stream_logs()
8375 await self.clear_verified_job_count(executor_info)
8376
8377 return FailedContainerRequest(
8378 miner_hotkey=payload.miner_hotkey,
8379 executor_id=payload.executor_id,
8380 msg=str(log_text),
8381 error_code=FailedContainerErrorCodes.UnknownError,
8382 )
8383
8384 # check if the container is running correctly
8385 if not await self.check_container_running(ssh_client, container_name):
8386 log_text = _m(
8387 "Run docker run command but container is not running",
8388 extra=get_extra_info({
8389 **default_extra,
8390 "container_name": container_name,
8391 }),
8392 )
8393 logger.error(log_text)
8394
8395 await self.finish_stream_logs()
8396 await self.clear_verified_job_count(executor_info)
8397
8398 return FailedContainerRequest(
8399 miner_hotkey=payload.miner_hotkey,
8400 executor_id=payload.executor_id,
8401 msg=str(log_text),
8402 error_code=FailedContainerErrorCodes.ContainerNotRunning,
8403 )
8404
8405 logger.info(
8406 _m(
8407 "Created Docker Container",
8408 extra=get_extra_info({**default_extra, "container_name": container_name}),
8409 ),
8410 )
8411
8412 await self.finish_stream_logs()
8413
8414 await self.redis_service.add_rented_machine(
8415 RentedMachine(
8416 miner_hotkey=payload.miner_hotkey,
8417 executor_id=payload.executor_id,
8418 executor_ip_address=executor_info.address,
8419 executor_ip_port=str(executor_info.port),
8420 )
8421 )
8422
8423 return ContainerCreatedResult(
8424 container_name=container_name,
8425 volume_name=volume_name,
8426 port_maps=[
8427 (docker_port, external_port) for docker_port, _, external_port in port_maps
8428 ],
8429 )
8430 except Exception as e:
8431 log_text = _m(
8432 "Unknown Error create_container",
8433 extra=get_extra_info({**default_extra, "error": str(e)}),
8434 )
8435 logger.error(log_text, exc_info=True)
8436
8437 await self.finish_stream_logs()
8438 await self.clear_verified_job_count(executor_info)
8439
8440 return FailedContainerRequest(
8441 miner_hotkey=payload.miner_hotkey,
8442 executor_id=payload.executor_id,
8443 msg=str(log_text),
8444 error_code=FailedContainerErrorCodes.UnknownError,
8445 )
8446
8447 async def stop_container(
8448 self,
8449 payload: ContainerStopRequest,
8450 executor_info: ExecutorSSHInfo,
8451 keypair: bittensor.Keypair,
8452 private_key: str,
8453 ):
8454 default_extra = {
8455 "miner_hotkey": payload.miner_hotkey,
8456 "executor_uuid": payload.executor_id,
8457 "executor_ip_address": executor_info.address,
8458 "executor_port": executor_info.port,
8459 "executor_ssh_username": executor_info.ssh_username,
8460 "executor_ssh_port": executor_info.ssh_port,
8461 }
8462
8463 logger.info(
8464 _m(
8465 "Stop Docker Container", extra=get_extra_info({**default_extra, "payload": str(payload)})
8466 ),
8467 )
8468
8469 private_key = self.ssh_service.decrypt_payload(keypair.ss58_address, private_key)
8470 pkey = asyncssh.import_private_key(private_key)
8471
8472 async with asyncssh.connect(
8473 host=executor_info.address,
8474 port=executor_info.ssh_port,
8475 username=executor_info.ssh_username,
8476 client_keys=[pkey],
8477 known_hosts=None,
8478 ) as ssh_client:
8479 await ssh_client.run(f"docker stop {payload.container_name}")
8480
8481 logger.info(
8482 _m(
8483 "Stopped Docker Container",
8484 extra=get_extra_info(
8485 {**default_extra, "container_name": payload.container_name}
8486 ),
8487 ),
8488 )
8489
8490 async def start_container(
8491 self,
8492 payload: ContainerStartRequest,
8493 executor_info: ExecutorSSHInfo,
8494 keypair: bittensor.Keypair,
8495 private_key: str,
8496 ):
8497 default_extra = {
8498 "miner_hotkey": payload.miner_hotkey,
8499 "executor_uuid": payload.executor_id,
8500 "executor_ip_address": executor_info.address,
8501 "executor_port": executor_info.port,
8502 "executor_ssh_username": executor_info.ssh_username,
8503 "executor_ssh_port": executor_info.ssh_port,
8504 }
8505
8506 logger.info(
8507 _m(
8508 "Restart Docker Container",
8509 extra=get_extra_info({**default_extra, "payload": str(payload)}),
8510 ),
8511 )
8512
8513 private_key = self.ssh_service.decrypt_payload(keypair.ss58_address, private_key)
8514 pkey = asyncssh.import_private_key(private_key)
8515
8516 async with asyncssh.connect(
8517 host=executor_info.address,
8518 port=executor_info.ssh_port,
8519 username=executor_info.ssh_username,
8520 client_keys=[pkey],
8521 known_hosts=None,
8522 ) as ssh_client:
8523 await ssh_client.run(f"docker start {payload.container_name}")
8524 logger.info(
8525 _m(
8526 "Started Docker Container",
8527 extra=get_extra_info(
8528 {**default_extra, "container_name": payload.container_name}
8529 ),
8530 ),
8531 )
8532
8533 async def delete_container(
8534 self,
8535 payload: ContainerDeleteRequest,
8536 executor_info: ExecutorSSHInfo,
8537 keypair: bittensor.Keypair,
8538 private_key: str,
8539 ):
8540 default_extra = {
8541 "miner_hotkey": payload.miner_hotkey,
8542 "executor_uuid": payload.executor_id,
8543 "executor_ip_address": executor_info.address,
8544 "executor_port": executor_info.port,
8545 "executor_ssh_username": executor_info.ssh_username,
8546 "executor_ssh_port": executor_info.ssh_port,
8547 }
8548
8549 logger.info(
8550 _m(
8551 "Delete Docker Container",
8552 extra=get_extra_info({**default_extra, "payload": str(payload)}),
8553 ),
8554 )
8555
8556 private_key = self.ssh_service.decrypt_payload(keypair.ss58_address, private_key)
8557 pkey = asyncssh.import_private_key(private_key)
8558
8559 async with asyncssh.connect(
8560 host=executor_info.address,
8561 port=executor_info.ssh_port,
8562 username=executor_info.ssh_username,
8563 client_keys=[pkey],
8564 known_hosts=None,
8565 ) as ssh_client:
8566 # await ssh_client.run(f"docker stop {payload.container_name}")
8567 await ssh_client.run(f"docker rm {payload.container_name} -f")
8568 await ssh_client.run(f"docker volume rm {payload.volume_name} -f")
8569
8570 logger.info(
8571 _m(
8572 "Deleted Docker Container",
8573 extra=get_extra_info(
8574 {
8575 **default_extra,
8576 "container_name": payload.container_name,
8577 "volume_name": payload.volume_name,
8578 }
8579 ),
8580 ),
8581 )
8582
8583 await self.redis_service.remove_rented_machine(
8584 RentedMachine(
8585 miner_hotkey=payload.miner_hotkey,
8586 executor_id=payload.executor_id,
8587 executor_ip_address=executor_info.address,
8588 executor_ip_port=str(executor_info.port),
8589 )
8590 )
8591
8592 async def get_docker_hub_digests(self, repositories) -> dict[str, str]:
8593 """Retrieve all tags and their corresponding digests from Docker Hub."""
8594 all_digests = {} # Initialize a dictionary to store all tag-digest pairs
8595
8596 async with aiohttp.ClientSession() as session:
8597 for repo in repositories:
8598 try:
8599 # Split repository and tag if specified
8600 if ":" in repo:
8601 repository, specified_tag = repo.split(":", 1)
8602 else:
8603 repository, specified_tag = repo, None
8604
8605 # Get authorization token
8606 async with session.get(
8607 f"https://auth.docker.io/token?service=registry.docker.io&scope=repository:{repository}:pull"
8608 ) as token_response:
8609 token_response.raise_for_status()
8610 token = await token_response.json()
8611 token = token.get("token")
8612
8613 # Find all tags if no specific tag is specified
8614 if specified_tag is None:
8615 async with session.get(
8616 f"https://index.docker.io/v2/{repository}/tags/list",
8617 headers={"Authorization": f"Bearer {token}"},
8618 ) as tags_response:
8619 tags_response.raise_for_status()
8620 tags_data = await tags_response.json()
8621 all_tags = tags_data.get("tags", [])
8622 else:
8623 all_tags = [specified_tag]
8624
8625 # Dictionary to store tag-digest pairs for the current repository
8626 tag_digests = {}
8627 for tag in all_tags:
8628 # Get image digest
8629 async with session.head(
8630 f"https://index.docker.io/v2/{repository}/manifests/{tag}",
8631 headers={
8632 "Authorization": f"Bearer {token}",
8633 "Accept": "application/vnd.docker.distribution.manifest.v2+json",
8634 },
8635 ) as manifest_response:
8636 manifest_response.raise_for_status()
8637 digest = manifest_response.headers.get("Docker-Content-Digest")
8638 tag_digests[f"{repository}:{tag}"] = digest
8639
8640 # Update the all_digests dictionary with the current repository's tag-digest pairs
8641 all_digests.update(tag_digests)
8642
8643 except aiohttp.ClientError as e:
8644 print(f"Error retrieving data for {repo}: {e}")
8645
8646 return all_digests
8647
8648 async def setup_ssh_access(
8649 self,
8650 ssh_client: asyncssh.SSHClientConnection,
8651 container_name: str,
8652 ip_address: str,
8653 username: str = "root",
8654 port_maps: list[tuple[int, int]] = None,
8655 ) -> tuple[bool, str, str]:
8656 """Generate an SSH key pair, add the public key to the Docker container, and check SSH connection."""
8657
8658 my_key = "my_key"
8659 private_key, public_key = self.ssh_service.generate_ssh_key(my_key)
8660
8661 public_key = public_key.decode("utf-8")
8662 private_key = private_key.decode("utf-8")
8663
8664 private_key = self.ssh_service.decrypt_payload(my_key, private_key)
8665 pkey = asyncssh.import_private_key(private_key)
8666
8667 await asyncio.sleep(5)
8668
8669 command = f"docker exec {container_name} sh -c 'echo \"{public_key}\" >> /root/.ssh/authorized_keys'"
8670
8671 result = await ssh_client.run(command)
8672 if result.exit_status != 0:
8673 log_text = "Error creating docker connection"
8674 log_status = "error"
8675 logger.error(log_text)
8676
8677 return False, log_text, log_status
8678
8679 port = 0
8680 for internal, external in port_maps:
8681 if internal == 22:
8682 port = external
8683 # Check SSH connection
8684 try:
8685 async with asyncssh.connect(
8686 host=ip_address,
8687 port=port,
8688 username=username,
8689 client_keys=[pkey],
8690 known_hosts=None,
8691 ):
8692 log_status = "info"
8693 log_text = "SSH connection successful!"
8694 logger.info(
8695 _m(
8696 log_text,
8697 extra={
8698 "container_name": container_name,
8699 "ip_address": ip_address,
8700 "port_maps": port_maps,
8701 },
8702 )
8703 )
8704 return True, log_text, log_status
8705 except Exception as e:
8706 log_text = "SSH connection failed"
8707 log_status = "error"
8708 logger.error(
8709 _m(
8710 log_text,
8711 extra={
8712 "container_name": container_name,
8713 "ip_address": ip_address,
8714 "port_maps": port_maps,
8715 "error": str(e),
8716 },
8717 )
8718 )
8719 return False, log_text, log_status
8720
8721
8722
8723---
8724File: /neurons/validators/src/services/file_encrypt_service.py
8725---
8726
8727import os
8728import random
8729import subprocess
8730from typing import Annotated
8731from pathlib import Path
8732import tempfile
8733import shutil
8734import PyInstaller.__main__
8735from fastapi import Depends
8736
8737from services.ssh_service import SSHService
8738
8739from payload_models.payloads import MinerJobEnryptedFiles
8740
8741
8742class FileEncryptService:
8743 def __init__(
8744 self,
8745 ssh_service: Annotated[SSHService, Depends(SSHService)],
8746 ):
8747 self.ssh_service = ssh_service
8748
8749 def make_obfuscated_file(self, tmp_directory: str, file_path: str):
8750 subprocess.run(
8751 ['pyarmor', 'gen', '-O', tmp_directory, file_path],
8752 stdout=subprocess.PIPE,
8753 stderr=subprocess.PIPE,
8754 )
8755 return os.path.basename(file_path)
8756
8757 def make_binary_file(self, tmp_directory: str, file_path: str):
8758 file_name = os.path.basename(file_path)
8759
8760 PyInstaller.__main__.run([
8761 file_path,
8762 '--onefile',
8763 '--noconsole',
8764 '--log-level=ERROR',
8765 '--distpath', tmp_directory,
8766 '--name', file_name,
8767 ])
8768
8769 subprocess.run(['rm', '-rf', 'build', f'{file_name}.spec'])
8770
8771 return file_name
8772
8773 def make_binary_file_with_nuitka(self, tmp_directory: str, file_path: str):
8774 file_name = os.path.basename(file_path)
8775
8776 subprocess.run([
8777 'nuitka', '--standalone', '--onefile',
8778 f'--output-dir={tmp_directory}',
8779 '--remove-output', '--quiet', '--no-progress',
8780 f'--output-filename={file_name}',
8781 file_path
8782 ])
8783
8784 return file_name
8785
8786 def ecrypt_miner_job_files(self):
8787 tmp_directory = Path(__file__).parent / "temp"
8788 if tmp_directory.exists() and tmp_directory.is_dir():
8789 shutil.rmtree(tmp_directory)
8790
8791 string_count = random.randint(10, 100)
8792 encrypt_key = self.ssh_service.generate_random_string(string_count)
8793
8794 machine_scrape_file_path = str(
8795 Path(__file__).parent / ".." / "miner_jobs/machine_scrape.py"
8796 )
8797 with open(machine_scrape_file_path, 'r') as file:
8798 content = file.read()
8799 modified_content = content.replace('encrypt_key', encrypt_key)
8800
8801 with tempfile.NamedTemporaryFile(delete=True) as machine_scrape_file:
8802 machine_scrape_file.write(modified_content.encode('utf-8'))
8803 machine_scrape_file.flush()
8804 os.fsync(machine_scrape_file.fileno())
8805 machine_scrape_file_name = self.make_binary_file_with_nuitka(str(tmp_directory), machine_scrape_file.name)
8806
8807 # generate score_script file
8808 score_script_file_path = str(Path(__file__).parent / ".." / "miner_jobs/score.py")
8809 with open(score_script_file_path, 'r') as file:
8810 content = file.read()
8811 modified_content = content.replace('encrypt_key', encrypt_key)
8812
8813 with tempfile.NamedTemporaryFile(delete=True, suffix='.py') as score_file:
8814 score_file.write(modified_content.encode('utf-8'))
8815 score_file.flush()
8816 os.fsync(score_file.fileno())
8817 score_file_name = self.make_obfuscated_file(str(tmp_directory), score_file.name)
8818
8819 return MinerJobEnryptedFiles(
8820 encrypt_key=encrypt_key,
8821 tmp_directory=str(tmp_directory),
8822 machine_scrape_file_name=machine_scrape_file_name,
8823 score_file_name=score_file_name,
8824 )
8825
8826
8827
8828---
8829File: /neurons/validators/src/services/hash_service.py
8830---
8831
8832import enum
8833import hashlib
8834import string
8835import random
8836import secrets
8837from dataclasses import dataclass
8838from base64 import b64encode
8839import json
8840from typing import Self
8841import subprocess
8842
8843
8844class Algorithm(enum.Enum):
8845 SHA256 = "SHA256"
8846 SHA384 = "SHA384"
8847 SHA512 = "SHA512"
8848
8849 @property
8850 def params(self):
8851 return {
8852 Algorithm.SHA256: {
8853 "hash_function": hashlib.sha256,
8854 "hash_type": "1410",
8855 },
8856 Algorithm.SHA384: {
8857 "hash_function": hashlib.sha384,
8858 "hash_type": "10810",
8859 },
8860 Algorithm.SHA512: {
8861 "hash_function": hashlib.sha512,
8862 "hash_type": "1710",
8863 },
8864 }
8865
8866 def hash(self, *args, **kwargs):
8867 return self.params[self]["hash_function"](*args, **kwargs)
8868
8869 @property
8870 def type(self):
8871 return self.params[self]["hash_type"]
8872
8873
8874@dataclass
8875class JobParam:
8876 algorithm: Algorithm
8877 num_letters: int
8878 num_digits: int
8879 num_hashes: int
8880
8881 @classmethod
8882 def generate(
8883 cls,
8884 num_letters: int,
8885 num_digits: int,
8886 num_hashes: int,
8887 ) -> Self:
8888 algorithm = random.choice(list(Algorithm))
8889
8890 return cls(
8891 algorithm=algorithm,
8892 num_letters=num_letters,
8893 num_digits=num_digits,
8894 num_hashes=num_hashes,
8895 )
8896
8897 @property
8898 def password_length(self) -> int:
8899 return self.num_letters + self.num_digits
8900
8901 def __str__(self) -> str:
8902 return (
8903 f"algorithm={self.algorithm} "
8904 f"algorithm_type={self.algorithm.type} "
8905 f"num_letters={self.num_letters} "
8906 f"num_digits={self.num_digits} "
8907 f"num_hashes={self.num_hashes}"
8908 )
8909
8910
8911@dataclass
8912class HashcatJob:
8913 passwords: list[list[str]]
8914 salts: list[bytes]
8915 job_params: list[JobParam]
8916
8917
8918@dataclass
8919class HashService:
8920 gpu_count: int
8921 num_job_params: int
8922 jobs: list[HashcatJob]
8923 timeout: int
8924
8925 @classmethod
8926 def random_string(self, num_letters: int, num_digits: int) -> str:
8927 return ''.join(random.choices(string.ascii_letters, k=num_letters)) + ''.join(random.choices(string.digits, k=num_digits))
8928
8929 @classmethod
8930 def generate(
8931 cls,
8932 gpu_count: int = 1,
8933 timeout: int = 60,
8934 num_job_params: int = 1,
8935 num_letters: int = 0,
8936 num_digits: int = 11,
8937 num_hashes: int = 10,
8938 salt_length_bytes: int = 8
8939 ) -> Self:
8940 jobs = []
8941 for _ in range(gpu_count):
8942 job_params = [
8943 JobParam.generate(
8944 num_letters=num_letters,
8945 num_digits=num_digits,
8946 num_hashes=num_hashes,
8947 )
8948 for _ in range(num_job_params)
8949 ]
8950
8951 passwords = [
8952 sorted(
8953 {
8954 cls.random_string(
8955 num_letters=_params.num_letters, num_digits=_params.num_digits
8956 )
8957 for _ in range(_params.num_hashes)
8958 }
8959 )
8960 for _params in job_params
8961 ]
8962
8963 salts = [secrets.token_bytes(salt_length_bytes) for _ in range(num_job_params)]
8964
8965 jobs.append(HashcatJob(
8966 job_params=job_params,
8967 passwords=passwords,
8968 salts=salts,
8969 ))
8970
8971 return cls(
8972 gpu_count=gpu_count,
8973 num_job_params=num_job_params,
8974 jobs=jobs,
8975 timeout=timeout,
8976 )
8977
8978 def hash_masks(self, job: HashcatJob) -> list[str]:
8979 return ["?1" * param.num_letters + "?d" * param.num_digits for param in job.job_params]
8980
8981 def hash_hexes(self, algorithm: Algorithm, passwords: list[str], salt: str) -> list[str]:
8982 return [
8983 algorithm.hash(password.encode("ascii") + salt).hexdigest()
8984 for password in passwords
8985 ]
8986
8987 def _hash(self, s: bytes) -> bytes:
8988 return b64encode(hashlib.sha256(s).digest(), altchars=b"-_")
8989
8990 # def _payload(self, i) -> str:
8991 # return "\n".join([f"{hash_hex}:{self.salts[i].hex()}" for hash_hex in self.hash_hexes(i)])
8992
8993 def _payloads(self, job: HashcatJob) -> list[str]:
8994 payloads = [
8995 "\n".join([
8996 f"{hash_hex}:{job.salts[i].hex()}"
8997 for hash_hex
8998 in self.hash_hexes(job.job_params[i].algorithm, job.passwords[i], job.salts[i])
8999 ])
9000 for i in range(self.num_job_params)
9001 ]
9002 return payloads
9003
9004 @property
9005 def payload(self) -> str | bytes:
9006 """Convert this instance to a hashcat argument format."""
9007
9008 data = {
9009 "gpu_count": self.gpu_count,
9010 "num_job_params": self.num_job_params,
9011 "jobs": [
9012 {
9013 "payloads": self._payloads(job),
9014 "masks": self.hash_masks(job),
9015 "algorithms": [param.algorithm.type for param in job.job_params],
9016 }
9017 for job in self.jobs
9018 ],
9019 "timeout": self.timeout,
9020 }
9021 return json.dumps(data)
9022
9023 @property
9024 def answer(self) -> str:
9025 return self._hash(
9026 "".join(["".join(["".join(passwords) for passwords in job.passwords]) for job in self.jobs]).encode("utf-8")
9027 ).decode("utf-8")
9028
9029 def __str__(self) -> str:
9030 return f"JobService {self.jobs}"
9031
9032
9033if __name__ == "__main__":
9034 import time
9035
9036 hash_service = HashService.generate(gpu_count=1, timeout=50)
9037 # print(hash_service.payload)
9038 print('answer ====>', hash_service.answer)
9039
9040 start_time = time.time()
9041
9042 cmd = f"python src/miner_jobs/score.py '{hash_service.payload}'"
9043 result = subprocess.check_output(cmd, shell=True, text=True, stderr=subprocess.DEVNULL)
9044 end_time = time.time()
9045 print('result ===>', result)
9046 print(end_time - start_time)
9047
9048
9049
9050---
9051File: /neurons/validators/src/services/ioc.py
9052---
9053
9054import asyncio
9055
9056from services.docker_service import DockerService
9057from services.miner_service import MinerService
9058from services.ssh_service import SSHService
9059from services.task_service import TaskService
9060from services.redis_service import RedisService
9061from services.file_encrypt_service import FileEncryptService
9062
9063ioc = {}
9064
9065
9066async def initiate_services():
9067 ioc["SSHService"] = SSHService()
9068 ioc["RedisService"] = RedisService()
9069 ioc["TaskService"] = TaskService(
9070 ssh_service=ioc["SSHService"],
9071 redis_service=ioc["RedisService"]
9072 )
9073 ioc["DockerService"] = DockerService(
9074 ssh_service=ioc["SSHService"],
9075 redis_service=ioc["RedisService"]
9076 )
9077 ioc["MinerService"] = MinerService(
9078 ssh_service=ioc["SSHService"],
9079 task_service=ioc["TaskService"],
9080 redis_service=ioc["RedisService"]
9081 )
9082 ioc["FileEncryptService"] = FileEncryptService(
9083 ssh_service=ioc["SSHService"],
9084 )
9085
9086
9087def sync_initiate():
9088 loop = asyncio.get_event_loop()
9089 loop.run_until_complete(initiate_services())
9090
9091
9092sync_initiate()
9093
9094
9095
9096---
9097File: /neurons/validators/src/services/miner_service.py
9098---
9099
9100import asyncio
9101import json
9102import logging
9103from typing import Annotated
9104
9105import bittensor
9106from clients.miner_client import MinerClient
9107from datura.requests.miner_requests import (
9108 AcceptSSHKeyRequest,
9109 DeclineJobRequest,
9110 ExecutorSSHInfo,
9111 FailedRequest,
9112)
9113from datura.requests.validator_requests import SSHPubKeyRemoveRequest, SSHPubKeySubmitRequest
9114from fastapi import Depends
9115from payload_models.payloads import (
9116 ContainerBaseRequest,
9117 ContainerCreated,
9118 ContainerCreateRequest,
9119 ContainerDeleted,
9120 ContainerDeleteRequest,
9121 ContainerStarted,
9122 ContainerStartRequest,
9123 ContainerStopped,
9124 ContainerStopRequest,
9125 FailedContainerErrorCodes,
9126 FailedContainerRequest,
9127 MinerJobEnryptedFiles,
9128 MinerJobRequestPayload,
9129)
9130from protocol.vc_protocol.compute_requests import RentedMachine
9131
9132from core.config import settings
9133from core.utils import _m, get_extra_info
9134from services.docker_service import DockerService
9135from services.redis_service import EXECUTOR_COUNT_PREFIX, MACHINE_SPEC_CHANNEL_NAME, RedisService
9136from services.ssh_service import SSHService
9137from services.task_service import TaskService
9138
9139logger = logging.getLogger(__name__)
9140
9141
9142JOB_LENGTH = 300
9143
9144
9145class MinerService:
9146 def __init__(
9147 self,
9148 ssh_service: Annotated[SSHService, Depends(SSHService)],
9149 task_service: Annotated[TaskService, Depends(TaskService)],
9150 redis_service: Annotated[RedisService, Depends(RedisService)],
9151 ):
9152 self.ssh_service = ssh_service
9153 self.task_service = task_service
9154 self.redis_service = redis_service
9155
9156 async def request_job_to_miner(
9157 self,
9158 payload: MinerJobRequestPayload,
9159 encypted_files: MinerJobEnryptedFiles,
9160 docker_hub_digests: dict[str, str],
9161 debug=False,
9162 ):
9163 loop = asyncio.get_event_loop()
9164 my_key: bittensor.Keypair = settings.get_bittensor_wallet().get_hotkey()
9165 default_extra = {
9166 "job_batch_id": payload.job_batch_id,
9167 "miner_hotkey": payload.miner_hotkey,
9168 "miner_address": payload.miner_address,
9169 "miner_port": payload.miner_port,
9170 }
9171
9172 try:
9173 logger.info(_m("Requesting job to miner", extra=get_extra_info(default_extra)))
9174 miner_client = MinerClient(
9175 loop=loop,
9176 miner_address=payload.miner_address,
9177 miner_port=payload.miner_port,
9178 miner_hotkey=payload.miner_hotkey,
9179 my_hotkey=my_key.ss58_address,
9180 keypair=my_key,
9181 miner_url=f"ws://{payload.miner_address}:{payload.miner_port}/jobs/{my_key.ss58_address}"
9182 )
9183
9184 async with miner_client:
9185 # generate ssh key and send it to miner
9186 private_key, public_key = self.ssh_service.generate_ssh_key(my_key.ss58_address)
9187
9188 await miner_client.send_model(SSHPubKeySubmitRequest(public_key=public_key))
9189
9190 try:
9191 msg = await asyncio.wait_for(
9192 miner_client.job_state.miner_accepted_ssh_key_or_failed_future, JOB_LENGTH
9193 )
9194 except TimeoutError:
9195 logger.error(
9196 _m(
9197 "Waiting accepted ssh key or failed request from miner resulted in TimeoutError",
9198 extra=get_extra_info(default_extra),
9199 ),
9200 )
9201 msg = None
9202 except Exception:
9203 logger.error(
9204 _m(
9205 "Waiting accepted ssh key or failed request from miner resulted in an exception",
9206 extra=get_extra_info(default_extra),
9207 ),
9208 )
9209 msg = None
9210
9211 if isinstance(msg, AcceptSSHKeyRequest):
9212 logger.info(
9213 _m(
9214 "Received AcceptSSHKeyRequest for miner. Running tasks for executors",
9215 extra=get_extra_info(
9216 {**default_extra, "executors": len(msg.executors)}
9217 ),
9218 ),
9219 )
9220 if len(msg.executors) == 0:
9221 return None
9222
9223 tasks = [
9224 asyncio.create_task(
9225 self.task_service.create_task(
9226 miner_info=payload,
9227 executor_info=executor_info,
9228 keypair=my_key,
9229 private_key=private_key.decode("utf-8"),
9230 public_key=public_key.decode("utf-8"),
9231 encypted_files=encypted_files,
9232 docker_hub_digests=docker_hub_digests,
9233 debug=debug,
9234 )
9235 )
9236 for executor_info in msg.executors
9237 ]
9238
9239 results = [
9240 result
9241 for result in await asyncio.gather(*tasks, return_exceptions=True)
9242 if result
9243 ]
9244
9245 logger.info(
9246 _m(
9247 "Finished running tasks for executors",
9248 extra=get_extra_info({**default_extra, "executors": len(results)}),
9249 ),
9250 )
9251
9252 await miner_client.send_model(SSHPubKeyRemoveRequest(public_key=public_key))
9253
9254 await self.publish_machine_specs(results, miner_client.miner_hotkey)
9255 await self.store_executor_counts(
9256 payload.miner_hotkey, payload.job_batch_id, len(msg.executors), results
9257 )
9258
9259 total_score = 0
9260 for _, _, score, _, _, _, _ in results:
9261 total_score += score
9262
9263 logger.info(
9264 _m(
9265 f"total score: {total_score}",
9266 extra=get_extra_info(default_extra),
9267 )
9268 )
9269
9270 return {
9271 "miner_hotkey": payload.miner_hotkey,
9272 "score": total_score,
9273 }
9274 elif isinstance(msg, FailedRequest):
9275 logger.warning(
9276 _m(
9277 "Requesting job failed for miner",
9278 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9279 ),
9280 )
9281 return None
9282 elif isinstance(msg, DeclineJobRequest):
9283 logger.warning(
9284 _m(
9285 "Requesting job declined for miner",
9286 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9287 ),
9288 )
9289 return None
9290 else:
9291 logger.error(
9292 _m(
9293 "Unexpected msg",
9294 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9295 ),
9296 )
9297 return None
9298 except asyncio.CancelledError:
9299 logger.error(
9300 _m("Requesting job to miner was cancelled", extra=get_extra_info(default_extra)),
9301 )
9302 return None
9303 except Exception as e:
9304 logger.error(
9305 _m(
9306 "Requesting job to miner resulted in an exception",
9307 extra=get_extra_info({**default_extra, "error": str(e)}),
9308 ),
9309 exc_info=True,
9310 )
9311 return None
9312
9313 async def publish_machine_specs(
9314 self, results: list[tuple[dict, ExecutorSSHInfo]], miner_hotkey: str
9315 ):
9316 """Publish machine specs to compute app connector process"""
9317 default_extra = {
9318 "miner_hotkey": miner_hotkey,
9319 }
9320
9321 logger.info(
9322 _m(
9323 "Publishing machine specs to compute app connector process",
9324 extra=get_extra_info({**default_extra, "results": len(results)}),
9325 ),
9326 )
9327 for (
9328 specs,
9329 ssh_info,
9330 score,
9331 synthetic_job_score,
9332 job_batch_id,
9333 log_status,
9334 log_text,
9335 ) in results:
9336 try:
9337 await self.redis_service.publish(
9338 MACHINE_SPEC_CHANNEL_NAME,
9339 {
9340 "specs": specs,
9341 "miner_hotkey": miner_hotkey,
9342 "executor_uuid": ssh_info.uuid,
9343 "executor_ip": ssh_info.address,
9344 "executor_port": ssh_info.port,
9345 "score": score,
9346 "synthetic_job_score": synthetic_job_score,
9347 "job_batch_id": job_batch_id,
9348 "log_status": log_status,
9349 "log_text": str(log_text),
9350 },
9351 )
9352 except Exception as e:
9353 logger.error(
9354 _m(
9355 f"Error publishing machine specs of {miner_hotkey} to compute app connector process",
9356 extra=get_extra_info({**default_extra, "error": str(e)}),
9357 ),
9358 exc_info=True,
9359 )
9360
9361 async def store_executor_counts(
9362 self, miner_hotkey: str, job_batch_id: str, total: int, results: list[dict]
9363 ):
9364 default_extra = {
9365 "job_batch_id": job_batch_id,
9366 "miner_hotkey": miner_hotkey,
9367 }
9368
9369 success = 0
9370 failed = 0
9371
9372 for _, _, score, _, _, _, _ in results:
9373 if score > 0:
9374 success += 1
9375 else:
9376 failed += 1
9377
9378 data = {"total": total, "success": success, "failed": failed}
9379
9380 key = f"{EXECUTOR_COUNT_PREFIX}:{miner_hotkey}"
9381
9382 try:
9383 await self.redis_service.hset(key, job_batch_id, json.dumps(data))
9384
9385 logger.info(
9386 _m(
9387 "Stored executor counts",
9388 extra=get_extra_info({**default_extra, **data}),
9389 ),
9390 )
9391 except Exception as e:
9392 logger.error(
9393 _m(
9394 "Failed storing executor counts",
9395 extra=get_extra_info({**default_extra, **data, "error": str(e)}),
9396 ),
9397 exc_info=True,
9398 )
9399
9400 async def handle_container(self, payload: ContainerBaseRequest):
9401 loop = asyncio.get_event_loop()
9402 my_key: bittensor.Keypair = settings.get_bittensor_wallet().get_hotkey()
9403 default_extra = {
9404 "miner_hotkey": payload.miner_hotkey,
9405 "executor_id": payload.executor_id,
9406 "executor_ip": payload.miner_address,
9407 "executor_port": payload.miner_port,
9408 "container_request_type": str(payload.message_type),
9409 }
9410
9411 docker_service = DockerService(
9412 ssh_service=self.ssh_service,
9413 redis_service=self.redis_service,
9414 )
9415
9416 try:
9417 miner_client = MinerClient(
9418 loop=loop,
9419 miner_address=payload.miner_address,
9420 miner_port=payload.miner_port,
9421 miner_hotkey=payload.miner_hotkey,
9422 my_hotkey=my_key.ss58_address,
9423 keypair=my_key,
9424 miner_url=f"ws://{payload.miner_address}:{payload.miner_port}/resources/{my_key.ss58_address}",
9425 )
9426
9427 async with miner_client:
9428 # generate ssh key and send it to miner
9429 private_key, public_key = self.ssh_service.generate_ssh_key(my_key.ss58_address)
9430 await miner_client.send_model(
9431 SSHPubKeySubmitRequest(public_key=public_key, executor_id=payload.executor_id)
9432 )
9433
9434 logger.info(
9435 _m("Sent SSH key to miner.", extra=get_extra_info(default_extra)),
9436 )
9437
9438 try:
9439 msg = await asyncio.wait_for(
9440 miner_client.job_state.miner_accepted_ssh_key_or_failed_future,
9441 timeout=JOB_LENGTH,
9442 )
9443 except TimeoutError:
9444 logger.error(
9445 _m(
9446 "Waiting accepted ssh key or failed request from miner resulted in an timeout error",
9447 extra=get_extra_info(default_extra),
9448 ),
9449 )
9450 msg = None
9451 except Exception as e:
9452 logger.error(
9453 _m(
9454 "Waiting accepted ssh key or failed request from miner resulted in an exception",
9455 extra=get_extra_info({**default_extra, "error": str(e)}),
9456 ),
9457 )
9458 msg = None
9459
9460 if isinstance(msg, AcceptSSHKeyRequest):
9461 logger.info(
9462 _m(
9463 "Received AcceptSSHKeyRequest",
9464 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9465 ),
9466 )
9467
9468 try:
9469 executor = msg.executors[0]
9470 except Exception as e:
9471 logger.error(
9472 _m(
9473 "Error: Miner didn't return executor info",
9474 extra=get_extra_info({**default_extra, "error": str(e)}),
9475 ),
9476 )
9477 executor = None
9478
9479 if executor is None or executor.uuid != payload.executor_id:
9480 logger.error(
9481 _m("Error: Invalid executor id", extra=get_extra_info(default_extra)),
9482 )
9483 await miner_client.send_model(
9484 SSHPubKeyRemoveRequest(
9485 public_key=public_key, executor_id=payload.executor_id
9486 )
9487 )
9488
9489 await self.redis_service.remove_rented_machine(
9490 RentedMachine(
9491 miner_hotkey=payload.miner_hotkey,
9492 executor_id=payload.executor_id,
9493 executor_ip_address=executor.address if executor else "",
9494 executor_ip_port=str(executor.port if executor else ""),
9495 )
9496 )
9497
9498 return FailedContainerRequest(
9499 miner_hotkey=payload.miner_hotkey,
9500 executor_id=payload.executor_id,
9501 msg=f"Invalid executor id {payload.executor_id}",
9502 error_code=FailedContainerErrorCodes.InvalidExecutorId,
9503 )
9504
9505 try:
9506 if isinstance(payload, ContainerCreateRequest):
9507 logger.info(
9508 _m(
9509 "Creating container",
9510 extra=get_extra_info(
9511 {**default_extra, "payload": str(payload)}
9512 ),
9513 ),
9514 )
9515 result = await docker_service.create_container(
9516 payload,
9517 executor,
9518 my_key,
9519 private_key.decode("utf-8"),
9520 )
9521
9522 await miner_client.send_model(
9523 SSHPubKeyRemoveRequest(
9524 public_key=public_key, executor_id=payload.executor_id
9525 )
9526 )
9527
9528 if isinstance(result, FailedContainerRequest):
9529 return result
9530
9531 return ContainerCreated(
9532 miner_hotkey=payload.miner_hotkey,
9533 executor_id=payload.executor_id,
9534 container_name=result.container_name,
9535 volume_name=result.volume_name,
9536 port_maps=result.port_maps,
9537 )
9538
9539 # elif isinstance(payload, ContainerStartRequest):
9540 # logger.info(
9541 # _m(
9542 # "Starting container",
9543 # extra=get_extra_info(
9544 # {**default_extra, "payload": str(payload)}
9545 # ),
9546 # ),
9547 # )
9548 # await docker_service.start_container(
9549 # payload,
9550 # executor,
9551 # my_key,
9552 # private_key.decode("utf-8"),
9553 # )
9554
9555 # logger.info(
9556 # _m(
9557 # "Started Container",
9558 # extra=get_extra_info(
9559 # {**default_extra, "payload": str(payload)}
9560 # ),
9561 # ),
9562 # )
9563 # await miner_client.send_model(
9564 # SSHPubKeyRemoveRequest(
9565 # public_key=public_key, executor_id=payload.executor_id
9566 # )
9567 # )
9568
9569 # return ContainerStarted(
9570 # miner_hotkey=payload.miner_hotkey,
9571 # executor_id=payload.executor_id,
9572 # container_name=payload.container_name,
9573 # )
9574 # elif isinstance(payload, ContainerStopRequest):
9575 # await docker_service.stop_container(
9576 # payload,
9577 # executor,
9578 # my_key,
9579 # private_key.decode("utf-8"),
9580 # )
9581 # await miner_client.send_model(
9582 # SSHPubKeyRemoveRequest(
9583 # public_key=public_key, executor_id=payload.executor_id
9584 # )
9585 # )
9586
9587 # return ContainerStopped(
9588 # miner_hotkey=payload.miner_hotkey,
9589 # executor_id=payload.executor_id,
9590 # container_name=payload.container_name,
9591 # )
9592 elif isinstance(payload, ContainerDeleteRequest):
9593 logger.info(
9594 _m(
9595 "Deleting container",
9596 extra=get_extra_info(
9597 {**default_extra, "payload": str(payload)}
9598 ),
9599 ),
9600 )
9601 await docker_service.delete_container(
9602 payload,
9603 executor,
9604 my_key,
9605 private_key.decode("utf-8"),
9606 )
9607
9608 logger.info(
9609 _m(
9610 "Deleted Container",
9611 extra=get_extra_info(
9612 {**default_extra, "payload": str(payload)}
9613 ),
9614 ),
9615 )
9616 await miner_client.send_model(
9617 SSHPubKeyRemoveRequest(
9618 public_key=public_key, executor_id=payload.executor_id
9619 )
9620 )
9621
9622 return ContainerDeleted(
9623 miner_hotkey=payload.miner_hotkey,
9624 executor_id=payload.executor_id,
9625 container_name=payload.container_name,
9626 volume_name=payload.volume_name,
9627 )
9628 else:
9629 logger.error(
9630 _m(
9631 "Unexpected request",
9632 extra=get_extra_info(
9633 {**default_extra, "payload": str(payload)}
9634 ),
9635 ),
9636 )
9637 return FailedContainerRequest(
9638 miner_hotkey=payload.miner_hotkey,
9639 executor_id=payload.executor_id,
9640 msg=f"Unexpected request: {payload}",
9641 error_code=FailedContainerErrorCodes.UnknownError,
9642 )
9643
9644 except Exception as e:
9645 logger.error(
9646 _m(
9647 "Error: create container error",
9648 extra=get_extra_info({**default_extra, "error": str(e)}),
9649 ),
9650 )
9651 await miner_client.send_model(
9652 SSHPubKeyRemoveRequest(
9653 public_key=public_key, executor_id=payload.executor_id
9654 )
9655 )
9656
9657 return FailedContainerRequest(
9658 miner_hotkey=payload.miner_hotkey,
9659 executor_id=payload.executor_id,
9660 msg=f"create container error: {str(e)}",
9661 error_code=FailedContainerErrorCodes.ExceptionError,
9662 )
9663
9664 elif isinstance(msg, FailedRequest):
9665 logger.info(
9666 _m(
9667 "Error: Miner failed job",
9668 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9669 ),
9670 )
9671 return FailedContainerRequest(
9672 miner_hotkey=payload.miner_hotkey,
9673 executor_id=payload.executor_id,
9674 msg=f"Failed request from miner: {str(msg)}",
9675 error_code=FailedContainerErrorCodes.FailedMsgFromMiner,
9676 )
9677 else:
9678 logger.error(
9679 _m(
9680 "Error: Unexpected msg",
9681 extra=get_extra_info({**default_extra, "msg": str(msg)}),
9682 ),
9683 )
9684 return FailedContainerRequest(
9685 miner_hotkey=payload.miner_hotkey,
9686 executor_id=payload.executor_id,
9687 msg=f"Unexpected msg: {str(msg)}",
9688 error_code=FailedContainerErrorCodes.UnknownError,
9689 )
9690 except Exception as e:
9691 log_text = _m(
9692 "[handle_container] resulted in an exception",
9693 extra=get_extra_info({**default_extra, "error": str(e)}),
9694 )
9695
9696 logger.error(log_text, exc_info=True)
9697
9698 return FailedContainerRequest(
9699 miner_hotkey=payload.miner_hotkey,
9700 executor_id=payload.executor_id,
9701 msg=f"Exception: {str(e)}",
9702 error_code=FailedContainerErrorCodes.ExceptionError,
9703 )
9704
9705
9706MinerServiceDep = Annotated[MinerService, Depends(MinerService)]
9707
9708
9709
9710---
9711File: /neurons/validators/src/services/redis_service.py
9712---
9713
9714import json
9715import asyncio
9716import redis.asyncio as aioredis
9717from protocol.vc_protocol.compute_requests import RentedMachine
9718from core.config import settings
9719
9720MACHINE_SPEC_CHANNEL_NAME = "channel:1"
9721STREAMING_LOG_CHANNEL = "channel:2"
9722RENTED_MACHINE_SET = "rented_machines"
9723DUPLICATED_MACHINE_SET = "duplicated_machines"
9724EXECUTOR_COUNT_PREFIX = "executor_counts"
9725AVAILABLE_PORT_MAPS_PREFIX = "available_port_maps"
9726VERIFIED_JOB_COUNT_KEY = "verified_job_counts"
9727
9728
9729class RedisService:
9730 def __init__(self):
9731 self.redis = aioredis.from_url(f"redis://{settings.REDIS_HOST}:{settings.REDIS_PORT}")
9732 self.lock = asyncio.Lock()
9733
9734 async def publish(self, channel: str, message: dict):
9735 """Publish a message to a Redis channel."""
9736 await self.redis.publish(channel, json.dumps(message))
9737
9738 async def subscribe(self, channel: str):
9739 """Subscribe to a Redis channel."""
9740 pubsub = self.redis.pubsub()
9741 await pubsub.subscribe(channel)
9742 return pubsub
9743
9744 async def set(self, key: str, value: str):
9745 """Set a key-value pair in Redis."""
9746 async with self.lock:
9747 await self.redis.set(key, value)
9748
9749 async def get(self, key: str):
9750 """Get a value by key from Redis."""
9751 async with self.lock:
9752 return await self.redis.get(key)
9753
9754 async def delete(self, key: str):
9755 """Remove a key from Redis."""
9756 async with self.lock:
9757 await self.redis.delete(key)
9758
9759 async def sadd(self, key: str, elem: str):
9760 """Add an element to a set in Redis."""
9761 async with self.lock:
9762 await self.redis.sadd(key, elem)
9763
9764 async def srem(self, key: str, elem: str):
9765 """Remove an element from a set in Redis."""
9766 async with self.lock:
9767 await self.redis.srem(key, elem)
9768
9769 async def is_elem_exists_in_set(self, key: str, elem: str) -> bool:
9770 """Check an element exists or not in a set in Redis."""
9771 async with self.lock:
9772 return await self.redis.sismember(key, elem)
9773
9774 async def smembers(self, key: str):
9775 async with self.lock:
9776 return await self.redis.smembers(key)
9777
9778 async def add_rented_machine(self, machine: RentedMachine):
9779 await self.sadd(RENTED_MACHINE_SET, f"{machine.miner_hotkey}:{machine.executor_id}")
9780
9781 async def remove_rented_machine(self, machine: RentedMachine):
9782 await self.srem(RENTED_MACHINE_SET, f"{machine.miner_hotkey}:{machine.executor_id}")
9783
9784 async def lpush(self, key: str, element: bytes):
9785 """Add an element to a list in Redis."""
9786 async with self.lock:
9787 await self.redis.lpush(key, element)
9788
9789 async def lrange(self, key: str) -> list[bytes]:
9790 """Get all elements from a list in Redis in order."""
9791 async with self.lock:
9792 return await self.redis.lrange(key, 0, -1)
9793
9794 async def lrem(self, key: str, element: bytes, count: int = 0):
9795 """Remove elements from a list in Redis."""
9796 async with self.lock:
9797 await self.redis.lrem(key, count, element)
9798
9799 async def ltrim(self, key: str, max_length: int):
9800 """Trim the list to maintain a maximum length."""
9801 async with self.lock:
9802 await self.redis.ltrim(key, 0, max_length - 1)
9803
9804 async def lpop(self, key: str) -> bytes:
9805 """Remove and return the first element (last inserted) from a list in Redis."""
9806 async with self.lock:
9807 return await self.redis.lpop(key)
9808
9809 async def rpop(self, key: str) -> bytes:
9810 """Remove and return the last element (first inserted) from a list in Redis."""
9811 async with self.lock:
9812 return await self.redis.rpop(key)
9813
9814 async def hset(self, key: str, field: str, value: str):
9815 async with self.lock:
9816 await self.redis.hset(key, field, value)
9817
9818 async def hget(self, key: str, field: str):
9819 async with self.lock:
9820 return await self.redis.hget(key, field)
9821
9822 async def hgetall(self, key: str):
9823 async with self.lock:
9824 return await self.redis.hgetall(key)
9825
9826 async def hdel(self, key: str, *fields: str):
9827 async with self.lock:
9828 await self.redis.hdel(key, *fields)
9829
9830 async def clear_by_pattern(self, pattern: str):
9831 async with self.lock:
9832 async for key in self.redis.scan_iter(match=pattern):
9833 await self.redis.delete(key.decode())
9834
9835 async def clear_all_executor_counts(self):
9836 pattern = f"{EXECUTOR_COUNT_PREFIX}:*"
9837 cursor = 0
9838
9839 async with self.lock:
9840 while True:
9841 cursor, keys = await self.redis.scan(cursor, match=pattern, count=100)
9842 if keys:
9843 await self.redis.delete(*keys)
9844 if cursor == 0:
9845 break
9846
9847 async def clear_all_ssh_ports(self):
9848 pattern = f"{AVAILABLE_PORT_MAPS_PREFIX}:*"
9849 await self.clear_by_pattern(pattern)
9850
9851 async def set_verified_job_count(self, executor_id: str, count: int):
9852 data = {
9853 "count": count,
9854 }
9855
9856 await self.hset(VERIFIED_JOB_COUNT_KEY, executor_id, json.dumps(data))
9857
9858 async def get_verified_job_count(self, executor_id: str):
9859 data = await self.hget(VERIFIED_JOB_COUNT_KEY, executor_id)
9860
9861 if not data:
9862 return 0
9863
9864 return json.loads(data)['count']
9865
9866
9867
9868---
9869File: /neurons/validators/src/services/ssh_service.py
9870---
9871
9872import hashlib
9873from base64 import b64encode
9874import random
9875import string
9876
9877from cryptography.fernet import Fernet
9878from cryptography.hazmat.primitives import serialization
9879from cryptography.hazmat.primitives.asymmetric import ed25519
9880
9881
9882class SSHService:
9883 def generate_random_string(self, length=30):
9884 characters = (
9885 string.ascii_letters + string.digits +
9886 "/ "
9887 )
9888 random_string = ''.join(random.choices(characters, k=length))
9889 return random_string
9890
9891 def _hash(self, s: bytes) -> bytes:
9892 return b64encode(hashlib.sha256(s).digest(), altchars=b"-_")
9893
9894 def _encrypt(self, key: str, payload: str) -> str:
9895 key_bytes = self._hash(key.encode("utf-8"))
9896 return Fernet(key_bytes).encrypt(payload.encode("utf-8")).decode("utf-8")
9897
9898 def decrypt_payload(self, key: str, encrypted_payload: str) -> str:
9899 key_bytes = self._hash(key.encode("utf-8"))
9900 return Fernet(key_bytes).decrypt(encrypted_payload.encode("utf-8")).decode("utf-8")
9901
9902 def generate_ssh_key(self, encryption_key: str) -> (bytes, bytes):
9903 """Generate SSH key pair.
9904
9905 Args:
9906 encryption_key (str): key to encrypt the private key.
9907
9908 Returns:
9909 (bytes, bytes): return (private key bytes, public key bytes)
9910 """
9911 # Generate a new private-public key pair
9912 private_key = ed25519.Ed25519PrivateKey.generate()
9913 public_key = private_key.public_key()
9914
9915 private_key_bytes = private_key.private_bytes(
9916 encoding=serialization.Encoding.PEM,
9917 format=serialization.PrivateFormat.OpenSSH,
9918 # encryption_algorithm=BestAvailableEncryption(encryption_key.encode()),
9919 encryption_algorithm=serialization.NoEncryption(),
9920 )
9921 public_key_bytes = public_key.public_bytes(
9922 encoding=serialization.Encoding.OpenSSH,
9923 format=serialization.PublicFormat.OpenSSH,
9924 )
9925
9926 # extract pub key content, excluding first line and end line
9927 # pub_key_str = "".join(public_key_bytes.decode().split("\n")[1:-2])
9928
9929 return self._encrypt(encryption_key, private_key_bytes.decode("utf-8")).encode(
9930 "utf-8"
9931 ), public_key_bytes
9932
9933
9934
9935---
9936File: /neurons/validators/src/services/task_service.py
9937---
9938
9939import asyncio
9940import json
9941import logging
9942import os
9943import random
9944import time
9945import uuid
9946from typing import Annotated
9947
9948import asyncssh
9949import bittensor
9950from datura.requests.miner_requests import ExecutorSSHInfo
9951from fastapi import Depends
9952from payload_models.payloads import MinerJobEnryptedFiles, MinerJobRequestPayload
9953
9954from core.config import settings
9955from core.utils import _m, context, get_extra_info
9956from services.const import (
9957 DOWNLOAD_SPEED_WEIGHT,
9958 GPU_MAX_SCORES,
9959 JOB_TAKEN_TIME_WEIGHT,
9960 MAX_DOWNLOAD_SPEED,
9961 MAX_UPLOAD_SPEED,
9962 UPLOAD_SPEED_WEIGHT,
9963 MAX_GPU_COUNT,
9964 UNRENTED_MULTIPLIER,
9965 HASHCAT_CONFIGS,
9966 LIB_NVIDIA_ML_DIGESTS,
9967 DOCKER_DIGESTS,
9968 GPU_UTILIZATION_LIMIT,
9969 GPU_MEMORY_UTILIZATION_LIMIT,
9970 VERIFY_JOB_REQUIRED_COUNT,
9971)
9972from services.redis_service import (
9973 RedisService,
9974 RENTED_MACHINE_SET,
9975 DUPLICATED_MACHINE_SET,
9976 AVAILABLE_PORT_MAPS_PREFIX,
9977)
9978from services.ssh_service import SSHService
9979from services.hash_service import HashService
9980
9981logger = logging.getLogger(__name__)
9982
9983JOB_LENGTH = 300
9984
9985
9986class TaskService:
9987 def __init__(
9988 self,
9989 ssh_service: Annotated[SSHService, Depends(SSHService)],
9990 redis_service: Annotated[RedisService, Depends(RedisService)],
9991 ):
9992 self.ssh_service = ssh_service
9993 self.redis_service = redis_service
9994 self.wallet = settings.get_bittensor_wallet()
9995
9996 async def upload_directory(
9997 self, ssh_client: asyncssh.SSHClientConnection, local_dir: str, remote_dir: str
9998 ):
9999 """Uploads a directory recursively to a remote server using AsyncSSH."""
10000 async with ssh_client.start_sftp_client() as sftp_client:
10001 for root, dirs, files in os.walk(local_dir):
10002 relative_dir = os.path.relpath(root, local_dir)
10003 remote_path = os.path.join(remote_dir, relative_dir)
10004
10005 # Create remote directory if it doesn't exist
10006 result = await ssh_client.run(f"mkdir -p {remote_path}")
10007 if result.exit_status != 0:
10008 raise Exception(f"Failed to create directory {remote_path}: {result.stderr}")
10009
10010 # Upload files
10011 upload_tasks = []
10012 for file in files:
10013 local_file = os.path.join(root, file)
10014 remote_file = os.path.join(remote_path, file)
10015 upload_tasks.append(sftp_client.put(local_file, remote_file))
10016
10017 # Await all upload tasks for the current directory
10018 await asyncio.gather(*upload_tasks)
10019
10020 async def is_script_running(
10021 self, ssh_client: asyncssh.SSHClientConnection, script_path: str
10022 ) -> bool:
10023 """
10024 Check if a specific script is running.
10025
10026 Args:
10027 ssh_client: SSH client instance
10028 script_path: Full path to the script (e.g., '/root/app/gpus_utility.py')
10029
10030
10031 Returns:
10032 bool: True if script is running, False otherwise
10033 """
10034 try:
10035 result = await ssh_client.run(f'ps aux | grep "python.*{script_path}"', timeout=10)
10036 # Filter out the grep process itself
10037 processes = [line for line in result.stdout.splitlines() if "grep" not in line]
10038
10039 logger.info(f"{script_path} running status: {bool(processes)}")
10040 return bool(processes)
10041 except Exception as e:
10042 logger.error(f"Error checking {script_path} status: {e}")
10043 return False
10044
10045 async def start_script(
10046 self,
10047 ssh_client: asyncssh.SSHClientConnection,
10048 script_path: str,
10049 command_args: dict,
10050 executor_info: ExecutorSSHInfo,
10051 ) -> bool:
10052 """
10053 Start a script with specified arguments.
10054
10055 Args:
10056 ssh_client: SSH client instance
10057 script_path: Full path to the script (e.g., '/root/app/gpus_utility.py')
10058 command_args: Dictionary of argument names and values
10059
10060 Returns:
10061 bool: True if script started successfully, False otherwise
10062 """
10063 try:
10064 # Build command string from arguments
10065 args_string = " ".join([f"--{key} {value}" for key, value in command_args.items()])
10066 await ssh_client.run("pip install aiohttp click pynvml psutil", timeout=30)
10067 command = (
10068 f"nohup {executor_info.python_path} {script_path} {args_string} > /dev/null 2>&1 & "
10069 )
10070 # Run the script
10071 result = await ssh_client.run(command, timeout=50, check=True)
10072 logger.info(f"Started {script_path}: {result}")
10073 return True
10074 except Exception as e:
10075 logger.error(f"Error starting script {script_path}: {e}", exc_info=True)
10076 return False
10077
10078 def validate_digests(self, docker_digests, docker_hub_digests):
10079 # Check if the list is empty
10080 if not docker_digests:
10081 return False
10082
10083 # Get unique digests
10084 unique_digests = list({item["digest"] for item in docker_digests})
10085
10086 # Check for duplicates
10087 if len(unique_digests) != len(docker_digests):
10088 return False
10089
10090 # Check if any digest is invalid
10091 for digest in unique_digests:
10092 if digest not in docker_hub_digests.values():
10093 return False
10094
10095 return True
10096
10097 async def clear_remote_directory(
10098 self, ssh_client: asyncssh.SSHClientConnection, remote_dir: str
10099 ):
10100 try:
10101 await ssh_client.run(f"rm -rf {remote_dir}", timeout=10)
10102 except Exception as e:
10103 logger.error(f"Error clearing remote directory: {e}")
10104
10105 def get_available_port_map(
10106 self,
10107 executor_info: ExecutorSSHInfo,
10108 ) -> tuple[int, int] | None:
10109 if executor_info.port_mappings:
10110 port_mappings: list[tuple[int, int]] = json.loads(executor_info.port_mappings)
10111 port_mappings = [
10112 (internal_port, external_port)
10113 for internal_port, external_port in port_mappings
10114 if internal_port != executor_info.ssh_port
10115 and external_port != executor_info.ssh_port
10116 ]
10117
10118 if not port_mappings:
10119 return None
10120
10121 return random.choice(port_mappings)
10122
10123 if executor_info.port_range:
10124 if "-" in executor_info.port_range:
10125 min_port, max_port = map(
10126 int, (part.strip() for part in executor_info.port_range.split("-"))
10127 )
10128 ports = list(range(min_port, max_port + 1))
10129 else:
10130 ports = list(
10131 map(int, (part.strip() for part in executor_info.port_range.split(",")))
10132 )
10133 else:
10134 # Default range if port_range is empty
10135 ports = list(range(40000, 65536))
10136
10137 ports = [port for port in ports if port != executor_info.ssh_port]
10138
10139 if not ports:
10140 return None
10141
10142 internal_port = random.choice(ports)
10143
10144 return internal_port, internal_port
10145
10146 async def docker_connection_check(
10147 self,
10148 ssh_client: asyncssh.SSHClientConnection,
10149 job_batch_id: str,
10150 miner_hotkey: str,
10151 executor_info: ExecutorSSHInfo,
10152 private_key: str,
10153 public_key: str,
10154 ):
10155 port_map = self.get_available_port_map(executor_info)
10156 if port_map is None:
10157 log_text = _m(
10158 "No port available for docker container",
10159 extra=get_extra_info(
10160 {
10161 "job_batch_id": job_batch_id,
10162 "miner_hotkey": miner_hotkey,
10163 "executor_uuid": executor_info.uuid,
10164 "executor_ip_address": executor_info.address,
10165 "executor_port": executor_info.port,
10166 "ssh_username": executor_info.ssh_username,
10167 "ssh_port": executor_info.ssh_port,
10168 "version": settings.VERSION
10169 }
10170 ),
10171 )
10172 log_status = "error"
10173 logger.error(log_text, exc_info=True)
10174
10175 return False, log_text, log_status
10176
10177 internal_port, external_port = port_map
10178 executor_name = f"{executor_info.uuid}_{executor_info.address}_{executor_info.port}"
10179 default_extra = {
10180 "job_batch_id": job_batch_id,
10181 "miner_hotkey": miner_hotkey,
10182 "executor_uuid": executor_info.uuid,
10183 "executor_ip_address": executor_info.address,
10184 "executor_port": executor_info.port,
10185 "ssh_username": executor_info.ssh_username,
10186 "ssh_port": executor_info.ssh_port,
10187 "internal_port": internal_port,
10188 "external_port": external_port,
10189 "version": settings.VERSION,
10190 }
10191 context.set(f"[_docker_connection_check][{executor_name}]")
10192
10193 container_name = f"container_{miner_hotkey}"
10194
10195 try:
10196 result = await ssh_client.run(f"docker ps -q -f name={container_name}")
10197 if result.stdout.strip():
10198 command = f"docker rm {container_name} -f"
10199 await ssh_client.run(command)
10200
10201 log_text = _m(
10202 "Creating docker container",
10203 extra=default_extra,
10204 )
10205 log_status = "info"
10206 logger.info(log_text)
10207
10208 docker_cmd = f"sh -c 'mkdir -p ~/.ssh && echo \"{public_key}\" >> ~/.ssh/authorized_keys && ssh-keygen -A && service ssh start && tail -f /dev/null'"
10209 command = f"docker run -d --name {container_name} -p {internal_port}:22 daturaai/compute-subnet-executor:latest {docker_cmd}"
10210
10211 result = await ssh_client.run(command)
10212 if result.exit_status != 0:
10213 error_message = result.stderr.strip() if result.stderr else "No error message available"
10214 log_text = _m(
10215 "Error creating docker connection",
10216 extra=get_extra_info({
10217 **default_extra,
10218 "error": error_message
10219 }),
10220 )
10221 log_status = "error"
10222 logger.error(log_text, exc_info=True)
10223
10224 try:
10225 command = f"docker rm {container_name} -f"
10226 await ssh_client.run(command)
10227 except Exception as e:
10228 logger.error(f"Error removing docker container: {e}")
10229
10230 return False, log_text, log_status
10231
10232 await asyncio.sleep(3)
10233
10234 pkey = asyncssh.import_private_key(private_key)
10235 async with asyncssh.connect(
10236 host=executor_info.address,
10237 port=external_port,
10238 username=executor_info.ssh_username,
10239 client_keys=[pkey],
10240 known_hosts=None,
10241 ) as _:
10242 log_text = _m(
10243 "Connected into docker container",
10244 extra=default_extra,
10245 )
10246 logger.info(log_text)
10247
10248 # set port on redis
10249 key = f"{AVAILABLE_PORT_MAPS_PREFIX}:{miner_hotkey}:{executor_info.uuid}"
10250 port_map = f"{internal_port},{external_port}"
10251
10252 # delete all the same port_maps in the list
10253 await self.redis_service.lrem(key=key, element=port_map)
10254
10255 # insert port_map in the list
10256 await self.redis_service.lpush(key, port_map)
10257
10258 # keep the latest 10 port maps
10259 port_maps = await self.redis_service.lrange(key)
10260 if len(port_maps) > 10:
10261 await self.redis_service.rpop(key)
10262
10263 command = f"docker rm {container_name} -f"
10264 await ssh_client.run(command)
10265
10266 return True, log_text, log_status
10267 except Exception as e:
10268 log_text = _m(
10269 "Error connection docker container",
10270 extra=get_extra_info({**default_extra, "error": str(e)}),
10271 )
10272 log_status = "error"
10273 logger.error(log_text, exc_info=True)
10274
10275 try:
10276 command = f"docker rm {container_name} -f"
10277 await ssh_client.run(command)
10278 except Exception as e:
10279 logger.error(f"Error removing docker container: {e}")
10280
10281 return False, log_text, log_status
10282
10283 async def clear_verified_job_count(self, executor_info: ExecutorSSHInfo):
10284 await self.redis_service.set_verified_job_count(executor_info.uuid, 0)
10285
10286 async def create_task(
10287 self,
10288 miner_info: MinerJobRequestPayload,
10289 executor_info: ExecutorSSHInfo,
10290 keypair: bittensor.Keypair,
10291 private_key: str,
10292 public_key: str,
10293 encypted_files: MinerJobEnryptedFiles,
10294 docker_hub_digests: dict[str, str],
10295 debug: bool = False,
10296 ):
10297 default_extra = {
10298 "job_batch_id": miner_info.job_batch_id,
10299 "miner_hotkey": miner_info.miner_hotkey,
10300 "executor_uuid": executor_info.uuid,
10301 "executor_ip_address": executor_info.address,
10302 "executor_port": executor_info.port,
10303 "executor_ssh_username": executor_info.ssh_username,
10304 "executor_ssh_port": executor_info.ssh_port,
10305 "version": settings.VERSION,
10306 }
10307 try:
10308 logger.info(_m("Start job on an executor", extra=get_extra_info(default_extra)))
10309
10310 private_key = self.ssh_service.decrypt_payload(keypair.ss58_address, private_key)
10311 pkey = asyncssh.import_private_key(private_key)
10312
10313 async with asyncssh.connect(
10314 host=executor_info.address,
10315 port=executor_info.ssh_port,
10316 username=executor_info.ssh_username,
10317 client_keys=[pkey],
10318 known_hosts=None,
10319 ) as ssh_client:
10320 remote_dir = f"{executor_info.root_dir}/temp"
10321 await ssh_client.run(f"rm -rf {remote_dir}")
10322 await ssh_client.run(f"mkdir -p {remote_dir}")
10323
10324 # start gpus_utility.py
10325 program_id = str(uuid.uuid4())
10326 command_args = {
10327 "program_id": program_id,
10328 "signature": f"0x{keypair.sign(program_id.encode()).hex()}",
10329 "executor_id": executor_info.uuid,
10330 "validator_hotkey": keypair.ss58_address,
10331 "compute_rest_app_url": settings.COMPUTE_REST_API_URL,
10332 }
10333 script_path = f"{executor_info.root_dir}/src/gpus_utility.py"
10334 if not await self.is_script_running(ssh_client, script_path):
10335 await self.start_script(ssh_client, script_path, command_args, executor_info)
10336
10337 if debug is True:
10338 logger.info("Debug mode is enabled. Skipping other tasks.")
10339 return (
10340 None,
10341 executor_info,
10342 0,
10343 0,
10344 miner_info.job_batch_id,
10345 "info",
10346 "Debug mode is enabled. Skipping other tasks.",
10347 )
10348
10349 # upload temp directory
10350 await self.upload_directory(ssh_client, encypted_files.tmp_directory, remote_dir)
10351
10352 remote_machine_scrape_file_path = (
10353 f"{remote_dir}/{encypted_files.machine_scrape_file_name}"
10354 )
10355 remote_score_file_path = f"{remote_dir}/{encypted_files.score_file_name}"
10356
10357 logger.info(
10358 _m(
10359 "Uploaded files to run job",
10360 extra=get_extra_info(default_extra),
10361 ),
10362 )
10363
10364 machine_specs, _ = await self._run_task(
10365 ssh_client=ssh_client,
10366 miner_hotkey=miner_info.miner_hotkey,
10367 executor_info=executor_info,
10368 command=f"chmod +x {remote_machine_scrape_file_path} && {remote_machine_scrape_file_path}",
10369 )
10370 if not machine_specs:
10371 log_status = "warning"
10372 log_text = _m("No machine specs found", extra=get_extra_info(default_extra))
10373 logger.warning(log_text)
10374
10375 await self.clear_remote_directory(ssh_client, remote_dir)
10376 await self.clear_verified_job_count(executor_info)
10377
10378 return (
10379 None,
10380 executor_info,
10381 0,
10382 0,
10383 miner_info.job_batch_id,
10384 log_status,
10385 log_text,
10386 )
10387
10388 machine_spec = json.loads(
10389 self.ssh_service.decrypt_payload(
10390 encypted_files.encrypt_key, machine_specs[0].strip()
10391 )
10392 )
10393
10394 gpu_model = None
10395 if machine_spec.get("gpu", {}).get("count", 0) > 0:
10396 details = machine_spec["gpu"].get("details", [])
10397 if len(details) > 0:
10398 gpu_model = details[0].get("name", None)
10399
10400 max_score = 0
10401 if gpu_model:
10402 max_score = GPU_MAX_SCORES.get(gpu_model, 0)
10403
10404 gpu_count = machine_spec.get("gpu", {}).get("count", 0)
10405 gpu_details = machine_spec.get("gpu", {}).get("details", [])
10406
10407 nvidia_driver = machine_spec.get("gpu", {}).get("driver", "")
10408 libnvidia_ml = machine_spec.get("md5_checksums", {}).get("libnvidia_ml", "")
10409
10410 docker_version = machine_spec.get("docker", {}).get("version", "")
10411 docker_digest = machine_spec.get("md5_checksums", {}).get("docker", "")
10412
10413 ram = machine_spec.get("ram", {}).get("total", 0)
10414 storage = machine_spec.get("hard_disk", {}).get("free", 0)
10415
10416 gpu_processes = machine_spec.get("gpu_processes", [])
10417
10418 vram = 0
10419 for detail in gpu_details:
10420 vram += detail.get("capacity", 0) * 1024
10421
10422 logger.info(
10423 _m(
10424 "Machine spec scraped",
10425 extra=get_extra_info(
10426 {
10427 **default_extra,
10428 "gpu_model": gpu_model,
10429 "gpu_count": gpu_count,
10430 "nvidia_driver": nvidia_driver,
10431 "libnvidia_ml": libnvidia_ml,
10432 }
10433 ),
10434 ),
10435 )
10436
10437 if gpu_count > MAX_GPU_COUNT:
10438 log_status = "warning"
10439 log_text = _m(
10440 f"GPU count({gpu_count}) is greater than the maximum allowed ({MAX_GPU_COUNT}).",
10441 extra=get_extra_info(default_extra),
10442 )
10443 logger.warning(log_text)
10444
10445 await self.clear_remote_directory(ssh_client, remote_dir)
10446 await self.clear_verified_job_count(executor_info)
10447
10448 return (
10449 machine_spec,
10450 executor_info,
10451 0,
10452 0,
10453 miner_info.job_batch_id,
10454 log_status,
10455 log_text,
10456 )
10457
10458 if max_score == 0 or gpu_count == 0 or len(gpu_details) != gpu_count:
10459 extra_info = {
10460 **default_extra,
10461 "os_version": machine_spec.get("os", ""),
10462 "nvidia_cfg": machine_spec.get("nvidia_cfg", ""),
10463 "docker_cfg": machine_spec.get("docker_cfg", ""),
10464 "gpu_scrape_error": machine_spec.get("gpu_scrape_error", ""),
10465 "nvidia_cfg_scrape_error": machine_spec.get("nvidia_cfg_scrape_error", ""),
10466 "docker_cfg_scrape_error": machine_spec.get("docker_cfg_scrape_error", ""),
10467 }
10468 if gpu_model:
10469 extra_info["gpu_model"] = gpu_model
10470 extra_info["help_text"] = (
10471 "If you have the gpu machine and encountering this issue consistantly, "
10472 "then please pull the latest version of github repository and follow the installation guide here: "
10473 "https://github.com/Datura-ai/compute-subnet/tree/main/neurons/executor. "
10474 "Also, please configure the nvidia-container-runtime correctly. Check out here: "
10475 "https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours "
10476 "https://bobcares.com/blog/docker-failed-to-initialize-nvml-unknown-error/"
10477 )
10478
10479 log_text = _m(
10480 f"Max Score({max_score}) or GPU count({gpu_count}) is 0. No need to run job.",
10481 extra=get_extra_info(
10482 {
10483 **default_extra,
10484 **extra_info,
10485 }
10486 ),
10487 )
10488 log_status = "warning"
10489 logger.warning(log_text)
10490
10491 await self.clear_remote_directory(ssh_client, remote_dir)
10492 await self.clear_verified_job_count(executor_info)
10493
10494 return (
10495 machine_spec,
10496 executor_info,
10497 0,
10498 0,
10499 miner_info.job_batch_id,
10500 log_status,
10501 log_text,
10502 )
10503
10504 if not docker_version or DOCKER_DIGESTS.get(docker_version) != docker_digest:
10505 log_status = "warning"
10506 log_text = _m(
10507 "Docker is altered",
10508 extra=get_extra_info(
10509 {
10510 **default_extra,
10511 "docker_version": docker_version,
10512 "docker_digest": docker_digest,
10513 }
10514 ),
10515 )
10516 logger.warning(log_text)
10517
10518 await self.clear_remote_directory(ssh_client, remote_dir)
10519 await self.clear_verified_job_count(executor_info)
10520
10521 return (
10522 machine_spec,
10523 executor_info,
10524 0,
10525 0,
10526 miner_info.job_batch_id,
10527 log_status,
10528 log_text,
10529 )
10530
10531 if nvidia_driver and LIB_NVIDIA_ML_DIGESTS.get(nvidia_driver) != libnvidia_ml:
10532 log_status = "warning"
10533 log_text = _m(
10534 "Nvidia driver is altered",
10535 extra=get_extra_info(
10536 {
10537 **default_extra,
10538 "gpu_model": gpu_model,
10539 "gpu_count": gpu_count,
10540 "nvidia_driver": nvidia_driver,
10541 "libnvidia_ml": libnvidia_ml,
10542 }
10543 ),
10544 )
10545 logger.warning(log_text)
10546
10547 await self.clear_remote_directory(ssh_client, remote_dir)
10548 await self.clear_verified_job_count(executor_info)
10549
10550 return (
10551 machine_spec,
10552 executor_info,
10553 0,
10554 0,
10555 miner_info.job_batch_id,
10556 log_status,
10557 log_text,
10558 )
10559
10560 for process in gpu_processes:
10561 container_name = process.get('container_name', None)
10562 if not container_name:
10563 log_status = "warning"
10564 log_text = _m(
10565 "GPU is using in some other places",
10566 extra=get_extra_info(
10567 {
10568 **default_extra,
10569 "gpu_model": gpu_model,
10570 "gpu_count": gpu_count,
10571 **process,
10572 }
10573 ),
10574 )
10575 logger.warning(log_text)
10576
10577 await self.clear_remote_directory(ssh_client, remote_dir)
10578 await self.clear_verified_job_count(executor_info)
10579
10580 return (
10581 machine_spec,
10582 executor_info,
10583 0,
10584 0,
10585 miner_info.job_batch_id,
10586 log_status,
10587 log_text,
10588 )
10589
10590 # if ram < vram * 0.9 or storage < vram * 1.5:
10591 # log_status = "warning"
10592 # log_text = _m(
10593 # "Incorrect vram",
10594 # extra=get_extra_info(
10595 # {
10596 # **default_extra,
10597 # "gpu_model": gpu_model,
10598 # "gpu_count": gpu_count,
10599 # "memory": ram,
10600 # "vram": vram,
10601 # "storage": storage,
10602 # "nvidia_driver": nvidia_driver,
10603 # "libnvidia_ml": libnvidia_ml,
10604 # }
10605 # ),
10606 # )
10607 # logger.warning(log_text)
10608
10609 # await self.clear_remote_directory(ssh_client, remote_dir)
10610 # await self.clear_verified_job_count(executor_info)
10611
10612 # return (
10613 # machine_spec,
10614 # executor_info,
10615 # 0,
10616 # 0,
10617 # miner_info.job_batch_id,
10618 # log_status,
10619 # log_text,
10620 # )
10621
10622 logger.info(
10623 _m(
10624 f"Got GPU specs: {gpu_model} with max score: {max_score}",
10625 extra=get_extra_info(default_extra),
10626 ),
10627 )
10628
10629 # check duplicated
10630 is_duplicated = await self.redis_service.is_elem_exists_in_set(
10631 DUPLICATED_MACHINE_SET, f"{miner_info.miner_hotkey}:{executor_info.uuid}"
10632 )
10633 if is_duplicated:
10634 log_status = "warning"
10635 log_text = _m(
10636 f"Executor is duplicated",
10637 extra=get_extra_info(default_extra),
10638 )
10639 logger.warning(log_text)
10640
10641 await self.clear_remote_directory(ssh_client, remote_dir)
10642 await self.clear_verified_job_count(executor_info)
10643
10644 return (
10645 machine_spec,
10646 executor_info,
10647 0,
10648 0,
10649 miner_info.job_batch_id,
10650 log_status,
10651 log_text,
10652 )
10653
10654 # check rented status
10655 is_rented = await self.redis_service.is_elem_exists_in_set(
10656 RENTED_MACHINE_SET, f"{miner_info.miner_hotkey}:{executor_info.uuid}"
10657 )
10658 if is_rented:
10659 score = max_score * gpu_count
10660 log_text = _m(
10661 "Executor is already rented.",
10662 extra=get_extra_info({**default_extra, "score": score}),
10663 )
10664 log_status = "info"
10665 logger.info(log_text)
10666
10667 await self.clear_remote_directory(ssh_client, remote_dir)
10668
10669 return (
10670 machine_spec,
10671 executor_info,
10672 score,
10673 0,
10674 miner_info.job_batch_id,
10675 log_status,
10676 log_text,
10677 )
10678 else:
10679 # check gpu usages
10680 for detail in gpu_details:
10681 gpu_utilization = detail.get("gpu_utilization", GPU_UTILIZATION_LIMIT)
10682 gpu_memory_utilization = detail.get("memory_utilization", GPU_MEMORY_UTILIZATION_LIMIT)
10683 if gpu_utilization >= GPU_UTILIZATION_LIMIT or gpu_memory_utilization > GPU_MEMORY_UTILIZATION_LIMIT:
10684 log_status = "warning"
10685 log_text = _m(
10686 f"High gpu utilization detected:",
10687 extra=get_extra_info({
10688 **default_extra,
10689 "gpu_utilization": gpu_utilization,
10690 "gpu_memory_utilization": gpu_memory_utilization,
10691 }),
10692 )
10693 logger.warning(log_text)
10694
10695 await self.clear_remote_directory(ssh_client, remote_dir)
10696 await self.clear_verified_job_count(executor_info)
10697
10698 return (
10699 machine_spec,
10700 executor_info,
10701 0,
10702 0,
10703 miner_info.job_batch_id,
10704 log_status,
10705 log_text,
10706 )
10707
10708 # if not rented, check renting ports
10709 success, log_text, log_status = await self.docker_connection_check(
10710 ssh_client=ssh_client,
10711 job_batch_id=miner_info.job_batch_id,
10712 miner_hotkey=miner_info.miner_hotkey,
10713 executor_info=executor_info,
10714 private_key=private_key,
10715 public_key=public_key,
10716 )
10717 if not success:
10718 await self.clear_remote_directory(ssh_client, remote_dir)
10719 await self.clear_verified_job_count(executor_info)
10720
10721 return (
10722 None,
10723 executor_info,
10724 0,
10725 0,
10726 miner_info.job_batch_id,
10727 log_status,
10728 log_text,
10729 )
10730
10731 # if not rented, check docker digests
10732 docker_digests = machine_spec.get("docker", {}).get("containers", [])
10733 is_docker_valid = self.validate_digests(docker_digests, docker_hub_digests)
10734 if not is_docker_valid:
10735 log_text = _m(
10736 "Docker digests are not valid",
10737 extra=get_extra_info(
10738 {**default_extra, "docker_digests": docker_digests}
10739 ),
10740 )
10741 log_status = "error"
10742
10743 logger.warning(log_text)
10744
10745 await self.clear_remote_directory(ssh_client, remote_dir)
10746 await self.clear_verified_job_count(executor_info)
10747
10748 return (
10749 None,
10750 executor_info,
10751 0,
10752 0,
10753 miner_info.job_batch_id,
10754 log_status,
10755 log_text,
10756 )
10757
10758 # scoring
10759 hashcat_config = HASHCAT_CONFIGS[gpu_model]
10760 if not hashcat_config:
10761 log_text = _m(
10762 "No config for hashcat",
10763 extra=get_extra_info(default_extra),
10764 )
10765 log_status = "error"
10766
10767 logger.warning(log_text)
10768
10769 await self.clear_remote_directory(ssh_client, remote_dir)
10770 await self.clear_verified_job_count(executor_info)
10771
10772 return (
10773 None,
10774 executor_info,
10775 0,
10776 0,
10777 miner_info.job_batch_id,
10778 log_status,
10779 log_text,
10780 )
10781
10782 num_digits = hashcat_config.get("digits", 11)
10783 avg_job_time = (
10784 hashcat_config.get("average_time")[gpu_count - 1 if gpu_count <= 8 else 7]
10785 if hashcat_config.get("average_time")
10786 else 60
10787 )
10788 hash_service = HashService.generate(
10789 gpu_count=gpu_count, num_digits=num_digits, timeout=int(avg_job_time * 2.5)
10790 )
10791 start_time = time.time()
10792
10793 results, err = await self._run_task(
10794 ssh_client=ssh_client,
10795 miner_hotkey=miner_info.miner_hotkey,
10796 executor_info=executor_info,
10797 command=f"export PYTHONPATH={executor_info.root_dir}:$PYTHONPATH && {executor_info.python_path} {remote_score_file_path} '{hash_service.payload}'",
10798 )
10799 if not results:
10800 log_text = _m(
10801 "No result from training job task.",
10802 extra=get_extra_info({
10803 **default_extra,
10804 "error": str(err)
10805 }),
10806 )
10807 log_status = "warning"
10808 logger.warning(log_text)
10809
10810 await self.clear_remote_directory(ssh_client, remote_dir)
10811 await self.clear_verified_job_count(executor_info)
10812
10813 return (
10814 machine_spec,
10815 executor_info,
10816 0,
10817 0,
10818 miner_info.job_batch_id,
10819 log_status,
10820 log_text,
10821 )
10822
10823 end_time = time.time()
10824 job_taken_time = end_time - start_time
10825
10826 result = json.loads(results[0])
10827 answer = result["answer"]
10828
10829 score = 0
10830
10831 logger.info(
10832 _m(
10833 f"Results from training job task: {str(result)}",
10834 extra=get_extra_info(default_extra),
10835 ),
10836 )
10837 log_text = ""
10838 log_status = ""
10839
10840 if err is not None:
10841 log_status = "error"
10842 log_text = _m(
10843 f"Error executing task on executor: {err}",
10844 extra=get_extra_info(default_extra),
10845 )
10846 logger.error(log_text)
10847
10848 await self.clear_remote_directory(ssh_client, remote_dir)
10849 await self.clear_verified_job_count(executor_info)
10850
10851 return (
10852 machine_spec,
10853 executor_info,
10854 0,
10855 0,
10856 miner_info.job_batch_id,
10857 log_status,
10858 log_text,
10859 )
10860
10861 elif answer != hash_service.answer:
10862 log_status = "error"
10863 log_text = _m(
10864 "Hashcat incorrect Answer",
10865 extra=get_extra_info({**default_extra, "answer": answer, "hash_service_answer": hash_service.answer}),
10866 )
10867 logger.error(log_text)
10868
10869 await self.clear_remote_directory(ssh_client, remote_dir)
10870 await self.clear_verified_job_count(executor_info)
10871
10872 return (
10873 machine_spec,
10874 executor_info,
10875 0,
10876 0,
10877 miner_info.job_batch_id,
10878 log_status,
10879 log_text,
10880 )
10881
10882 # elif job_taken_time > avg_job_time * 2:
10883 # log_status = "error"
10884 # log_text = _m(
10885 # f"Incorrect Answer",
10886 # extra=get_extra_info(default_extra),
10887 # )
10888 # logger.error(log_text)
10889
10890 else:
10891 verified_job_count = await self.redis_service.get_verified_job_count(executor_info.uuid)
10892 verified_job_count += 1
10893
10894 logger.info(
10895 _m(
10896 "Job taken time for executor",
10897 extra=get_extra_info({
10898 **default_extra,
10899 "job_taken_time": job_taken_time,
10900 "verified_job_count": verified_job_count,
10901 }),
10902 ),
10903 )
10904
10905 upload_speed = machine_spec.get("network", {}).get("upload_speed", 0)
10906 download_speed = machine_spec.get("network", {}).get("download_speed", 0)
10907
10908 # Ensure upload_speed and download_speed are not None
10909 upload_speed = upload_speed if upload_speed is not None else 0
10910 download_speed = download_speed if download_speed is not None else 0
10911
10912 job_taken_score = (
10913 min(avg_job_time * 0.7 / job_taken_time, 1) if job_taken_time > 0 else 0
10914 )
10915 upload_speed_score = min(upload_speed / MAX_UPLOAD_SPEED, 1)
10916 download_speed_score = min(download_speed / MAX_DOWNLOAD_SPEED, 1)
10917
10918 score = (
10919 max_score
10920 * gpu_count
10921 * UNRENTED_MULTIPLIER
10922 * (
10923 job_taken_score * JOB_TAKEN_TIME_WEIGHT
10924 + upload_speed_score * UPLOAD_SPEED_WEIGHT
10925 + download_speed_score * DOWNLOAD_SPEED_WEIGHT
10926 )
10927 )
10928
10929 log_status = "info"
10930 log_text = _m(
10931 "Train task finished",
10932 extra=get_extra_info(
10933 {
10934 **default_extra,
10935 "job_score": score,
10936 "acutal_score": score if verified_job_count >= VERIFY_JOB_REQUIRED_COUNT else 0,
10937 "job_taken_time": job_taken_time,
10938 "upload_speed": upload_speed,
10939 "download_speed": download_speed,
10940 "gpu_model": gpu_model,
10941 "gpu_count": gpu_count,
10942 "verified_job_count": verified_job_count,
10943 "remaining_jobs_before_emission": 0 if verified_job_count >= VERIFY_JOB_REQUIRED_COUNT else VERIFY_JOB_REQUIRED_COUNT - verified_job_count,
10944 "unrented_multiplier": UNRENTED_MULTIPLIER,
10945 }
10946 ),
10947 )
10948
10949 logger.info(log_text)
10950
10951 logger.info(
10952 _m(
10953 "SSH connection closed for executor",
10954 extra=get_extra_info(default_extra),
10955 ),
10956 )
10957
10958 await self.clear_remote_directory(ssh_client, remote_dir)
10959 await self.redis_service.set_verified_job_count(executor_info.uuid, verified_job_count)
10960
10961 if verified_job_count >= VERIFY_JOB_REQUIRED_COUNT:
10962 return (
10963 machine_spec,
10964 executor_info,
10965 score,
10966 score,
10967 miner_info.job_batch_id,
10968 log_status,
10969 log_text,
10970 )
10971 else:
10972 return (
10973 machine_spec,
10974 executor_info,
10975 0,
10976 score,
10977 miner_info.job_batch_id,
10978 log_status,
10979 log_text,
10980 )
10981 except Exception as e:
10982 log_status = "error"
10983 log_text = _m(
10984 "Error creating task for executor",
10985 extra=get_extra_info({**default_extra, "error": str(e)}),
10986 )
10987
10988 try:
10989 await self.clear_verified_job_count(executor_info)
10990
10991 key = f"{AVAILABLE_PORT_MAPS_PREFIX}:{miner_info.miner_hotkey}:{executor_info.uuid}"
10992 await self.redis_service.delete(key)
10993 except Exception as redis_error:
10994 log_text = _m(
10995 "Error creating task redis_reset_error",
10996 extra=get_extra_info(
10997 {
10998 **default_extra,
10999 "error": str(e),
11000 "redis_reset_error": str(redis_error),
11001 }
11002 ),
11003 )
11004
11005 logger.error(
11006 log_text,
11007 exc_info=True,
11008 )
11009
11010 return (
11011 None,
11012 executor_info,
11013 0,
11014 0,
11015 miner_info.job_batch_id,
11016 log_status,
11017 log_text,
11018 )
11019
11020 async def _run_task(
11021 self,
11022 ssh_client: asyncssh.SSHClientConnection,
11023 miner_hotkey: str,
11024 executor_info: ExecutorSSHInfo,
11025 command: str,
11026 timeout: int = JOB_LENGTH,
11027 ) -> tuple[list[str] | None, str | None]:
11028 try:
11029 executor_name = f"{executor_info.uuid}_{executor_info.address}_{executor_info.port}"
11030 default_extra = {
11031 "executor_uuid": executor_info.uuid,
11032 "executor_ip_address": executor_info.address,
11033 "executor_port": executor_info.port,
11034 "miner_hotkey": miner_hotkey,
11035 "command": command[:100] + ("..." if len(command) > 100 else ""),
11036 "version": settings.VERSION,
11037 }
11038 context.set(f"[_run_task][{executor_name}]")
11039 logger.info(
11040 _m(
11041 "Running task for executor",
11042 extra=default_extra,
11043 ),
11044 )
11045 result = await ssh_client.run(command, timeout=timeout)
11046 results = result.stdout.splitlines()
11047 errors = result.stderr.splitlines()
11048
11049 actual_errors = [error for error in errors if "warnning" not in error.lower()]
11050
11051 if len(results) == 0 and len(actual_errors) > 0:
11052 logger.error(_m("Failed to execute command!", extra=get_extra_info({**default_extra, "errors": actual_errors})))
11053 raise Exception("Failed to execute command!")
11054
11055 return results, None
11056 except Exception as e:
11057 logger.error(
11058 _m("Run task error to executor", extra=get_extra_info(default_extra)),
11059 exc_info=True,
11060 )
11061
11062 return None, str(e)
11063
11064
11065TaskServiceDep = Annotated[TaskService, Depends(TaskService)]
11066
11067
11068
11069---
11070File: /neurons/validators/src/cli.py
11071---
11072
11073import asyncio
11074import logging
11075import random
11076import time
11077import uuid
11078
11079import click
11080from datura.requests.miner_requests import ExecutorSSHInfo
11081
11082from core.utils import configure_logs_of_other_modules
11083from core.validator import Validator
11084from services.ioc import ioc
11085from services.miner_service import MinerService
11086from services.docker_service import DockerService, REPOSITORIES
11087from services.file_encrypt_service import FileEncryptService
11088from payload_models.payloads import (
11089 MinerJobRequestPayload,
11090 ContainerCreateRequest,
11091 CustomOptions,
11092)
11093
11094configure_logs_of_other_modules()
11095logger = logging.getLogger(__name__)
11096
11097
11098@click.group()
11099def cli():
11100 pass
11101
11102
11103@cli.command()
11104@click.option("--miner_hotkey", prompt="Miner Hotkey", help="Hotkey of Miner")
11105@click.option("--miner_address", prompt="Miner Address", help="Miner IP Address")
11106@click.option("--miner_port", type=int, prompt="Miner Port", help="Miner Port")
11107def debug_send_job_to_miner(miner_hotkey: str, miner_address: str, miner_port: int):
11108 """Debug sending job to miner"""
11109 miner = type("Miner", (object,), {})()
11110 miner.hotkey = miner_hotkey
11111 miner.axon_info = type("AxonInfo", (object,), {})()
11112 miner.axon_info.ip = miner_address
11113 miner.axon_info.port = miner_port
11114 validator = Validator(debug_miner=miner)
11115 asyncio.run(validator.sync())
11116
11117
11118def generate_random_ip():
11119 return ".".join(str(random.randint(0, 255)) for _ in range(4))
11120
11121
11122@cli.command()
11123def debug_send_machine_specs_to_connector():
11124 """Debug sending machine specs to connector"""
11125 miner_service: MinerService = ioc["MinerService"]
11126 counter = 0
11127
11128 while counter < 10:
11129 counter += 1
11130 debug_specs = {
11131 "gpu": {
11132 "count": 1,
11133 "details": [
11134 {
11135 "name": "NVIDIA RTX A5000",
11136 "driver": "555.42.06",
11137 "capacity": "24564",
11138 "cuda": "8.6",
11139 "power_limit": "230.00",
11140 "graphics_speed": "435",
11141 "memory_speed": "5000",
11142 "pcei": "16",
11143 }
11144 ],
11145 },
11146 "cpu": {"count": 128, "model": "AMD EPYC 7452 32-Core Processor", "clocks": []},
11147 "ram": {
11148 "available": 491930408,
11149 "free": 131653212,
11150 "total": 528012784,
11151 "used": 396359572,
11152 },
11153 "hard_disk": {"total": 20971520, "used": 13962880, "free": 7008640},
11154 "os": "Ubuntu 22.04.4 LTS",
11155 }
11156 asyncio.run(
11157 miner_service.publish_machine_specs(
11158 results=[
11159 (
11160 debug_specs,
11161 ExecutorSSHInfo(
11162 uuid=str(uuid.uuid4()),
11163 address=generate_random_ip(),
11164 port="8001",
11165 ssh_username="test",
11166 ssh_port=22,
11167 python_path="test",
11168 root_dir="test",
11169 ),
11170 )
11171 ],
11172 miner_hotkey="5Cco1xUS8kXuaCzAHAXZ36nr6mLzmY5B9ufxrfb8Q3HB6ZdN",
11173 )
11174 )
11175
11176 asyncio.run(
11177 miner_service.publish_machine_specs(
11178 results=[
11179 (
11180 debug_specs,
11181 ExecutorSSHInfo(
11182 uuid=str(uuid.uuid4()),
11183 address=generate_random_ip(),
11184 port="8001",
11185 ssh_username="test",
11186 ssh_port=22,
11187 python_path="test",
11188 root_dir="test",
11189 ),
11190 )
11191 ],
11192 miner_hotkey="5Cco1xUS8kXuaCzAHAXZ36nr6mLzmY5B9ufxrfb8Q3HB6ZdN",
11193 )
11194 )
11195
11196 time.sleep(2)
11197
11198
11199@cli.command()
11200def debug_set_weights():
11201 """Debug setting weights"""
11202 validator = Validator()
11203 subtensor = validator.get_subtensor()
11204 # fetch miners
11205 miners = validator.fetch_miners(subtensor)
11206 asyncio.run(validator.set_weights(miners=miners, subtensor=subtensor))
11207
11208
11209@cli.command()
11210@click.option("--miner_hotkey", prompt="Miner Hotkey", help="Hotkey of Miner")
11211@click.option("--miner_address", prompt="Miner Address", help="Miner IP Address")
11212@click.option("--miner_port", type=int, prompt="Miner Port", help="Miner Port")
11213def request_job_to_miner(miner_hotkey: str, miner_address: str, miner_port: int):
11214 asyncio.run(_request_job_to_miner(miner_hotkey, miner_address, miner_port))
11215
11216
11217async def _request_job_to_miner(miner_hotkey: str, miner_address: str, miner_port: int):
11218 miner_service: MinerService = ioc["MinerService"]
11219 docker_service: DockerService = ioc["DockerService"]
11220 file_encrypt_service: FileEncryptService = ioc["FileEncryptService"]
11221
11222 docker_hub_digests = await docker_service.get_docker_hub_digests(REPOSITORIES)
11223 encypted_files = file_encrypt_service.ecrypt_miner_job_files()
11224
11225 await miner_service.request_job_to_miner(
11226 MinerJobRequestPayload(
11227 job_batch_id='job_batch_id',
11228 miner_hotkey=miner_hotkey,
11229 miner_address=miner_address,
11230 miner_port=miner_port,
11231 ),
11232 encypted_files=encypted_files,
11233 docker_hub_digests=docker_hub_digests,
11234 )
11235
11236@cli.command()
11237@click.option("--miner_hotkey", prompt="Miner Hotkey", help="Hotkey of Miner")
11238@click.option("--miner_address", prompt="Miner Address", help="Miner IP Address")
11239@click.option("--miner_port", type=int, prompt="Miner Port", help="Miner Port")
11240@click.option("--executor_id", prompt="Executor Id", help="Executor Id")
11241@click.option("--docker_image", prompt="Docker Image", help="Docker Image")
11242def create_container_to_miner(miner_hotkey: str, miner_address: str, miner_port: int, executor_id: str, docker_image: str):
11243 asyncio.run(_create_container_to_miner(miner_hotkey, miner_address, miner_port, executor_id, docker_image))
11244
11245
11246async def _create_container_to_miner(miner_hotkey: str, miner_address: str, miner_port: int, executor_id: str, docker_image: str):
11247 miner_service: MinerService = ioc["MinerService"]
11248
11249 payload = ContainerCreateRequest(
11250 docker_image=docker_image,
11251 user_public_key="user_public_key",
11252 executor_id=executor_id,
11253 miner_hotkey=miner_hotkey,
11254 miner_address=miner_address,
11255 miner_port=miner_port,
11256 )
11257 await miner_service.handle_container(payload)
11258
11259@cli.command()
11260@click.option("--miner_hotkey", prompt="Miner Hotkey", help="Hotkey of Miner")
11261@click.option("--miner_address", prompt="Miner Address", help="Miner IP Address")
11262@click.option("--miner_port", type=int, prompt="Miner Port", help="Miner Port")
11263@click.option("--executor_id", prompt="Executor Id", help="Executor Id")
11264@click.option("--docker_image", prompt="Docker Image", help="Docker Image")
11265def create_custom_container_to_miner(miner_hotkey: str, miner_address: str, miner_port: int, executor_id: str, docker_image: str):
11266 asyncio.run(_create_custom_container_to_miner(miner_hotkey, miner_address, miner_port, executor_id, docker_image))
11267
11268
11269async def _create_custom_container_to_miner(miner_hotkey: str, miner_address: str, miner_port: int, executor_id: str, docker_image: str):
11270 miner_service: MinerService = ioc["MinerService"]
11271 # mock custom options
11272 custom_options = CustomOptions(
11273 volumes=["/var/runer/docker.sock:/var/runer/docker.sock"],
11274 environment={"UPDATED_PUBLIC_KEY":"user_public_key"},
11275 entrypoint="",
11276 internal_ports=[22, 8002],
11277 startup_commands="/bin/bash -c 'apt-get update && apt-get install -y ffmpeg && pip install opencv-python'",
11278 )
11279 payload = ContainerCreateRequest(
11280 docker_image=docker_image,
11281 user_public_key="user_public_key",
11282 executor_id=executor_id,
11283 miner_hotkey=miner_hotkey,
11284 miner_address=miner_address,
11285 miner_port=miner_port,
11286 custom_options=custom_options
11287 )
11288 await miner_service.handle_container(payload)
11289
11290if __name__ == "__main__":
11291 cli()
11292
11293
11294
11295---
11296File: /neurons/validators/src/connector.py
11297---
11298
11299import asyncio
11300import time
11301
11302from clients.compute_client import ComputeClient
11303
11304from core.config import settings
11305from core.utils import get_logger, wait_for_services_sync
11306from services.ioc import ioc
11307
11308logger = get_logger(__name__)
11309wait_for_services_sync()
11310
11311
11312async def run_forever():
11313 logger.info("Compute app connector started")
11314 keypair = settings.get_bittensor_wallet().get_hotkey()
11315 compute_app_client = ComputeClient(
11316 keypair, f"{settings.COMPUTE_APP_URI}/validator/{keypair.ss58_address}", ioc["MinerService"]
11317 )
11318 async with compute_app_client:
11319 await compute_app_client.run_forever()
11320
11321
11322def start_process():
11323 while True:
11324 try:
11325 loop = asyncio.new_event_loop()
11326 asyncio.set_event_loop(loop)
11327 loop.run_until_complete(run_forever())
11328 except Exception as e:
11329 logger.error(f"Compute app connector crashed: {e}", exc_info=True)
11330 time.sleep(1)
11331
11332
11333if __name__ == "__main__":
11334 start_process()
11335
11336# def start_connector_process():
11337# p = multiprocessing.Process(target=start_process)
11338# p.start()
11339# return p
11340
11341
11342
11343---
11344File: /neurons/validators/src/job.py
11345---
11346
11347import time
11348import random
11349
11350start_time = time.time()
11351
11352wait_time = random.uniform(10, 30)
11353time.sleep(wait_time)
11354
11355# print("Job finished")
11356print(time.time() - start_time)
11357
11358
11359---
11360File: /neurons/validators/src/test_validator.py
11361---
11362
11363import asyncio
11364import bittensor
11365
11366from core.config import settings
11367from fastapi.testclient import TestClient
11368from concurrent.futures import ThreadPoolExecutor, as_completed
11369from services.docker_service import DockerService
11370from services.ioc import ioc
11371
11372from validator import app
11373
11374client = TestClient(app)
11375
11376
11377def send_post_request():
11378 response = client.post(
11379 "/miner_request",
11380 json={
11381 "miner_hotkey": "5EHgHZBfx4ZwU7GzGCS8VCMBLBEKo5eaCvXKiu6SASwWT6UY",
11382 "miner_address": "localhost",
11383 "miner_port": 8000
11384 },
11385 )
11386 assert response.status_code == 200
11387
11388
11389def test_socket_connections():
11390 num_requests = 10 # Number of simultaneous requests
11391 with ThreadPoolExecutor(max_workers=num_requests) as executor:
11392 futures = [executor.submit(send_post_request) for _ in range(num_requests)]
11393
11394 for future in as_completed(futures):
11395 response = future.result()
11396 assert response.status_code == 200
11397
11398
11399async def check_docker_port_mappings():
11400 docker_service: DockerService = ioc["DockerService"]
11401 miner_hotkey = '5Df8qGLMd19BXByefGCZFN57fWv6jDm5hUbnQeUTu2iqNBhT'
11402 executor_id = 'c272060f-8eae-4265-8e26-1d83ac96b498'
11403 port_mappings = await docker_service.generate_portMappings(miner_hotkey, executor_id)
11404 print('port_mappings ==>', port_mappings)
11405
11406if __name__ == "__main__":
11407 # test_socket_connections()
11408 asyncio.run(check_docker_port_mappings())
11409
11410 config = settings.get_bittensor_config()
11411 subtensor = bittensor.subtensor(config=config)
11412 node = subtensor.substrate
11413
11414 netuid = settings.BITTENSOR_NETUID
11415 tempo = subtensor.tempo(netuid)
11416 weights_rate_limit = node.query("SubtensorModule", "WeightsSetRateLimit", [netuid]).value
11417 server_rate_limit = node.query("SubtensorModule", "WeightsSetRateLimit", [netuid]).value
11418 serving_rate_limit = node.query("SubtensorModule", "ServingRateLimit", [netuid]).value
11419 print('rate limit ===>', tempo, weights_rate_limit, serving_rate_limit)
11420
11421
11422
11423---
11424File: /neurons/validators/src/validator.py
11425---
11426
11427import asyncio
11428import logging
11429
11430import uvicorn
11431from fastapi import FastAPI
11432
11433from core.config import settings
11434from core.utils import configure_logs_of_other_modules, wait_for_services_sync
11435from core.validator import Validator
11436
11437configure_logs_of_other_modules()
11438wait_for_services_sync()
11439
11440
11441async def app_lifespan(app: FastAPI):
11442 if settings.DEBUG:
11443 validator = Validator(debug_miner=settings.get_debug_miner())
11444 else:
11445 validator = Validator()
11446 # Run the miner in the background
11447 task = asyncio.create_task(validator.start())
11448
11449 try:
11450 yield
11451 finally:
11452 await validator.stop() # Ensure proper cleanup
11453 await task # Wait for the background task to complete
11454 logging.info("Validator exited successfully.")
11455
11456
11457app = FastAPI(
11458 title=settings.PROJECT_NAME,
11459 lifespan=app_lifespan,
11460)
11461
11462# app.include_router(apis_router)
11463
11464reload = True if settings.ENV == "dev" else False
11465
11466if __name__ == "__main__":
11467 uvicorn.run("validator:app", host="0.0.0.0", port=settings.INTERNAL_PORT, reload=reload)
11468
11469
11470
11471---
11472File: /neurons/validators/tests/__init__.py
11473---
11474
11475
11476
11477
11478---
11479File: /neurons/validators/docker_build.sh
11480---
11481
11482#!/bin/bash
11483set -eux -o pipefail
11484
11485IMAGE_NAME="daturaai/compute-subnet-validator:$TAG"
11486
11487docker build --build-context datura=../../datura -t $IMAGE_NAME .
11488
11489
11490---
11491File: /neurons/validators/docker_publish.sh
11492---
11493
11494#!/bin/bash
11495set -eux -o pipefail
11496
11497source ./docker_build.sh
11498
11499echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
11500docker push "$IMAGE_NAME"
11501
11502
11503---
11504File: /neurons/validators/docker_runner_build.sh
11505---
11506
11507#!/bin/bash
11508set -eux -o pipefail
11509
11510IMAGE_NAME="daturaai/compute-subnet-validator-runner:$TAG"
11511
11512docker build --file Dockerfile.runner -t $IMAGE_NAME .
11513
11514
11515---
11516File: /neurons/validators/docker_runner_publish.sh
11517---
11518
11519#!/bin/bash
11520set -eux -o pipefail
11521
11522source ./docker_runner_build.sh
11523
11524echo "$DOCKERHUB_PAT" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
11525docker push "$IMAGE_NAME"
11526
11527
11528---
11529File: /neurons/validators/entrypoint.sh
11530---
11531
11532#!/bin/sh
11533set -eu
11534
11535docker compose up --pull always --detach --wait --force-recreate
11536
11537# Clean docker images
11538docker image prune -f
11539
11540# Remove all Docker images with a name but no tag
11541# docker images --filter "dangling=false" --format "{{.Repository}}:{{.Tag}} {{.ID}}" | grep ":<none>" | awk '{print $2}' | xargs -r docker rmi
11542
11543while true
11544do
11545 docker compose logs -f
11546 echo 'All containers died'
11547 sleep 10
11548done
11549
11550
11551
11552---
11553File: /neurons/validators/README.md
11554---
11555
11556# Validator
11557
11558## System Requirements
11559
11560For validation, a validator machine will need:
11561
11562- **CPU**: 4 cores
11563- **RAM**: 8 GB
11564
11565Ensure that your machine meets these requirements before proceeding with the setup.
11566
11567---
11568
11569First, register and regen your bittensor wallet and validator hotkey onto the machine.
11570
11571For installation of btcli, check [this guide](https://github.com/opentensor/bittensor/blob/master/README.md#install-bittensor-sdk)
11572```
11573btcli s register --netuid 51
11574```
11575```
11576btcli w regen_coldkeypub
11577```
11578```
11579btcli w regen_hotkey
11580```
11581
11582## Installation
11583
11584### Using Docker
11585
11586#### Step 1: Clone Git repo
11587
11588```
11589git clone https://github.com/Datura-ai/compute-subnet.git
11590```
11591
11592#### Step 2: Install Required Tools
11593
11594```
11595cd compute-subnet && chmod +x scripts/install_validator_on_ubuntu.sh && ./scripts/install_validator_on_ubuntu.sh
11596```
11597
11598Verify docker installation
11599
11600```
11601docker --version
11602```
11603If did not correctly install, follow [this link](https://docs.docker.com/engine/install/)
11604
11605#### Step 3: Setup ENV
11606```
11607cp neurons/validators/.env.template neurons/validators/.env
11608```
11609
11610Replace with your information for `BITTENSOR_WALLET_NAME`, `BITTENSOR_WALLET_HOTKEY_NAME`, `HOST_WALLET_DIR`.
11611If you want you can use different port for `INTERNAL_PORT`, `EXTERNAL_PORT`.
11612
11613#### Step 4: Docker Compose Up
11614
11615```
11616cd neurons/validators && docker compose up -d
11617```
11618
11619
11620
11621---
11622File: /neurons/validators/run.sh
11623---
11624
11625#!/bin/sh
11626
11627# db migrate
11628alembic upgrade head
11629
11630# run fastapi app
11631python src/validator.py
11632
11633
11634---
11635File: /neurons/__init__.py
11636---
11637
11638
11639
11640
11641---
11642File: /scripts/check_compatibility.sh
11643---
11644
11645 #!/bin/bash
11646
11647if [ -z "$1" ]; then
11648 echo "Please provide a Python version as an argument."
11649 exit 1
11650fi
11651
11652python_version="$1"
11653all_passed=true
11654
11655GREEN='\033[0;32m'
11656YELLOW='\033[0;33m'
11657RED='\033[0;31m'
11658NC='\033[0m' # No Color
11659
11660check_compatibility() {
11661 all_supported=0
11662
11663 while read -r requirement; do
11664 # Skip lines starting with git+
11665 if [[ "$requirement" == git+* ]]; then
11666 continue
11667 fi
11668
11669 package_name=$(echo "$requirement" | awk -F'[!=<>]' '{print $1}' | awk -F'[' '{print $1}') # Strip off brackets
11670 echo -n "Checking $package_name... "
11671
11672 url="https://pypi.org/pypi/$package_name/json"
11673 response=$(curl -s $url)
11674 status_code=$(curl -s -o /dev/null -w "%{http_code}" $url)
11675
11676 if [ "$status_code" != "200" ]; then
11677 echo -e "${RED}Information not available for $package_name. Failure.${NC}"
11678 all_supported=1
11679 continue
11680 fi
11681
11682 classifiers=$(echo "$response" | jq -r '.info.classifiers[]')
11683 requires_python=$(echo "$response" | jq -r '.info.requires_python')
11684
11685 base_version="Programming Language :: Python :: ${python_version%%.*}"
11686 specific_version="Programming Language :: Python :: $python_version"
11687
11688 if echo "$classifiers" | grep -q "$specific_version" || echo "$classifiers" | grep -q "$base_version"; then
11689 echo -e "${GREEN}Supported${NC}"
11690 elif [ "$requires_python" != "null" ]; then
11691 if echo "$requires_python" | grep -Eq "==$python_version|>=$python_version|<=$python_version"; then
11692 echo -e "${GREEN}Supported${NC}"
11693 else
11694 echo -e "${RED}Not compatible with Python $python_version due to constraint $requires_python.${NC}"
11695 all_supported=1
11696 fi
11697 else
11698 echo -e "${YELLOW}Warning: Specific version not listed, assuming compatibility${NC}"
11699 fi
11700 done < requirements.txt
11701
11702 return $all_supported
11703}
11704
11705echo "Checking compatibility for Python $python_version..."
11706check_compatibility
11707if [ $? -eq 0 ]; then
11708 echo -e "${GREEN}All requirements are compatible with Python $python_version.${NC}"
11709else
11710 echo -e "${RED}All requirements are NOT compatible with Python $python_version.${NC}"
11711 all_passed=false
11712fi
11713
11714echo ""
11715if $all_passed; then
11716 echo -e "${GREEN}All tests passed.${NC}"
11717else
11718 echo -e "${RED}All tests did not pass.${NC}"
11719 exit 1
11720fi
11721
11722
11723
11724---
11725File: /scripts/check_requirements_changes.sh
11726---
11727
11728#!/bin/bash
11729
11730# Check if requirements files have changed in the last commit
11731if git diff --name-only HEAD~1 | grep -E 'requirements.txt|requirements.txt'; then
11732 echo "Requirements files have changed. Running compatibility checks..."
11733 echo 'export REQUIREMENTS_CHANGED="true"' >> $BASH_ENV
11734else
11735 echo "Requirements files have not changed. Skipping compatibility checks..."
11736 echo 'export REQUIREMENTS_CHANGED="false"' >> $BASH_ENV
11737fi
11738
11739
11740
11741---
11742File: /scripts/install_dev.sh
11743---
11744
11745#!/bin/bash
11746
11747set -u
11748
11749# enable command completion
11750set -o history -o histexpand
11751
11752abort() {
11753 printf "%s\n" "$1"
11754 exit 1
11755}
11756
11757getc() {
11758 local save_state
11759 save_state=$(/bin/stty -g)
11760 /bin/stty raw -echo
11761 IFS= read -r -n 1 -d '' "$@"
11762 /bin/stty "$save_state"
11763}
11764
11765exit_on_error() {
11766 exit_code=$1
11767 last_command=${@:2}
11768 if [ $exit_code -ne 0 ]; then
11769 >&2 echo "\"${last_command}\" command failed with exit code ${exit_code}."
11770 exit $exit_code
11771 fi
11772}
11773
11774shell_join() {
11775 local arg
11776 printf "%s" "$1"
11777 shift
11778 for arg in "$@"; do
11779 printf " "
11780 printf "%s" "${arg// /\ }"
11781 done
11782}
11783
11784# string formatters
11785if [[ -t 1 ]]; then
11786 tty_escape() { printf "\033[%sm" "$1"; }
11787else
11788 tty_escape() { :; }
11789fi
11790tty_mkbold() { tty_escape "1;$1"; }
11791tty_underline="$(tty_escape "4;39")"
11792tty_blue="$(tty_mkbold 34)"
11793tty_red="$(tty_mkbold 31)"
11794tty_bold="$(tty_mkbold 39)"
11795tty_reset="$(tty_escape 0)"
11796
11797ohai() {
11798 printf "${tty_blue}==>${tty_bold} %s${tty_reset}\n" "$(shell_join "$@")"
11799}
11800
11801wait_for_user() {
11802 local c
11803 echo
11804 echo "Press RETURN to continue or any other key to abort"
11805 getc c
11806 # we test for \r and \n because some stuff does \r instead
11807 if ! [[ "$c" == $'\r' || "$c" == $'\n' ]]; then
11808 exit 1
11809 fi
11810}
11811
11812#install pre
11813install_pre() {
11814 sudo apt update
11815 sudo apt install --no-install-recommends --no-install-suggests -y sudo apt-utils curl git cmake build-essential
11816 exit_on_error $?
11817}
11818
11819# check if python is installed, if not install it
11820install_python() {
11821 # Check if python3.11 is installed
11822 if command -v python3.11 &> /dev/null
11823 then
11824 # Check the version
11825 PYTHON_VERSION=$(python3.11 --version 2>&1)
11826 if [[ $PYTHON_VERSION == *"Python 3.11"* ]]; then
11827 ohai "Python 3.11 is already installed."
11828 else
11829 ohai "Linking python to python 3.11"
11830 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
11831 python -m pip install cffi
11832 python -m pip install cryptography
11833 fi
11834 else
11835 ohai "Installing Python 3.11"
11836 add-apt-repository ppa:deadsnakes/ppa
11837 apt install python3.11
11838 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
11839 python -m pip install cffi
11840 python -m pip install cryptography
11841 fi
11842
11843 # check if PDM is installed
11844 if command -v pdm &> /dev/null
11845 then
11846 ohai "PDM is already installed."
11847 echo "Checking PDM version..."
11848 pdm --version
11849 else
11850 ohai "Installing PDM..."
11851 curl -sSL https://pdm-project.org/install-pdm.py | python3 -
11852
11853 local bashrc_file="/root/.bashrc"
11854 local path_string="export PATH=/root/.local/bin:\$PATH"
11855
11856 if ! grep -Fxq "$path_string" $bashrc_file; then
11857 echo "$path_string" >> $bashrc_file
11858 echo "Added $path_string to $bashrc_file"
11859 else
11860 echo "$path_string already present in $bashrc_file"
11861 fi
11862
11863 export PATH=/root/.local/bin:$PATH
11864
11865 echo "Checking PDM version..."
11866 pdm --version
11867 fi
11868
11869 PROJECT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
11870 PROJECT_DIR=${PROJECT_DIR}/../
11871 cd ${PROJECT_DIR}
11872
11873 ohai "Installing PDM packages in root folder."
11874 pdm install -d
11875
11876 ohai "Installing pre-commit for the project."
11877 pdm run pre-commit install
11878}
11879
11880
11881
11882ohai "This script will install:"
11883echo "git"
11884echo "curl"
11885echo "python3.11 and pdm"
11886echo "python3-pip"
11887echo "pre-commit with ruff"
11888
11889wait_for_user
11890install_pre
11891install_python
11892
11893
11894---
11895File: /scripts/install_executor_on_ubuntu.sh
11896---
11897
11898#!/bin/bash
11899set -u
11900
11901# enable command completion
11902set -o history -o histexpand
11903
11904abort() {
11905 printf "%s\n" "$1"
11906 exit 1
11907}
11908
11909getc() {
11910 local save_state
11911 save_state=$(/bin/stty -g)
11912 /bin/stty raw -echo
11913 IFS= read -r -n 1 -d '' "$@"
11914 /bin/stty "$save_state"
11915}
11916
11917exit_on_error() {
11918 exit_code=$1
11919 last_command=${@:2}
11920 if [ $exit_code -ne 0 ]; then
11921 >&2 echo "\"${last_command}\" command failed with exit code ${exit_code}."
11922 exit $exit_code
11923 fi
11924}
11925
11926shell_join() {
11927 local arg
11928 printf "%s" "$1"
11929 shift
11930 for arg in "$@"; do
11931 printf " "
11932 printf "%s" "${arg// /\ }"
11933 done
11934}
11935
11936# string formatters
11937if [[ -t 1 ]]; then
11938 tty_escape() { printf "\033[%sm" "$1"; }
11939else
11940 tty_escape() { :; }
11941fi
11942tty_mkbold() { tty_escape "1;$1"; }
11943tty_underline="$(tty_escape "4;39")"
11944tty_blue="$(tty_mkbold 34)"
11945tty_red="$(tty_mkbold 31)"
11946tty_bold="$(tty_mkbold 39)"
11947tty_reset="$(tty_escape 0)"
11948
11949ohai() {
11950 printf "${tty_blue}==>${tty_bold} %s${tty_reset}\n" "$(shell_join "$@")"
11951}
11952
11953wait_for_user() {
11954 local c
11955 echo
11956 echo "Press RETURN to continue or any other key to abort"
11957 getc c
11958 # we test for \r and \n because some stuff does \r instead
11959 if ! [[ "$c" == $'\r' || "$c" == $'\n' ]]; then
11960 exit 1
11961 fi
11962}
11963
11964#install pre
11965install_pre() {
11966 sudo apt update
11967 sudo apt install --no-install-recommends --no-install-suggests -y sudo apt-utils curl git cmake build-essential
11968 exit_on_error $?
11969}
11970
11971# check if python is installed, if not install it
11972install_python() {
11973 # Check if python3.11 is installed
11974 if command -v python3.11 &> /dev/null
11975 then
11976 # Check the version
11977 PYTHON_VERSION=$(python3.11 --version 2>&1)
11978 if [[ $PYTHON_VERSION == *"Python 3.11"* ]]; then
11979 ohai "Python 3.11 is already installed."
11980 else
11981 ohai "Linking python to python 3.11"
11982 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
11983 python -m pip install cffi
11984 python -m pip install cryptography
11985 fi
11986 else
11987 ohai "Installing Python 3.11"
11988 add-apt-repository ppa:deadsnakes/ppa
11989 sudo apt install python3.11
11990 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
11991 python -m pip install cffi
11992 python -m pip install cryptography
11993 fi
11994
11995 # check if PDM is installed
11996 if command -v pdm &> /dev/null
11997 then
11998 ohai "PDM is already installed."
11999 echo "Checking PDM version..."
12000 pdm --version
12001 else
12002 ohai "Installing PDM..."
12003 sudo apt install -y python3.12-venv
12004 curl -sSL https://pdm-project.org/install-pdm.py | python3 -
12005
12006 local bashrc_file="/root/.bashrc"
12007 local path_string="export PATH=/root/.local/bin:\$PATH"
12008
12009 if ! grep -Fxq "$path_string" $bashrc_file; then
12010 echo "$path_string" >> $bashrc_file
12011 echo "Added $path_string to $bashrc_file"
12012 else
12013 echo "$path_string already present in $bashrc_file"
12014 fi
12015
12016 export PATH=/root/.local/bin:$PATH
12017
12018 echo "Checking PDM version..."
12019 pdm --version
12020 fi
12021}
12022
12023# install redis
12024install_redis() {
12025 if command -v redis-server &> /dev/null
12026 then
12027 ohai "Redis is already installed."
12028 echo "Checking Redis version..."
12029 redis-server --version
12030 else
12031 ohai "Installing Redis..."
12032
12033 sudo apt install -y redis-server
12034
12035 echo "Starting Redis server..."
12036 sudo systemctl start redis-server.service
12037
12038 echo "Checking Redis server status..."
12039 sudo systemctl status redis-server.service
12040 fi
12041}
12042
12043# install postgresql
12044install_postgresql() {
12045 if command -v psql &> /dev/null
12046 then
12047 ohai "PostgreSQL is already installed."
12048 echo "Checking PostgreSQL version..."
12049 psql --version
12050
12051 # Check if the database exists
12052 DB_EXISTS=$(sudo -u postgres psql -tAc "SELECT 1 FROM pg_database WHERE datname='compute_subnet_db'")
12053 if [ "$DB_EXISTS" == "1" ]; then
12054 echo "Database compute_subnet_db already exists."
12055 else
12056 echo "Creating database compute_subnet_db..."
12057 sudo -u postgres createdb compute_subnet_db
12058 fi
12059 else
12060 echo "Installing PostgreSQL..."
12061 sudo apt install -y postgresql postgresql-contrib
12062
12063 echo "Starting PostgreSQL server..."
12064 sudo systemctl start postgresql.service
12065
12066 echo "Setting password for postgres user..."
12067 sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
12068
12069 echo "Creating database compute_subnet_db..."
12070 sudo -u postgres createdb compute_subnet_db
12071 fi
12072}
12073
12074# install btcli
12075install_btcli() {
12076 if command -v btcli &> /dev/null
12077 then
12078 ohai "BtCLI is already installed."
12079 else
12080 ohai "Installing BtCLI..."
12081
12082 sudo apt install -y pipx
12083 pipx install bittensor
12084 source ~/.bashrc
12085 fi
12086}
12087
12088# install docker
12089install_docker() {
12090 if command -v docker &> /dev/null; then
12091 ohai "Docker is already installed."
12092 return 0
12093 else
12094 ohai "Installing Docker..."
12095 sudo apt-get update -y
12096 sudo apt-get install -y ca-certificates curl
12097 sudo install -m 0755 -d /etc/apt/keyrings
12098 sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
12099 sudo chmod a+r /etc/apt/keyrings/docker.asc
12100
12101 # Add the repository to Apt sources:
12102 echo \
12103 "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
12104 $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
12105 sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
12106 sudo apt-get update -y
12107 sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
12108 sudo groupadd docker
12109 sudo usermod -aG docker $USER
12110 newgrp docker
12111 fi
12112}
12113
12114ohai "This script will install:"
12115echo "docker"
12116
12117
12118wait_for_user
12119install_pre
12120install_docker
12121
12122
12123
12124---
12125File: /scripts/install_miner_on_runpod.sh
12126---
12127
12128#!/bin/bash
12129set -u
12130
12131# enable command completion
12132set -o history -o histexpand
12133
12134abort() {
12135 printf "%s\n" "$1"
12136 exit 1
12137}
12138
12139getc() {
12140 local save_state
12141 save_state=$(/bin/stty -g)
12142 /bin/stty raw -echo
12143 IFS= read -r -n 1 -d '' "$@"
12144 /bin/stty "$save_state"
12145}
12146
12147exit_on_error() {
12148 exit_code=$1
12149 last_command=${@:2}
12150 if [ $exit_code -ne 0 ]; then
12151 >&2 echo "\"${last_command}\" command failed with exit code ${exit_code}."
12152 exit $exit_code
12153 fi
12154}
12155
12156shell_join() {
12157 local arg
12158 printf "%s" "$1"
12159 shift
12160 for arg in "$@"; do
12161 printf " "
12162 printf "%s" "${arg// /\ }"
12163 done
12164}
12165
12166# string formatters
12167if [[ -t 1 ]]; then
12168 tty_escape() { printf "\033[%sm" "$1"; }
12169else
12170 tty_escape() { :; }
12171fi
12172tty_mkbold() { tty_escape "1;$1"; }
12173tty_underline="$(tty_escape "4;39")"
12174tty_blue="$(tty_mkbold 34)"
12175tty_red="$(tty_mkbold 31)"
12176tty_bold="$(tty_mkbold 39)"
12177tty_reset="$(tty_escape 0)"
12178
12179ohai() {
12180 printf "${tty_blue}==>${tty_bold} %s${tty_reset}\n" "$(shell_join "$@")"
12181}
12182
12183wait_for_user() {
12184 local c
12185 echo
12186 echo "Press Enter to continue or any other key to abort"
12187 getc c
12188 # we test for \r and \n because some stuff does \r instead
12189 if ! [[ "$c" == $'\r' || "$c" == $'\n' ]]; then
12190 exit 1
12191 fi
12192}
12193
12194#install pre
12195install_pre() {
12196 apt update
12197 apt upgrade
12198 apt install --no-install-recommends --no-install-suggests -y apt-utils curl git cmake build-essential nano
12199 exit_on_error $?
12200}
12201
12202# check if python is installed, if not install it
12203install_python() {
12204 # Check if python3.11 is installed
12205 if command -v python3.11 &> /dev/null
12206 then
12207 # Check the version
12208 PYTHON_VERSION=$(python3.11 --version 2>&1)
12209 if [[ $PYTHON_VERSION == *"Python 3.11"* ]]; then
12210 echo "Python 3.11 is already installed."
12211 else
12212 echo "Linking python to python 3.11"
12213 update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12214
12215 # Ensure pip is installed
12216 python3.11 -m ensurepip --upgrade
12217
12218 # Install necessary packages
12219 python -m pip install --upgrade pip
12220 pip install cffi
12221 pip install cryptography
12222
12223 # Install bittensor
12224 pip install bittensor
12225 pip install bittensor[torch]
12226 fi
12227 else
12228 ohai "Installing Python 3.11..."
12229 add-apt-repository ppa:deadsnakes/ppa
12230 apt update
12231 apt install -y python3.11 python3.11-venv python3.11-dev
12232
12233 echo "Linking python to python 3.11"
12234 update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12235
12236 # Ensure pip is installed
12237 python3.11 -m ensurepip --upgrade
12238
12239 # Install necessary packages
12240 python -m pip install --upgrade pip
12241 pip install cffi
12242 pip install cryptography
12243
12244 # Install bittensor
12245 pip install bittensor
12246 pip install bittensor[torch]
12247 fi
12248
12249 # check if PDM is installed
12250 if command -v pdm &> /dev/null
12251 then
12252 ohai "PDM is already installed."
12253 echo "Checking PDM version..."
12254 pdm --version
12255 else
12256 ohai "Installing PDM..."
12257 curl -sSL https://pdm-project.org/install-pdm.py | python3 -
12258
12259 local bashrc_file="$HOME/.bashrc"
12260 local path_string="export PATH=$HOME/.local/bin:\$PATH"
12261
12262 if ! grep -Fxq "$path_string" $bashrc_file; then
12263 echo "$path_string" >> $bashrc_file
12264 echo "Added $path_string to $bashrc_file"
12265 else
12266 echo "$path_string already present in $bashrc_file"
12267 fi
12268
12269 export PATH=$HOME/.local/bin:$PATH
12270
12271 echo "Checking PDM version..."
12272 pdm --version
12273 fi
12274}
12275
12276# install postgresql
12277install_postgresql() {
12278 if command -v psql &> /dev/null
12279 then
12280 echo "PostgreSQL is already installed."
12281 echo "Checking PostgreSQL version..."
12282 psql --version
12283
12284 # Check if the database exists
12285 DB_EXISTS=$(runuser -l postgres -c "psql -tAc \"SELECT 1 FROM pg_database WHERE datname='compute_subnet_db'\"")
12286 if [ "$DB_EXISTS" == "1" ]; then
12287 echo "Database compute_subnet_db already exists."
12288 else
12289 echo "Creating database compute_subnet_db..."
12290 runuser -l postgres -c "createdb compute_subnet_db"
12291 fi
12292 else
12293 ohai "Installing PostgreSQL..."
12294
12295 apt install -y postgresql postgresql-contrib
12296
12297 echo "Starting PostgreSQL server..."
12298 service postgresql start
12299
12300 read -p "Enter Postgres password: " pg_password
12301
12302 # Set the password for the postgres user
12303 runuser -l postgres -c "psql -c \"ALTER USER postgres PASSWORD '$pg_password';\""
12304
12305 # Create the database as the postgres user
12306 runuser -l postgres -c "createdb compute_subnet_db"
12307 fi
12308}
12309
12310# install miner dependencies
12311install_miner_dependencies() {
12312 ohai "Installing miner..."
12313
12314 # Get the directory of the current script
12315 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
12316
12317 # Navigate to the PDM root path relative to the script directory
12318 cd "$SCRIPT_DIR/../neurons/miners" || exit
12319
12320 # Install PDM dependencies
12321 pdm install
12322}
12323
12324ohai "This script will install:"
12325echo "git"
12326echo "curl"
12327echo "python3.11 and pdm"
12328echo "python3-pip"
12329echo "postgresql"
12330echo "bittensor"
12331echo "install miner dependencies"
12332
12333wait_for_user
12334install_pre
12335install_python
12336install_postgresql
12337install_miner_dependencies
12338
12339
12340---
12341File: /scripts/install_miner_on_ubuntu.sh
12342---
12343
12344#!/bin/bash
12345set -u
12346
12347# enable command completion
12348set -o history -o histexpand
12349
12350abort() {
12351 printf "%s\n" "$1"
12352 exit 1
12353}
12354
12355getc() {
12356 local save_state
12357 save_state=$(/bin/stty -g)
12358 /bin/stty raw -echo
12359 IFS= read -r -n 1 -d '' "$@"
12360 /bin/stty "$save_state"
12361}
12362
12363exit_on_error() {
12364 exit_code=$1
12365 last_command=${@:2}
12366 if [ $exit_code -ne 0 ]; then
12367 >&2 echo "\"${last_command}\" command failed with exit code ${exit_code}."
12368 exit $exit_code
12369 fi
12370}
12371
12372shell_join() {
12373 local arg
12374 printf "%s" "$1"
12375 shift
12376 for arg in "$@"; do
12377 printf " "
12378 printf "%s" "${arg// /\ }"
12379 done
12380}
12381
12382# string formatters
12383if [[ -t 1 ]]; then
12384 tty_escape() { printf "\033[%sm" "$1"; }
12385else
12386 tty_escape() { :; }
12387fi
12388tty_mkbold() { tty_escape "1;$1"; }
12389tty_underline="$(tty_escape "4;39")"
12390tty_blue="$(tty_mkbold 34)"
12391tty_red="$(tty_mkbold 31)"
12392tty_bold="$(tty_mkbold 39)"
12393tty_reset="$(tty_escape 0)"
12394
12395ohai() {
12396 printf "${tty_blue}==>${tty_bold} %s${tty_reset}\n" "$(shell_join "$@")"
12397}
12398
12399wait_for_user() {
12400 local c
12401 echo
12402 echo "Press RETURN to continue or any other key to abort"
12403 getc c
12404 # we test for \r and \n because some stuff does \r instead
12405 if ! [[ "$c" == $'\r' || "$c" == $'\n' ]]; then
12406 exit 1
12407 fi
12408}
12409
12410#install pre
12411install_pre() {
12412 sudo apt update
12413 sudo apt install --no-install-recommends --no-install-suggests -y sudo apt-utils curl git cmake build-essential
12414 exit_on_error $?
12415}
12416
12417# check if python is installed, if not install it
12418install_python() {
12419 # Check if python3.11 is installed
12420 if command -v python3.11 &> /dev/null
12421 then
12422 # Check the version
12423 PYTHON_VERSION=$(python3.11 --version 2>&1)
12424 if [[ $PYTHON_VERSION == *"Python 3.11"* ]]; then
12425 ohai "Python 3.11 is already installed."
12426 else
12427 ohai "Linking python to python 3.11"
12428 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12429 python -m pip install cffi
12430 python -m pip install cryptography
12431 fi
12432 else
12433 ohai "Installing Python 3.11"
12434 add-apt-repository ppa:deadsnakes/ppa
12435 sudo apt install python3.11
12436 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12437 python -m pip install cffi
12438 python -m pip install cryptography
12439 fi
12440
12441 # check if PDM is installed
12442 if command -v pdm &> /dev/null
12443 then
12444 ohai "PDM is already installed."
12445 echo "Checking PDM version..."
12446 pdm --version
12447 else
12448 ohai "Installing PDM..."
12449 sudo apt install -y python3.12-venv
12450 curl -sSL https://pdm-project.org/install-pdm.py | python3 -
12451
12452 local bashrc_file="/root/.bashrc"
12453 local path_string="export PATH=/root/.local/bin:\$PATH"
12454
12455 if ! grep -Fxq "$path_string" $bashrc_file; then
12456 echo "$path_string" >> $bashrc_file
12457 echo "Added $path_string to $bashrc_file"
12458 else
12459 echo "$path_string already present in $bashrc_file"
12460 fi
12461
12462 export PATH=/root/.local/bin:$PATH
12463
12464 echo "Checking PDM version..."
12465 pdm --version
12466 fi
12467}
12468
12469# install redis
12470install_redis() {
12471 if command -v redis-server &> /dev/null
12472 then
12473 ohai "Redis is already installed."
12474 echo "Checking Redis version..."
12475 redis-server --version
12476 else
12477 ohai "Installing Redis..."
12478
12479 sudo apt install -y redis-server
12480
12481 echo "Starting Redis server..."
12482 sudo systemctl start redis-server.service
12483
12484 echo "Checking Redis server status..."
12485 sudo systemctl status redis-server.service
12486 fi
12487}
12488
12489# install postgresql
12490install_postgresql() {
12491 if command -v psql &> /dev/null
12492 then
12493 ohai "PostgreSQL is already installed."
12494 echo "Checking PostgreSQL version..."
12495 psql --version
12496
12497 # Check if the database exists
12498 DB_EXISTS=$(sudo -u postgres psql -tAc "SELECT 1 FROM pg_database WHERE datname='compute_subnet_db'")
12499 if [ "$DB_EXISTS" == "1" ]; then
12500 echo "Database compute_subnet_db already exists."
12501 else
12502 echo "Creating database compute_subnet_db..."
12503 sudo -u postgres createdb compute_subnet_db
12504 fi
12505 else
12506 echo "Installing PostgreSQL..."
12507 sudo apt install -y postgresql postgresql-contrib
12508
12509 echo "Starting PostgreSQL server..."
12510 sudo systemctl start postgresql.service
12511
12512 echo "Setting password for postgres user..."
12513 sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
12514
12515 echo "Creating database compute_subnet_db..."
12516 sudo -u postgres createdb compute_subnet_db
12517 fi
12518}
12519
12520# install btcli
12521install_btcli() {
12522 if command -v btcli &> /dev/null
12523 then
12524 ohai "BtCLI is already installed."
12525 else
12526 ohai "Installing BtCLI..."
12527
12528 sudo apt install -y pipx
12529 pipx install bittensor
12530 source ~/.bashrc
12531 fi
12532}
12533
12534# install docker
12535install_docker() {
12536 if command -v docker &> /dev/null; then
12537 ohai "Docker is already installed."
12538 return 0
12539 else
12540 ohai "Installing Docker..."
12541 sudo apt-get update -y
12542 sudo apt-get install -y ca-certificates curl
12543 sudo install -m 0755 -d /etc/apt/keyrings
12544 sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
12545 sudo chmod a+r /etc/apt/keyrings/docker.asc
12546
12547 # Add the repository to Apt sources:
12548 echo \
12549 "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
12550 $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
12551 sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
12552 sudo apt-get update -y
12553 sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
12554 sudo groupadd docker
12555 sudo usermod -aG docker $USER
12556 newgrp docker
12557 fi
12558}
12559
12560ohai "This script will install:"
12561echo "bittensor"
12562echo "docker"
12563
12564
12565wait_for_user
12566install_pre
12567install_btcli
12568install_docker
12569
12570
12571
12572---
12573File: /scripts/install_staging.sh
12574---
12575
12576#!/bin/bash
12577
12578# Section 1: Build/Install
12579# This section is for first-time setup and installations.
12580
12581install_dependencies() {
12582 # Function to install packages on macOS
12583 install_mac() {
12584 which brew > /dev/null
12585 if [ $? -ne 0 ]; then
12586 echo "Installing Homebrew..."
12587 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
12588 fi
12589 echo "Updating Homebrew packages..."
12590 brew update
12591 echo "Installing required packages..."
12592 brew install make llvm curl libssl protobuf tmux
12593 }
12594
12595 # Function to install packages on Ubuntu/Debian
12596 install_ubuntu() {
12597 echo "Updating system packages..."
12598 sudo apt update
12599 echo "Installing required packages..."
12600 sudo apt install --assume-yes make build-essential git clang curl libssl-dev llvm libudev-dev protobuf-compiler tmux
12601 }
12602
12603 # Detect OS and call the appropriate function
12604 if [[ "$OSTYPE" == "darwin"* ]]; then
12605 install_mac
12606 elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
12607 install_ubuntu
12608 else
12609 echo "Unsupported operating system."
12610 exit 1
12611 fi
12612
12613 # Install rust and cargo
12614 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
12615
12616 # Update your shell's source to include Cargo's path
12617 source "$HOME/.cargo/env"
12618}
12619
12620# Call install_dependencies only if it's the first time running the script
12621if [ ! -f ".dependencies_installed" ]; then
12622 install_dependencies
12623 touch .dependencies_installed
12624fi
12625
12626
12627# Section 2: Test/Run
12628# This section is for running and testing the setup.
12629
12630# Create a coldkey for the owner role
12631wallet=${1:-owner}
12632
12633# Logic for setting up and running the environment
12634setup_environment() {
12635 # Clone subtensor and enter the directory
12636 if [ ! -d "subtensor" ]; then
12637 git clone https://github.com/opentensor/subtensor.git
12638 fi
12639 cd subtensor
12640 git pull
12641
12642 # Update to the nightly version of rust
12643 ./scripts/init.sh
12644
12645 cd ../bittensor-subnet-template
12646
12647 # Install the bittensor-subnet-template python package
12648 python -m pip install -e .
12649
12650 # Create and set up wallets
12651 # This section can be skipped if wallets are already set up
12652 if [ ! -f ".wallets_setup" ]; then
12653 btcli wallet new_coldkey --wallet.name $wallet --no_password --no_prompt
12654 btcli wallet new_coldkey --wallet.name miner --no_password --no_prompt
12655 btcli wallet new_hotkey --wallet.name miner --wallet.hotkey default --no_prompt
12656 btcli wallet new_coldkey --wallet.name validator --no_password --no_prompt
12657 btcli wallet new_hotkey --wallet.name validator --wallet.hotkey default --no_prompt
12658 touch .wallets_setup
12659 fi
12660
12661}
12662
12663# Call setup_environment every time
12664setup_environment
12665
12666## Setup localnet
12667# assumes we are in the bittensor-subnet-template/ directory
12668# Initialize your local subtensor chain in development mode. This command will set up and run a local subtensor network.
12669cd ../subtensor
12670
12671# Start a new tmux session and create a new pane, but do not switch to it
12672echo "FEATURES='pow-faucet runtime-benchmarks' BT_DEFAULT_TOKEN_WALLET=$(cat ~/.bittensor/wallets/$wallet/coldkeypub.txt | grep -oP '"ss58Address": "\K[^"]+') bash scripts/localnet.sh" >> setup_and_run.sh
12673chmod +x setup_and_run.sh
12674tmux new-session -d -s localnet -n 'localnet'
12675tmux send-keys -t localnet 'bash ../subtensor/setup_and_run.sh' C-m
12676
12677# Notify the user
12678echo ">> localnet.sh is running in a detached tmux session named 'localnet'"
12679echo ">> You can attach to this session with: tmux attach-session -t localnet"
12680
12681# Register a subnet (this needs to be run each time we start a new local chain)
12682btcli subnet create --wallet.name $wallet --wallet.hotkey default --subtensor.chain_endpoint ws://127.0.0.1:9946 --no_prompt
12683
12684# Transfer tokens to miner and validator coldkeys
12685export BT_MINER_TOKEN_WALLET=$(cat ~/.bittensor/wallets/miner/coldkeypub.txt | grep -oP '"ss58Address": "\K[^"]+')
12686export BT_VALIDATOR_TOKEN_WALLET=$(cat ~/.bittensor/wallets/validator/coldkeypub.txt | grep -oP '"ss58Address": "\K[^"]+')
12687
12688btcli wallet transfer --subtensor.network ws://127.0.0.1:9946 --wallet.name $wallet --dest $BT_MINER_TOKEN_WALLET --amount 1000 --no_prompt
12689btcli wallet transfer --subtensor.network ws://127.0.0.1:9946 --wallet.name $wallet --dest $BT_VALIDATOR_TOKEN_WALLET --amount 10000 --no_prompt
12690
12691# Register wallet hotkeys to subnet
12692btcli subnet register --wallet.name miner --netuid 1 --wallet.hotkey default --subtensor.chain_endpoint ws://127.0.0.1:9946 --no_prompt
12693btcli subnet register --wallet.name validator --netuid 1 --wallet.hotkey default --subtensor.chain_endpoint ws://127.0.0.1:9946 --no_prompt
12694
12695# Add stake to the validator
12696btcli stake add --wallet.name validator --wallet.hotkey default --subtensor.chain_endpoint ws://127.0.0.1:9946 --amount 10000 --no_prompt
12697
12698# Ensure both the miner and validator keys are successfully registered.
12699btcli subnet list --subtensor.chain_endpoint ws://127.0.0.1:9946
12700btcli wallet overview --wallet.name validator --subtensor.chain_endpoint ws://127.0.0.1:9946 --no_prompt
12701btcli wallet overview --wallet.name miner --subtensor.chain_endpoint ws://127.0.0.1:9946 --no_prompt
12702
12703cd ../bittensor-subnet-template
12704
12705
12706# Check if inside a tmux session
12707if [ -z "$TMUX" ]; then
12708 # Start a new tmux session and run the miner in the first pane
12709 tmux new-session -d -s bittensor -n 'miner' 'python neurons/miner.py --netuid 1 --subtensor.chain_endpoint ws://127.0.0.1:9946 --wallet.name miner --wallet.hotkey default --logging.debug'
12710
12711 # Split the window and run the validator in the new pane
12712 tmux split-window -h -t bittensor:miner 'python neurons/validator.py --netuid 1 --subtensor.chain_endpoint ws://127.0.0.1:9946 --wallet.name validator --wallet.hotkey default --logging.debug'
12713
12714 # Attach to the new tmux session
12715 tmux attach-session -t bittensor
12716else
12717 # If already in a tmux session, create two panes in the current window
12718 tmux split-window -h 'python neurons/miner.py --netuid 1 --subtensor.chain_endpoint ws://127.0.0.1:9946 --wallet.name miner --wallet.hotkey default --logging.debug'
12719 tmux split-window -v -t 0 'python neurons/validator.py --netuid 1 --subtensor.chain_endpoint ws://127.0.0.1:9946 --wallet.name validator --wallet.hotkey default --logging.debug'
12720fi
12721
12722
12723
12724---
12725File: /scripts/install_validator_on_ubuntu.sh
12726---
12727
12728#!/bin/bash
12729set -u
12730
12731# enable command completion
12732set -o history -o histexpand
12733
12734abort() {
12735 printf "%s\n" "$1"
12736 exit 1
12737}
12738
12739getc() {
12740 local save_state
12741 save_state=$(/bin/stty -g)
12742 /bin/stty raw -echo
12743 IFS= read -r -n 1 -d '' "$@"
12744 /bin/stty "$save_state"
12745}
12746
12747exit_on_error() {
12748 exit_code=$1
12749 last_command=${@:2}
12750 if [ $exit_code -ne 0 ]; then
12751 >&2 echo "\"${last_command}\" command failed with exit code ${exit_code}."
12752 exit $exit_code
12753 fi
12754}
12755
12756shell_join() {
12757 local arg
12758 printf "%s" "$1"
12759 shift
12760 for arg in "$@"; do
12761 printf " "
12762 printf "%s" "${arg// /\ }"
12763 done
12764}
12765
12766# string formatters
12767if [[ -t 1 ]]; then
12768 tty_escape() { printf "\033[%sm" "$1"; }
12769else
12770 tty_escape() { :; }
12771fi
12772tty_mkbold() { tty_escape "1;$1"; }
12773tty_underline="$(tty_escape "4;39")"
12774tty_blue="$(tty_mkbold 34)"
12775tty_red="$(tty_mkbold 31)"
12776tty_bold="$(tty_mkbold 39)"
12777tty_reset="$(tty_escape 0)"
12778
12779ohai() {
12780 printf "${tty_blue}==>${tty_bold} %s${tty_reset}\n" "$(shell_join "$@")"
12781}
12782
12783wait_for_user() {
12784 local c
12785 echo
12786 echo "Press RETURN to continue or any other key to abort"
12787 getc c
12788 # we test for \r and \n because some stuff does \r instead
12789 if ! [[ "$c" == $'\r' || "$c" == $'\n' ]]; then
12790 exit 1
12791 fi
12792}
12793
12794#install pre
12795install_pre() {
12796 sudo apt update
12797 sudo apt install --no-install-recommends --no-install-suggests -y sudo apt-utils curl git cmake build-essential
12798 exit_on_error $?
12799}
12800
12801# check if python is installed, if not install it
12802install_python() {
12803 # Check if python3.11 is installed
12804 if command -v python3.11 &> /dev/null
12805 then
12806 # Check the version
12807 PYTHON_VERSION=$(python3.11 --version 2>&1)
12808 if [[ $PYTHON_VERSION == *"Python 3.11"* ]]; then
12809 ohai "Python 3.11 is already installed."
12810 else
12811 ohai "Linking python to python 3.11"
12812 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12813 python -m pip install cffi
12814 python -m pip install cryptography
12815 fi
12816 else
12817 ohai "Installing Python 3.11"
12818 add-apt-repository ppa:deadsnakes/ppa
12819 sudo apt install python3.11
12820 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
12821 python -m pip install cffi
12822 python -m pip install cryptography
12823 fi
12824
12825 # check if PDM is installed
12826 if command -v pdm &> /dev/null
12827 then
12828 ohai "PDM is already installed."
12829 echo "Checking PDM version..."
12830 pdm --version
12831 else
12832 ohai "Installing PDM..."
12833 sudo apt install -y python3.12-venv
12834 curl -sSL https://pdm-project.org/install-pdm.py | python3 -
12835
12836 local bashrc_file="/root/.bashrc"
12837 local path_string="export PATH=/root/.local/bin:\$PATH"
12838
12839 if ! grep -Fxq "$path_string" $bashrc_file; then
12840 echo "$path_string" >> $bashrc_file
12841 echo "Added $path_string to $bashrc_file"
12842 else
12843 echo "$path_string already present in $bashrc_file"
12844 fi
12845
12846 export PATH=/root/.local/bin:$PATH
12847
12848 echo "Checking PDM version..."
12849 pdm --version
12850 fi
12851}
12852
12853# install redis
12854install_redis() {
12855 if command -v redis-server &> /dev/null
12856 then
12857 ohai "Redis is already installed."
12858 echo "Checking Redis version..."
12859 redis-server --version
12860 else
12861 ohai "Installing Redis..."
12862
12863 sudo apt install -y redis-server
12864
12865 echo "Starting Redis server..."
12866 sudo systemctl start redis-server.service
12867
12868 echo "Checking Redis server status..."
12869 sudo systemctl status redis-server.service
12870 fi
12871}
12872
12873# install postgresql
12874install_postgresql() {
12875 if command -v psql &> /dev/null
12876 then
12877 ohai "PostgreSQL is already installed."
12878 echo "Checking PostgreSQL version..."
12879 psql --version
12880
12881 # Check if the database exists
12882 DB_EXISTS=$(sudo -u postgres psql -tAc "SELECT 1 FROM pg_database WHERE datname='compute_subnet_db'")
12883 if [ "$DB_EXISTS" == "1" ]; then
12884 echo "Database compute_subnet_db already exists."
12885 else
12886 echo "Creating database compute_subnet_db..."
12887 sudo -u postgres createdb compute_subnet_db
12888 fi
12889 else
12890 echo "Installing PostgreSQL..."
12891 sudo apt install -y postgresql postgresql-contrib
12892
12893 echo "Starting PostgreSQL server..."
12894 sudo systemctl start postgresql.service
12895
12896 echo "Setting password for postgres user..."
12897 sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'password';"
12898
12899 echo "Creating database compute_subnet_db..."
12900 sudo -u postgres createdb compute_subnet_db
12901 fi
12902}
12903
12904# install btcli
12905install_btcli() {
12906 if command -v btcli &> /dev/null
12907 then
12908 ohai "BtCLI is already installed."
12909 else
12910 ohai "Installing BtCLI..."
12911
12912 sudo apt install -y pipx
12913 pipx install bittensor
12914 source ~/.bashrc
12915 fi
12916}
12917
12918# install docker
12919install_docker() {
12920 if command -v docker &> /dev/null; then
12921 ohai "Docker is already installed."
12922 return 0
12923 else
12924 ohai "Installing Docker..."
12925 sudo apt-get update -y
12926 sudo apt-get install -y ca-certificates curl
12927 sudo install -m 0755 -d /etc/apt/keyrings
12928 sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
12929 sudo chmod a+r /etc/apt/keyrings/docker.asc
12930
12931 # Add the repository to Apt sources:
12932 echo \
12933 "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
12934 $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
12935 sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
12936 sudo apt-get update -y
12937 sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
12938 sudo groupadd docker
12939 sudo usermod -aG docker $USER
12940 newgrp docker
12941 fi
12942}
12943
12944ohai "This script will install:"
12945echo "bittensor"
12946echo "docker"
12947
12948
12949wait_for_user
12950install_pre
12951install_btcli
12952install_docker
12953
12954
12955
12956---
12957File: /tests/__init__.py
12958---
12959
12960
12961
12962
12963---
12964File: /README.md
12965---
12966
12967# Datura Compute Subnet
12968
12969# Compute Subnet on Bittensor
12970
12971Welcome to the **Compute Subnet on Bittensor**! This project enables a decentralized, peer-to-peer GPU rental marketplace, connecting miners who contribute GPU resources with users who need computational power. Our frontend interface is available at [celiumcompute.ai](https://celiumcompute.ai), where you can easily rent machines from the subnet.
12972
12973## Table of Contents
12974
12975- [Introduction](#introduction)
12976- [High-Level Architecture](#high-level-architecture)
12977- [Getting Started](#getting-started)
12978 - [For Renters](#for-renters)
12979 - [For Miners](#for-miners)
12980 - [For Validators](#for-validators)
12981- [Contact and Support](#contact-and-support)
12982
12983## Introduction
12984
12985The Compute Subnet on Bittensor is a decentralized network that allows miners to contribute their GPU resources to a global pool. Users can rent these resources for computational tasks, such as machine learning, data analysis, and more. The system ensures fair compensation for miners based on the quality and performance of their GPUs.
12986
12987
12988## High-Level Architecture
12989
12990- **Miners**: Provide GPU resources to the network, evaluated and scored by validators.
12991- **Validators**: Securely connect to miner machines to verify hardware specs and performance. They maintain the network's integrity.
12992- **Renters**: Rent computational resources from the network to run their tasks.
12993- **Frontend (celiumcompute.ai)**: The web interface facilitating easy interaction between miners and renters.
12994- **Bittensor Network**: The decentralized blockchain in which the compensation is managed and paid out by the validators to the miners through its native token, $TAO.
12995
12996## Getting Started
12997
12998### For Renters
12999
13000If you are looking to rent computational resources, you can easily do so through the Compute Subnet. Renters can:
13001
130021. Visit [celiumcompute.ai](https://celiumcompute.ai) and sign up.
130032. **Browse** available GPU resources.
130043. **Select** machines based on GPU type, performance, and price.
130054. **Deploy** and monitor your computational tasks using the platform's tools.
13006
13007To start renting machines, visit [celiumcompute.ai](https://celiumcompute.ai) and access the resources you need.
13008
13009### For Miners
13010
13011Miners can contribute their GPU-equipped machines to the network. The machines are scored and validated based on factors like GPU type, number of GPUs, bandwidth, and overall GPU performance. Higher performance results in better compensation for miners.
13012
13013If you are a miner and want to contribute GPU resources to the subnet, please refer to the [Miner Setup Guide](neurons/miners/README.md) for instructions on how to:
13014
13015- Set up your environment.
13016- Install the miner software.
13017- Register your miner and connect to the network.
13018- Get compensated for providing GPUs!
13019
13020### For Validators
13021
13022Validators play a crucial role in maintaining the integrity of the Compute Subnet by verifying the hardware specifications and performance of miners’ machines. Validators ensure that miners are fairly compensated based on their GPU contributions and prevent fraudulent activities.
13023
13024For more details, visit the [Validator Setup Guide](neurons/validators/README.md).
13025
13026
13027## Contact and Support
13028
13029If you need assistance or have any questions, feel free to reach out:
13030
13031- **Discord Support**: [Dedicated Channel within the Bittensor Discord](https://discord.com/channels/799672011265015819/1291754566957928469)
13032