+117
-289
README.md
+117
-289
README.md
···
1
1
# Project
2
2
3
-
To run this project use:
4
-
5
-
```sh
6
-
make run
7
-
```
8
-
9
-
To compile this project use:
10
-
11
-
```sh
12
-
make build
13
-
```
14
-
15
-
# TODO
16
-
- [x] createdb.py
17
-
- [x] testdb.py
18
-
- [x] ls.py
19
-
- [ ] meta-data.py
20
-
- [x] implementar `reg` para recibir informacion de los data nodes
21
-
22
-
23
-
# Assignment 04: Distributed File systems
24
-
25
-
The components to implement are:
26
-
27
-
* **Metadata server**, which will function as an inodes repository
28
-
* **Data servers**, that will serve as the disk space for file data blocks
29
-
* **List client**, that will list the files available in the DFS
30
-
* **Copy client**, that will copy files from and to the DFS
31
-
32
-
# Objectives
33
-
34
-
* Study the main components of a distributed file system
35
-
* Get familiarized with File Management
36
-
* Implementation of a distributed system
37
-
38
-
# Prerequisites
39
-
40
-
* Python:
41
-
* [www.python.org](http://www.python.org/)
42
-
* Python SocketServer library: for **TCP** socket communication.
43
-
* https://docs.python.org/3/library/socketserver.html
44
-
* uuid: to generate unique IDs for the data blocks
45
-
* https://docs.python.org/3/library/uuid.html
46
-
* **Optionally** you may read about the json and sqlite3 libraries used in the
47
-
skeleton of the program.
48
-
* https://docs.python.org/3/library/json.html
49
-
* https://docs.python.org/3/library/sqlite3.html
50
-
51
-
### **The metadata server's database manipulation functions.**
52
-
53
-
No expertise in database management is required to accomplish this project.
54
-
However sqlite3 is used to store the file inodes in the metadata server. You
55
-
don't need to understand the functions but you need to read the documentation
56
-
of the functions that interact with the database. The metadata server database
57
-
functions are defined in file mds\_db.py.
3
+
This is a distributed filesystem written in C!
58
4
59
-
#### **Inode**
5
+
## Documentation
60
6
61
-
For this implementation an **inode** consists of:
7
+
You can read the documentation by:
8
+
- simply opening the source files in `./src`
9
+
- running `make docs`. This will create a directory called `docs`. Open it in
10
+
your browser like `file:///path/to/this/project/docs/index.html`.
11
+
- visiting the website: [dfs-docs](https://sona-tau.github.io/dfs-docs/files.html).
62
12
63
-
* File name
64
-
* File size
65
-
* List of blocks
13
+
## Usage
66
14
67
-
#### **Block List**
15
+
First, you have to setup the project.
68
16
69
-
The **block list** consists of a list of:
17
+
```sh
18
+
make all
19
+
```
70
20
71
-
* data node address \- to know the data node the block is stored
72
-
* data node port \- to know the service port of the data node
73
-
* data node block\_id \- the id assigned to the block
21
+
This will compile all the necessary executables and put them in a directory
22
+
called: `./.build`. The executables provided are:
23
+
- `metadata-server`
24
+
- `data-node`
25
+
- `ls`
26
+
- `copy`
74
27
75
-
Functions:
76
-
77
-
* AddDataNode(address, port): Adds new data node to the metadata server
78
-
Receives IP address and port. I.E. the information to connect to the data node.
79
-
80
-
* GetDataNodes(): Returns a list of data node tuples **(address, port)**
81
-
registered. Useful to know to which data nodes the data blocks can be sent.
82
-
* InsertFile(filename, fsize): Insert a filename with its file size into the
83
-
database.
84
-
* GetFiles(): Returns a list of the attributes of the files stored in the DFS.
85
-
(addr, file size)
86
-
* AddBlockToInode(filename, blocks): Add the list of data blocks information of
87
-
a file. The data block information consists of (address, port, block\_id)
88
-
* GetFileInode(filename): Returns the file size, and the list of data block
89
-
information of a file. (fsize, block\_list)
90
-
91
-
### **The packet manipulation functions:**
92
-
93
-
The packet library is designed to serialize the communication data using the
94
-
json library. No expertise with json is required to accomplish this assignment.
95
-
These functions were developed to ease the packet generation process of the
96
-
project. The packet library is defined in file Packet.py.
97
-
98
-
In this project all packet objects have a packet type among the following
99
-
command type options:
100
-
101
-
* reg: to register a data node
102
-
* list: to ask for a list of files
103
-
* put: to put a files in the DFS
104
-
* get: to get files from the DFS
105
-
* dblks: to add the data block ids to the files.
106
-
107
-
#### **Functions:**
108
-
109
-
##### **General Functions**
110
-
111
-
* getEncodedPacket(): returns a serialized packet ready to send through the
112
-
network. First you need to build the packets. See Build**\<X\>**Packet
113
-
functions.
114
-
* DecodePacket(packet): Receives a serialized message and turns it into a
115
-
packet object.
116
-
* getCommand(): Returns the command type of the packet
117
-
118
-
##### **Packet Registration Functions**
119
-
120
-
* BuildRegPacket(addr, port): Builds a registration packet.
121
-
* getAddr(): Returns the IP address of a server. Useful for registration
122
-
packets
123
-
* getPort(): Returns the Port number of a server. Useful for registration
124
-
packets
125
-
126
-
##### **Packet List Functions**
127
-
128
-
* BuildListPacket(): Builds a list packet for file listing
129
-
* BuildListResponse(filelist): Builds a list response packet with the list of
130
-
files.
131
-
* getFileArray(): Returns a list of files
132
-
133
-
##### **Get Packet Functions**
134
-
135
-
* BuildGetPacket(fname): Builds a get packet to get a file name.
136
-
* BuildGetResponse(metalist, fsize): Builds a list of data node servers with
137
-
the blocks of a file, and the file size.
138
-
* getFileName(): Returns the file name in a packet.
139
-
* getDataNodes(): Returns a list of data servers.
140
-
141
-
##### **Put Packet Functions (Put Blocks)**
142
-
143
-
* BuildPutPacket(fname, size): Builds a put packet to put fname and file size
144
-
in the metadata server.
145
-
* getFileInfo(): Returns the file info in a packet.
146
-
* BuildPutResponse(metalist): Builds a list of data node servers where the data
147
-
blocks of a file can be stored. I.E a list of available data servers.
148
-
* BuildDataBlockPacket(fname, block\_list): Builds a data block packet.
149
-
Contains the file name and the list of blocks for the file. See [block
150
-
list](http://ccom.uprrp.edu/~jortiz/clases/ccom4017/asig04/#block_list) to
151
-
review the content of a block list.
152
-
* getDataBlocks(): Returns a list of data blocks
153
-
154
-
##### **Get Data block Functions (Get Blocks)**
155
-
156
-
* BuildGetDataBlockPacket(blockid): Builds a get data block packet. Usefull
157
-
when requesting a data block from a data node.
158
-
* getBlockID(): Returns the block\_id from a packet.
159
-
160
-
# Instructions
161
-
162
-
Write and complete code for an unreliable and insecure distributed file server
163
-
following the specifications below.
164
-
165
-
### **Design specifications.**
166
-
167
-
For this project you will design and complete a distributed file system. You
168
-
will write a DFS with tools to list the files, and to copy files from and to
169
-
the DFS.
170
-
171
-
Your DFS will consist of:
172
-
173
-
* A metadata server: which will contain the metadata (inode) information of the
174
-
files in your file system. It will also keep a registry of the data servers
175
-
that are connected to the DFS.
176
-
* Data nodes: The data nodes will contain chunks (some blocks) of the file that
177
-
you are storing in the DFS.
178
-
* List command: A command to list the files stored in the DFS.
179
-
* Copy command: A command that will copy files from and to the DFS.
180
-
181
-
### **The metadata server**
182
-
183
-
The metadata server contains the metadata (inode) information of the files in
184
-
your file system. It will also keep a registry of the data servers that are
185
-
connected to the DFS.
186
-
187
-
Your metadata server must provide the following services:
188
-
189
-
1. Listen to the data nodes that are part of the DFS. Every time a new data
190
-
node registers to the DFS the metadata server must keep the contact information
191
-
of that data node. This is (IP Address, Listening Port).
192
-
* To ease the implementation of the DFS, the directory file system must
193
-
contain three things:
194
-
* the path of the file in the file system (filename)
195
-
* the nodes that contain the data blocks of the files
196
-
* the file size
197
-
2. Every time a client (commands list or copy) contacts the metadata server
198
-
for:
199
-
* get: requesting to read a file: the metadata server must check if the file
200
-
is in the DFS database, and if it is, it must return the nodes with the
201
-
blocks\_ids that contain the file.
202
-
* put: requesting to write a file: the metadata server must:
203
-
* insert in the database the path of the new file (with its name), and its
204
-
size.
205
-
* return a list of available data nodes where to write the chunks of the
206
-
file
207
-
* dblks: then store the data blocks that have the information of the data
208
-
nodes and the block ids of the file.
209
-
* list: requesting to list files:
210
-
* the metadata server must return a list with the files in the DFS and
211
-
their size.
28
+
NOTE: The `make all` command that this project produces lots of other
29
+
intermediary files that are useful for debugging and testing. Please, ignore
30
+
these!
212
31
213
-
The metadata server must be run:
32
+
> If this is not the first time you run the project, you might want to clear the
33
+
> data directory. In the following configuration, you can do this by simply
34
+
> running `make clean-data`
214
35
215
-
python meta-data.py \<port, default=8000\>
36
+
The first thing you have to do after compiling everything is create the
37
+
database. To do this, run the following command:
216
38
217
-
If no port is specified the port 8000 will be used by default.
39
+
```sh
40
+
createdb
41
+
```
218
42
219
-
### **The data node server**
43
+
This command does not take any parameters.
220
44
221
-
The data node is the process that receives and saves the data blocks of the
222
-
files. It must first register with the metadata server as soon as it starts its
223
-
execution. The data node receives the data from the clients when the client
224
-
wants to write a file, and returns the data when the client wants to read a
225
-
file.
45
+
Then, you have to start the metadata server:
226
46
227
-
Your data node must provide the following services:
47
+
```sh
48
+
metadata-server Port
49
+
```
228
50
229
-
1. put: Listen to writes:
230
-
* The data node will receive blocks of data, store them using an unique id,
231
-
and return the unique id.
232
-
* Each node must have its own block storage path. You may run more than one
233
-
data node per system.
234
-
2. get: Listen to reads
235
-
* The data node will receive requests for data blocks, and it must read the
236
-
data block, and return its content.
51
+
where:
52
+
- `Port` is any valid number that can be a port. This is the port that the
53
+
metadata server will be listening on.
237
54
238
-
The data nodes must be run:
55
+
After starting the metadata server, start a couple data nodes:
239
56
240
-
python data-node.py \<server address\> \<port\> \<data path\> \<metadata
241
-
port,default=8000\>
57
+
```sh
58
+
data-node IPv4 Port Path Port
59
+
```
242
60
243
-
Server address is the metadata server address, port is the data-node port
244
-
number, data path is a path to a directory to store the data blocks, and
245
-
metadata port is the optional metadata port if it was run in a different port
246
-
other than the default port.
61
+
where:
62
+
- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
63
+
server.
64
+
- `Port` is any valid number that can be a port. This is the port that the
65
+
metadata server is listening on.
66
+
- `Path` is the file path to the data directory for this data node.
67
+
- `Port` is any valid number that can be a port. This is the port that the data
68
+
node will be listening on.
247
69
248
-
**Note:** Since you most probably do not have many different computers at your
249
-
disposal, you may run more than one data-node in the same computer but the
250
-
listening port and their data block directory must be different.
70
+
You can now copy files to and from the server. To do this, use the `copy`
71
+
command:
251
72
252
-
### **The list client**
73
+
```sh
74
+
copy IPv4 Port [-s] Path [-s] Path
75
+
```
253
76
254
-
The list client just sends a list request to the metadata server and then waits
255
-
for a list of file names with their size.
77
+
where:
78
+
- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
79
+
server.
80
+
- `Port` is any valid number that can be a port. This is the port that the
81
+
metadata server is listening on.
82
+
- `Path` is the file path to the source file that you want to copy.
83
+
- `Path` is the file path to the destination of the file you want to copy.
84
+
Notice that there is a `[-s]`. This must only be supplied once. This flag is to
85
+
indicate that the next path represents a file that is in the server. For
86
+
example, the following are correct ways to use this command:
256
87
257
-
The output must look like:
88
+
```sh
89
+
copy 136.145.10.2 42069 -s /home/root/.bashrc /home/cheo/.bashrc
90
+
copy 136.145.10.2 42069 /etc/passwd -s /home/sona/important_files.txt
91
+
```
258
92
259
-
/home/cheo/asig.cpp 30 bytes
260
-
/home/hola.txt 200 bytes
261
-
/home/saludos.dat 2000 bytes
93
+
The following would be an incorrect way to use this command:
262
94
263
-
The list client must be run:
95
+
```sh
96
+
copy 127.0.0.1 58008 -s /home/root/.bashrc -s /home/cheo/.bashrc
97
+
# ERROR: ^^ ^^
98
+
# The -s flag appears twice!
264
99
265
-
python ls.py \<server\>:\<port, default=8000\>
100
+
copy 127.0.0.1 58008 /etc/passwd /home/sona/important_files.txt
101
+
# ERROR: ^ ^
102
+
# The -s flag does not appear!
103
+
```
266
104
267
-
Where server is the metadata server IP and port is the metadata server port. If
268
-
the default port is not indicated the default port is 8000 and no ':' character
269
-
is necessary.
270
105
271
-
### **The copy client**
106
+
To list the files that are on the server you can use the `ls` command:
272
107
273
-
The copy client is more complicated than the list client. It is in charge of
274
-
copying the files from and to the DFS.
108
+
```sh
109
+
ls IPv4 Port
110
+
```
275
111
276
-
The copy client must:
112
+
where:
113
+
- `IPv4` is any valid `IPv4` address. This is the IP addresss of the metadata
114
+
server.
115
+
- `Port` is any valid number that can be a port. This is the port that the
116
+
metadata server is listening on.
277
117
278
-
1. Write files in the DFS
279
-
* The client must send to the metadata server the file name and size of the
280
-
file to write.
281
-
* Wait for the metadata server response with the list of available data
282
-
nodes.
283
-
* Send the data blocks to each data node.
284
-
* You may decide to divide the file over the number of data servers.
285
-
* You may divide the file into X size blocks and send it to the data
286
-
servers in round robin.
287
-
2. Read files from the DFS
288
-
* Contact the metadata server with the file name to read.
289
-
* Wait for the block list with the bloc id and data server information
290
-
* Retrieve the file blocks from the data servers.
291
-
* This part will depend on the division algorithm used in step (1).
118
+
### Example
292
119
293
-
The copy client must be run:
120
+
If you want to run this project as an example, try the following:
294
121
295
-
Copy from DFS:
122
+
First, make sure you compile all the executables.
123
+
```sh
124
+
make all
125
+
```
296
126
297
-
python copy.py \<server\>:\<port\>:\<dfs file path\> \<destination file\>
127
+
Remember, this will put all the executables in a directory called `.build`.
298
128
299
-
To DFS:
129
+
Then, launch the metadata server first!
300
130
301
-
python copy.py \<source file\> \<server\>:\<port\>:\<dfs file path\>
302
131
303
-
Where server is the metadata server IP address, and port is the metadata server
304
-
port.
132
+
```sh
133
+
./.build/metadata-server 127.0.0.1 42069
134
+
```
305
135
306
-
# Creating an empty database
136
+
Leave this running in the background. Now launch several data nodes in other
137
+
terminals:
307
138
308
-
The script createdb.py generates an empty database *dfs.db* for the project.
139
+
Terminal 1:
140
+
```sh
141
+
./.build/data-node 127.0.0.1 8001 ./.build/d1 42069
142
+
```
309
143
310
-
python createdb.py
144
+
Terminal 2:
145
+
```sh
146
+
./.build/data-node 127.0.0.1 8002 ./.build/d2 42069
147
+
```
311
148
312
-
# Deliverables
149
+
Terminal 3:
150
+
```sh
151
+
./.build/data-node 127.0.0.1 8003 ./.build/d3 42069
152
+
```
313
153
314
-
* The source code of the programs (well documented)
315
-
* A README file with:
316
-
* description of the programs, including a brief description of how they
317
-
work.
318
-
* who helped you or discussed issues with you to finish the program.
319
-
* Video description of the project with implementation details. Any doubt
320
-
please consult the professor.
154
+
I recommend that you put the data directory for the data nodes inside `.build`
155
+
so that cleaning up is a lot easier.
321
156
322
-
# Rubric
157
+
Now we're ready to start copying files!
323
158
324
-
* (10 pts) the programs run
325
-
* (80 pts) quality of the working solutions
326
-
* (20 pts) Metadata server implemented correctly
327
-
* (25 pts) Data server implemented correctly
328
-
* (10 pts) List client implemented correctly
329
-
* (25 pts) Copy client implemented correctly
330
-
* (10 pts) quality of the README
331
-
* (10 pts) description of the programs with their description.
332
-
* No project will be graded without submission of the video explaining how the
333
-
project was implemented.
159
+
There is a `./test.sh` file that can help you test out your files. It will create
160
+
a `500MB` file. I tested this with files up to `5GB` so if you want to try that,
161
+
just add another 0 to the `./test.sh` file.
+17
-10
test.sh
+17
-10
test.sh
···
1
1
#!/usr/bin/env sh
2
2
3
-
gum log -l "info" -t ansic "Erasing test files"
3
+
mylog() {
4
+
local MSG="$*"
5
+
local DATE="$(date +"%a %b %d %H:%M:%S %Y")"
6
+
7
+
echo -e "$DATE \033[96mINFO\033[0m $MSG"
8
+
}
9
+
10
+
mylog "Erasing test files"
4
11
echo "rm 500MB.bin another_500MB.bin"
5
12
rm 500MB.bin another_500MB.bin
6
13
7
14
echo ""
8
15
9
-
gum log -l "info" -t ansic "Testing ls"
16
+
mylog "Testing ls"
10
17
echo "./.build/ls 127.0.0.1 8000"
11
18
./.build/ls 127.0.0.1 8000
12
19
13
20
echo ""
14
-
gum log -l "info" -t ansic "Testing copy from client to server"
21
+
mylog "Testing copy from client to server"
15
22
echo ""
16
23
17
-
gum log -l "info" -t ansic "Creating 500MB file with random bytes, called \"500MB.bin\""
24
+
mylog "Creating 500MB file with random bytes, called \"500MB.bin\""
18
25
echo "cat /dev/random | head -c500000000 > 500MB.bin"
19
26
cat /dev/random | head -c500000000 > 500MB.bin
20
27
21
-
gum log -l "info" -t ansic "The size of 500MB.bin is:"
28
+
mylog "The size of 500MB.bin is:"
22
29
echo "cat 500MB.bin | wc -c"
23
30
cat 500MB.bin | wc -c
24
31
25
32
echo ""
26
33
27
-
gum log -l "info" -t ansic "Copying file to the server"
34
+
mylog "Copying file to the server"
28
35
echo "./.build/copy 127.0.0.1 8000 500MB.bin -s /somewhere/in/the/server/500MB.bin"
29
36
./.build/copy 127.0.0.1 8000 500MB.bin -s /somewhere/in/the/server/500MB.bin
30
37
31
38
echo ""
32
39
33
-
gum log -l "info" -t ansic "Testing ls"
40
+
mylog "Testing ls"
34
41
echo "./.build/ls 127.0.0.1 8000"
35
42
./.build/ls 127.0.0.1 8000
36
43
37
44
echo ""
38
-
gum log -l "info" -t ansic "Testing copy from server to client"
45
+
mylog "Testing copy from server to client"
39
46
echo ""
40
47
41
-
gum log -l "info" -t ansic "Copying file from the server to the client"
48
+
mylog "Copying file from the server to the client"
42
49
echo "./.build/copy 127.0.0.1 8000 -s /somewhere/in/the/server/500MB.bin another_500MB.bin"
43
50
./.build/copy 127.0.0.1 8000 -s /somewhere/in/the/server/500MB.bin another_500MB.bin
44
51
45
52
echo ""
46
53
47
-
gum log -l "info" -t ansic "Checking if both files are the same"
54
+
mylog "Checking if both files are the same"
48
55
diff -s 500MB.bin another_500MB.bin